The Framework Process of Data Science: Cross-industry Standard Process for Data Mining (CRISP-DM)

Ivan Muhammad Siegfried
8 min readApr 15, 2021
Figure 1: CRISP-DM Framework Diagram

Introduction

In general, there are two frameworks that are commonly used by data scientists to gather information and create models from raw data. Commonly used methods include the Cross-industry Standard Process for Data Mining (CRISP-DM) and Obtain, Scrub, Explore, Model, and Interpret (OSEMN) Framework. I will explain the use of CRISP-DM and its use directly in the program that has been created.

History

CRISP-DM was first introduced in 1996 when computer capabilities and tools were still limited. The desire to produce data analysis and a good model then prompted several large companies at that time, SPSS and Tetradata, and their current users, namely Daimler, NCR, and OHRA to form a research group and produce a codification called CRISP-DM. CRISP-DM is still loved today.

Business Understanding

The CRISP-DM framework starts with a business understanding or understanding of a business/project. Understanding the project is the initial basis and also examines the impact or effects that can arise if we analyze the raw data. Then, what are the factors that have an impact on the completion of the project? it is expected that the data analysis work is in accordance with the expected track and does not deviate from the initial target.

Figure 2: Complete Business Understanding Diagram

Some of the steps needed to answer the section on Business Understanding according to Figure 2 are as follows:

What do you want to achieve from the business side or the completion of this project? An advanced process that can be done is to write down some background, business goals, and how someone’s business goals are.
What are the resources (including hardware and software), limitations, and assumptions applied in completing the project?
How do you get raw data and analyze it so that it has a significant impact on the completion of the program?
What is the project planning that needs to be done by the team? And what tools and techniques can be used to make this job a success?

Data Understanding

Then, to understand the raw data, we need to get to know the data further. Some of the steps that can be taken are as follows.

Figure 3: Complete Data Understanding Diagram

This part of the work is the part that is used to check and understand the data in more depth, especially after understanding the business problem that you want to solve. The first stage includes collecting initial data which is used to retrieve and check the relevance/suitability and completeness of the data. The second is to describe and explore the data: the steps that aim to understand the meaning of each column, check the outlier values, and analyze the trend of data with various analyzes (for example, univariate, bivariate, and multivariate analysis).

Univariate analysis is an analysis used to view data using only one variable. An example of its use is a frequency distribution table that shows how many houses are standing near the river.

Figure 4: (a) Pie Chart to Find the Number of Houses in Boston Near the River
Figure 4 (b) Distribution of Median House Price Data (MEDV) which has positive skewness

Besides that, one of the most significant forms of univariate analysis in making modeling better is the detection of outliers data using Boxplot. The boxplot will show data outliers represented as dots. Examples are as follows.

Figure 5: Outliers detection using Box Plot

Then, bivariate analysis is an analysis used to view data using two variables. An example is looking at two correlations using a heatmap to see the correlation between one variable and another.

Gambar 6: Contoh Bivariate Analysis — Korelasi Data terhadap Data Target (MEDV)

Meanwhile, multivariate analysis is an analysis used to analyze data involving more than two variables. An example is a relationship between plant species and the concentration of a compound in a particular wine.

Figure 7: Example of Multivariate Analysis on the wine.data dataset (Relationship of Plant Types 1, 2, and 3 to Concentrations of V5 and V4)

Data Preparation

Figure 8: Data Preparation Diagram

Then, the picture above explains the real data preparation and its steps. The main purpose of data preparation is to prepare data so that it can be easily understood by both the model and the user. One of these steps is data selection. Data selection here generally uses data that has a large correlation with the target data. One of the data selections obtained is based on bivariate analysis, namely the calculation of correlation using a heat map.

Figure 9: (a) Data before Treatment Outliers
Figure 9: (b) Data After Treatment Outliers

Then, another step is eliminating the error value, then integrating the data when using various kinds of datasets to create a model. Then, you can also reform the data so that new data is generated (for example, time data is broken down into hour, minute, second, day, date, month, year data).

Modeling Stage

Figure 10: Modelling

This stage is the next stage after the data we have processed is clean and ready to be used as a model by looking at the patterns from the training data we have. Some of the phases used:

• Choosing a modeling technique: Using initial parameters and some existing modeling techniques or even our own
• Creating test data: the data is separated into training data and test data that are useful for training our model
• Creating a model: training the model we have. The result that we can give to the user is the parameter setting, which is the best parameter of some of the parameters that we define. One of the ways to get the best parameters is to use GridSearchCV. GridSearchCV is a function that covers various hyperparameters and fits the training set model. The cross-validation value can be defined for each set of hyperparameters that are looped.

Figure 10: Example of Using GridSearchCV to Find the Best Hyperparameter Value

• Assessment of the Model: When the model has been made, the model is assessed whether it is feasible or not. For example, if we have carried out the training process using the LinearRegression(), Lasso(), and GradientBoostRegressor() models, then the three models will be sorted based on metrics criteria that have been determined in the objectives in the business understanding section.
• Revising Parameter Settings: a data scientist can decide whether the model that has been created is suitable for use or not. If not, data scientists can choose certain hyperparameters or do less modeling so that the evaluation results can run more optimally.

Evaluation Stage

Figure 11: Chart of the Evaluation Section

At this stage, data scientists will be exposed to several stages to assess the model whether meets business/project objectives or not. A concrete example is whether the model answers the questions that have been explained in business understanding. The tasks a data scientist should undertake at this point are as follows:

• Check the quality of the model objectively and how effectively the model solves problems
• Pay attention to the key results of the model that has been made
• Produces reasonable results against the business problems created
• Deciding whether to continue to the final stage or return to the previous phase
• If you return to the previous phase, repeat as necessary.

Deployment

Figure 12: Deployment Stages in CRISP-DM Framework

In this section, when the model is ready, the model is deployed to be applied to the customer. The deployment can be scaled using a combination of Spark, Luigi, Docker, Jenkins, and Kubernetes.
• Apache Spark: is an engine used to process large-scale data. The advantage of Apache Spark when compared to Hadoop MapReduce is that data processing is faster because data storage is stored in memory compared to Hadoop MapReduce which stores data on a hard disk
• Luigi: This is a python module used to create complex pipelines to batch perform various jobs which is an alternative to Apache Oozie.
• Docker: Tool used to create packages
• Jenkins: Used to create and perform continuous testing of software. In this case, Jenkins is used to conducting training, deploy the machine learning model we have created, and perform tasks to monitor the quality of metrics.
• Training process using Google Compute Engine
• Data Storaging using Google Storage or Amazon S3
• Kubernetes: An orchestration platform used to manage resources connected to the Kubernetes master.

Stages in the Deployment Process:
• Performs the Trigger in running Jenkins
• Jenkins creates a Docker Image
• Run Tests Regularly
• Luigi orchestrated performance checks
o Retrieve test data from storage such as Google Cloud Storage,
o Processing the dataset using PySpark
o Testing the new model and issuing performance output
• If the performance feels good, do deploy into Kubernetes
• If performance is not good enough, do not deploy or deploy only if the administrator wishes to do so.

Perform the monitoring process for some time to avoid things that are not desirable. If it feels good, then prepare a final presentation by preparing two audiences: the layman and the data scientist.

One form of a report that is good in conveying a result is what has been done by Sri Audita Sari et al. with Nurvita Monarizqa as a mentor in the Solver Society Volume 3 program, one of the programs implemented by IYKRA, page 26. However, due to copyright issues, the screenshot of the slide cannot be displayed here. The link used in discussing the report can be accessed via the https://drive.google.com/drive/folders/1tzMv5XqJNmuXazoUAouPQ3fUJlKZUtLd page.

The report provides an infographic form that is not only rich in information but enriched with the form of a good vector image. So, readers will not get bored in listening to data exposure. The slides also seem not to have too much text, but they are rich in information so that only important points are shown.

Conclusion

This is the explanation regarding the CRISP-DM Framework. It can be concluded that there are several steps for processing data in accordance with the CRISP-DM standard.

If you have anything to ask, do not hesitate to send an email to ivanmuhsiegfried@gmail.com by including a link to this post in your email.

Happy coding!

--

--

Ivan Muhammad Siegfried

A full-stack Data Scientist. Curently work at Telkom Indonesia. Portfolio: ivanmsiegfried.github.io