Data Science Life Cycle

Mustafa Serdar Konca
6 min readMay 18, 2022

--

When you enter any computer science department on university, generally you will learn software life cycle your first year. And you should learn that if you want to become software developer.

So if you want to work with data or become data scientist, you should learn about data science life cycle.

Today, I will try explain data science life cycle. I wish it will be beneficial for every people that reading.

Contents:

1- Business Understanding and Problem Definition

2- Data Preparation

3- Exploratory Data Analysis (EDA)

4- Data Modeling

5- Model Evaluation

6- Reference

What is the Data Science Life Cycle?

Data Science Lifecycle is a step-by-step demonstration of how machine learning and other analytical methods are used to generate insights and predictions from data to achieve a business goal.

The entire process involves several steps like data cleaning, preparation, modelling, model evaluation, etc. It is a long process and may take several months to complete

I know each project is unique but we can make a general working template.

Then let’s look at the titles of the data science lifecycle steps together

1. Business Understanding and Problem Definition

Many developments in the world first started with the question of “why”.

Just like any good business or IT-focused life cycle, a good data science life cycle starts with “why”. It is essential to understand the business objective clearly because that will be your final goal of the analysis.

In this first phase of data analytics, the stakeholders regularly perform the following tasks — examine the business trends, make case studies of similar data analytics, and study the domain of the business industry. The entire team makes an assessment of the in-house resources, the in-house infrastructure, total time involved, and technology requirements. Once all these assessments and evaluations are completed, the stakeholders start formulating the initial hypothesis for resolving all business challenges in terms of the current market scenario.

Generally, the project lead or product manages make that phase.

  • State clearly the problem to be solved and why
  • Define the potential value of the project
  • Identify the project risks including ethical considerations
  • Develop and communicate a high-level, flexible project plan

2. Data Preparation

In the second phase after the data discovery phase. Generally data scientists or business/data analysts make that phase.

This includes

  • Selecting the relevant data,
  • Integrating the data by merging the data sets,
  • Cleaning them,
  • Treating the missing values by either removing them or imputing them,
  • Treating erroneous data by removing them,
  • Checking outliers using box plots and handle them.

Data preparation is the most time consuming yet arguably the most important step in the entire life cycle. Your model will be as good as your data.

3. Exploratory Data Analysis (EDA)

This step involves getting some idea about the solution and factors affecting it before building the actual model. We make bar-graphs, plotting, and heat maps to better understand data and data feature.

Few things we must keep in mind while exploring the data by making sure that data should be clean, it does not have any redundancies or missing values or even null values in the data set. Also, we have to make sure that we identify the important variables in the data set and remove all unnecessary noise in the data that may actually hinder the accuracy of our conclusions when we work on model building.

4. Data Modeling

We spent a lot of time on this steps and now, our data is ready for the model.

This step includes choosing the appropriate type of model, whether the problem is a classification problem, or a regression problem or a clustering problem. After choosing the model family, amongst the various algorithm amongst that family, we need to choose the algorithms to implement and implement them carefully.

There are a lot of hyperparameters
in our chosen model. So we should find the optimum hyperparameters values for our model. But we don’t want to become overfitting.

5. Model Evaluation

We created model the before step. But our model is successful? So in order to make our model more successful, we first need to calculate its current state.

There are two methods of evaluating models in data science, Hold-Out and Cross-Validation. The purpose of holdout evaluation is to test a model on different data than it was trained on. This provides an unbiased estimate of learning performance.

Cross-validation is a technique that involves partitioning the original observation data set into a training set, used to train the model, and an independent set used to evaluate the analysis. To avoid over-fitting, both methods use a test set (not seen by the model) to evaluate model performance. If we do not obtain a satisfactory result in the evaluation, we must re-iterate the entire modelling process until the desired level of metrics is achieved.

Common metrics used to evaluate models:

Classification metrics:

  • Precision-Recall,
  • ROC-AUC,
  • Accuracy,
  • Log-Loss

Regression metrics:

  • MSPE,
  • MSAE,
  • R Square,
  • Adjusted R Square

Unsupervised Models:

  • Rand Index,
  • Mutual Information

We can build multiple models for a certain phenomenon, but a lot of them may be imperfect. Model evaluation helps us choose and build a perfect model.

6. Model Deployment

We are now at the final stage of our life cycle. This step creates the delivery mechanism you need to get the model out to the users or to another system.

“No machine learning model is valuable, unless it’s deployed to production.”

This step means a lot of different things for different projects. It could be as simple as getting your model output in a Tableau dashboard. Or as complex as scaling it to the cloud to millions of users.

Any short-cuts taken in earlier the minimal viable model phase are upgraded to production-grade systems.

Typically the more “engineering-focused” team members such as data engineers, cloud engineers, machine learning engineers, application developers, and quality assurance engineers execute this phase.

Coming to the end of the article…

Each step in the data science life cycle explained above should be worked upon carefully. If any step is executed improperly, it will affect the next step, and the entire effort goes to waste.

For example, if data is not collected properly, you’ll lose information, and you will not be building a perfect model. If data is not cleaned properly, the model will not work. If the model is not evaluated properly, it will fail in the real world. From business understanding to model deployment, each step should be given proper attention, time and effort.

Stay with the data :) see you in the next article …

Reference:

--

--