Whitepaper

Data Wrangling and the Machine Learning Life Cycle

September 21, 2022
Data Wrangling and the Machine Learning Life Cycle

Data Wrangling Impact to Machine Learning

Data wrangling is a process of gathering, selecting, and transforming data to support the next phase
of analytics and learning. This data preparation takes up most of the time for projects, so efficiency
dramatically improves the entire life cycle of machine learning and frees up time for the data scientist.

The main goal of effective data wrangling is to produce analytical data sets with structures (e.g. Analytic
Base Tables) to perform better modeling; and once done, to feed into a myriad of machine learning
strategies.

For the data scientist, data wrangling transforms and maps raw data into various formats based on use cases. In practice, that means looking for inconsistencies, missing information, or data format issues. Better data translates well into efficient machine learning. Some of the key areas to consider include:

  • What is your use case and what entities are relevant?
  • Does the data integration pipeline support access to the data?
  • Check data from all sources, including API’s, Cloud, or internal premise.
  • Generate an unbiased data sample.
  • Protect for data leakage with feature engineering.
  • Check data for correlation and redundancy before feeding the model for export

After consideration of these questions to start, a deeper workflow is needed beyond just data wrangling; and it can be represented in a Machine Learning Life Cycle.

The Life Cycle of Machine Learning – Being Disciplined

Disciplined machine learning (ML) is represented by certain steps that determine a coherent and holistic ML life cycle to derive business value, with a correct execution to gain meaningful insights with practical benefits. It’s all about the data to start.

If we remember that most ML projects never make it past the experimental phase due to a number of reasons ranging from poor data quality to inadequate modeling to substandard data preparation, then what is the best way to start?

If we follow certain steps, we can establish both the good workflow and data wrangling to successfully develop and deploy an effective model.

The Life Cycle of Machine Learning – Top Level

With a thoughtful approach to machine learning, the focus is generally on the data collection, data wrangling, data interpretation, data modeling, and model deployment. Software automation during these stages is important because it supplements human manual processes and increases speed and data quality.

As shown below, the data pipeline aligns the data transformation with rules and governance, which feeds the data wrangling phase where the data exploration and cleaning takes place. Once the data is “wrangled,” it moves to the feature engineering phase after performing the train-test split depending upon the learning task. This feeds data modeling and eventual deployment. We can now take a deeper look at the process.

Using the Dextrus ML Studio in the Machine Learning Cycle

With the know-how of a data scientist and subject matter expert (SME), the use of an automated software tool makes the ML Life Cycle process more complete. Using RightData’s Dextrus ML Studio platform architecture, an effective, step-by-step approach for the ML Life Cycle can be accomplished.

Using the Dextrus ML Studio in the Machine Learning Cycle… continued

1. Data Collection

From raw data, APIs, or Cloud-delivered data sources, data collection is an important step in the Life Cycle process. Today, Dextrus provides connectors to access the data and feed the data into a governed data pipeline and enterprises can ingest data from any source and location at speed and scale. Dextrus offers 150+ connectors for databases, applications, events, flat file data sources, cloud platform technologies, SAP sources, REST APIs, social media platforms and the list grows for every release.

2.Data Wrangling 

Data wrangling is perhaps the most important step in the cycle because the quality of the results obtained from the model is highly dependent upon the quality of the data used by the model. Raw data may contain a lot of inconsistencies and noise before an ideal form can be reached to perform modeling operations. So, this phase is critical to performance for machine learning models – you need high quality data to trust model outcomes which drive strategy and spending at the business level.

Dextrus provides a range of data wrangling capabilities using an easy to use no code user interface for identifying, cleaning and fixing issues related to data. The platform makes use of open-source tools such as Apache Spark that allows it scale with the data size. In addition, because Dextrus data integration is integrated with the ML Studio, it is easy to manage the workflow described in this paper.

3.Data Understanding 

In order to perform any wrangling operations, it is first important to understand the user data and important metrics that define the data.

Tabular data in csv format can be imported into the Dextrus platform with any required import customizations to further perform wrangling operations on it.

Once the data is imported, users can have an overview of the data. There are various important metrics that play an important role in defining the data for each column. With the help of Dextrus platform, it becomes easy to view all the important information so that the users can understand the important metrics and identify any potential issues.

The first step in understanding the data would be identifying the categorical and numerical columns which can be known from the data type and distinct values when compared which the total available rows of the data.

The platform provides information regarding the number of valid values for each of the columns which helps in identifying issues like null values and the values inconsistent with the type and the domain for each of that column.

It is important to understand the distribution of data in each column for better insights on identifying the right engineering and modeling operations in the later steps in the cycle and for getting those insights, the platform provides descriptive statistical information for each column based on their data type.

Suppose for numerical columns, users can view the statistical information for information like mean, median, unique values and also view histogram plot for checking if it is normal distribution.

4.Data Processing

Shown to the right is a screenshot of what? Getting an overview of the data helps in identifying any issues with the data and then fixing those issues and performing other data wrangling operations. Some of these issues could be missing values, skewness in the distribution due to presence of outliers.

Dextrus provides a wide range of capabilities with respect to fixing such issues as replacing missing data with appropriate values or metrics like mean of the distribution for that column.

Apart from fixing the issues, there are other operations like removing the outliers, encoding the data and cleaning operations like masking, splitting and others. All the data different versions of the data cleaning recipes can be accessed at one place which can then be further used in the data modeling steps.

5.Data Preparation for Modeling 

Train test split: After the data cleaning process, the data needs to be further divided into train and test data so that we can test the performance of the data and there is no data leakage that would lead to any illegitimate information being included in the training step.

6.Feature Engineering

After splitting the data in training data and testing data, feature engineering can be performed on the train data that helps in generating the right set of features. Dextrus provides encoding capabilities like one-hot encoding, ordinal encoding for the categorical columns as well as a range normalization for numerical columns like standard scalar, min-max scalar and others.

7.Modeling and Tuning Hyperparameters

Dextrus supports a range of different algorithms for modeling which can be used for classification tasks like predicting binary outputs. It allows the users to select the metrics for the evaluation of those modeling algorithms.

Hyperparameter tuning is an important step where the parameters of the model are optimized to get the best results. On Dextrus platform, users can customize their search for the parameter tuning and compare the results for each of those selections.

8. Testing and prediction 

Once the modeling has been completed along with hyperparameter tuning, Dextrus provides detailed information related to all the different versions of models on the dashboard. The data science professional can make use of information about all the relevant metrics and model parameters to select the best model for their data and use it for prediction.

Thinking From a Data Scientist’s Perspective

For the data scientist there needs to be a balanced approach as data is managed in different stages, which can be automated for faster results and greater scalability.

More important, feeding machine learning solutions is an iterative process, where self-service and collaboration dramatically increases learning outcomes. If you want to and deploy ML solutions faster, you need a platform that has an agile solution that combines different steps of the cycle and integrates easily with the rest of the existing system pipeline. This helps the push and pull of training and testing data with a major outcome of faster decisions.

Finally, moving from experimentation to operational machine learning – continuous learning really –requires ML modeling and deployment at scale. With an automated approach that is integrated into both the data phase and machine learning phase, the concept of a “data scientist” now blends into one “learning scientist.” The industry at large is certainly moving in that direction and the platforms today are enabling that transition.

Dextrus and the Dextrus ML Studio meet that challenge today by creating a deep model bridge all the way from raw data to pipeline to wrangling to machine learning. With the speed of modern management and processing, ML operational models are emerging quickly; and with a simplified user interface for business users and next generation “learning scientists” are growing in numbers.

Dextrus ML Studio represents the future of machine learning at scale. Learn more about how Dextrus manages the entire process and Life Cycle.

About the Author

Suresh Saguturu serves as Vice President Of Product Development and Customer
Success at RightData Inc. With extensive data consulting experience with top firms such
as Coca-Cola, Bank of America, and Nike, Suresh has demonstrated both architecture
acumen and project leadership, including deep expertise for SAP and S/4 HANA systems.
Suresh holds many data certifications and a bachelor’s degree from Acharya Nagarjuna
University. suresh@getrightdata.com

About RightData

RightData is a trusted total software company that empowers end-to-end capabilities for modern data, analytics, and machine learning using modern data lakehouse and data mesh frameworks. The combination of Dextrus software for data integration and the RDt for data quality and observability provides a comprehensive DataOps approach. With a commitment to a no-code approach and a user friendly user interface, RightData increases speed to market and provides significant cost savings to its customers. www.getrightdata.com