The Reality of Data Engineering in Practice

Explore our approach to cleaning and migrating messy data from Azure Blob Storage to SQL Server.

Saartje Ly

Data Engineering Intern

August 26, 2024

Introduction

Organizations usually rely on large volumes of data stored in many different formats and locations in today's data-driven world. Recently, we were approached by a client who had a challenging but common scenario: they had flat files that were somewhat messy and needed to be migrated from an Azure Blob Storage into a SQL Server database. Our team took up this challenge, successfully performing data cleaning and reconciliation. In this case study, we will explore the steps we took to transform this data using an Azure Function, making sure it was clean, accurate, and ready for business use.


Data Reconciliation

Before the cleaning and migration process, we performed thorough data reconciliation to ensure the integrity and consistency of the data. This involved identifying discrepancies between data records with IDs, resolving these conflicts, and validating the data.


Data Cleaning using an Azure Function

Our client provided us files that were filled with issues that needed to be addressed before the data could be migrated to SQL Server. We leveraged an Azure Function to automate the data cleaning process, this serverless solution allowing us to handle the data transformations required regardless of data volume:

1. Messy Column Names: The files had varying column names, with differences in spacing, special characters and length. We standardized these by converting them to lowercase, replaced spaces with underscores, and removed any special characters that could cause issues in SQL Server. Standardizing column names ensures consistency across all datasets, making it easier to merge, query, and analyze data later on.

2. NAs and NaNs: This data included many NA and NaN values, which were appropriately handled and converted to None (SQL NULL). This conversion was important as NULL in SQL is the standard way to represent missing or undefined data - NA/NaN values may not be interpreted correctly by SQL functions. This conversion ensures data integrity, improves query performance, makes sure there is compatibility across systems, and offers a complete/consistent dataset for decision making.

3. Floats and Scientific Notation: Some numerical fields were stored as floats or scientific notation, requiring conversion to integers for data consistency. This proved difficult with the NA values in the columns as their existence stopped the integer conversion from happening. So, we first had to convert the NAs to a 0 placeholder to convert the columns, then we had to convert these zeroes to None after. All float values contained '.0' meaning no data was lost when the number was rounded to an integer.

4. Date Format Discrepancies: Date columns were inconsistent, with multiple formats in files such as 'DD-MM-YYYY' and 'MM-DD-YYYY'. We handled the ambiguity of the date format by converting the data name - which contained the month - to a number to compare. We converted the format to the SQL-compliant arrangement 'YYYY-MM-DD' to make sure date-based queries would function correctly in SQL Server. This standardization not only improves the accuracy of queries and reports but also ensures consistency across all datasets, leading to easier data integration and analysis.


Migrating the Data with an Azure Function

After each file was processed, the Azure Function handled the migration of these files from Azure Blob Storage into SQL Server. This automation reduced time and effort required for migration, allowing the handling of large volumes of data efficiently. As a serverless solution, the Azure Function scales automatically to handle varying loads; this means it can process large batches of files or a single file with equal efficiency.


The Outcome

The team successfully transformed the client's messy flat files into a clean, reliable SQL Server database. The automated process using an Azure Function ensured data cleanliness and made the migration process efficient. The client now has a solid data foundation that supports their operational and analytical needs.

If your organization faces similar challenges with messy or unstructured data, our team has the expertise to help you navigate the process, making sure your data is truly transformed. Reach out to us to learn how we can help and support your data needs.

Introduction

Organizations usually rely on large volumes of data stored in many different formats and locations in today's data-driven world. Recently, we were approached by a client who had a challenging but common scenario: they had flat files that were somewhat messy and needed to be migrated from an Azure Blob Storage into a SQL Server database. Our team took up this challenge, successfully performing data cleaning and reconciliation. In this case study, we will explore the steps we took to transform this data using an Azure Function, making sure it was clean, accurate, and ready for business use.


Data Reconciliation

Before the cleaning and migration process, we performed thorough data reconciliation to ensure the integrity and consistency of the data. This involved identifying discrepancies between data records with IDs, resolving these conflicts, and validating the data.


Data Cleaning using an Azure Function

Our client provided us files that were filled with issues that needed to be addressed before the data could be migrated to SQL Server. We leveraged an Azure Function to automate the data cleaning process, this serverless solution allowing us to handle the data transformations required regardless of data volume:

1. Messy Column Names: The files had varying column names, with differences in spacing, special characters and length. We standardized these by converting them to lowercase, replaced spaces with underscores, and removed any special characters that could cause issues in SQL Server. Standardizing column names ensures consistency across all datasets, making it easier to merge, query, and analyze data later on.

2. NAs and NaNs: This data included many NA and NaN values, which were appropriately handled and converted to None (SQL NULL). This conversion was important as NULL in SQL is the standard way to represent missing or undefined data - NA/NaN values may not be interpreted correctly by SQL functions. This conversion ensures data integrity, improves query performance, makes sure there is compatibility across systems, and offers a complete/consistent dataset for decision making.

3. Floats and Scientific Notation: Some numerical fields were stored as floats or scientific notation, requiring conversion to integers for data consistency. This proved difficult with the NA values in the columns as their existence stopped the integer conversion from happening. So, we first had to convert the NAs to a 0 placeholder to convert the columns, then we had to convert these zeroes to None after. All float values contained '.0' meaning no data was lost when the number was rounded to an integer.

4. Date Format Discrepancies: Date columns were inconsistent, with multiple formats in files such as 'DD-MM-YYYY' and 'MM-DD-YYYY'. We handled the ambiguity of the date format by converting the data name - which contained the month - to a number to compare. We converted the format to the SQL-compliant arrangement 'YYYY-MM-DD' to make sure date-based queries would function correctly in SQL Server. This standardization not only improves the accuracy of queries and reports but also ensures consistency across all datasets, leading to easier data integration and analysis.


Migrating the Data with an Azure Function

After each file was processed, the Azure Function handled the migration of these files from Azure Blob Storage into SQL Server. This automation reduced time and effort required for migration, allowing the handling of large volumes of data efficiently. As a serverless solution, the Azure Function scales automatically to handle varying loads; this means it can process large batches of files or a single file with equal efficiency.


The Outcome

The team successfully transformed the client's messy flat files into a clean, reliable SQL Server database. The automated process using an Azure Function ensured data cleanliness and made the migration process efficient. The client now has a solid data foundation that supports their operational and analytical needs.

If your organization faces similar challenges with messy or unstructured data, our team has the expertise to help you navigate the process, making sure your data is truly transformed. Reach out to us to learn how we can help and support your data needs.

Introduction

Organizations usually rely on large volumes of data stored in many different formats and locations in today's data-driven world. Recently, we were approached by a client who had a challenging but common scenario: they had flat files that were somewhat messy and needed to be migrated from an Azure Blob Storage into a SQL Server database. Our team took up this challenge, successfully performing data cleaning and reconciliation. In this case study, we will explore the steps we took to transform this data using an Azure Function, making sure it was clean, accurate, and ready for business use.


Data Reconciliation

Before the cleaning and migration process, we performed thorough data reconciliation to ensure the integrity and consistency of the data. This involved identifying discrepancies between data records with IDs, resolving these conflicts, and validating the data.


Data Cleaning using an Azure Function

Our client provided us files that were filled with issues that needed to be addressed before the data could be migrated to SQL Server. We leveraged an Azure Function to automate the data cleaning process, this serverless solution allowing us to handle the data transformations required regardless of data volume:

1. Messy Column Names: The files had varying column names, with differences in spacing, special characters and length. We standardized these by converting them to lowercase, replaced spaces with underscores, and removed any special characters that could cause issues in SQL Server. Standardizing column names ensures consistency across all datasets, making it easier to merge, query, and analyze data later on.

2. NAs and NaNs: This data included many NA and NaN values, which were appropriately handled and converted to None (SQL NULL). This conversion was important as NULL in SQL is the standard way to represent missing or undefined data - NA/NaN values may not be interpreted correctly by SQL functions. This conversion ensures data integrity, improves query performance, makes sure there is compatibility across systems, and offers a complete/consistent dataset for decision making.

3. Floats and Scientific Notation: Some numerical fields were stored as floats or scientific notation, requiring conversion to integers for data consistency. This proved difficult with the NA values in the columns as their existence stopped the integer conversion from happening. So, we first had to convert the NAs to a 0 placeholder to convert the columns, then we had to convert these zeroes to None after. All float values contained '.0' meaning no data was lost when the number was rounded to an integer.

4. Date Format Discrepancies: Date columns were inconsistent, with multiple formats in files such as 'DD-MM-YYYY' and 'MM-DD-YYYY'. We handled the ambiguity of the date format by converting the data name - which contained the month - to a number to compare. We converted the format to the SQL-compliant arrangement 'YYYY-MM-DD' to make sure date-based queries would function correctly in SQL Server. This standardization not only improves the accuracy of queries and reports but also ensures consistency across all datasets, leading to easier data integration and analysis.


Migrating the Data with an Azure Function

After each file was processed, the Azure Function handled the migration of these files from Azure Blob Storage into SQL Server. This automation reduced time and effort required for migration, allowing the handling of large volumes of data efficiently. As a serverless solution, the Azure Function scales automatically to handle varying loads; this means it can process large batches of files or a single file with equal efficiency.


The Outcome

The team successfully transformed the client's messy flat files into a clean, reliable SQL Server database. The automated process using an Azure Function ensured data cleanliness and made the migration process efficient. The client now has a solid data foundation that supports their operational and analytical needs.

If your organization faces similar challenges with messy or unstructured data, our team has the expertise to help you navigate the process, making sure your data is truly transformed. Reach out to us to learn how we can help and support your data needs.

SHARE