Leveraging cloud services makes us easy to do integration part of on-premise data sources with cloud services, ETL (Extract, Transform, and Load) has been a part of every digital transformation project that we undertake as a data engineer. Whether it’s moving data out of an on-premise legacy system to be consumed by newer cloud-based applications or just combining data from disparate systems to be used in a reporting data warehouse, ETL processes are a necessary part of most enterprise solutions.
In this post, I pen down my thoughts on how we can achieve our end goal by leveraging Azure Cloud services. Today, every ETL discussion starts with Azure Data Factory. ADF has a ton of ways to ingest data, it scales well, and Data Flow offers a ton of transformation options without writing any custom code.
High-Level Decision Guide
Microsoft positions ADF specifically as an Azure service to manage ETL and other integrations at a big data scale. While there are many ways to employ ADF for the solution, I’ve specifically found the following questions and answers most useful as our guide:
- If we only need to perform extraction and loading of data (for example making data from a legacy system available in the cloud), ADF’s basic pipeline activities are sufficient.
- If we also need transformations, we’ll start out using ADF’s Data Flow features. I have found that the majority of transformations that we need (joins, unions, derivations, pivots, aggregates, etc) can be handled with the Data Flow user interface.
- If the transformations involve some edge case scenarios, are hard to visualize in a UI, or there is just a comfort level developing it as code, Azure Databricks (ADB) integration can be used to perform these transformations.
Pipeline Activities Approach
The basic workflow in an Azure Data Factory is called a pipeline. A pipeline is an organization of activities (data movement, row iteration, conditionals, basic filtering, etc) against the source and target data sets.
Below is the basic pipeline created with the requirement of getting queried data from the ON-Prem SQL server to Azure ADLS Gen2 writing as a parquet format(Most optimized for Big data Workloads). This pipeline queries the data from SQL server On-Premises using Self Hosted Integration Run time(SHIR) and pulls the output result data from SQL Server Inside a Clients Private Datacenter and write it as parquet file in azure ADLS Gen2 for further processing.
select d.yearID, d.playerID, d.gameID, d.salary, [LahmansBaseballDB].[dbo].[People].nameFirst, [LahmansBaseballDB].[dbo].[People].nameGiven from (select [LahmansBaseballDB].[dbo].[AllstarFull].yearID ,[LahmansBaseballDB].[dbo].[AllstarFull].gameID ,[LahmansBaseballDB].[dbo].[AllstarFull].playerID ,[LahmansBaseballDB].[dbo].[Salaries].salary from [LahmansBaseballDB].[dbo].[AllstarFull] inner join [LahmansBaseballDB].[dbo].[Salaries] on [LahmansBaseballDB].[dbo].[AllstarFull].playerID = [LahmansBaseballDB].[dbo].[Salaries].playerID ) as d inner join [LahmansBaseballDB].[dbo].[People] on d.playerID = [LahmansBaseballDB].[dbo].[People].playerID;
The figure below shows the self-hosted integration runtime.
as you can see we have a successful debug run.
One can visualize further details by clicking the view icon as demonstrated.
The detailed view is shown below.
Note:- In order to get the output as a parquet file by using SHIR, 64-bit JRE 8 (Java Runtime Environment) should have been installed in SHIR
AS mentioned in this official document
Data Flow Approach
A Data Flow is a visually designed data transformation for use in Azure Data Factory. The Data Flow is designed in ADF, then invoked during a pipeline using a Data Flow Activity. The transformations offered here offer a lot of power and configuration options through an easy to follow interface. Joining and splitting data sets, cleaning data, deriving new columns, filtering and sorting results, and running expression functions on row data are some of the possibilities with Data Flow. All of it is done through simple user interface controls.
Below is an example mapping data flow I created to show just a few of the transformation components that can be used. This data flow reads HR employee data, contractor data, and billing info from three different systems. It performs some filtering and new column generation, then combines all of these results. Finally, it sorts the results and drops them into a CSV file.
For more info on Data Flow check the official docs here
If you’ve been around for several years in the ‘Data and Analytics’ domain, you’ll probably remember the hype on ‘Big Data’. The most common scenario where companies switch to Big Data is if they do not have the analytical power to perform a workload on one (and only one) server. Such a workload could be ‘categorizing your documents by using Natural Language Processing’, ‘creating segmentations on all website visitors’, ‘training a very accurate prediction model’, … In the most used sense of the word ‘Big Data’, processing is done by in a distributed way, i.e. by coordinating several machines at once.
Spark is an open-source analytical engine which allow technical users to setup a distributed system, thus allowing companies to tackle their Big Data projects. By default, Spark also gives you the ability to capture streaming events, provides a set of machine learning algorithms and allows for working with graph databases.
While Spark is great at what it does, it is hard to maintain and configure, hard to spin up and spin down, hard to add servers to your cluster and remove them. DataBricks addresses this problem and provides ‘Spark as a Service’ while also adding enterprise-required features. As the majority of the DataBricks product team has also created the core of Spark, they also made API and performance improvements to the analytical engine they provide you with. As such, we believe that DataBricks is the most enterprise-ready Big Data and Data Science platform.
The DataBricks Notebook Experience
If you haven’t worked with analytical notebooks (Jupyter, Azure Data Studio or DataBricks), you’re missing out! Being able to write documentation and code within the same document is a big step forward. It helps you make clear to the readers why you created the queries and guiding them in their first steps in analytics. As opposed to other notebooks, DataBricks can connect to version control (Git, TFS, …) and allows you to combine both R, Python, SQL, and Scala in the same notebook.
There will be business use cases where it makes sense to simply write code to perform a data transformation. For example:
- There’s a weird edge case that the data flow in built transformations will not accommodate
- There’s going to be a high degree of refactoring (change) needed overtime and the data flow will be large. It would be much easier / faster to tweak the code than to try to re-write large data flows.
- The source data is already exported as enormous amounts of unstructured data into Azure Data Lakes Storage, the file system is natively integrated into Azure Databricks.
- To do unit tests on data using Amazon Deequ
In these cases, Azure Data Factory pipelines can invoke notebooks in Azure Databricks using a Databricks Notebook activity. Notebooks define Scala, Python, SQL, or Java code to manipulate and query large volumes of data (terabytes) on its specialized Azure Data Lake Storage file system.
In the example shown below, I created a simple Databricks notebook to read CSV files that have been dropped into Azure Data Lake Storage. These could have easily been Excel, JSON, parquet, or some other file format as long as there is an extension to read them into data frames.
In this example let’s examine how we can handle CSV file with multiple delimiters.
After mounting the ADLS Gen2 in Databricks notebook we can access all the files in the ADLS Gen 2 Datastore.
If we try to read it as CSV we can’t able to parse all the columns
So we have to read it as txt file
Now we can get the header and schema as shown below.
Finally we get the output that’s required and write it as a CSV file back to the storage location
Note:- we can seamlessly switch between languages in a notebook. This one is a Python-based notebook we can also switch between SCALA, SQL, and R. The environment is also very interactive for debugging against large amounts of data, a nice feature since the ADF user interface can only show limited amounts of data in its debug sessions.
Interestingly, ADF’s data flows are implemented as generated Scala code running in its own managed Databricks cluster. This all happens behind the scenes, but it explains why it’s actually hard to come up with everyday use cases where data flows aren’t sufficient. For more information on transformations with ADB, check out here.
As awesome as ADF is, it is true that it’s not always the be-all and end-all for ETL. There are still times when ADF is only part of the solution. For example:
- The server infrastructure hosting an on-premise or alternate cloud database from which ADF needs to pull can’t host the integration runtime (a requirement for ADF to reach the data). In this case, we may have to design and build interesting ways to get the data out and into Azure, accessible to ADF
- Though accessible by the network to ADF, the source data is contained in a format ADF can’t read. In this case, we may have to build or leverage 3rd party software to extract the data into a digestible format for ADF
- The source data’s schema is so bonkers, Azure Databricks is necessary to pull off the transformation. However, the client may be unwilling to pay the heft ongoing cost of ADB, or they may not be comfortable supporting it down the road and would rather see a more traditional C# or T-SQL coded solution. In this case, we may fall back to a bare-metal approach.
Like most software projects, one-size never fits all. But we highly recommend you give ADF a strong look at your next ETL adventure.0