It’s a fair question and one that’s easily answered. But, first let’s review some fundamentals and see what caused such a disturbance in the force.
What is Azure Data Factory?
Azure Data Factory (ADF) is the Azure data integration service in the cloud that enables building, scheduling and monitoring of hybrid data pipelines at scale with a code-free user interface. Additionally, you can process and transform the data along the way by using compute services such as Azure HDInsight, Spark, Azure Data Lake Analytics, Azure SQL Database, Azure SQL Data Warehouse, SQL Server, Azure Databricks and Azure Machine Learning.
In summary, ADF is a fully managed ETL (Extract Transform Load) cloud service by Microsoft that joins other solutions from vendors such as Amazon (AWS Glue) and Google (Cloud Dataflow). Incidentally, the ETL market has no shortage of solutions from pure cloud options, to open source and traditional proprietary software vendors, and it’s common for most enterprises to use at least “one from each.”
How Does Azure Data Factory Differ from other ETL Tools?
There’s some nice features that distinguish ADF from the rest of the cloud pack. ADF has a clean user interface to define dataflows, also .NET, REST and Python API’s if you feel the need to code your own ETL tasks. It’s also got the ability to run SSIS packages too. That’s handy if you’ve got some legacy SSIS integrations hanging around and are looking for a path to something more modern. Finally, there’s a windows-based integration runtime (aka data gateway) that helps you retrieve data from “on-premises” systems.
What Can I Use Azure Data Factory For?
You can use ADF in the same way that you’d use any traditional ETL tool. Modernizing a data warehouse, aggregating data for analytics and reporting, or acting as a collection hub for transactional data. However, the primary goal in this release is to migrate your data to Azure Data Services for further processing or visualization. Not surprising that Power BI is a good complement to ADF as we can see from the diagram below. In this example ADF aggregates blog comments from a blob store with data from an on-premises SQL server. It then transforms the data and delivers the result to an Azure SQL Database. Finally, a Power BI dashboard visualizes the data and insights emerge! Perhaps learning that avid blog readers also purchase the most widgets on Tuesdays.
The former example looks great, but it’s been carefully crafted to highlight the type of use case where batch-oriented ETL shines. The blog comments are variable data with small payloads and infrequent updates. Realistically blog comments are a few Mb at most and even the greatest viral posts generate responses every few seconds. Also, notice that the on-premise SQL Server data source is a look-up, where the tables hardly change and just copied wholesale to the cloud. I think we need to consider a more realistic scenario where we can see ADF in the real world.
A Real-World Retailer
In this scenario, a retailer has launched a data warehouse modernization initiative and has decided to implement a cloud data lake. They’ve decided not to use an on-premises Hadoop solution, opting instead for Azure. The retailer is using Azure Data Factory to populate Azure Data Lake Store with Power BI for visualizations and analysis. The high-level architecture looks something like the diagram below:
ADP Integration Runtime
On paper this looks fantastic, Azure Data Factory can access the field service data files via http service. The point of Sale and SAP can be accessed via the on-premise integration runtime and ODBC. The CRM can be accessed via the native SQL server connector. The ADF “copy wizard” is used to create the pipelines and the Azure Data Lake Store is loaded with data. I imagine the Power BI dashboard looks something like the illustration below:
Incremental Copy – A Common ETL Pattern
The obvious solution to keeping data fresh is to schedule Azure Data Factory pipelines to execute every few minutes. The ADF copy activity is primarily built for copying whole tables of data and not just the rows that have changed or copy time-partitioned buckets of data files. So, as size of the data source gets bigger, the more data you need to copy and the longer the tasks take to manage and execute.
One solution is to use the ADF “incremental copy template” and only copy new rows from the source. However, this process relies on the data source having a “watermark” table to keep track of database activity over time. In addition, extra pipeline logic is required to calculate the time stamps, retrieve the data for the relevant time period, and update the watermark table ready for the next ADF pipeline run. See below:
The watermark technique works well for these batch-oriented use cases and approaches that can leverage queries against fields in the data sources that can be compared against a control table.
Understanding Change Data Capture (CDC)
Like ETL, Change Data Capture (CDC) is not a new concept. The technology emerged two decades ago to help replication software vendors deliver real-time transactions to data warehouses. It works via standard change-log mechanisms to continuously identify and capture incremental changes to data and data schemas from sources such as enterprise-grade databases. These small incremental changes are then automatically propagated to target systems via a low-latency data transfer. The impact on operational systems is also extremely low since the source systems don’t require integration agents or additional database queries. As a result, CDC has the benefit of being easier to administer and manage than other types of data integration. Furthermore, there’s no “batch window” and the data in target systems are always up-to-date.
Adding Attunity Replicate to Azure Data Factory
It’s easy to see how our retail scenario could benefit by adding Attunity Replicate to our architecture now that we understand the concept of CDC. A possible configuration might look something like the diagram below:
Now the data lake is updated automatically by Attunity Replicate when a customer makes a purchase (logged by DB2), or an account gets updated (a CSR uses the SAP application), or a field report gets amended (an example would be a service agent who creates new site visit logs). Also, remember that should the source schemas change due to an event like an application upgrade, then those changes are automatically reflected in the data lake too! We can use Azure Data Factory for our ETL-based data pipeline integrations, like in our first scenario. Then we can use Attunity Replicate to ingest transaction data into our Azure Data Lake Store in-real time. The combined solution provides the best, most accurate data for our business analysts and users.
Comparing Azure Data Factory and Attunity Replicate
By now you should have gotten a sense that although you can use both solutions to migrate data to Microsoft Azure, the two solutions are quite different. ADF is a cloud-based ETL service, and Attunity Replicate is a high-speed data replication and change data capture solution. Use ADF when you want to create dataflows that copy complete tables of information or incrementally load delta data in batch workflows. Use Attunity Replicate when you want to ingest large volumes of data into Azure Data Services and keep those data targets up-to-date with incremental data. The two solutions are extremely complimentary and work well together. The illustration below is a handy comparison that you can print and stick to your wall as a reminder.
Attunity and Microsoft Strategic Partnership
Attunity and Microsoft have been business partners for over twenty years. From the early days of providing change data capture for Microsoft SQL Server Integration Services, to a renewed OEM agreement announced just last month.
In November 2017, Attunity working with Microsoft launched a dedicated product called “Attunity Replicate for Microsoft Migrations” which was designed to accelerate the migration of enterprise data at scale from various database systems to the Microsoft data platform. Both companies sponsor this no-fee migration offering that you can download here.
Proving the strength of a combined solution can be easily demonstrated by customers who use both Microsoft Azure and Attunity. A recent example is a global telecommunications company that announced a strategic agreement to leverage Attunity as the key technology for it’s Microsoft Azure Data Lake Initiative. Details can be seen here.
Finally, follow this link for all the information about Attunity and Microsoft partnership.
Microsoft Azure Data Factory is the Azure data integration service in the cloud that enables building, scheduling and monitoring of hybrid data pipelines at scale with a code-free user interface. The ETL-based nature of the service does not natively support a change data capture integration pattern that is required for many real-time integration scenarios. Adding Attunity Replicate to an ADF installation is good practice and ensures that the combined solution delivers the most relevant and accurate data for business analysis.
Next Steps – Download Attunity Replicate for Microsoft Migrations
Register to download a free version of the award-winning Attunity Replicate that’s specifically tailored to help you migrate commercial data sources to the Microsoft Data Platform. In just a few minutes you’ll be able to easily ingest data into your Microsoft data products from diverse sources such as Oracle databases and enterprise data warehouses like Teradata. Try it for yourself today.