While few technology sectors really sit still, the data segment is especially dynamic. Open source innovators and vendors are churning out a dizzying set of architectural options for today’s architects and CIOs. Placing strategic bets has rarely been more difficult.
At Attunity, we advise some of the world’s largest enterprises as they whiteboard their data environments and data integration plans. We have found that our customers can change plans midstream, based on new business requirements or lessons learned. As they go through this trial and error process, they find that data ingest is a repeated requirement rather than one-time event. Sources and targets may change over time. In addition, data often now flows through pipelines or multiple processing zones, often coming to rest in familiar analytics destinations: data warehouses or data marts.
Common Requirements for Modern Data Integration
Let’s explore common data integration requirements we help enterprises address. Most projects entail at least two of four primary data integration use cases. Here is a summary of the use cases and the motivation for each:
Managing Change Data Capture at a Managed Health Services Provider
As an example, we are working with the CIO of a managed health services provider to publish records real-time from a DB2 iSeries database to Kafka, which in turn will go through Flume to feed their Cloudera Kudu columnar database for high performance reporting and analytics.
Our change data capture (CDC) technology starts the whole data flow by non-disruptively copying live records (millions each day) as they are inserted or updated on the production iSeries system. This is a great example of an initiative involving multiple use cases: data lake ingestion, database streaming and production database extraction. Many variables are changing at the same time.
The change does not stop there, because this customer has not fully settled on Kudu. They might instead (or also) run analytics on Hive , which effectively serves as a SQL data warehouse within their data lake, depending on what they learn about the analytics workload behavior.
The change does not stop there, because this customer has not fully settled on Kudu. They might instead (or also) run analytics on Hive, which effectively serves as a SQL data warehouse within their data lake, depending on what they learn about the analytics workload behavior.
Changing Data Integration Ingredients at an International Food Manufacturer
Another example of complexity and change is a Fortune 500 food manufacturer that is using our CDC technology to feed a new Hadoop data lake based on the Hortonworks Data Platform. They efficiently copy an average of 40 SAP record changes every five seconds, decoding that data from complex source SAP pool and cluster tables. Attunity Replicate injects this data stream, along with periodic metadata and DDL changes, to a Kafka message queue that feeds HDFS and HBase consumers that subscribe to the relevant message topics (one topic per source table.)
Once the data arrives in HDFS and HBase, Spark in-memory processing helps match orders to production on a real-time basis and maintain referential integrity for purchase order tables within HBase and Hive. As a result, they have accelerated sales and product delivery with accurate real-time operational reporting. They have replaced batch loads with change data capture to operate more efficiently and more profitably.
But once again this is not the end of the story. They are now moving their data lake to an Azure cloud environment to improve efficiency and reduce cost.
We believe these and other companies are making the right strategic choices based on the best information available at each point in time. As our customers evolve, they are adopting a few guiding principles to navigate the complexity.
There is no decoder ring for architecting your data environment. Our customers are finding these five guiding principles provide consistent guard-rails to improve the odds of success.
To learn more about data integration: