Traditionally, the enterprise data warehouse has been the means by which a corporation consolidates its data for purposes of reporting and analytics. Today, the Hadoop data lake has emerged as a modern alternative—or more often, a complement—to the conventional data warehouse. A data lake is a large and diverse reservoir of corporate data stored across a cluster of commodity servers running software, most often the Hadoop platform, for efficient, distributed data processing. Its low cost, flexibility, and easy scalability make a data lake well-suited for supporting big data analytics.

Determining What to Put Into a Data Lake

Part of the appeal of a data lake is that you can load and analyze data and content that would not go into a traditional data warehouse, such as web server logs or sensor logs, social media content, or image files and associated metadata, in its native format. Data lake analytics can therefore encompass any historical data or content from which you may be able to derive business insights. But a data lake can play a key role in harvesting conventional structured data as well—particularly, data that you offload from your data warehouse in order to control the costs and improve the performance of the warehouse.

For the latter type of data, the question is exactly which data—and which ETL workloads—to offload from your enterprise data warehouse into your data lake / Hadoop. Qlik Visibility (formerly Attunity Visibility) was built to answer exactly these sorts of questions. The Qlik Visibility (formerly Attunity Visibility) solution for data warehouse optimization shows you which warehouse data is frequently used, rarely used, or never used, and which queries and ETL processes are consuming the most resources. Qlik Visibility (formerly Attunity Visibility) also provides you insight into the data access and resource consumption patterns of each application, user group, and individual user that accesses the data warehouse. Armed with these analytics around data usage and business value, you can make well-informed decisions about which data and workloads to offload to your data lake.

Loading a Data Lake

Along with determining what should go into your data lake, a critical operational challenge is how to cost-effectively ingest data into the lake from diverse source systems. What is required is an enterprise-class Hadoop data integration tool, and here again Qlik (Attunity) has a proven solution.

Qlik Replicate (formerly Attunity Replicate) is a big data integration platform used by thousands of businesses worldwide including many of the leading names in the Fortune 500. Designed to meet the demands of today's data-driven enterprises, Qlik Replicate (formerly Attunity Replicate) is a unified solution for moving data from nearly any type of source system to nearly any type of destination system including a Hadoop data lake. With Replicate you have a single solution for loading your data lake with data from any major database, data warehouse, mainframe system, file system, or SAS application. You can configure and execute bulk or real-time incremental data migration processes through an intuitive GUI with no need for manual coding—so you can load and continuously refresh a data lake without having deep Hadoop coding expertise.


Five Principles for Effectively Managing Your Data Lake Pipeline