Data Lake Hadoop

The premium cost and rigidity of the traditional enterprise data warehouse have fueled interest in a new type of business analytics environment, the data lake. A data lake is a large, diverse reservoir of enterprise data stored across a cluster of commodity servers that run software such as the open source Hadoop platform for distributed big data analytics. A data lake Hadoop environment has the appeal of costing far less than a conventional data warehouse and being far more flexible in terms of the types of data that can be processed and the variety of analytics applications that can be developed and executed. To maximize these benefits, organizations need to carefully plan, implement and manage their data lake Hadoop systems.

Moving Data into a Data Lake Hadoop Environment

One of the primary attractions of a data lake Hadoop system is its ability to store many data types with little or no pre-processing.. But with this source data agnosticism can come a couple of "gotchas" that businesses need to be aware of when planning for a data lake Hadoop deployment:

  • Without careful planning and attention to the big picture, it's easy to end up with a hodge-podge of multiple data movement tools and scripts, each specific to a different source data system—making the data migration apparatus as a whole difficult to maintain, monitor, and scale.
  • Each separate data flow may require coding work by programmers with expertise in the source system interfaces and the Hadoop interface. This need for programming man-hours can become a bottleneck for launching a new data lake Hadoop system or making changes to an existing system once it is up and running.

These data lake Hadoop problems can be avoided by using a purpose-built big data ingestion solution like Qlik Replicate (formerly Attunity Replicate). Qlik Replicate (formerly Attunity Replicate) is a unified platform for configuring, executing, and monitoring data migration flows from nearly any type of source system into any major Hadoop distribution—including support for cloud data transfer to Hadoop-as-a-service platforms like Amazon Elastic MapReduce. Qlik Replicate (formerly Attunity Replicate) also can feed Kafka Hadoop flows for real-time big data streaming. Best of all, with Qlik (Attunity) Replicate data architects can create and execute big data migration flow without doing any manual coding, sharply reducing reliance on developers and boosting the agility of your data lake analytics program.

Maintaining Visibility into a Data Lake Hadoop Environment

A typical data lake Hadoop environment may span dozens or hundreds of nodes, a wide range of data types and ages, and support numerous analytics applications and user groups. System operators are faced with difficult questions such as: Is the cluster the right size for our big data and Hadoop storage and processing needs? Will we need to expand soon, and by how much? Which data and resources in the Hadoop cluster are being used by which applications and which user groups? Is our data lake Hadoop environment meeting corporate compliance and data governance requirements?

Qlik Visibility (formerly Attunity Visibility) answers these questions and more by delivering comprehensive usage analytics for any Hadoop platform, on-premises or in the cloud. Qlik Visibility (formerly Attunity Visibility) helps data architects and administrators improve capacity planning and more easily satisfy auditor inquiries about usage of sensitive data. Qlik Visibility (formerly Attunity Visibility) also can provide usage analytics for conventional enterprise data warehouse systems, helping operators identify data and workloads that are suitable for offloading from high-cost data warehouse environments to low-cost Hadoop environments.