Little Budget, Few Resources – How To Get Started On a Data Lake

I recently had the pleasure of co-presenting on a topic on the mind of many organizations – how to get faster return on your Artificial Intelligence (AI) and Machine Learning (ML) initiatives. Companies across industries are heavily investing in AI and ML capabilities, but what continues to stump their efforts in getting timely and reliable insights is the availability of a single version of trusted data, ready for consumption on a continuous basis.

The Qlik-TDWI-AWS joint webinar, titled “One Source of Truth for AI & Analytics: Optimizing Your Data Lake Pipeline for Faster Business Insights,” was co-led by me, Fern Halper, VP of Advanced Analytics Research at TDWI and Dilip Rajan, Partner Solutions Architect at AWS. The webinar focused on the criticality of managed data lake creation for the success of AI and ML programs and discussed key considerations for building a performant data lake.

Widely viewed across multiple geographies, the webinar generated more than 100 questions. I am addressing the top five most asked and/or most interesting questions below. You can also register and watch the webinar replay here to listen to us address additional questions.

1. Can you explain the difference between a Data Warehouse and a Data Lake? Which is better?

This question continues to come up in almost every discussion. Although a data warehouse and a data lake are both used to store and manage data, the two are completely different architectural approaches.

A data warehouse is a repository of structured, relational data, where the purpose of data use is defined at the time of storage. Data warehouses normally act as organizations’ systems of record and are designed to support high-level business intelligence (BI) and reporting initiatives. A great source of curated, analytics-ready data, data warehouses aren’t usually designed to handle raw, semi-structured or unstructured data.

Data lakes, on the other hand, store huge volumes of all data – structured, semi-structured, or unstructured. No data is turned away. Data is loaded from source systems, primarily in raw format, without a defined purpose at the time of storage, making data lakes ideal for data exploration/ experimentation, AI, ML and Data Science initiatives.

Refer to the table below for more differences between data lakes and data warehouses.


Although both data warehouses and data lakes have been critical to enterprise data management, each comes with its own strengths and limitations. As a result, we are starting to see the emergence of the concept of Data Lakehouse, where data warehouse and data lake platforms blend their capabilities with the objective of providing a more unified architecture to provide a single source of truth for all analytic initiatives, including BI, Streaming Analytics, ML and Data Science.

2. How does a company with multiple data architectures rationalize and modernize its environment in cloud? We have a legacy on-prem data warehouse and a cloud-based data lake.

This is a great question, but, if you have this problem, know you aren’t alone. Most enterprises started with on-prem data warehouses but moved on to data lakes (on-prem or cloud-based) to overcome the limitations of traditional data warehouses. However, in many instances, they continue to use both data warehouse and data lakes. Cloud platforms, as well as separation of storage and compute, offer great flexibility to organizations like yours looking to modernize your data architectures. Nonetheless, the approach you take depends on your data latency and analytics needs.

We have customers that are building smaller, more subject area-focused cloud data warehouses. We are also seeing customers leveraging their cloud data lake as a pre-staging ingest phase, landing raw data in the data lake and then extracting a subset of that data into a cloud data warehouse. No one approach suits every organization or business scenario, which is why Qlik Data Integration is designed to support customers across multiple architectural constructs as their needs evolve, across all major cloud platforms.

3. We have many systems, storing data in multiple, inconsistent formats. How can we use your capabilities to make the data consistent across systems in a central system? Also, how do you support typical ELT operations like match/merge, etc.? Does this require custom coding?

Qlik Data Integration automates data standardization, formatting and merging of raw change data files to create a full history of the data, without any need for custom coding. The solution also allows you to automate data pipelines from multiple sources into target systems of your choice via a simple, user-friendly web-based console. No manual scripting required.

4. How long does it take to migrate information to one platform (on average) and how much does your solution cost? Can you mention some of your customers and how they are benefiting from your solution?

The change data capture and replication component of Qlik Data Integration solution can be set up and configured in under an hour and moving data. The time it takes to migrate data from a specific source to a target endpoint depends on the volume of data being moved. The pricing varies based on customers’ specific needs, such as number of sources they wish to replicate data from, CPU cores of source systems and number of target endpoints.

Qlik Data Integration is used by nearly 2,500 customers worldwide, with almost half of Fortune 100 companies using the platform to optimize their data pipelines. Our customers report a multitude of benefits, including reduction in cost to compute, improvement in deployment timelines, reduction in cost to build and speed to decision. In fact, Vanguard, one of the world’s largest investment management companies, publicly shared at AWS re:Invent in 2019 how it is leveraging a Qlik solution to replicate mainframe transactions to AWS cloud with one-two second latency, supporting peak volume in excess of 60M row updates per hour. Another customer, Ferguson, one of the largest suppliers of plumbing and HVAC equipment in the United States reported the ability to migrate 27 databases in just six months using a Qlik solution where it previously took them two years to move just two. You can read about these customers and more here.

5. How do you start a data lake in an environment where you have budgetary constraints and few resources?

This is a great question to end with, as I am sure it resonates with many.

The cloud is a great place to start, as it provides you the elasticity to buy what you need, both from storage and compute perspective. Qlik Data Integration supports all major cloud platform providers, including AWS, Microsoft Azure, Google Cloud Platform, as well as Cloudera and Databricks, giving you the complete flexibility to choose a partner of your choice. Additionally, we fully automate the ELT code generation process required to stitch together near real-time data change, freeing up your expensive and scarce data engineering and programming resources from coding tasks in a Hadoop-based data lake implementation. We also automate the entire process of data pipeline setup, configuration and management, so your resources can focus on higher value analytic tasks.

In the end, many thanks to all webinar attendees for your terrific questions, which enabled us to help you understand how you can achieve your AI, ML and Data Science goals! For those of you who weren’t able to make it, you can register to watch the webinar by clicking here.

Finally, try out a free trial of Qlik Replicate, part of the Qlik Data Integration platform, and experience the ease of use yourself.

Learn how to get started on a #datalake. Read our own Ritu Jain's latest blog post that expands on a recent webinar she co-led on optimizing your data lake pipeline.

 

In this article:

Comments

Get ready to transform your entire business with data.

Follow Qlik