Moreover, with the rise of cloud-based data warehousing we’ve recently seen a dramatic shift in that dynamic. David provided his great overview from TDWI’s perspective. Noel expanded the conversation with broad market insights, and I wrapped it up with some practical implementation advice gleaned from customer projects. Also don’t panic If you missed our lively conversation. You can catch the replay here. The summary conclusion was that we’re seeing a rise in the adoption of new cloud data warehousing solutions because they tend to be more agile, cost-effective, and flexible than traditional solutions, resulting in analytics initiatives that help drive businesses forward.
The topic has evidently struck a chord since so many of you attended the live event AND there were so many great questions. In fact, too many to answer in the allotted time. Blogging however doesn’t have such constraints and so here are the answers to some fantastic webinar questions:
What is the best company to provide SaaS for cloud Data Warehouse?
There are several great cloud data warehouses on the market and Qlik’s Data Integration Platform supports Amazon Redshift, Azure SQL Data Warehouse, Google BigQuery and Snowflake equally. This helpful Datamation article lists the criteria that organization should consider when choosing a cloud data warehouse service. https://www.datamation.com/cloud-computing/top-cloud-data-warehouses.html
On the cloud, since databases are columnar (Redshift, Snowflake etc), is there a need for star schema?
This is another great question. There are three concepts with data warehousing:
- The data mart model (star schema)
- The workload or query characteristics (e.g. OLTP vs. OLAP)
- The physical data organization (columnar) on disk.
The data model (star schema) typically produces a data mart (subset of data warehouse data) that’s optimized for reporting and generally has no bearing on column-organized storage.
Is a star schema in the cloud good or bad from cost and performance perspective?
Cloud data warehouses generally charge you for storage and the compute processing of queries. With a star schema we generally increase the storage but reduce the time for query -based compute. Storage is typically cheaper than compute and so there’s definitely a trade-off to consider there. Also, every company has different data storage requirements and query frequencies that vary from business to business and even from department to departments. Therefore, it’s extremely difficult to estimate if a star schema has a negative effect on cost. However, we can be certain that cloud data warehousing is an order of magnitude more cost effective than traditional options regardless of design pattern.
What is the use case to have cloud DW in 10s of Petabytes versus leveraging data lakes?
Data lakes and data warehouses are both widely used for storing data, but they are not interchangeable terms. A data lake is a pool of raw unstructured data where the purpose may not be known at the time of ingest. A data warehouse is a repository of structured data where the purpose is generally known at the time of storage. Consequently, data lakes use the “schema on read” pattern and data warehouse generally use “schema on write”.
Data lake storage technology is typically file-based, and data warehouse technology is generally drawn from database theory. Therefore, data lakes are often used in industries where there’s a requirement to store lots of unstructured data (for example in healthcare there’s physician’s notes, clinical data, medical images, etc.) but the query load is unknown. Data warehouse use-cases focus on providing high-level reporting and analysis that lead to more informed business decisions.
In many the instances companies use BOTH a data warehouse AND a data lake. In fact we’ve started to see the data lake being leveraged as a “pre-staging” ingest phase where raw data is ingested into the lake and a subset later extracted and ingested into the “landing”question is not an While a data lake works for one company, a data warehouse will be a better fit for another.
How are people thinking about security differently?It is interesting to see fraud and customer analysis as primary drivers in the cloud given stricter security requirements everyone is facing.
Wow! This is a huge topic that is worth many dedicated blog posts but let me take a stab at answering this question. I don’t think folks are thinking differently about security, but high-profile security breaches and regulations are ensuring that anyone who handles data acts responsibly and takes appropriate measures to secure their customers information.
When looking to automate in the ETL/data warehousing space do you suggest buying a product or building something new specific to your needs?
To buy or build is the eternal question and each customer must make the determination for themselves. However, whatever you choose please remember that most ETL solutions may be great for initially ingesting data into the warehouse but offer little help for managing the data warehouse lifecycle. Most ETL solutions do not offer data modelling tools, don’t automatically create the internal data warehouse tables, and also require that you manually map the entities between various pipeline stages. As a result, ETL solutions can be very time consuming when applied to data warehouse automation.
That’s why we recommend that you a data warehouse automation
(DWA) solution to your cloud data warehouse implementation. I’m obviously
biased that you should choose Qlik
Compose because of its model-first approach and its automatic ETL-script
generation, but any DWA solution will provide some productivity enhancement.
Are there any new trends in data anonymization you are seeing?
There’s a few of areas where data anonymization is being increasingly used. We’re seeing data being anonymized in development, test and QA environments. We’re also seeing data anonymization being used for web traffic too. Finally, we’re also seeing anonymization as a best practice being enforced on PII data for certain data science use cases when correlation is required, but the actual values are “unimportant”. For example, the identity of specific users is not significant when we study behavior and usage patterns.
Could the panel talk a little about "data vault" and is it really being implemented in US vs. EU?
Data Vault is one of the “big 3” data warehouse design methodologies in use today. The others being Kimball, and Inmon. Unfortunately, I don’t have popularity statistics and so can’t provide an answer. However, I can comment that Qlik Compose is loosely based on the Data Vault approach and generates scripts that follow the Hub and Satellite pattern.
How does a company with a large investment in an on-prem data warehouse like HANA start to move to cloud native DWs like Snowflake. HANA on Cloud does not provide any cost benefit to us.
This is a great question to end on. This situation is one we’ve seen many times. Hana can remain on-premises and you can choose what data you want replicated to a Snowflake staging zone with Qlik Replicate. Then you can use Qlik Compose to create the data model, generate your ETL scripts and make data marts. While every customer situation is unique, the approach is generally the same, but the best advice is “come talk to us and work on a plan for you”.
The cloud has definitively brought new life and interest to the Data Warehousing domain. However, if we truly want to realize the benefits of cloud then the limitless flexibility and scalability still need to be combined with the lessons learned from decades of data warehousing implementations. Finally, check out the terrific webinar to hear the entire conversation and leave feedback below if you have comments on my answers. Yes indeed. Data Warehousing is most definitely back!