I believe I’ve struck a nerve with folks out there on the Internet concerning the webinar – “ 2020 and Beyond: Architecting Your Data Warehouse for the New Decade ” – we held a few weeks ago. For those that didn’t attend, here’s a refresher. The topic focused on the architectural considerations for a new data warehouse with additional commentary by David Stodder, Senior Director of Research at TDWI, and Matt Basile Azure Data Program Manager at Microsoft. Both were excellent speakers.
The response to the webinar has been truly amazing. We’ve reached almost 100 questions from both the live event and post event emails. Consequently, I’ve got tons of fodder for future blog posts, product enhancements and hot topics you’d like to hear about. Therefore, I’d like to thank all attendees for their fantastic feedback and would strongly suggest checking out the replay if you missed the live event.
With that said, I’d like to address some of the most frequently asked and interesting questions that came up during the show.
FAQ and Answers
Let’s start with something relatively easy:
Q: What is the difference between ETL and ELT?
A: ETL is an acronym for Extract Transform and Load, and is a process typically associated with data integration, data warehouse loading and data lake ingestion. While modern architectures have sought alternatives to this type of data delivery, traditional ETL is still used in many companies today. However, ETL loading processes can take a significant time to complete because the data transformation needs to complete BEFORE it’s loaded into the target. Also, extra latency can occur if the transformations cannot be executed in parallel.
Many existing ETL integration technologies generally operate in a batch mode, require manual programming to map data sources to targets, and often need to deploy software agents that impose additional query workload burden on the data sources. These constraints often combine to restrict how much data ETL software can deliver in real time.
ELT on the other hand is Extract Load and Transform – meaning that data is extracted from the source system and loaded into the target system “as is,” with data transformations occurring later. As a result, ELT is often more performant than ETL since no heavy-duty data transformations occur BEFORE or during data transmission.
ELT is the method typically used by Change Data Capture (CDC) technology. In this scenario, the data change is detected on the source system and then immediately replicated to the target system. This method is extremely performant and is ideal for high-frequency and high-volume transactional systems. In addition, little to no additional programming is required since data is not transformed before delivery.
Modern integration solutions like Qlik Replicate use both ELT approach and CDC to deliver masses of rapidly changing data to target systems. In addition, Qlik Replicate does not require developer programming, automatically propagates changes to the target schema, doesn’t require distributed software agents, and effortlessly connects popular data sources to a wide variety of data targets.
Q: Are we only bringing together data warehouses with data lakes because of the heritage? If you were starting from scratch, would you really start with a data warehouse or something else? Also, any opinion on the "Data Lakehouse" paradigm?
The Data Lakehouse has been a hot topic of late, and no, we aren’t bringing lakes and warehouses together because of heritage. In my opinion, there are two factors that are largely responsible for driving their combined use. Warehouses have traditionally operated with a “schema-on-write” mindset, whereby you define the warehouse structure when you first store the data. Lakes, on the other hand, operate on a “schema-on-read” approach, whereby you apply structure when the data is consumed. Both approaches have their pros and cons but offer the best solution when used together.
The Qlik Data Integration platform provides the flexibility to support your architecture regardless of the approach you choose.
Q: My concern with diagrams, such as the "Data Warehouse Architecture Drill Down," is that there are a lot of hops from the data sources to the data consumers, which adds friction to delivery. Do you see a demand for decentralization, i.e., moving from tech to empowering others to do more, and, if so, does this affect potential aspects of that, such as data marts and associated responsibilities?
A: The diagram highlights a logical structure and does not suggest that data in each stage of the pipeline resides in different storage mechanisms. Indeed, in many situations, the various stages reside INSIDE of the data warehouse as a collection of materialized views, and the data does not actually move anywhere. Warehouse best practices have been developed to reduce the friction you describe and advise that your business users consume data from curated/governed data marts.
I think you’re implying “self-service” when you talk about “de-centralized access,” and there’s a couple of facets to consider. You can certainly offer data marts to consumers in this paradigm if you have the infrastructure to control access and keep the marts fresh. Also, access to the raw data source can also be granted to your consumers, but the plethora of tables might be a barrier to easy analysis.
Q: “Matt mentioned a white paper about Synapse and Data Vault. Where can I get it?”
A: You can download the paper from the Microsoft website here. One of the paper’s key takeaways is that “[a]utomation gives your project a real productivity boost and is considered a good practice in any modern data warehouse service.” If you want more information about data warehouse automation for Azure Synapse Analytics, then download the Wiley dummies book Matt and I co-authored.
Apologies and Next Steps
I apologize if your question wasn’t mentioned above, but this post already has gotten longer than anticipated. Let me know in the comments if you’d like me to continue posting answers, or if there are other topics to consider. Finally, for more background, read my first blog about the webinar and check out the replay.