I recently had the pleasure of delivering a Chairman address at the Financial Information Management Association (FIMA) 2018 event in Boston. FIMA does an excellent job of creating an intimate setting for C-level professionals to explore all-things data management in fintech.
The response to my presentation was such that I decided to turn it into a blog. My address was part of the “Metadata and Data Quality” track, hence data cataloging was at the core of my presentation.
Advancing Metadata and Data Quality for Increased Efficiencies
I've been doing data management for large companies and financial institutions for most of my career, and it's been really interesting to see how waves of technical innovation are adopted and successful or not so successful in different institutions. For this reason, we have to explore these waves for success and failure by asking; What are the drivers? What are the business drivers behind data and metadata? Is this internal or externally-based? Forrester Research sheds some light on these questions.
One of our customers, a bank I don’t have permission to name publicly, has no branch locations; a very different kind of business model than traditional banking and investment management companies. Similarly, we’re seeing this in the insurance industry as well. This Forrester chart shows there's a lot of initiatives going on that are trying to do this transformation, and of course, the top drivers are about the customer experience.
However, the ones below customers experience, namely multichannel integration, innovation in products and services, flexibility, agility, are actually very dependent on data and metadata and knowing as much as possible about your data assets. Here, the data drivers are now around data diversity and complexity. You've got IoT, you've got online, you've got semi-structured, unstructured data, and the complexity of bringing that all together to get a holistic picture.
Businesses have to have audit trails and data teams have to be able to manage and understand your data assets to achieve increased speed and scale. These data drivers mandate what we Podium refers to as “a catalog-centric” or a “catalog-first strategy.” And the reason we talk about it being first is that we believe it’s absolutely critical for delivering automation and scale.
We also believe in cataloging continuously, that this is not a one-time or intermittent effort. This has to be built into the processes and something, a little more radical is catalog everything. I'll talk a little bit about that.
A Catalog-First Strategy
Let’s further explore this idea of catalog first. One of the things we have found is that increasingly there are technologies out there that can scan your universe and analyze your data and create what we call a “smart data catalog.”
The Smart Catalog needs to be able to handle data that’s not from a perfect data warehouse source, but the raw data you're getting off of a telemetric system, or a web form, or third party, and deal with complex formats, dirty data, etc. We believe that the catalog - from the start - should deliver value through automated data profiling and identifying things like sensitive data. This type of technology allows businesses to build a rich and robust catalog of data, no matter where it lives. Furthermore, you use the data and the metadata that you derive from that, to start building out your data strategy because if you're going to catalog a lot of data, you're not going to be engineering and perfecting it all, but you will need to know where it is and what it is.
So for example, you have three or four sources of the same kind of transaction information, you’ll want to pick the best source. But we believe an automated catalog would help you use that as a decision support tool; "Which one of these should we use, and which ones should we make business ready?"
And we use “business ready” as a very deliberate term, which is not the same as data quality or master data. Those are extreme poles. In my view, quality is fit-for-purpose. It does require some cleansing and conforming of data, so you can, for example, use it for marketing analytics where it doesn't have to be 100% correct for you to get a good insight. On the other hand, if you're doing regulatory reporting or client-facing reports, it has to be perfect. Here, the degree of effort and engineering you put in to identify the data structure and prepare it, differs. And to enable agility, you want to support multiple kinds of business readiness. A data scientist can use it in a sandbox quickly and the sensitive data is always protected.
And lastly, provisioning. Data should be like an Amazon shopping experience. We should be able to refresh, browse and search it, publish it out to other systems, interact with the data and the metadata and the catalog through APIs, and all the while respecting access controls.
In Summary
Our view of this catalog-for-strategy does require some new technologies, like data scanners and analyzers, and profiling tools. Some of them already reside in the Podium platform but I think as a general principle, it's a good best practice.
Catalog continuously and build metadata maintenance into your automated software processes. Therefore, while you’re building a catalog and bringing in source data, you’re also bringing in metadata.
What sources should we migrate?
Where is there GDPR and PII data?
What duplicated and related data should we rationalize?
What is the profile, content, and quality of every field?
Beware of the hyping up on data lakes and putting everything on one cluster. The future platforms that we're seeing are that the data is going to reside in the best-fit place for it. And data may have a life timeline where it should be on a new platform for a few years, or for a few months, or maybe just a day. It may be out in the cloud. Here is where location-agnostic cataloging will help you adapt to future technologies.