Prior to the pandemic, the AI and ML market was growing at a robust rate, projected to reach nearly $100B in global spending by 2023 per IDC. That projection does not seem to have changed. In fact, from my perspective, the crisis has seemingly acted as a catalyst to driving that growth, with companies big and small deploying AI to run projections, looking for ways to prepare for what lies ahead.
Although AI possesses great potential to deal with COVID-19 related disruptions across sectors, spurring novel uses, such as touch-less robo-deliveries in retail or remote diagnostics in healthcare, the Achilles heel for AI and ML adoption continues to be appropriate data management. AI and ML models can be exceptional in recognizing patterns that may be too subtle for human operators; however, to do so successfully, the models need to be exposed to and trained on huge volumes of data to recognize these patterns. Additionally, because there are no generic, all-purpose AI and ML models, the data needs to be relevant (i.e., an AI solution that predicts, say, a customer’s propensity to respond to a specific offer amongst thousands of offers, or the one that identifies fraudulent activity amongst millions of transactions, needs to be trained for that particular task, with appropriate data).
Therein lies the problem. Although organizations have massive volumes of data, they lack the right technology infrastructure to ensure that the data is well defined, accessible, possesses the right quality and integrity characteristics, and is consumption-ready for AI and ML. Looking at the highest level, there are three key data challenges hampering the success of AI initiatives. The 1st challenge is:
- Siloed, multiple types of data streaming at different velocities. Organizations collect all types of multi-format data— text, transactional, survey, VOC, social media, image, location, etc. – and store it in multiple disparate systems across different business units and geographies. Depending on the source and type of data used to train the model, results can be skewed. To get the most accurate insights from AI and ML, you need a single, unified repository to store all relevant data.
Data lakes can provide that single, central source of data to feed and train AI and ML models, for their ability to store massive volumes of all types of data – structured, semi-structured and unstructured. But, data lakes on their own offer little value to AI and ML initiatives. This brings us to our 2nd challenge.
- Raw, unrefined data, without consistent metadata. To train ML models, data consumers need to feed them with a continuous stream of up-to-date, analytics-ready data. Although data lakes provide a single source of data, they are designed to store raw, untransformed data, without common data definitions/metadata. Data with no tagging, or common descriptions explaining what it means, can’t be used for ML, as it lacks the markers on what it is supposed to teach the model. Further, standardizing, formatting and refining raw data to get it in a consumption-ready state can be time-consuming and code-intensive, requiring specialized skills.
Automation can help accelerate the transformation and refinement of raw data into an analytics-ready stage, while alleviating the need for specialized data engineering and programming skills. An integrated catalog can help generate rich metadata to ensure data is easily understandable and searchable.
However, the need for speed has to be balanced with the need for veracity and trust, which brings us to our 3rd but most important data challenge.
- Lack of data confidence. You can blame it on Hollywood. With all the science fiction movies perpetuating AI distrust, how do you believe in the results of ML models? While one aspect of AI trust is driven by model auditability, the other very important aspect is having confidence in the data feeding those models – is it secure, can you establish lineage, who has access to which data? 1st gen Hadoop-based data lakes exacerbated this distrust for lacking standard data security and governance capabilities.
Data confidence comes in many, equally important facets: change propagation to ensure source-target schema sync; the ability to persist change history for end-to-end lineage; integrated security and governance to enable enterprise-level access controls. These are critical components, so your data scientists can bypass tedious data preparation to focus on high-value modeling and training tasks.
AI usage is only going to accelerate in the coming years. Well-architected, modern data lakes can provide you that single source of trusted, analytics-ready data to help you maximize return from your AI investments.
To learn how Qlik can help you architect data lakes of the future to accelerate your AI and ML journey, register for our upcoming webinar on June 3: “One Source of Truth for AI & Analytic Data: Optimizing Your Data Lake Pipeline for Faster Business Insights.” Jointly presented by myself and experts from TDWI and AWS, the webinar will highlight the importance of managed data lakes for the success of AI and ML programs and the core capabilities required for building a performant data lake.
We look forward to talking with you and answering your questions on June 3.
In the meantime, we encourage you to learn more about our Data Lake Creation solution and try our Qlik Replicate product for free to see how easy and quick it is to ingest multi-source, multi-format data into a data lake to accelerate you AI and ML initiatives.