Most enterprises leverage a variety of data types in high volumes for big data analytics projects. These include social media data, internal data, log data, mobile device data, sensor data, free public external data…and the list keeps growing. In fact, by 2020, the world will generate 50 times as much data as it does today, but the IT staff responsible for managing it will only grow 1.5 times. On top of that challenge, only 40–55% of the data that companies load is ever used. When you consider that it costs $2–6 million to support every 50–100 TB of new data, supporting dormant data results in a tremendous amount of inefficiency when not properly managed.
But the exorbitant cost of not managing dormant data well isn’t just about the storage. In fact, it’s less about the storage and more about CPU capacity. Most vendors charge by CPU capacity. As CPU capacity increases, so do your licensing costs. Dormant data also slows down performance since the process of loading data uses up to 60% of the CPU. A lot of data undergoes ETL and transformation processes but may need to be retained in its original form for compliance, but is never used. As a result, it’s unnecessarily impacting costs and performance.
Data Warehouse Management
The data warehouse is a reflection of the business. It grows in response to business needs. It makes sense then to analyze data activity and usage accordingly. When you group applications, data, or users in the context of the business (for example, by department or line of business), you can then begin to analyze utilization and assign accountability via chargeback or showback. For example, when marketing requests more data, the IT department can show them how much data hasn’t been used and the cost to continue to manage the current data as well as additional data coming in. When you start putting it in terms of how much it costs to load and maintain data, and can demonstrate how much isn’t being used, the data set that seemed so important may lose some of its significance. The standing request might just lose its urgency, particularly if the cost to keep the data comes out of departmental budgets.
Identifying dormant data recovers storage capacity. But it also helps reduce costs related to loading and transforming the data. If you don’t need the data anymore, you can stop loading it, which means you eliminate a portion of the ETL processes that are consuming CPU capacity. If you do need the data, say for regulatory reasons, you can offload the processes of ETL to load and transform the data onto a lower-cost Hadoop cluster. You not only recover storage capacity, but also ETL CPU capacity because all the data that you’re not loading and ingesting in an EDW means you’re consuming less CPU capacity on the system.
The key is to gain visibility into the EDW to learn what data is used and what data is unused.