Big data analytics is too broad a discipline for one single tool to completely cover. Naturally, big data analytics software is the primary tool, but below we briefly describe the related supporting technologies in order of the overall process.
Big data replication and change data capture (CDC) tools copy data from master sources to other locations. As described above, these tools allow for fast data access, high performance, and an accurate backup of the data.
Big data ingestion tools move raw big data from a variety of sources to a storage location such as a data warehouse or data lake.
Big data consolidation and storage tools such as a Hadoop data lake, allow for big data analytics usage by making data available to be processed and used flexibly for deep analysis.
- As stated above, the Hadoop data lake framework is popular because the open source software framework is free and its distributed computing model can quickly process big data. There are two key components of the Hadoop framework. The first is MapReduce, which filters data to nodes of the cluster and reduces the results from each node for a given query. The second is a cluster management technology called YARN which assists with job scheduling and resource management across the cluster.
- A NoSQL database can also handle a variety of data models and is another good option for raw, unstructured big data. This is because its non-relational data management system doesn’t require a fixed scheme.
- In contrast, a data warehouse typically stores structured, filtered data that’s been processed for a specific purpose and is therefore not as popular for the often highly unstructured nature of big data. Learn more about data lake vs data warehouse.