APACHE SQOOP

Apache Sqoop is an open source project that provides a simple and economical mechanism to transfer bulk data from relational databases into Hadoop. It’s a convenient solution for companies embarking on data lake initiatives and provides basic data ingestion functionality such as full and incremental data loading from different database formats. It can also be integrated with Apache Oozie for scheduling and can load data directly into Apache Hive.

However, many users struggle with Apache Sqoop data ingestion and quickly encounter performance limitations as their deployments expand.

While Apache Sqoop is a convenient way to load data into Hadoop distributions from Apache, Azure HDInsight, Amazon EMR, Cloudera, Hortonworks and MapR, it uses poses significant challenges for many big data initiatives. While the software is often free, it’s extremely time consuming to administer, optimize and monitor due to its dependence on manual scripting.

Apache Sqoop uses the MapReduce algorithm for data loading and consequentially performance is an issue. To increase throughput Apache Sqoop uses MapReduce jobs called “mappers”. Using more mappers leads to a higher number of concurrent data transfer tasks and faster job completion however, it also increases the load on the Hadoop database as Apache Sqoop executes more concurrent queries. Increasing the number of mappers doesn’t always lead to faster job completion and can produce the opposite effect of the cluster spending more time context switching than serving data.

A Sqoop Alternative With a Modern Data Integration Platform

Today, when IT managers are asked "What is data integration?" in the context of their own enterprise, the answers steer toward real-time integration between multiple source systems and multiple destination systems. For thousands of organizations worldwide, Qlik is at the center of this many-to-many data integration – increasingly in combination with an Hadoop data lake.

With Qlik Replicate® you can:

  • Use a graphical interface to create real-time data pipelines from producer systems into Hadoop, without having to do any manual coding or scripting. This point-and-click automation lets you get started on data lakes initiatives faster, and maintain the agility to easily integrate additional source systems as business requirements evolve.
  • Ingest data into Hadoop from a wide range of source systems including all major database and data warehouse platforms
  • Leverage Qlik Replicate agentless change data capture (CDC) technology to establish real-time data pipelines without negatively impacting the performance of the source database systems.
  • Monitor all your Hadoop data ingest flows through the Qlik Replicate console.
  • Configure Qlik Replicate to notify you of important events regarding your Hadoop ingest flows

Comparing Apache Sqoop with Qlik Replicate

With Qlik Replicate you can move data where you want it, when you want it – easily, dependably, in real time, and at big data scale.

Apache Sqoop

Qlik Replicate

User Interface

Command Line

Every Hadoop ingestion tasks requires time consuming and error prone manual scripting.

Graphical

Quickly and easily define Hadoop ingestion tasks with an intuitive GUI, eliminating the need for manual coding.

Architectural Style

Generic Query-Based
Requires triggers or tables with timestamps. Apaches Sqoop only captures inserts/updates. It cannot capture deletes.

Native Transaction Log Based Uses native transaction log to identify changes and takes full advantage of the services and security provided by the database.

Application and DB Modification Required

Highly Intrusive
Requires timestamps. Either you retrofit applications and databases if none exist OR you need consistent representation across your data sources. This introduces significant development time and risk into your deployment.

Non-Intrusive
Does not require software to be installed on source database, Hadoop node or cluster.

Overhead on Operational Systems

Resource Intensive
Requires significant I/O and CPU utilization as queries against are continuously run against source tables.

Minimal Resources
Identifies transactional changes from native transaction log with minimal overhead.

DDL / Schema Changes


No CDC for DDL/Schema Change
Although Apaches Sqoop can load metadata into Apache Hive it does not capture any DDL changes. There is significant risk that the application may break when schema changes occur, requiring more development effort, resources and time.

Schema Aware
Automatically detects source schema changes and automatically implements changes to the target.

Latency

High Latency
Waits for specified query intervals; time to execute queries; delays cause data sync issues. Sqoop cannot be paused and resumed. It is an atomic step. If failed you need to clear things up and start again.

Low/No Latency. Real-Time Immediate data delivery with no intermediary storage requirements.

Performance


Limited Scalability
As previously mentioned Sqoop can be slow to load data and is resource hungry because it uses MapReduce under the hood. Incremental pull is also difficult because different tables require incremental pull queries to be written.


Linear Scalability
Multi-server and multi-threaded architecture supports high-volume, rapidly changing environments. Multiple data centers, servers and tasks can be managed and optimized centrally with Qlik Enterprise Manager.

Add Qlik Replicate to your Hadoop data ingest workflows and experience a 40% performance improvement over Apache Sqoop.

Finally Qlik Replicate helps you quickly load data from all major databases and data warehouses into Hadoop, from on-premises and in the cloud

Whitepaper

Real-Time Database Streaming for Kafka