The Shortcomings of Apache Sqoop
Firstly, I’m not a hater. I want to categorically state that there’s nothing inherently wrong with Apache Sqoop and it’s a perfectly good solution, especially if you have no alternative. It ships with many of the popular Hadoop distributions and works well for simple use cases. However, you must be willing to invest the time to learn how to optimize Sqoop configuration for peak performance. That means trolling the Internet for hours looking for tips and advice to improve the standard set-up.
Suffice to say let’s now review the top FIVE gotchas in an annotated summary.
- No graphical user Interface (GUI) – You might not consider the lack of a GUI to be a show stopper, but it’s worth considering that EVERYTHING in Sqoop either requires issuing atomic shell commands or running hand-crafted scripts. Let’s break down a simple command that imports a table from MySQL into HDFS.
The arguments of the former Apache Sqoop import command are as follows:
–connect => This argument describes the jdbc connector.
–driver => Tells Apache Sqoop which jdbc driver we will be using.
–username => This database user name
— password => The password – In this example the password is in clear text. Obviously, this can lead to all sorts of issues, but best practice is to use the standard unix/linux method of obfuscated password files or by using the Apache Sqoop metastore. Please don’t use clear text passwords in your scripts.
–table => The name of the source database table that we’re going to load into Hadoop.
–m => The number of mappers (more on those later).
–target-dir => Is the HDFS location where you want to store the imported Data.
- Limited connectivity – Apache Sqoop realistically only imports data from files or popular relational databases. It takes a lot of effort to import live data from SAP or Mainframes, especially in real-time. Incidentally, Apache Sqoop incremental copy can only be performed by databases, but I’m getting ahead of myself. More on incremental copy later.
- Needs installation – Yes, I know all software requires installation, but Sqoop ships as a single binary that incorporates two separate client and server components. The Sqoop server must be installed and configured on the Hadoop cluster. The client can be installed on remote machines that wish to issue Sqoop commands.
- Incompatible versions – I’d be remiss not to mention that there are two different and incompatible versions of Apache Sqoop (1.4.7 and 1.99.7). The version incompatibility means that you might not be able to share scripts between different installations. Incidentally Sqoop 1.4.7. supports 4 major Hadoop releases. However, version 1.99.7 supports Hadoop versions 2.6.0 or later.
- No schema detection or DDL change propagation – Some emails I received after my last article told me I was half right. Sqoop does perform some schema detection, but not in all cases. Sqoop doesn’t perform a schema check when loading into HDFS, since it’s essentially copying a file to the cluster. However, if you are using Sqoop to load into Hive then the create-hive-table or hive-import attributes will read metadata. The gotcha here is that it Apache Sqoop won’t automatically propagate changes to the data source schema to the target Hive database. Incremental changes have to be completed by hand.
By now you’ve got the point. Apache Sqoop is not perfect, but there were two bonus questions that repeatedly surfaced:
- Does Apache Sqoop do change data capture (CDC)?
- What impacts Apache Sqoop performance? How does Apache Sqoop’s performance compare to Attunity Replicate?
Apache Sqoop and Change Data Capture
For the most part I cited Apache Sqoop examples where I transferred an entire table’s content from one database into Hadoop as a one-time operation. In Attunity Replicate, we’d call that an “initial load or full copy”. Consequently, many folks asked “What if you needed to keep Hadoop in sync with the source table from the relational database? Say only update the rows that had changed.”
Obtaining a fresh, daily copy of data by reimporting ALL the data is obviously not optimal. Not only would the time needed to import the data proportionally increase, but it would also add an unnecessary performance burden to the source database, since the query data sets get larger and larger. Therefore, Apache Sqoop offers incremental imports for transferring deltas of data. You can see how this is accomplished in the example below.
This time I’ll only explain the additional parameters:
–incremental lastmodified => tells Sqoop to use incremental mode and only the data set determined by the next two parameters
–check-column last_update_date => Specifies which column to examine when deciding which rows to import.
–last-value “2018-05-22 01:01:01” => Specifies the value of the check column from the previous import. In this example I’ve hardcoded the value to “only import data AFTER 1am May 22nd 2018”. The best practice is to create incremental imports as “saved jobs” otherwise you’ll have to add additional logic to your BASH scripts to track and update this value. Note: Another tool called “sqoop-job” is used to repeatedly execute saved jobs.
Incremental mode requires that a column on the data source holds a date value that identifies when each row was last updated. Then Sqoop will import only those rows that were changed after the last import. However, Sqoop knows only what you tell it and the onus is on your source application to reliably update this column for every row change. If your data source is not modified, then incremental mode CANNOT work. Also importing incremental changes in data sources other than databases can be euphemistically described as “challenging”. OK, virtually impossible with a bog-standard Apache Sqoop installation.
I don’t want to pile more troubles onto Apache Sqoop, but you also need to consider how to manually handle failures, especially in the case of a partial import or save job failure. Do you delete all the data and resubmit, or skip and log the rows that are failing? You need to decide on strategy and implement logic to address accordingly. You’ll need to decide and implement a failure strategy regardless of using incremental or standard import mode.
Finally, you may also be wondering how Apache Sqoop implements incremental mode under the hood. The incremental import consists of two standalone MapReduce jobs. The first job imports the delta of changed data into a temporary HDFS directory. The second job merges the old and new data, keeping only the last updated value for each row. Can the two jobs impact cluster performance? Hold that thought!
To clarify the answer to our first bonus question. Apache Sqoop does not do change data capture.
Performance, Performance, Performance
Most of the Apache Sqoop issues I’ve raised so far are easy to describe with screenshots, but how can I highlight performance issues? Well so many people had asked that I made a quick (non-scientific) “side-by-side” animation comparing Apache Sqoop and Attunity Replicate. But before we get there let’s review why Apache Sqoop can exhibit such lackluster performance.
It’s important to understand that data ingest is done in two steps. Apache Sqoop first introspects the data source to gather the necessary metadata for the import job. The second step is a map-only Hadoop job that Sqoop submits to the cluster. It is the second job that does the actual data transfer using the metadata captured in the previous step.
As with incremental mode, the imported data is saved in an HDFS directory based on the table being imported. Also remember that since the import jobs are running on the SAME Hadoop as your analytics jobs there’s additional load on the servers that impact performance of ALL jobs. I’d suggest STOPPING analytics query jobs while importing data if you can. Then start querying again ONCE the data import has been completed.
You can also improve the performance of import jobs by increasing the default m value (remember that from earlier in the blog post?) from the default 4 value to 8 or 16, but best practice suggests that you don’t increase the degree of parallelism greater than that available within your MapReduce cluster. This is because tasks run serially, according to the Apache Sqoop documentation, and will likely increase the amount of time required to perform the import.
Finally, do not increase the degree of parallism beyond which your source database can comfortably support. Connecting 100 concurrent clients to your source database may increase the load on the server to a point where performance suffers as a result. In practice, diligent testing will reveal the maximum number of concurrent queries you can run.
What’s the one takeaway about performance? You’ll need to spend significant time testing and tweaking your set-up to find the optimal configuration for data ingest into your Hadoop cluster.
A Real-World Comparison
Now on to the animation. The video below is a side-by-side comparison of the real-time performance of Apache Sqoop and Attunity Replicate. Let me state that the video is NOT edited for time and the two software solutions performed the SAME test. The test was a simple use case that performed a full load from an Oracle source database into HDFS using the sample Oracle “EMP” table that consisted of 107 records. The total video runtime is about 40 seconds.
It’s clear that the Apache Sqoop is slow. It took about 38 seconds to load all 107 records, while Attunity Replicate took just over 15(.03) and that included not only transferring the data, but also creating tables on the target system. That’s quite a difference. Out of the box Attunity is almost two thirds faster than Sqoop in this test. I want to emphasize that I didn’t scientifically instrument the load and used a stopwatch for approximate timings. I did this because I’m not so much interested in the specific times but wanted to illustrate the massive difference in performance of two vanilla installations. I think you’ll agree that the difference is stunningly obvious.
Some Handy Resources
Still not convinced or do you know someone who needs more persuading? Then here’s a few more resources to hand them.
We know that everything in IT is a tradeoff and nothing is ever perfect. We typically balance time, performance and cost. As the saying goes “pick any two!”. Apache Sqoop is a great place to start but you’ll soon realize that it’s not really an “enterprise-grade” solution. Yes, it’s free, but you’ll spend significant time trying to improve performance. Then if you achieve acceptable results, incremental copy will almost certainly break you. Especially if you want to ingest data deltas from non-relational sources or relational sources where you CANNOT alter the underlying database to add time stamp columns. Hopefully I’ve explained in more detail why Apache Sqoop is challenging to optimize and that you’ll consider augmenting your Apache Sqoop usage with an enterprise-grade solution like Attunity Replicate.