A few months ago, I wrote a post that debated the pros and cons of Apache Sqoop, concluding that Attunity Replicate was a better alternative for ingesting data into Hadoop. I also suggested adding Attunity Replicate to your data management arsenal. If you didn’t catch the original article, then you can read it here, I can wait. One… Two… Three… OK, now that you’re back I wanted to answer some questions that I’ve received about Apache Sqoop in the intervening months since the original article.
The Shortcomings of Apache Sqoop
Firstly, I’m not a hater. I want to categorically state that there’s nothing inherently wrong with Apache Sqoop and it’s a perfectly good solution, especially if you have no alternative. It ships with many of the popular Hadoop distributions and works well for simple use cases. However, you must be willing to invest the time to learn how to optimize Sqoop configuration for peak performance. That means trolling the Internet for hours looking for tips and advice to improve the standard set-up.
Suffice to say let’s now review the top FIVE gotchas in an annotated summary.
The arguments of the former Apache Sqoop import command are as follows:
–connect => This argument describes the jdbc connector.
–driver => Tells Apache Sqoop which jdbc driver we will be using.
–username => This database user name
— password => The password – In this example the password is in clear text. Obviously, this can lead to all sorts of issues, but best practice is to use the standard unix/linux method of obfuscated password files or by using the Apache Sqoop metastore. Please don’t use clear text passwords in your scripts.
–table => The name of the source database table that we’re going to load into Hadoop.
–m => The number of mappers (more on those later).
–target-dir => Is the HDFS location where you want to store the imported Data.
By now you’ve got the point. Apache Sqoop is not perfect, but there were two bonus questions that repeatedly surfaced:
Apache Sqoop and Change Data Capture
For the most part I cited Apache Sqoop examples where I transferred an entire table’s content from one database into Hadoop as a one-time operation. In Attunity Replicate, we’d call that an “initial load or full copy”. Consequently, many folks asked “What if you needed to keep Hadoop in sync with the source table from the relational database? Say only update the rows that had changed.”
Obtaining a fresh, daily copy of data by reimporting ALL the data is obviously not optimal. Not only would the time needed to import the data proportionally increase, but it would also add an unnecessary performance burden to the source database, since the query data sets get larger and larger. Therefore, Apache Sqoop offers incremental imports for transferring deltas of data. You can see how this is accomplished in the example below.
This time I’ll only explain the additional parameters:
–incremental lastmodified => tells Sqoop to use incremental mode and only the data set determined by the next two parameters
–check-column last_update_date => Specifies which column to examine when deciding which rows to import.
–last-value “2018-05-22 01:01:01” => Specifies the value of the check column from the previous import. In this example I’ve hardcoded the value to “only import data AFTER 1am May 22nd 2018”. The best practice is to create incremental imports as “saved jobs” otherwise you’ll have to add additional logic to your BASH scripts to track and update this value. Note: Another tool called “sqoop-job” is used to repeatedly execute saved jobs.
Incremental mode requires that a column on the data source holds a date value that identifies when each row was last updated. Then Sqoop will import only those rows that were changed after the last import. However, Sqoop knows only what you tell it and the onus is on your source application to reliably update this column for every row change. If your data source is not modified, then incremental mode CANNOT work. Also importing incremental changes in data sources other than databases can be euphemistically described as “challenging”. OK, virtually impossible with a bog-standard Apache Sqoop installation.
I don’t want to pile more troubles onto Apache Sqoop, but you also need to consider how to manually handle failures, especially in the case of a partial import or save job failure. Do you delete all the data and resubmit, or skip and log the rows that are failing? You need to decide on strategy and implement logic to address accordingly. You’ll need to decide and implement a failure strategy regardless of using incremental or standard import mode.
Finally, you may also be wondering how Apache Sqoop implements incremental mode under the hood. The incremental import consists of two standalone MapReduce jobs. The first job imports the delta of changed data into a temporary HDFS directory. The second job merges the old and new data, keeping only the last updated value for each row. Can the two jobs impact cluster performance? Hold that thought!
To clarify the answer to our first bonus question. Apache Sqoop does not do change data capture.
Performance, Performance, Performance
Most of the Apache Sqoop issues I’ve raised so far are easy to describe with screenshots, but how can I highlight performance issues? Well so many people had asked that I made a quick (non-scientific) “side-by-side” animation comparing Apache Sqoop and Attunity Replicate. But before we get there let’s review why Apache Sqoop can exhibit such lackluster performance.
It’s important to understand that data ingest is done in two steps. Apache Sqoop first introspects the data source to gather the necessary metadata for the import job. The second step is a map-only Hadoop job that Sqoop submits to the cluster. It is the second job that does the actual data transfer using the metadata captured in the previous step.
As with incremental mode, the imported data is saved in an HDFS directory based on the table being imported. Also remember that since the import jobs are running on the SAME Hadoop as your analytics jobs there’s additional load on the servers that impact performance of ALL jobs. I’d suggest STOPPING analytics query jobs while importing data if you can. Then start querying again ONCE the data import has been completed.
You can also improve the performance of import jobs by increasing the default m value (remember that from earlier in the blog post?) from the default 4 value to 8 or 16, but best practice suggests that you don’t increase the degree of parallelism greater than that available within your MapReduce cluster. This is because tasks run serially, according to the Apache Sqoop documentation, and will likely increase the amount of time required to perform the import.
Finally, do not increase the degree of parallism beyond which your source database can comfortably support. Connecting 100 concurrent clients to your source database may increase the load on the server to a point where performance suffers as a result. In practice, diligent testing will reveal the maximum number of concurrent queries you can run.
What’s the one takeaway about performance? You’ll need to spend significant time testing and tweaking your set-up to find the optimal configuration for data ingest into your Hadoop cluster.
A Real-World Comparison
Now on to the animation. The video below is a side-by-side comparison of the real-time performance of Apache Sqoop and Attunity Replicate. Let me state that the video is NOT edited for time and the two software solutions performed the SAME test. The test was a simple use case that performed a full load from an Oracle source database into HDFS using the sample Oracle “EMP” table that consisted of 107 records. The total video runtime is about 40 seconds.
It’s clear that the Apache Sqoop is slow. It took about 38 seconds to load all 107 records, while Attunity Replicate took just over 15(.03) and that included not only transferring the data, but also creating tables on the target system. That’s quite a difference. Out of the box Attunity is almost two thirds faster than Sqoop in this test. I want to emphasize that I didn’t scientifically instrument the load and used a stopwatch for approximate timings. I did this because I’m not so much interested in the specific times but wanted to illustrate the massive difference in performance of two vanilla installations. I think you’ll agree that the difference is stunningly obvious.
Some Handy Resources
Still not convinced or do you know someone who needs more persuading? Then here’s a few more resources to hand them.
Conclusion
We know that everything in IT is a tradeoff and nothing is ever perfect. We typically balance time, performance and cost. As the saying goes “pick any two!”. Apache Sqoop is a great place to start but you’ll soon realize that it’s not really an “enterprise-grade” solution. Yes, it’s free, but you’ll spend significant time trying to improve performance. Then if you achieve acceptable results, incremental copy will almost certainly break you. Especially if you want to ingest data deltas from non-relational sources or relational sources where you CANNOT alter the underlying database to add time stamp columns. Hopefully I’ve explained in more detail why Apache Sqoop is challenging to optimize and that you’ll consider augmenting your Apache Sqoop usage with an enterprise-grade solution like Attunity Replicate.