Apache Sqoop Cookbook
Download Full Version of the eBook "Apache Sqoop Cookbook"
Download - Apache Sqoop Cookbook by Kathleen Ting - PDF
Whether moving a small collection of personal vacation photos between applications or moving petabytes of data between corporate warehouse systems, integrating data from multiple sources remains a struggle. Data storage is more accessible thanks to the availability of a number of widely used storage systems and accompanying tools. Core to that are relational databases (e.g., Oracle, MySQL, SQL Server, Teradata, and Netezza) that have been used for decades to serve and store huge amounts of data across all industries.
Relational database systems often store valuable data in a company. If made available, that data can be managed and processed by Apache Hadoop, which is fast becoming the standard for big data processing. Several relational database vendors championed developing integration with Hadoop within one or more of their products.
Transferring data to and from relational databases is challenging and laborious. Because data transfer requires careful handling, Apache Sqoop, short for “SQL to Hadoop,” was created to perform bidirectional data transfer between Hadoop and almost any external structured datastore. Taking advantage of MapReduce, Hadoop’s execution engine, Sqoop performs the transfers in a parallel manner.
If you’re reading this book, you may have some prior exposure to Sqoop—especially from Aaron Kimball’s Sqoop section in Hadoop: The Definitive Guide by Tom White (O’Reilly) or from Hadoop Operations by Eric Sammer (O’Reilly).
From that exposure, you’ve seen how Sqoop optimizes data transfers between Hadoop and databases. Clearly it’s a tool optimized for power users. A command-line interface providing 60 parameters is both powerful and bewildering. In this book, we’ll focus on applying the parameters in common use cases to help you deploy and use Sqoop in your environment.
Chapter 1 guides you through the basic prerequisites of using Sqoop. You will learn how to download, install, and configure the Sqoop tool on any node of your Hadoop cluster. Chapters 2, 3, and 4 are devoted to the various use cases of getting your data from a database server into the Hadoop ecosystem. If you need to transfer generated, processed, or backed up data from Hadoop to your database, you’ll want to read Chapter 5. In Chapter 6, we focus on integrating Sqoop with the rest of the Hadoop ecosystem. We will show you how to run Sqoop from within a specialized Hadoop scheduler called Apache Oozie and how to load your data into Hadoop’s data warehouse system Apache Hive and Hadoop’s database Apache HBase.
For even greater performance, Sqoop supports database-specific connectors that use native features of the particular DBMS. Sqoop includes native connectors for MySQL and PostgreSQL. Available for download are connectors for Teradata, Netezza, Couchbase, and Oracle (from Dell). Chapter 7 walks you through using them.