Apache Sqoop

Apache Sqoop

Data ingestion is one of the crucial steps in the data lifecycle and when the source is a relational database, Sqoop can be a very easy and simple tool for this.

Apache Sqoop is a MapReduce-based command line utility that uses JDBC protocol to connect to a database to query and transfer data to HDFS.

Apache Sqoop (SQL-to-Hadoop) is a tool designed to support bulk export and import of data into HDFS from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. It is a data migration tool based upon a connector architecture that supports plugins to provide connectivity to new external systems.

Why do we need Sqoop?

Analytical processing using Hadoop requires loading huge amounts of data from diverse sources into Hadoop clusters. This process of loading bulk data into Hadoop, from heterogeneous sources and then processing it comes with a certain set of challenges. Maintaining and ensuring data consistency and ensuring efficient utilization of resources, are some factors to consider before selecting the right approach for data load.

Major Issues solved by Sqoop:-

Data load using Scripts:- The traditional approach of using scripts to load data is not suitable for bulk data load into Hadoop; this approach is inefficient and very time-consuming.

**Direct access to external data via Map-Reduce application:-**Providing direct access to the data residing at external systems(without loading into Hadoop) for map-reduce applications complicates these applications. So, this approach is not feasible.

An example use case of Hadoop Sqoop is an enterprise that runs a nightly Sqoop import to load the day’s data from a production transactional RDBMS into a Hive data warehouse for further analysis.

Sqoop Architecture:-

All the existing Database Management Systems are designed with SQL standards in mind. However, each DBMS differs with respect to dialect to some extent. So, this difference poses challenges when it comes to data transfers across the systems. Sqoop Connectors are components which help overcome these challenges.

Data transfer between Sqoop Hadoop and external storage systems is made possible with the help of Sqoop’s connectors.

Sqoop has connectors for working with a range of popular relational databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2. Each of these connectors knows how to interact with its associated DBMS. There is also a generic JDBC connector for connecting to any database that supports Java’s JDBC protocol. In addition, Sqoop Big data provides optimized MySQL and PostgreSQL connectors that use database-specific APIs to perform bulk transfers efficiently.

Sqoop supports the following four file formats:-

1) Text file format
2) Sequence fie format
3) Avro file format
4) Parquet file format

Commands:-

Let's say we have the following MySql database and a Hadoop cluster to work with, we can use the following commands to import and export data between MySql and Hadoop.

MySql database:-

hostname - cloudera.quickstart
username - root
password - cloudera
port - 3306

Accessing/listing MySql databases from Hadoop using Sqoop:-

sqoop list-databases \
--connect "jdbc:mysql://cloudera.quickstart:3306" \
--username root \ 
--password cloudera

Accessing/listing MySql tables of a given database from Hadoop using Sqoop:-

sqoop list-tables \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera

Displaying the data of a given database table from Hadoop using Sqoop eval:-

sqoop eval \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--query "select * from retail_db.customers limit 10"   # -e also can be used

Import data table from MySql to HDFS using Sqoop:-

Note:- This command will fail if the given table doesn't have a primary key, A Sqoop import job would by default spin up four mappers and a primary key column is necessary to distribute/split the workload among the four mappers.

sqoop import \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
--target-dir /queryresult

what if the given table doesn't have a primary key? We can use the following two strategies to import tables that do not have a primary key column.

1) Using only one mapper thereby eliminating the need of distributing the workload among multiple mappers.

2) Explicitly specifying the column that can be used by the sqoop job to split the data among mappers.

Sqoop import execution plan/flow:-

When the Sqoop utility is invoked, it first fetches the table metadata from the RDBMS and builds a java file.

A jar file is built using the above java file.

Boundary values are fetched based on min and max on primary key

Calculates (max - min)/4 and get the split size.

Import data table from MySql to HDFS using Sqoop with one mapper:-

sqoop import \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
-num-mappers 1 \                       # -m can also be used
--target-dir /queryresult

Import data table from MySql to HDFS using Sqoop by specifying split column:-

The command will fail if the column given in the split by is non-numeric or even if the primary key is non numeric

We should set the following flag to use a non numeric primary key or to use the non numeric column in split by directive -

org.apache.sqoop.splitter.allow_text_splitter=true

sqoop import \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
--split-by order_id \                       
--target-dir /queryresult

To import all tables from a given MySql database to HDFS:-

Note:- If the directive --warehouse-dir is specified then Sqoop will create a subfolder inside the given folder for each table that is imported.

If the directive --target-dir is specified then the data files will be created directly inside the specified folder.

scoop import-all-tables \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
--warehouse-dir /queryresult

Import only selected columns from the given MySql table to HDFS using Sqoop:-

sqoop import \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
--columns customer_id, customer_fname
--target-dir /queryresult

Import with where clause:- we can control the rows that are imported by adding a sql where clause to the import statement.

sqoop import \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
--where "order_status in ('complete', 'closed')"
--target-dir /queryresult

Import by customizing the boundary query:-

sqoop import \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
--boundary-query "select 1, 68883" \
--target-dir /queryresult

Sqoop auto reset to one mapper:-

while importing all tables from RDBMS to HDFS, if a table does not have a primary key defined then import will fail. This directive uses one mapper if a table with no primary key is encountered.

sqoop import-all-tables \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
--autoreset-to-one-mapper
--num-mappers 8 \
--warehouse-dir /queryresult

using delimiters:-

By default, fields are terminated by commas, and lines are terminated by new line characters.

sqoop import \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
-num-mappers 1 \
--fields-terminated-by '|' \
--lines-terminated-by ';' \
--target-dir /queryresult

Sqoop incremental import:-

Sqoop provides an incremental import mode which can be used to retrive only rows newer than some previously imported set of rows.

Sqoop supports two types of incremental imports: append & lastmodified. We can use the --incremental directive to specify the type of incremental import to perform.

1) we should specify append mode when importing a table where new rows are continually being added with increasing row_id values. we specify the column containing the row_id's with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value

2) we should use lastmodified mode when rows of the source may had been updated and each such update will set the value of a last-modified date column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported

sqoop import \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
--target-dir /queryresult \
--incremental append \
--check-column order_id \
--last-value 0

Sqoop import with last modified, supplied with --append will result in duplicate records in hdfs directory. If we just want latest records in HDFS then in that case we need to use --merge-key <merge-column>


sqoop import \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
--target-dir /queryresult \
--incremental lastmodified \
--check-column order_date \
--last-value 0 \
--merge-key order_id

create a Hive table based on the given database table:-

The --create-hive-table argument populates a Hive metastore with a definition for a table based on database table.

sqoop create-hive-table \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
--hive-table emps \
--fields-terminated-by ','

compression techniques:-

We can compress the data by using the default gzip algorithm with the -z or --compress arguments.

sqoop import \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
-num-mappers 1 \
--compress                        # -z can also be used
--target-dir /queryresult

we can also specify any Hadoop compression codec using the --compression-codec argument

sqoop import \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
-num-mappers 1 \
--compression-codec BZip2Codec
--target-dir /queryresult

Sqoop verbose:-

We can run sqoop job with the --verbose flag to generate more logs and debugging information

scoop import-all-tables \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table orders \
--verbose \
--warehouse-dir /queryresult

Export data from HDFS to a given MySql table:-

sqoop export \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table card_transactions \
--export-dir /data/card_trans.csv
--fields-terminated-by ','

Export data from HDFS to a given MySql table using an auxiliary staging table:-

If any of the mappers fail during the data export due to any reason, it may lead to partial data being commited in the table. We can overcome this scenario by specifying a staging table wherein data is first loaded into the staging table and if no errros are encountered, data is then migrated from the staging table to the actual table.

If the staging table contains data and the --clear-staging-table option is specified, Sqoop will delete all of the data before starting the export job.

sqoop export \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password cloudera
--table card_transactions \
--staging-table card_transactions_stage \
--export-dir /data/card_trans.csv
--fields-terminated-by ','

Sqoop Job:-

saved jobs remember the parameters used while the job is invoked, so they can be re-executed.

If a saved job is configured to perform an incremental import, the state regarding the most recently imported rows is updated in the saved job to continually import only the newest rows.

The state of the job is saved locally in a hidde file name .sqoop which resides in home directory.

sqoop job \
--create job_orders \
-- import \
--connect "jdbc:mysql://cloudera.quickstart:3306/retail_db" \
--username root \ 
--password-file file:///home/cloudera/.password-file \
--table orders \
--warehouse-dir /queryresult \
--incremental append \
--check-column order_id \
--last-value 0