Sunday, 29 May 2016

Sqoop & SQL Server #8 - Incremental Imports

So far the examples shown have imported whole or subsets of tables. But if the data load is a regular occurrence you'll need to reload data already transmitted and overwrite the existing target. This is inefficient. The Sqoop import command allows incremental data loads using either a incrementing integer column or a datetime column as a reference at the point of load.

Caution : Both methods assume the table is being populated with new data but existing data is not updated or deleted.

For both methods an argument --incremental is required but that argument requires a different parameter depending on the type of load. Let's start with using an incrementing integer column which uses the parameter append.

--incremental append

Two other arguments must be supplied. --check-column which defines the column in which the incrementing integer is stored and --last-value which unsurprisingly stores the previous maximum value for the integer column. The syntax is,

--check-column [column name]
--last-value [n]

An example of these arguments in an import command is,

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --incremental append --check-column CustomerID --last-value 1000

The second option is to use a datetime column. That requires the parameter lastmodified for the argument --incremental like this,
--incremental lastmodified

We continue to use check-column and last-value but in this case the value has to be a date contained within speech marks. Here is an example,

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --incremental lastmodified --check-column CreatedDateTime --last-value "2016-05-23 21:11:00"

For both methods Sqoop merges the data together using a MapReduce job.

Recurring Incremental Data Loads
An extension to this functionality allows Sqoop to remember the last value in a local configuration database called the Sqoop Metastore. This database can store jobs which can be executed by any cluster user or via an automated script. SQL Server pros can compare this to the MSDB database which stores SQL Agent jobs and configuration. Sqoop has no scheduler but another Apache tool named Oozie can be used to create scheduled jobs. Oozie is also installed on the Cloudera Quickstart VM.

The main benefit of creating a job is that you don;t need to remember the last integer value or date. Sqoop handles that for you.

To create a recurring job you need to define it. To do this use the Sqoop job command along with the argument --create and give the job a name. Here is the syntax,

sqoop job --create [my job name]

After this create statement we define the job with the import command and the arguments required. This saves a job in the Sqoop metastore with the name you've defined. The job can be run using the exec argument. For a job named testjob the example is as follows,

sqoop job --exec testjob

To list the job in the metastore use the following argument,

sqoop job --list

You can display the definition of a job with the show argument which requires a parameter to define the name of the job you wish to display. For example,

sqoop job --show testjob

And if a job is no longer needed you can remove it from the Sqoop metastore using the --delete argument which also requires the name of the job as a parameter. For example,

sqoop job --delete testjob

More about the Sqoop Metastore
Sqoop uses a small footprint database called HSQLDB which uses a a HyperSQL database to store it's metadata. This database can be shared so other users can use the jobs stored within it. However, if you're going to share jobs across multiple users you may want to move the metastore to a more familiar database like MySQL because the metastore can get corrupted and in a production environment you'll want good maintenance procedures.

Other users can connect to the metastore using the argument --meta-connect with the IP address of the virtual machine. It uses port number 16000 by default. For example,

sqoop job --create testjob \ --meta-connect jdbc:hsqldb:hsql://[VM IP Address]:16000/sqoop \

A better model for incremental loads
In reality if you have administrative access over the source SQL Server or a good relationship with your DBAs you're better off pushing the logic for identifying the new data back into a SQL Server view and using Sqoop to import it.



This is all well for tables with only insert transactions but what about tables that contain update or delete transactions. How do you reflect the changes in HDFS. For that you can use the Sqoop merge command.

No comments:

Post a Comment