Pragmatic Data Warehousing: 2016

Wednesday, 8 June 2016

Sqoop & SQL Server #12 - Sqoop Cheat Sheet

Theme	I want to…	Sqoop Example
General	See what version of Sqoop is installed.	Sqoop version
General	List the databases I have permission to see.	sqoop list-databases --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];username=[UserName];password=[Password]"
General	List the tables I have permission to see for a specific database.	sqoop list-tables --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
Import	Import a specific table into HDFS.	sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]" –table [TableName]
Import	Import a specified table from a named instance into HDFS.	sqoop import --connect "jdbc:sqlserver://[IPAddress]\\[InstanceName];database][Database];username=[UserName];password=[Password]" –table [TableName]
Import	Import a specific table from a specified schema into HDFS.	sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]" –table [TableName] -- --schema [SchemaName]
Import	Import all table from a database.	sqoop import-all-tables --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
Import	Import all table from a database with specified exceptions.	sqoop import-all-tables --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]" –exclude-tables [TableName1],[TableName2]
Import	Import a specific table into a defined HDFS location.	sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]" –table [TableName] –target-dir [Location]
Import	Import specified columns from a table into HDFS.	sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]" –table [TableName] –columns “[Column1], [Column2], [Column3]”
Import	Import into HDFS using a T-SQL query.	sqoop import \ --query 'SELECT a., b. FROM a JOIN b on (a.id = b.id) WHERE $CONDITIONS' \ --split-by a.id
Import	Spread the load of an import across the cluster nodes	sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]" –table [TableName] –m [NumberOfNodes]
Import	Import a specific table into a parquet format file in HDFS.	sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]" –table [TableName] –as-parquetfile
General	Define an Options File for code reuse	# # Options File for Sqoop Import # # Command import # Connection String jdbc:sqlserver://[IPAddress]:[PortNumber] --database [DatabaseName] # Credentials --username [UserName] --password [Password] # End
Import	Use an Options File to import data into HDFS	sqoop --options-file [LoactionOfFile]/[NameOfFile] --table Customers
Import	Import a table into Hive	sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]" –table [TableName] –hive-import
Import	Overwrite an existing Hive table during an Import	sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]" –table [TableName] –hive-import –hive-overwrite
Import	Import a table into Hive and handle NULLs	sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]" –table [TableName] –hive-import –null-string ‘\\N’ –null-non-string ‘\\N’
Import	Import a table into Hive and remove Hive delimiters from the data	sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]" –table [TableName] –hive-import –hive-drop-import-delims
Import	Import a table into a Hive partition	sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]" –table [TableName] –hive-import –hive-partition-key [PartitionColumn] –hive-partition-value “[ParitionValue]”
Export	Export data from HDFS to SQL Server	sqoop export --connect ‘jdbc:sqlserver://[IPAddress];database=[DatabaseName];instancename=[InstanceName];username=[UserName];password=[Password]’ --table [SQLServerTableName] --export-dir [HDFSTableLocation]
Export	Export specific columns from HDFS to SQL Server	sqoop export --connect ‘jdbc:sqlserver://[IPAddress];database=[DatabaseName];instancename=[InstanceName];username=[UserName];password=[Password]’ --table [SQLServerTableName] –columns ‘[Column1],[Column2],[Column3]’--export-dir [HDFSTableLocation]
Export	Export data from HDFS to SQL Server and handle NULLs	sqoop export --connect ‘jdbc:sqlserver://[IPAddress];database=[DatabaseName];instancename=[InstanceName];username=[UserName];password=[Password]’ --table [SQLServerTableName] --export-dir [HDFSTableLocation] –input-null-string –input-non-null-string
Export	Update SQL Server data from HDFS using an Export command	sqoop export --connect ‘jdbc:sqlserver://[IPAddress];database=[DatabaseName];instancename=[InstanceName];username=[UserName];password=[Password]’ --table [SQLServerTableName] --export-dir [HDFSTableLocation] –update-key ‘[KeyColumn]’
Export	Update SQL Server data and Insert new rows from HDFS using an Export command	sqoop export --connect ‘jdbc:sqlserver://[IPAddress];database=[DatabaseName];instancename=[InstanceName];username=[UserName];password=[Password]’ --table [SQLServerTableName] --export-dir [HDFSTableLocation] –update-key ‘[KeyColumn]’ --allowinsert
Export	Export data from HDFS to SQL Server in one batch process	sqoop export --connect ‘jdbc:sqlserver://[IPAddress];database=[DatabaseName];instancename=[InstanceName];username=[UserName];password=[Password]’ --table [SQLServerTableName] --export-dir [HDFSTableLocation] --batch

Tuesday, 7 June 2016

Sqoop & SQL Server #11 - Export from HDFS to SQL Server

Export

For those times when you need to get data back out of an HDFS cluster into a relational database Sqoop export is the answer.

Export works in a similar way to the import command. First it checks the connection to the SQL Server before mapping the job and delivering it in chunks.

Unlike import, export can’t create a table on the target during the process. The table being exported into must already exist on the target database.

By default export will insert data into the target table. But you can use the update argument in which case the process will update existing records.

The --export-dir argument and one of --table or --call are required. These specify the directory in HDFS which contains the source data and the table to populate in the database or in the case of call the stored procedure to execute.

Basic Example

First we'll load data into a specific folder in HDFS to assists with the export example

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --target-dir /user/cloudera/ExportExample1 –m 1

And on the target SQL Server we'll create a table ready to receive the data load.

CREATE TABLE dbo.CustomersExport

(

     CustomerID int NOT NULL,

     FirstName varchar(50) NOT NULL,

     Surname varchar(50) NOT NULL,

     DateOfBirth date NOT NULL,

     EmailAddress varchar(100) NULL,

     PostCode varchar(8) NULL,

     MarketingOptIn bit NULL,

     AutoRenewalOptIn bit NULL,

     CreatedDateTime datetime NOT NULL,

     ModifiedDateTime datetime NULL

) ON PRIMARY

GO

Now we're ready to execute the sqoop export.

sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username=ClouderaAccess;password=GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1

Whilst trying to get this to work I found a difference in the syntax form what I'd been using for imports. For example, the argument instancename=SQL2014 replaced the syntax within the connection string. I tried amendingg the import syntax toreflect this new syntax and the import failed on a syntax error.

Exporting a Subset of Columns

By default, all columns within a table are exported. You can select a subset of columns with the --columns argument. The syntax for this argument is –columns followed by a a comma-delimited list encased in double quotes. For example, --columns "colA,colB,colC".

sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username=ClouderaAccess;password=GiveMeData’ --table CustomersExport --columns 'CustomerID,FirstName,Surname,DateOfBirth,CreatedDateTime' --export-dir /user/Cloudera/ExportExample1

Handling Nulls

Similarly to Hive imports, exports can have issues with NULLs. The Arguments --input-null-string and --input-null-non-string fix these issues.

If --input-null-string is not specified, then the string "null" will be interpreted as null for string-type columns. If --input-null-non-string is not specified, then both the string "null" and the empty string will be interpreted as null for non-string columns. Empty string will be always interpreted as null for non-string columns.

sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username=ClouderaAccess;password=GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1 --input-null-string --input-non-null-string

Staging Data - Unfortunately this option is not currently supported for SQL Server

Sqoop export breaks down export process into multiple transactions. If something fails before all of the transactions have complete a partial dataset will be present on the target database. Further errors can follow if we rerun the job and insert collisions or duplications occur.

You can overcome this problem by specifying a staging table via the --staging-table option which acts as an buffer used to stage exported data. The data is then loaded to the target table in a single transaction after the stage has loaded completely.

In order, for it to work, the staging table must exist prior to running the export job. The stage table must be identical to the target table either be empty or the --clear-staging-table option must be used. The --clear-staging-table option will delete all of the data before starting the load.

Inserts & Updates

By default, export inserts new rows with each row transformed into an INSERT statement that loaded into the target database table. You must not violate any existing table constraints else the load will fail. This mode is best used when loading data into a new, empty table.

The --update-key argument modifies the existing dataset in the target database. Each row is treated as an UPDATE using the column name(s) specified with --update-key as the matching criteria to ensure the correct row is updated. The argument --update-key can also be given a comma separated list of column names.

To set this scenario up let's change the surname of a customer in the target SQL Server table.

UPDATE Utility.dbo.CustomersExport

SET Surname = 'UpdateTest'

WHERE CustomerID = 1

And now run the update Sqoop export process.

sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username ClouderaAccess;password GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1 --update-key 'CustomerID'

If an UPDATE statement modifies no rows the export will silently continue. If the column specified with --update-key does not uniquely identify rows then multiple rows are updated by a single statement.

You may also specify the --update-mode argument with allowinsert mode if you want to update rows if they exist in the database already or insert rows if they do not exist yet. The default is updateonly which you shouldn't need to specify unless you're have new rows in the source which you don;t want to export. THe syntax for that option is updateonly for the --update-mode.

sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username ClouderaAccess;password GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1 --update-key 'CustomerID' --update-mode allowinsert

Export in Batch Mode

Instead of firing rows into the target database one at a time you can tell Sqoop to batch up the exports. By adding the argument --batch to an export statement Sqoop will complete the work on the cluster before writing any data to the target server.

This could be a useful option if the receiving database doesn't like the drip feed of huge numbers of rows.

sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username ClouderaAccess;password GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1 --batch

Transactional consistency

During export jobs Sqoop will spawn multiple processes on the target database which run in parallel. These processes will complete at different times leaving the target table incomplete until all jobs have finished. Sqoop commits data every few thousand rows to make good use of memory. The target table will be available for query during the load.

Sqoop can stage data so inconsistency is not seen in the target object. But this is slow as the process needs to wait for the job to complete and then load the entire contents into the target object from the stage object.

It's worth speaking to the administrators about performance as poor configured Sqoop jobs may cause unacceptable issues on the source/target database.

Thursday, 2 June 2016

Sqoop & SQL Server #10 - SQL On Hadoop

So you've managed to Sqoop some data into HDFS. Great. What now? You'll want to query it. But what with? You don't know Java so writing a MapReduce job is right out. You know SQL so why not query the data using SQL. For that you'll need to use one of the many SQL on Hadoop tools that have emerged to make data analysis in Hadoop available to the large numbers of data professionals who know a variety of SQL

Since we're concerned here with Sqoop, we'll concentrate on Hive with a special mention for Impala. But there are other SQL on Hadoop tools out there which would satisfy most use cases.

Hive

Developed by Facebook as a Data Warehousing framework over Hadoop it provides a SQL-like language called HiveQL which is accessible to those with SQL knowledge. Hive converts queries to MapReduce jobs to be processed. Since MapReduce can be slow query times are also slow compared to SQL Server or other RDBMS's. Hive maintains a metastore which acts as metadata manager.

Impala

Developed by Cloudera as a MPP query engine over Hadoop Impala also provides an accessible SQL like language. Instead of using MapReduce Impala sends queries to agents installed on each data node in the cluster which then return their element of the query back to Impala. Impala removes MapReduce as a bottleneck by controlling the query execution itself and utilising the data nodes along with the Hive metastore to map out the query. This makes Impala many times faster than Hive with execution speeds nearing - but not quite reaching - traditional RDBMS's. Impala is designed for quick data analysis but when used against large datasets memory issues can be a problem. In cases when you absolutely need the query to complete or have large volumes of data Hive is the best bet. In cases when you want an answer fast over smaller data sets Impala is the right choice.

Importing Data Into Hive

By adding the --hive-import option to your Sqoop import command line you can simply load the data into Hive. Sqoop will examine the metadata of the table being imported and create a table in Hive. The data types won't match exactly. Sqoop will approximate the data types by default but you can gain greater control over this with specific arguments. Also, NULL handling can be tricky but again you can define exactly how you want NULLs to be handled if you don't like results of the default Sqoop execution.

As Hive is designed to act like a Data Warehouse a common use case is to update a table periodically with new data. Hive can use partitions to store this new data more efficiently. The analogy here is with SQL Server clustered indexes as we're controlling how data is stored on disk to make the queries run faster.

Hive tables are stored on HDFS. Remember this is a write once cluster with the potential to be read often by many possible tools so the data is written to disk in a format that can be read easily. Hive stores the schema in it's metastore and only at the point of query is the data interpreted against that schema. It's called schema on read. As the data is stored on HDFS we can use different file types to store the data in the most efficient manner. A common format used with Hive is Parquet which is optimised for data analysis work.

Basic Hive Import

The --hive-import argument for the import command provides an easy way of populating Hive.

sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --hive-import

Overwrite Existing Data

The --hive-overwrite option is analogous to a TRUNCATE and INSERT in SQL Server.

sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --hive-import --hive-overwrite

If you use the --hive-overwrite without --hive-import it won't work. The process will default to a HDFS load instead.

If the table has already been loaded into HDFS and you don't specify a target you'll get an error stating that the table already exists.

Creating Tables In Hive

If your data already exists in HDFS and you want to load it into Hive you will need to define the schema since Sqoop has no metadata to use during the load process to use to create the Hive table.

Hive CREATE TABLE syntax is similar to other SQL variants. Here is a simple example

CREATE TABLE [Table Name]

     ([Column Name 1] [DataType],

      [Column Name 2] [DataType],

      [Column Name 3] [DataType],

      ...

      [Column Name n] [DataType])

STORED AS [File Type];

For more complex examples and a full syntax explanation look in the Hive Wiki here. This also contains information about data types which are different in Hive to SQL Server so beware.

To execute Hive code you need to open the Hive shell. Do this by typing Hive in the terminal session and you'll see the text before the cursor change to say hive>.

For our Customers table the create script to store the data in Parquet format would be as follows,

hive> CREATE TABLE Customers

     (CustomerID INT,

     FirstName VARCHAR(50),

     Surname VARCHAR(50),

     DateOfBirth DATE,

     EmailAddress VARCHAR(100),

     PostCode VARCHAR(8),

     MarketingOptIn TINYINT,

     AutoRenewalOptIn TINYINT,

     CreatedDateTime TIMESTAMP,

     ModifiedDateTime TIMESTAMP)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Once created the table can be populated using the LOAD DATA LOCAL INPATH command as shown here;

hive> LOAD DATA LOCAL INPATH "[Folder]/[FileName]" INTO TABLE Customers;

Handling NULLs

Sqoop and Hive handle NULLs differently by default so you need to explicitly tell Sqoop how to load the data into Hive. This is easily remedied using the two arguments --null-string and --null-non-string with the parameter '\\N' as shown below.

sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --hive-import --null-string ‘\\N’ --null-non-string ‘\\N’

Handling Delimiters

When importing into Hive using Sqoop you call SELECT count(*) FROM your_table to see how many rows were loaded. If the number of rows is larger than the source table you may have encountered a problem with delimiters.

The issue is caused when the data contains characters that Hive usees as delimiters, e.g. \n, \t, \01. This can arise when data has been input into text based columns with newline or tab characters included.

Sqoop can remove this characters from your data if you use the argument --hive-dropimport- delims. For example,

sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --hive-import  --hive-drop-import-delims

If the special characters are required you can use the argument --hive-delims-replacement which will substitute a replacement

string. Although how useful that would be in the data is not clear. It works like this,

sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --hive-import  --hive-delims-replacement "SPECIAL"

Loading Into Partitions

If you're incrementally loading data into Hive perhaps as part of a daily load you can use partitions to separate the data but Hive still treats these separate blocks of data as one table. This could be useful if you want to swap out a days load that went wrong without reloading all the data.

Partitions also act like Clustered Indexes in SQL Server as the data is written to disk in sequential blocks making retrieval easier.

Sqoop supports Hive partitioning with two additional parameters: --hive-partition-key, the name of the partition column, and --hive-partition-value, the value. So if you partition on the column CreatedDate and you 're loading data for 31st May 2016 the value would be 2016-05-31 and the Sqoop command would read as follows;

sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --hive-import --hive-partition-key CreatedDate --hive-partition-value “2016-05-31”

Sqoop mandates that the partition column be of type STRING. The current implementation is limited to a single partition level so you can't do nested partitions.

Using Impala

Using the Cloudera equivalent of SQL Server Management Studio called HUE you can run Impala queries over the data we've already loaded into Hive. As Impala relies on the Hive metastore only data which has been loaded into Hive can be queried using Impala.

To refresh Impala's knowledge of the Hive metastore after a data load you can run the Invalidate Metadata Impala command which is poorly named as it actually does a refresh and leaves Impala ready to be used. Use as follows,

Invalidate Metadata

Sunday, 29 May 2016

Sqoop & SQL Server #8 - Incremental Imports

So far the examples shown have imported whole or subsets of tables. But if the data load is a regular occurrence you'll need to reload data already transmitted and overwrite the existing target. This is inefficient. The Sqoop import command allows incremental data loads using either a incrementing integer column or a datetime column as a reference at the point of load.

Caution : Both methods assume the table is being populated with new data but existing data is not updated or deleted.

For both methods an argument --incremental is required but that argument requires a different parameter depending on the type of load. Let's start with using an incrementing integer column which uses the parameter append.

--incremental append

Two other arguments must be supplied. --check-column which defines the column in which the incrementing integer is stored and --last-value which unsurprisingly stores the previous maximum value for the integer column. The syntax is,

--check-column [column name]

--last-value [n]

An example of these arguments in an import command is,

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --incremental append --check-column CustomerID --last-value 1000

The second option is to use a datetime column. That requires the parameter lastmodified for the argument --incremental like this,

--incremental lastmodified

We continue to use check-column and last-value but in this case the value has to be a date contained within speech marks. Here is an example,

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --incremental lastmodified --check-column CreatedDateTime --last-value "2016-05-23 21:11:00"

For both methods Sqoop merges the data together using a MapReduce job.

Recurring Incremental Data Loads

An extension to this functionality allows Sqoop to remember the last value in a local configuration database called the Sqoop Metastore. This database can store jobs which can be executed by any cluster user or via an automated script. SQL Server pros can compare this to the MSDB database which stores SQL Agent jobs and configuration. Sqoop has no scheduler but another Apache tool named Oozie can be used to create scheduled jobs. Oozie is also installed on the Cloudera Quickstart VM.

The main benefit of creating a job is that you don;t need to remember the last integer value or date. Sqoop handles that for you.

To create a recurring job you need to define it. To do this use the Sqoop job command along with the argument --create and give the job a name. Here is the syntax,

sqoop job --create [my job name]

After this create statement we define the job with the import command and the arguments required. This saves a job in the Sqoop metastore with the name you've defined. The job can be run using the exec argument. For a job named testjob the example is as follows,

sqoop job --exec testjob

To list the job in the metastore use the following argument,

sqoop job --list

You can display the definition of a job with the show argument which requires a parameter to define the name of the job you wish to display. For example,

sqoop job --show testjob

And if a job is no longer needed you can remove it from the Sqoop metastore using the --delete argument which also requires the name of the job as a parameter. For example,

sqoop job --delete testjob

More about the Sqoop Metastore

Sqoop uses a small footprint database called HSQLDB which uses a a HyperSQL database to store it's metadata. This database can be shared so other users can use the jobs stored within it. However, if you're going to share jobs across multiple users you may want to move the metastore to a more familiar database like MySQL because the metastore can get corrupted and in a production environment you'll want good maintenance procedures.

Other users can connect to the metastore using the argument --meta-connect with the IP address of the virtual machine. It uses port number 16000 by default. For example,

sqoop job --create testjob \ --meta-connect jdbc:hsqldb:hsql://[VM IP Address]:16000/sqoop \

A better model for incremental loads

In reality if you have administrative access over the source SQL Server or a good relationship with your DBAs you're better off pushing the logic for identifying the new data back into a SQL Server view and using Sqoop to import it.

This is all well for tables with only insert transactions but what about tables that contain update or delete transactions. How do you reflect the changes in HDFS. For that you can use the Sqoop merge command.

Saturday, 28 May 2016

Sqoop & SQL Server #7 - Re usability with Options Files

Using an Options File for Reusability

An Options File is a text file containing code to be called as part of a Sqoop job. They are useful when using repeat code often. You can share them between developers and the interact easily with the rest of the Sqoop command you're running.

To create an Options File simply create a text file with the elements of Sqoop syntax you wish to repeatedly call. Functionality you need to be aware of includes;

Each line identifies an option in the order that it appears otherwise on the command line
You must specify a single option on multiple lines by using the back-slash character at the end of intermediate lines.
Comments are supported when the line is prefixed with the hash character. Comments must be specified on a new line and may not be mixed with option text.
All comments and empty lines are ignored.
Unless options appear as quoted strings, any leading or trailing spaces are ignored.
Quoted strings if used must not extend beyond the line on which they are specified.
Option files can be specified anywhere in the command line as long as the options within them follow the otherwise prescribed rules of options ordering.

A simple example,

import

--connect

jdbc:sqlserver://10.88.12.0\\SQL2014

--database Utility

--username ClouderaAccess

--password GiveMeData

An example with comments,

#

# Options File for Sqoop Import

#

# Command

import

# Connection String

jdbc:sqlserver://10.88.12.0\\SQL2014

--database Utility

# Credentials

--username ClouderaAccess

--password GiveMeData

# End

Using an Options File

To call the file use the command --options-file plus any other arguments relevant to the file being called, in this case the table to be imported. We'll use the Options File defined in the examples above.

sqoop --options-file /users/cloudera/sqlserver/optionsfiles/StandardImportConnection.txt --table Customers

And this is equivalent to,

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers

Friday, 27 May 2016

Sqoop & SQL Server #6 - Sqoop Import Functionality

The standard import syntax is quite simple. All statements start with sqoop followed by a command. In this example the command is import.

sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers

Import has the many possible arguments but for this simple example we use just 2.

Connect specifies the connection string in which the server, port, database, username and password need to be supplied.

--connect "jdbc:sqlserver://[server]:[port];database=[database];username=[username];password=[password]"

Table is just the name of the table or view to be loaded. Sqoop works views the same way it works with tables.

--table [tablename]

When the job is submitted Sqoop connects via JDBC to the source table to retrieve the columns and their datatypes. The SQL datatypes are mapped to Java datatypes at which point there may be some datatype mapping differences.

You also see the following warning.

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.

Please set $ACCUMULO_HOME to the root of your Accumulo installation.

Under The Hood

Don't worry about this warning. It is highlighting that you haven't got Apache Accumulo installed. You don't need it. It's a Key Value store tool like Apache HBase. For more information see the Apache Accumulo project.

A MapReduce job is then initiated to get the data and write it to HDFS. Within the terminal session you'll see log entries prefixed INFO mapreduce.Job displaying the percentage completion of the process.

Simple queries don't take long. For larger tables it's possible to increase performance by splitting the job across multiple nodes. Sqoop will use the Primary Key to judge how to split the table. If no primary key exists Sqoop will make a guess. You can control the column used with the --split-by command. But be careful to select a column which can be uniformly distributed otherwise some nodes will have little work to do whilst others will be overworked.

When the process is complete the terminal session log will display a message similar to the following.

16/05/27 06:17:14 INFO mapreduce.ImportJobBase: Transferred 201.083 KB in 77.9396 seconds (2.58 KB/sec)

16/05/27 06:17:14 INFO mapreduce.ImportJobBase: Retrieved 1823 records.

Sqoop import has many possible arguments which cater for various use cases. The Sqoop User Guide has excellent documention for import syntax.

In the next posts we'll look at some of these more advanced import scenarios.

Transactional consistency

During a data load the source table must be consistent otherwise the data imported may not match the data at source. For example, in a transactional system if data is being inserted, updated or deleted during a long running data load some processes may extract inconsistent data or get blocked by other processes.

This is an important consideration to address with the administrators of the source system. It may be that you can arrange to have the data required replicated to a static source or maybe there is a maintenance window during which transactions cease.

Import from a Named Instance

When connecting to a SQL Server instance which is not the default instance the connection string is formatted differently.

This can be important is you have multiple SQL Server instances installed on a single server.

SQL Server instances which are not the default instance need to be named. A named instance is not connected to on the standard 1433 port. That is reserved for the default instance.

When connecting to a named instance you drop the port number and add the name. The format also changes slightly. In place of :port number we add \\named instance at the end of the host name of ip address.

jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData"

For example, here I reference a named instance called SQL2014.

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers

Import from a defined schema

By default SQL Server tables are created in the dbo schema. But there are many instances when other schemas are used for security or administrative reasons.

Sqoop assume the default dbo schema has been used. If not, you need to define the schema being referenced.

To reference a schema add the following at the end of the list of arguments

-- --schema [SchemaName]

For example, here I reference a schema called TestSchema.

sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers -- --schema TestSchema

Import all tables from a database

The import-all-tables command loads an entire database into HDFS in one go. However, depending on the size of the database that might not be the best solution. A one shot process which fails could require significant clean up before retrying and could take a long time to complete.

The command works through the source tables sequentially. To avoid unnecessary load on the RDBMS.

There are some other limitations of this command.

Each table must have a single-column primary key.
You must intend to import all columns of each table.
You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.

If, despite these limitations, import all tables is the right approach for your use case the syntax is as follows,

sqoop import-all-tables --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData"

If you need to import all but a few tables, you can use the parameter --excludetables that accepts a comma-separated list of table names. For example,

sqoop import-all-tables --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --exclude-tables cities,countries

Import into a defined folder

Sqoop will load your table into a directory with the same name in your HDFS home directory. If the folder already exists the job will fail to ensure existing data is not overwritten.

There are two options for specifying the target location. When loading a single table use the argument --target-dir. A sub folder named after your table will be created in the folder specified.

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --target-dir /tmp/definedfolderexample -m 1

When sqooping a number of tables you can use the argument --warehouse-dir. Again a sub folder will be created under this root but in this case a separate folder is created for each table.

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --warehouse-dir /tmp/definedfolderexample -m 1

Import a subset of data

There are occasions when you don't want to import an entire table. Sqoop defaults to taking all columns and all rows. You may want to restrict the columns or rows loaded.

Selecting a subset of columns

Sqoop ships with an argument for the import command named --columns. To use this command you must also supply a comma separated list enclosed in speech marks. Here is the syntax,

-- columns "column1,column2,column3"

And an example,

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --columns "FirstName,Surname,DateOfBirth" -m 1

Selecting a subset of rows

To restrict the rows loaded Sqoop uses a --where argument. The syntax is simple,

--where "[ColumnName]=n" or for string values --where "[ColumnName]='stringtext'"

And an example,

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --where "MarketingOptIn = 1" -m 1

The text within the speech marks can include anything which the receiving RDBMS can handle. For example, you can use UDFs and T-SQL specific syntax. The arguments will be passed to the source system for each query submitted. Be careful not to define a clause so complex that on process locks the table and the other mappers cannot continue to work in parallel.

Using a free-form query

It is also possible to supply a free-form T-SQL query to Sqoop although there are limitations to the power of this method.

Limitations

You must specify a destination directory as Sqoop cannot simple use the table name
To use parallelism you must include $CONDITIONS at the end of your query, \$CONDITIONS if the query is wrapped in quotes
You must use --split-by as Sqoop cannot identify a primary key

The syntax is as follows,

$ sqoop import \

--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \

--split-by a.id --target-dir /user/results

If you have control over the source database it may preferable to construct a view which contains logic to present the table as required and thus remove the need for both the --column and --where arguments.

Import using multiple nodes

For large loads it is possible to increase performance by splitting the job across multiple nodes.

Sqoop will use the Primary Key to judge how to split the table. If no primary key exists Sqoop will make a guess. You can control the column used with the import argument --split-by command. But be careful to select a column which can be uniformly distributed otherwise some nodes will have little work to do whilst others will be overworked.

The syntax for using --split-by is as follows.

--split-by [column name]

And an example of this argument in action is,

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --split-by CreatedDateTime

Remember Sqoop will examine the column chosen to split by in order to divide the job across multiple nodes. Therefore choosing a column with good distribution of data will allow Sqoop to spread the load across nodes. Choosing a column with lots of repeated data or one value which is very common in the data will skew the import leading to some nodes finishing early and others taking a long time to complete.

You can control the number of mappers used with the import argument -m [n] where n is an integer. The syntax is very simple,

-m 4

And an example of this argument

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --m 4

By default 4 tasks are used. During testing you may see better performance by adding more mappers. However, if you increase the number of mappers beyond the number of nodes in your cluster the mappers will run in series not parallel which defeats the objective. Equally the comes a point where splitting the work across nodes becomes self-defeating so testing to identify optimal performance is advised.

Into a defined file type

With Sqoop 1.4.6 there are 4 different file types available in the Sqoop import command. The different formats each have different advantages which but plain text is useful for most scenarios and can be used by various other Apache tools. Avro is commonly used when streaming data which is not Sqoops purpose. Parquet is commonly used to created efficient storage which can be used for fast analysis. SequenceFiles are used by the HDFS file system extensively behind the scenes.

Plain text - Standard CSV format is the default for Sqoop
Avro - A fast binary data format - see here for details
Parquet - A columnar store format - see here for details
SequenceFile - supports binary data types - see here for details

The syntax for the additional argument is similar for all four options.

--as-avrodatafile

--as-sequncefile

--as-textfile

--as-parquetfile

In this example we'll use Parquet,

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --as-parquetfile

Pragmatic Data Warehousing