Friday, 27 May 2016

Sqoop & SQL Server #6 - Sqoop Import Functionality

The standard import syntax is quite simple. All statements start with sqoop followed by a command. In this example the command is import. 

sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers

Import has the many possible arguments but for this simple example we use just 2.

Connect specifies the connection string in which the server, port, database, username and password need to be supplied.

--connect "jdbc:sqlserver://[server]:[port];database=[database];username=[username];password=[password]"

Table is just the name of the table or view to be loaded. Sqoop works views the same way it works with tables.

--table [tablename]

When the job is submitted Sqoop connects via JDBC to the source table to retrieve the columns and their datatypes. The SQL datatypes are mapped to Java datatypes at which point there may be some datatype mapping differences.

You also see the following warning.

Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.

Under The Hood
Don't worry about this warning. It is highlighting that you haven't got Apache Accumulo installed. You don't need it. It's a Key Value store tool like Apache HBase. For more information see the Apache Accumulo project.

A MapReduce job is then initiated to get the data and write it to HDFS. Within the terminal session you'll see log entries prefixed INFO mapreduce.Job displaying the percentage completion of the process.

Simple queries don't take long. For larger tables it's possible to increase performance by splitting the job across multiple nodes. Sqoop will use the Primary Key to judge how to split the table. If no primary key exists Sqoop will make a guess. You can control the column used with the --split-by command. But be careful to select a column which can be uniformly distributed otherwise some nodes will have little work to do whilst others will be overworked. 

When the process is complete the terminal session log will display a message similar to the following.

16/05/27 06:17:14 INFO mapreduce.ImportJobBase: Transferred 201.083 KB in 77.9396 seconds (2.58 KB/sec)
16/05/27 06:17:14 INFO mapreduce.ImportJobBase: Retrieved 1823 records.

Sqoop import has many possible arguments which cater for various use cases. The Sqoop User Guide has excellent documention for import syntax.

In the next posts we'll look at some of these more advanced import scenarios.

Transactional consistency
During a data load the source table must be consistent otherwise the data imported may not match the data at source. For example, in a transactional system if data is being inserted, updated or deleted during a long running data load some processes may extract inconsistent data or get blocked by other processes.

This is an important consideration to address with the administrators of the source system. It may be that you can arrange to have the data required replicated to a static source or maybe there is a maintenance window during which transactions cease.

Import from a Named Instance
When connecting to a SQL Server instance which is not the default instance the connection string is formatted differently.

This can be important is you have multiple SQL Server instances installed on a single server.

SQL Server instances which are not the default instance need to be named. A named instance is not connected to on the standard 1433 port. That is reserved for the default instance.

When connecting to a named instance you drop the port number and add the name. The format also changes slightly. In place of :port number we add \\named instance at the end of the host name of ip address.

jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData"

For example, here I reference a named instance called SQL2014.

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers

Import from a defined schema
By default SQL Server tables are created in the dbo schema. But there are many instances when other schemas are used for security or administrative reasons. 

Sqoop assume the default dbo schema has been used. If not, you need to define the schema being referenced.

To reference a schema add the following at the end of the list of arguments

-- --schema [SchemaName]

For example, here I reference a schema called TestSchema.

sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers -- --schema TestSchema

Import all tables from a database
The import-all-tables command loads an entire database into HDFS in one go. However, depending on the size of the database that might not be the best solution. A one shot process which fails could require significant clean up before retrying and could take a long time to complete.

The command works through the source tables sequentially. To avoid unnecessary load on the RDBMS.

There are some other limitations of this command.
  • Each table must have a single-column primary key.
  • You must intend to import all columns of each table.
  • You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.

If, despite these limitations, import all tables is the right approach for your use case the syntax is as follows,

sqoop import-all-tables --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData"

If you need to import all but a few tables, you can use the parameter --excludetables  that accepts a comma-separated list of table names. For example,

sqoop import-all-tables --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --exclude-tables cities,countries

Import into a defined folder
Sqoop will load your table into a directory with the same name in your HDFS home directory. If the folder already exists the job will fail to ensure existing data is not overwritten.

There are two options for specifying the target location. When loading a single table use the argument --target-dir. A sub folder named after your table will be created in the folder specified.

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --target-dir /tmp/definedfolderexample -m 1

When sqooping a number of tables you can use the argument --warehouse-dir. Again a sub folder will be created under this root but in this case a separate folder is created for each table.

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --warehouse-dir /tmp/definedfolderexample -m 1

Import a subset of data
There are occasions when you don't want to import an entire table. Sqoop defaults to taking all columns and all rows. You may want to restrict the columns or rows loaded. 

Selecting a subset of columns
Sqoop ships with an argument for the import command named --columns. To use this command you must also supply a comma separated list enclosed in speech marks. Here is the syntax,

-- columns "column1,column2,column3"

And an example,

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --columns "FirstName,Surname,DateOfBirth" -m 1

Selecting a subset of rows
To restrict the rows loaded Sqoop uses a --where argument. The syntax is simple,

--where "[ColumnName]=n" or for string values --where "[ColumnName]='stringtext'"

And an example,

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --where "MarketingOptIn = 1" -m 1

The text within the speech marks can include anything which the receiving RDBMS can handle. For example, you can use UDFs and T-SQL specific syntax. The arguments will be passed to the source system for each query submitted. Be careful not to define a clause so complex that on process locks the table and the other mappers cannot continue to work in parallel.

Using a free-form query
It is also possible to supply a free-form T-SQL query to Sqoop although there are limitations to the power of this method.

Limitations
  • You must specify a destination directory as Sqoop cannot simple use the table name
  • To use parallelism you must include $CONDITIONS at the end of your query, \$CONDITIONS if the query is wrapped in quotes
  • You must use --split-by as Sqoop cannot identify a primary key

The syntax is as follows,

$ sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
--split-by a.id --target-dir /user/results

If you have control over the source database it may preferable to construct a view which contains logic to present the table as required and thus remove the need for both the --column and --where arguments.

Import using multiple nodes
For large loads it is possible to increase performance by splitting the job across multiple nodes.

Sqoop will use the Primary Key to judge how to split the table. If no primary key exists Sqoop will make a guess. You can control the column used with the import argument --split-by command. But be careful to select a column which can be uniformly distributed otherwise some nodes will have little work to do whilst others will be overworked. 

The syntax for using --split-by is as follows.

--split-by [column name]

And an example of this argument in action is,

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --split-by CreatedDateTime

Remember Sqoop will examine the column chosen to split by in order to divide the job across multiple nodes. Therefore choosing a column with good distribution of data will allow Sqoop to spread the load across nodes. Choosing a column with lots of repeated data or one value which is very common in the data will skew the import leading to some nodes finishing early and others taking a long time to complete.

You can control the number of mappers used with the import argument -m [n] where n is an integer. The syntax is very simple,

-m 4

And an example of this argument

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --m 4

By default 4 tasks are used. During testing you may see better performance by adding more mappers. However, if you increase the number of mappers beyond the number of nodes in your cluster the mappers will run in series not parallel which defeats the objective. Equally the comes a point where splitting the work across nodes becomes self-defeating so testing to identify optimal performance is advised.

Into a defined file type
With Sqoop 1.4.6 there are 4 different file types available in the Sqoop import command. The different formats each have different advantages which but plain text is useful for most scenarios and can be used by various other Apache tools. Avro is commonly used when streaming data which is not Sqoops purpose. Parquet is commonly used to created efficient storage which can be used for fast analysis. SequenceFiles are used by the HDFS file system extensively behind the scenes.

  1. Plain text - Standard CSV format is the default for Sqoop
  2. Avro - A fast binary data format - see here for details
  3. Parquet - A columnar store format - see here for details
  4. SequenceFile - supports binary data types - see here for details

The syntax for the additional argument is similar for all four options.

--as-avrodatafile
--as-sequncefile
--as-textfile
--as-parquetfile

In this example we'll use Parquet,

sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --as-parquetfile

No comments:

Post a Comment