Theme
|
I want to…
|
Sqoop Example
|
General
|
See what version of Sqoop is installed.
|
Sqoop version
|
General
|
List the databases I have permission to see.
|
sqoop list-databases --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];username=[UserName];password=[Password]"
|
General
|
List the tables I have permission to see for a specific database.
|
sqoop list-tables --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
|
Import
|
Import a specific table into HDFS.
|
sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
–table [TableName]
|
Import
|
Import a specified table from a named instance into HDFS.
|
sqoop import --connect "jdbc:sqlserver://[IPAddress]\\[InstanceName];database][Database];username=[UserName];password=[Password]"
–table [TableName]
|
Import
|
Import a specific table from a specified schema into HDFS.
|
sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
–table [TableName] -- --schema [SchemaName]
|
Import
|
Import all table from a database.
|
sqoop import-all-tables --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
|
Import
|
Import all table from a database with specified exceptions.
|
sqoop import-all-tables --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
–exclude-tables [TableName1],[TableName2]
|
Import
|
Import a specific table into a defined HDFS location.
|
sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
–table [TableName] –target-dir [Location]
|
Import
|
Import specified columns from a table into HDFS.
|
sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
–table [TableName] –columns “[Column1], [Column2], [Column3]”
|
Import
|
Import into HDFS using a T-SQL query.
|
sqoop import \
--query 'SELECT a.*, b.* FROM a JOIN b on (a.id = b.id) WHERE
$CONDITIONS' \
--split-by a.id
|
Import
|
Spread the load of an import across the cluster nodes
|
sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
–table [TableName] –m [NumberOfNodes]
|
Import
|
Import a specific table into a parquet format file in HDFS.
|
sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
–table [TableName] –as-parquetfile
|
General
|
Define an Options File for code reuse
|
#
# Options File for Sqoop Import
#
# Command
import
# Connection String
jdbc:sqlserver://[IPAddress]:[PortNumber]
--database [DatabaseName]
# Credentials
--username [UserName]
--password [Password]
# End
|
Import
|
Use an Options File to import data into HDFS
|
sqoop --options-file [LoactionOfFile]/[NameOfFile] --table Customers
|
Import
|
Import a table into Hive
|
sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
–table [TableName] –hive-import
|
Import
|
Overwrite an existing Hive table during an Import
|
sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
–table [TableName] –hive-import –hive-overwrite
|
Import
|
Import a table into Hive and handle NULLs
|
sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
–table [TableName] –hive-import –null-string ‘\\N’ –null-non-string ‘\\N’
|
Import
|
Import a table into Hive and remove Hive delimiters from the data
|
sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
–table [TableName] –hive-import –hive-drop-import-delims
|
Import
|
Import a table into a Hive partition
|
sqoop import --connect "jdbc:sqlserver://[IPAddress]:[PortNumber];database][Database];username=[UserName];password=[Password]"
–table [TableName] –hive-import –hive-partition-key [PartitionColumn] –hive-partition-value
“[ParitionValue]”
|
Export
|
Export data from HDFS to SQL Server
|
sqoop export --connect ‘jdbc:sqlserver://[IPAddress];database=[DatabaseName];instancename=[InstanceName];username=[UserName];password=[Password]’
--table [SQLServerTableName] --export-dir [HDFSTableLocation]
|
Export
|
Export specific columns from HDFS to SQL Server
|
sqoop export --connect ‘jdbc:sqlserver://[IPAddress];database=[DatabaseName];instancename=[InstanceName];username=[UserName];password=[Password]’
--table [SQLServerTableName] –columns ‘[Column1],[Column2],[Column3]’--export-dir
[HDFSTableLocation]
|
Export
|
Export data from HDFS to SQL Server and handle NULLs
|
sqoop export --connect ‘jdbc:sqlserver://[IPAddress];database=[DatabaseName];instancename=[InstanceName];username=[UserName];password=[Password]’
--table [SQLServerTableName] --export-dir [HDFSTableLocation] –input-null-string
–input-non-null-string
|
Export
|
Update SQL Server data from HDFS using an Export command
|
sqoop export --connect ‘jdbc:sqlserver://[IPAddress];database=[DatabaseName];instancename=[InstanceName];username=[UserName];password=[Password]’
--table [SQLServerTableName] --export-dir [HDFSTableLocation] –update-key ‘[KeyColumn]’
|
Export
|
Update SQL Server data and Insert new rows from HDFS using an Export
command
|
sqoop export --connect ‘jdbc:sqlserver://[IPAddress];database=[DatabaseName];instancename=[InstanceName];username=[UserName];password=[Password]’
--table [SQLServerTableName] --export-dir [HDFSTableLocation] –update-key ‘[KeyColumn]’
--allowinsert
|
Export
|
Export data from HDFS to SQL Server in one batch process
|
sqoop export --connect ‘jdbc:sqlserver://[IPAddress];database=[DatabaseName];instancename=[InstanceName];username=[UserName];password=[Password]’
--table [SQLServerTableName] --export-dir [HDFSTableLocation] --batch
|
Wednesday, 8 June 2016
Sqoop & SQL Server #12 - Sqoop Cheat Sheet
Tuesday, 7 June 2016
Sqoop & SQL Server #11 - Export from HDFS to SQL Server
Export
For those times when you need to get data back out of an
HDFS cluster into a relational database Sqoop export is the answer.
Export works in a similar way to the import command. First
it checks the connection to the SQL Server before mapping the job and delivering
it in chunks.
Unlike import, export can’t create a table on the target
during the process. The table being exported into must already exist on the
target database.
By default export will insert data into the target table.
But you can use the update argument in which case the process will update
existing records.
The --export-dir argument and one
of --table or --call are required. These specify the
directory in HDFS which contains the source data and the table to populate in
the database or in the case of call the stored procedure to execute.
Basic Example
First we'll load data into a specific folder in HDFS to assists with the export example
sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --target-dir /user/cloudera/ExportExample1 –m 1
And on the target SQL Server we'll create a table ready to receive the data load.
CREATE TABLE dbo.CustomersExport
(
CustomerID int NOT NULL,
FirstName varchar(50) NOT NULL,
Surname varchar(50) NOT NULL,
DateOfBirth date NOT NULL,
EmailAddress varchar(100) NULL,
PostCode varchar(8) NULL,
MarketingOptIn bit NULL,
AutoRenewalOptIn bit NULL,
CreatedDateTime datetime NOT NULL,
ModifiedDateTime datetime NULL
) ON PRIMARY
GO
Now we're ready to execute the sqoop export.
sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username=ClouderaAccess;password=GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1
Whilst trying to get this to work I found a difference in the syntax form what I'd been using for imports. For example, the argument instancename=SQL2014 replaced the syntax within the connection string. I tried amendingg the import syntax toreflect this new syntax and the import failed on a syntax error.
Exporting a Subset of Columns
By default, all columns within a table are exported. You can
select a subset of columns with the --columns argument. The syntax
for this argument is –columns followed by a a comma-delimited list encased in
double quotes. For example, --columns "colA,colB,colC".
sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username=ClouderaAccess;password=GiveMeData’ --table CustomersExport --columns 'CustomerID,FirstName,Surname,DateOfBirth,CreatedDateTime' --export-dir /user/Cloudera/ExportExample1
Handling Nulls
Similarly to Hive imports, exports can have issues with
NULLs. The Arguments --input-null-string and --input-null-non-string
fix these issues.
If --input-null-string is not specified, then the
string "null" will be interpreted as null for string-type columns.
If --input-null-non-string is not specified, then both the string
"null" and the empty string will be interpreted as null for
non-string columns. Empty string will be always interpreted as null for
non-string columns.
sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username=ClouderaAccess;password=GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1 --input-null-string --input-non-null-string
Staging Data - Unfortunately this option is not currently supported for SQL Server
Sqoop export breaks down export process into multiple
transactions. If something fails before all of the transactions have complete a
partial dataset will be present on the target database. Further errors can
follow if we rerun the job and insert collisions or duplications occur.
You can overcome this problem by specifying a staging table
via the --staging-table option which acts as an buffer used to stage
exported data. The data is then loaded to the target table in a single
transaction after the stage has loaded completely.
In order, for it to work, the staging table must exist prior
to running the export job. The stage table must be identical to the target
table either be empty or the --clear-staging-table option must be used.
The --clear-staging-table option will delete all of the data before
starting the load.
Inserts & Updates
By default, export inserts new rows with each row transformed
into an INSERT statement that loaded into the target database table. You
must not violate any existing table constraints else the load will fail. This
mode is best used when loading data into a new, empty table.
The --update-key argument modifies the existing
dataset in the target database. Each row is treated as an UPDATE using the
column name(s) specified with --update-key as the matching criteria to
ensure the correct row is updated. The argument --update-key can also
be given a comma separated list of column names.
To set this scenario up let's change the surname of a customer in the target SQL Server table.
UPDATE Utility.dbo.CustomersExport
SET Surname = 'UpdateTest'
WHERE CustomerID = 1
And now run the update Sqoop export process.
sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username ClouderaAccess;password GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1 --update-key 'CustomerID'
If an UPDATE statement modifies no rows the export
will silently continue. If the column specified with --update-key does
not uniquely identify rows then multiple rows are updated by a single
statement.
You may also specify the --update-mode argument with allowinsert mode
if you want to update rows if they exist in the database already or insert rows
if they do not exist yet. The default is updateonly which you shouldn't need to specify unless you're have new rows in the source which you don;t want to export. THe syntax for that option is updateonly for the --update-mode.
sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username ClouderaAccess;password GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1 --update-key 'CustomerID' --update-mode allowinsert
Export in Batch Mode
Instead of firing rows into the target database one at a time you can tell Sqoop to batch up the exports. By adding the argument --batch to an export statement Sqoop will complete the work on the cluster before writing any data to the target server.
This could be a useful option if the receiving database doesn't like the drip feed of huge numbers of rows.
sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username ClouderaAccess;password GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1 --batch
Transactional consistency
During export jobs Sqoop will spawn multiple processes on the target database which run in parallel. These processes will complete at different times leaving the target table incomplete until all jobs have finished. Sqoop commits data every few thousand rows to make good use of memory. The target table will be available for query during the load.
Sqoop can stage data so inconsistency is not seen in the target object. But this is slow as the process needs to wait for the job to complete and then load the entire contents into the target object from the stage object.
It's worth speaking to the administrators about performance as poor configured Sqoop jobs may cause unacceptable issues on the source/target database.
Thursday, 2 June 2016
Sqoop & SQL Server #10 - SQL On Hadoop
So you've managed to Sqoop some data into HDFS. Great. What now? You'll want to query it. But what with? You don't know Java so writing a MapReduce job is right out. You know SQL so why not query the data using SQL. For that you'll need to use one of the many SQL on Hadoop tools that have emerged to make data analysis in Hadoop available to the large numbers of data professionals who know a variety of SQL
Since we're concerned here with Sqoop, we'll concentrate on Hive with a special mention for Impala. But there are other SQL on Hadoop tools out there which would satisfy most use cases.
Hive
Developed by Facebook as a Data Warehousing framework over Hadoop it provides a SQL-like language called HiveQL which is accessible to those with SQL knowledge. Hive converts queries to MapReduce jobs to be processed. Since MapReduce can be slow query times are also slow compared to SQL Server or other RDBMS's. Hive maintains a metastore which acts as metadata manager.
Impala
Developed by Cloudera as a MPP query engine over Hadoop Impala also provides an accessible SQL like language. Instead of using MapReduce Impala sends queries to agents installed on each data node in the cluster which then return their element of the query back to Impala. Impala removes MapReduce as a bottleneck by controlling the query execution itself and utilising the data nodes along with the Hive metastore to map out the query. This makes Impala many times faster than Hive with execution speeds nearing -
but not quite reaching - traditional RDBMS's. Impala is designed for quick data analysis but when used against large datasets memory issues can be a problem. In cases when you absolutely need the query to complete or have large volumes of data Hive is the best bet. In cases when you want an answer fast over smaller data sets Impala is the right choice.
Importing Data Into Hive
By adding the --hive-import option to your Sqoop import command line you can simply load the data into Hive. Sqoop will examine the metadata of the table being imported and create a table in Hive. The data types won't match exactly. Sqoop will approximate the data types by default but you can gain greater control over this with specific arguments. Also, NULL handling can be tricky but again you can define exactly how you want NULLs to be handled if you don't like results of the default Sqoop execution.
As Hive is designed to act like a Data Warehouse a common use case is to update a table periodically with new data. Hive can use partitions to store this new data more efficiently. The analogy here is with SQL Server clustered indexes as we're controlling how data is stored on disk to make the queries run faster.
Hive tables are stored on HDFS. Remember this is a write once
cluster with the potential to be read often by many possible tools so the data is written to disk in a format that can be read easily. Hive stores the schema in it's metastore and only at the point of query is the data interpreted against that schema. It's called schema on read. As the data is stored on HDFS we can use different file types to store the data in the most efficient manner. A common format used with Hive is Parquet which is optimised for data analysis work.
Basic Hive Import
The --hive-import argument for the import command provides an easy way of populating Hive.
sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --hive-import
Overwrite Existing Data
The --hive-overwrite option is analogous to a TRUNCATE and INSERT in SQL Server.
sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --hive-import --hive-overwrite
If you use the --hive-overwrite without --hive-import it won't work. The process will default to a HDFS load instead.
If the table has already been loaded into HDFS and you don't specify a target you'll get an error stating that the table already exists.
Creating Tables In Hive
If your data already exists in HDFS and you want to load it into Hive you will need to define the schema since Sqoop has no metadata to use during the load process to use to create the Hive table.
Hive CREATE TABLE syntax is similar to other SQL variants. Here is a simple example
CREATE TABLE [Table Name]
([Column Name 1] [DataType],
[Column Name 2] [DataType],
[Column Name 3] [DataType],
...
[Column Name n] [DataType])
STORED AS [File Type];
For more complex examples and a full syntax explanation look in the Hive Wiki here. This also contains information about data types which are different in Hive to SQL Server so beware.
To execute Hive code you need to open the Hive shell. Do this by typing Hive in the terminal session and you'll see the text before the cursor change to say hive>.
For our Customers table the create script to store the data in Parquet format would be as follows,
To execute Hive code you need to open the Hive shell. Do this by typing Hive in the terminal session and you'll see the text before the cursor change to say hive>.
hive> CREATE TABLE Customers
(CustomerID INT,
FirstName VARCHAR(50),
Surname VARCHAR(50),
DateOfBirth DATE,
EmailAddress VARCHAR(100),
PostCode VARCHAR(8),
MarketingOptIn TINYINT,
AutoRenewalOptIn TINYINT,
CreatedDateTime TIMESTAMP,
ModifiedDateTime TIMESTAMP)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> LOAD DATA LOCAL INPATH "[Folder]/[FileName]" INTO TABLE Customers;
Handling NULLs
Sqoop and Hive handle NULLs differently by default so you need to explicitly tell Sqoop how to load the data into Hive. This is easily remedied using the two arguments --null-string and --null-non-string with the parameter '\\N' as shown below.
sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --hive-import --null-string ‘\\N’ --null-non-string ‘\\N’
Handling Delimiters
When importing into Hive using Sqoop you call SELECT count(*) FROM your_table to see how many rows were loaded. If the number of rows is larger than the source table you may have encountered a problem with delimiters.
The issue is caused when the data contains characters that Hive usees as delimiters, e.g. \n, \t, \01. This can arise when data has been input into text based columns with newline or tab characters included.
Sqoop can remove this characters from your data if you use the argument --hive-dropimport-
delims. For example,
sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --hive-import --hive-drop-import-delims
If the special characters are required you can use the argument --hive-delims-replacement which will substitute a replacement
string. Although how useful that would be in the data is not clear. It works like this,
sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --hive-import --hive-delims-replacement "SPECIAL"
Loading Into Partitions
If you're incrementally loading data into Hive perhaps as part of a daily load you can use partitions to separate the data but Hive still treats these separate blocks of data as one table. This could be useful if you want to swap out a days load that went wrong without reloading all the data.
Partitions also act like Clustered Indexes in SQL Server as the data is written to disk in sequential blocks making retrieval easier.
Sqoop supports Hive partitioning with two additional parameters: --hive-partition-key, the name of the partition column, and --hive-partition-value, the value. So if you partition on the column CreatedDate and you
're loading data for 31st May 2016 the value would be 2016-05-31 and the Sqoop command would read as follows;
sqoop import --connect "jdbc:sqlserver://10.88.12.0:1433;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --hive-import --hive-partition-key CreatedDate --hive-partition-value “2016-05-31”
Sqoop mandates that the partition column be of type STRING. The current implementation
is limited to a single partition level so you can't do nested partitions.
Using Impala
Using the Cloudera equivalent of SQL Server Management Studio called HUE you can run Impala queries over the data we've already loaded into Hive. As Impala relies on the Hive metastore only data which has been loaded into Hive can be queried using Impala.
To refresh Impala's knowledge of the Hive metastore after a data load you can run the Invalidate Metadata Impala command which is poorly named as it actually does a refresh and leaves Impala ready to be used. Use as follows,
Invalidate Metadata
Subscribe to:
Posts (Atom)