Export
For those times when you need to get data back out of an
HDFS cluster into a relational database Sqoop export is the answer.
Export works in a similar way to the import command. First
it checks the connection to the SQL Server before mapping the job and delivering
it in chunks.
Unlike import, export can’t create a table on the target
during the process. The table being exported into must already exist on the
target database.
By default export will insert data into the target table.
But you can use the update argument in which case the process will update
existing records.
The --export-dir argument and one
of --table or --call are required. These specify the
directory in HDFS which contains the source data and the table to populate in
the database or in the case of call the stored procedure to execute.
Basic Example
First we'll load data into a specific folder in HDFS to assists with the export example
sqoop import --connect "jdbc:sqlserver://10.88.12.0\\SQL2014;database=Utility;username=ClouderaAccess;password=GiveMeData" --table Customers --target-dir /user/cloudera/ExportExample1 –m 1
And on the target SQL Server we'll create a table ready to receive the data load.
CREATE TABLE dbo.CustomersExport
(
CustomerID int NOT NULL,
FirstName varchar(50) NOT NULL,
Surname varchar(50) NOT NULL,
DateOfBirth date NOT NULL,
EmailAddress varchar(100) NULL,
PostCode varchar(8) NULL,
MarketingOptIn bit NULL,
AutoRenewalOptIn bit NULL,
CreatedDateTime datetime NOT NULL,
ModifiedDateTime datetime NULL
) ON PRIMARY
GO
Now we're ready to execute the sqoop export.
sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username=ClouderaAccess;password=GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1
Whilst trying to get this to work I found a difference in the syntax form what I'd been using for imports. For example, the argument instancename=SQL2014 replaced the syntax within the connection string. I tried amendingg the import syntax toreflect this new syntax and the import failed on a syntax error.
Exporting a Subset of Columns
By default, all columns within a table are exported. You can
select a subset of columns with the --columns argument. The syntax
for this argument is –columns followed by a a comma-delimited list encased in
double quotes. For example, --columns "colA,colB,colC".
sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username=ClouderaAccess;password=GiveMeData’ --table CustomersExport --columns 'CustomerID,FirstName,Surname,DateOfBirth,CreatedDateTime' --export-dir /user/Cloudera/ExportExample1
Handling Nulls
Similarly to Hive imports, exports can have issues with
NULLs. The Arguments --input-null-string and --input-null-non-string
fix these issues.
If --input-null-string is not specified, then the
string "null" will be interpreted as null for string-type columns.
If --input-null-non-string is not specified, then both the string
"null" and the empty string will be interpreted as null for
non-string columns. Empty string will be always interpreted as null for
non-string columns.
sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username=ClouderaAccess;password=GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1 --input-null-string --input-non-null-string
Staging Data - Unfortunately this option is not currently supported for SQL Server
Sqoop export breaks down export process into multiple
transactions. If something fails before all of the transactions have complete a
partial dataset will be present on the target database. Further errors can
follow if we rerun the job and insert collisions or duplications occur.
You can overcome this problem by specifying a staging table
via the --staging-table option which acts as an buffer used to stage
exported data. The data is then loaded to the target table in a single
transaction after the stage has loaded completely.
In order, for it to work, the staging table must exist prior
to running the export job. The stage table must be identical to the target
table either be empty or the --clear-staging-table option must be used.
The --clear-staging-table option will delete all of the data before
starting the load.
Inserts & Updates
By default, export inserts new rows with each row transformed
into an INSERT statement that loaded into the target database table. You
must not violate any existing table constraints else the load will fail. This
mode is best used when loading data into a new, empty table.
The --update-key argument modifies the existing
dataset in the target database. Each row is treated as an UPDATE using the
column name(s) specified with --update-key as the matching criteria to
ensure the correct row is updated. The argument --update-key can also
be given a comma separated list of column names.
To set this scenario up let's change the surname of a customer in the target SQL Server table.
UPDATE Utility.dbo.CustomersExport
SET Surname = 'UpdateTest'
WHERE CustomerID = 1
And now run the update Sqoop export process.
sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username ClouderaAccess;password GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1 --update-key 'CustomerID'
If an UPDATE statement modifies no rows the export
will silently continue. If the column specified with --update-key does
not uniquely identify rows then multiple rows are updated by a single
statement.
You may also specify the --update-mode argument with allowinsert mode
if you want to update rows if they exist in the database already or insert rows
if they do not exist yet. The default is updateonly which you shouldn't need to specify unless you're have new rows in the source which you don;t want to export. THe syntax for that option is updateonly for the --update-mode.
sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username ClouderaAccess;password GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1 --update-key 'CustomerID' --update-mode allowinsert
Export in Batch Mode
Instead of firing rows into the target database one at a time you can tell Sqoop to batch up the exports. By adding the argument --batch to an export statement Sqoop will complete the work on the cluster before writing any data to the target server.
This could be a useful option if the receiving database doesn't like the drip feed of huge numbers of rows.
sqoop export --connect ‘jdbc:sqlserver://10.88.12.0;database=Utility;instancename=SQL2014;username ClouderaAccess;password GiveMeData’ --table CustomersExport --export-dir /user/Cloudera/ExportExample1 --batch
Transactional consistency
During export jobs Sqoop will spawn multiple processes on the target database which run in parallel. These processes will complete at different times leaving the target table incomplete until all jobs have finished. Sqoop commits data every few thousand rows to make good use of memory. The target table will be available for query during the load.
Sqoop can stage data so inconsistency is not seen in the target object. But this is slow as the process needs to wait for the job to complete and then load the entire contents into the target object from the stage object.
It's worth speaking to the administrators about performance as poor configured Sqoop jobs may cause unacceptable issues on the source/target database.
No comments:
Post a Comment