Pragmatic Data Warehousing: 2012

Sunday, 2 December 2012

T.S. Eliot Quote

Where is the wisdom? Lost in the knowledge. Where is the knowledge? Lost in the information.

T. S. Eliot.

Saturday, 22 September 2012

A 7-Zip Archiving Task for my SSIS Package Template

I use an SSIS package template to speed up project delivery. 7-zip is my archiving tool of choice because it provides encryption and password protection in addition to good compression. It has all the scripting options you'd expect in the command line version which makes it easy to use with SSIS. Documentation for command line scripting can be found here.

I call 7-ZIP from an Execute Package Task (EPT). This requires the 7-Zip executable to be present on the server where the SSIS package executes. The standard 7-Zip executable (7z.exe) doesn't work as a command line tool. Instead you need to use the 7za.exe. Both can be downloaded here.

In my SSIS Template I have a preconfigured EPT task with expressions constructed to build the complex strings required by the command line tools from SSIS variables. These variables can be set for simple recurring packages or with minor amendments populated from configuration values.

Configuring the Task

I always name each task using a standard acronym for the task followed by a brief description. This makes debugging much easier both during development using the package explorer pane and once the package is live as all the packages created using my template log to an audit schema. This audit can be invaluable when searching for a failure within a complex process. In addition, the log allows reporting on the duration of each step which aids performance tuning.

The screenshot below shows the general pane for the Execute Process Task used in this example.

Next follows the all important Process pane. This is where configurations are set. Or is it? With complex strings required as command line parameters it is better to configure the task dynamically using expressions and variables.

In this version on my package template I set only two values, the Arguments and Working Directory. I probably should have set the Executable location via a variable as well but I guess that'll wait for the next version.

The 3 string variables used are;

[User::ZipFileName]

[User::ZipFileLocation]

[User::ZipPassword]

The most complex expression is that which sets the commend line arguments, shown here;

"a -tzip " + @[User::ZipFileName] + "_" + (DT_STR,4,1252)YEAR(GETDATE()) + RIGHT("0" + (DT_STR,2,1252)MONTH(GETDATE()),2) + RIGHT("0" + (DT_STR,2,1252)DAY(GETDATE()),2) + ".zip *.csv -P" + @[User::ZipPassword]

And this results in the following string;

a -tzip DataExport_20130822.zip *.csv -PPa55word!

In Summary

This is a useful addition to my SSIS package template which I use regularly. I don't believe in reinventing the wheel, instead I prefer simple modular solutions I can combine to make more complex packages.

The time taken to establish a template is well worth the investment. I'm often very busy in my role and I prefer to focus my limited time and brain cycles on higher level problems than configuring an SSIS task. I recommend building your own template or using mine which I'll be sharing shortly.

Check out my WinSCP configure task also used as part of my SSIS package template.

Monday, 13 August 2012

Book Review - Microsoft SQL Server 2008 Integration Services Problem-design-solution

It doesn't matter that this book relates to SQL 2008 because unlike most of the books on the market this book will not explain every element in the current version SSIS in dry detail.

Instead SQL Server 2008 Integration Services Problem - Design - Solution teaches you how to use the tool. It's not a book for beginners it's a book for intermediate and advanced users.

You may have built countless packages during your career and think you're pretty good. That may be the case but the authors too have many years of experience across multiple projects and they've learned what works best.

Even advanced SSIS developers can learn from this text. It might be the case that it validates your approach. if so then great. If not keep reading.

Each chapter covers a different problem. I particularly liked Chapter 2 - SSIS Management Framework Design from which I've used copied the audit schema and framework to use throughout my SSIS development. Chapter 6 - Data Cleansing Design is also good and we're building a generic ETL cleanser based on advice in this chapter for a single customer view implementation. Chapter 9 - SSAS Processing Architecture describes an excellent method processing OLAP cubes using XMLA. I'm really keen to get this implemented but time constraints on my project mean we're still in a drop and rebuild model.

In my opinion this is the best SSIS book on the market.

But it. Read it. Use it.

Thursday, 26 July 2012

Building an Auditing Process for SSIS - Part 2

In this post I'l continue where I left off last time with the configuration of an SSIS package which allows the population of an audit schema.

We use this to support a DBA Dashboard showing the progress of ETL operations and as a method of documenting the package mechanism. I've found administering complex dynamic SSIS packages which use parameters and expressions can be time consuming to debug even for relatively simple issues. In my organisation we now use this audit schema to allow the rapid diagnosis of issues and poor performance.

In a future post I'll blog our DBA dashboard which links into this schema and provides us a snapshot of all SSIS processes in much greater detail than can be derived from SQL Agent errors.

Configuration

Each Package requires a connection manager. Keep this distinct from any other connection managers as a dedicated metadata connection. Configure the connection manager with the sufficient security to write to the audit schema in your chosen Utility database.

Control Flow Elements

The package logic should be wholly contained within a Sequence Container with a SQL task as a pre-requisite step which initiates the BatchLog and another SQL task as a final step which closes that Batch.

Initiate Audit - SQL Task General Pane

The SQL statement reads as follows: EXEC Audit.LogPackageStart @BatchLogID = ?,@PackageName = ?,@ExecutionInstanceID = ?,@MachineName = ?,@UserName = ?,@StartDatetime = ?,@PackageVersionGUID = ?,@VersionMajor = ?,@VersionMinor = ?,@VersionBuild = ?,@VersionComment = ?,@PackageGUID ,@CreationDate = ?,@CreatedBy = ?

Initiate Audit - SQL Task Parameter Mapping Pane

Initiate Audit - SQL Task Result Set Pane

Staying with the Control Flow pane we move onto the configuration of the Audit Termination.

Terminate Audit - SQL Task General Pane

The SQL Statement reads as follows: EXEC Audit.LogPackageEnd @PackageLogID = ?, @BatchLogID = ?,@EndBatchAudit = ?

Terminate Audit - SQL Task Parameter Mapping Pane

As this post is getting long I'll break here as the configuration of the Control Flow pane is now complete. Next time we'll move onto configuration of the Event Handlers.

Monday, 23 July 2012

Building an Auditing Process for SSIS - Part 1

An audit schema published in the excellent book SSIS Problem– Design – Solution allows the efficient capture of SSIS progress and errors into a simple SQL Server schema. We use this to support a DBA Dashboard showing the progress of ETL operations and as a method of documenting the package mechanism.

SSIS can be difficult to debug, particularly when packages are dynamically driven by variables and expressions or when sub packages are used, but this audit mechanism allows for easier identification of problems and performance bottlenecks.

The SQL Server schema

Attached here is the documentation for the SSIS PDS Audit schema plus a CREATE script. This schema should created in a Utility database or as has been done here in a separate schema with a database named DataWarehouse.

The SSIS Components

The SSIS tasks make use of the following variables. They need to be set up as shown below.

Name	DataType	Value
BatchLogID	Int32	0
PackageLogID	Int32	0
EndBatchAudit	Boolean	False
VariableValue	String	Unable to convert to string

Our standard SSIS Template has these variables and tasks already plumbed in so all new development should use that package as a starting point.

In Part 2 I'll show how your SSIS package should be configured.

Monday, 2 July 2012

ETL Cleansing : Remove Whitespace

I'm working on a green field data warehouse project with it's fair share of legacy data sources. I know the SQL Server based sources are of average quality but I'm expecting quite a lot of CSV files from external entities and I'll have minimal control over their data quality.

None of this is unusual. Therefore my solution is to produce an ETL cleansing mechanism which can accept any data source and use a process control table to iterate through predefined rules. As an example, I've written a sproc which removes white space from the front or back of a string. This is an easy thing to do but my requirement is to create a solution that can work with or without SSIS and which can log all the changes it makes in an audit schema. I also want it to be fast so it has to work set based not RBAR.

An excellent set of tools from Pragmatic Works does the same using SSIS. I'd like the choice of using a T-SQL or SSIS approach so I'm building my own plus my organisation is new to the data warehouse world and budgets are stretched. If you're working for an enterprise class business or an organisation with buy the Pragmatic Works tools.

I'll step through the solution then show then show the code.

The procedure takes a number of parameters. @Database, @Schema, @Table and @Column specify what will be cleansed. @TrimLeadingWhitespace and @TrimTrailingWhitespace parameters are switches, the @PrintOrExecute allows us to debug the Dynamic SQL by calling the procedure and producing but not executing the code. @BatchLogID is optional as we'll generate a BatchLogID is it's not passed.
After setting up the variables required to store the Dynamic SQL and for use in the loop I initiate a BatchLogID which is used in my Audit metadata to track all changes.
I then begin to build a dynamic SQL string which will use the stored procedure parameters to update the required column in the specified table. This can be quite complex hence the Print or Execute flag which is invaluable during development.
Because I'll log what data was changed in a set based operation I want to created a string containing the primary key columns and values for each row altered. To do this over any input parameters I've got another procedure called GetPrimaryKey which given the details of a table will return a comma separated list of the PK columns. I then reconstruct this PK string with the actual values of the row being updated appended into the string.
The follow step appends an OUTPUT statement to the foot of the UPDATE which will INSERT verbose logging information into a log location defined earlier in the sproc.
Finally I construct the Dynamic SQL and either Print or Execute.

To test it simply plug in some AdventureWorks details and set the @PrintOrExecute flag to 'P' like this;

 EXEC Clean.RemoveWhitespace NULL,'AdventureWorks','Person','Address','AddressLine2','Y','Y','P'

And here is the code. Feel free to use it but if you do please share your experiences and let me know of any optimisations you make in the comments below.

 ALTER PROCEDURE [Clean].[RemoveWhitespace]  
 (  
      @BatchLogID INT,  
      @Database AS VARCHAR(50),  
      @Schema AS VARCHAR(50),  
      @Table AS VARCHAR(50),  
      @Column AS VARCHAR(50),  
      @TrimLeadingWhitespace AS CHAR(1), -- Options are "Y" or "N"  
      @TrimTrailingWhitespace AS CHAR(1), -- Options are "Y" or "N"  
      @PrintOrExecute AS CHAR(1) -- Options are "P" or "E". Print will just show what the sproc intended to clean, execute will actually do it.  
 )  
 AS  
 SET NOCOUNT ON  
 -- SPROC SPECIFIC VARIABLES  
 DECLARE @SQL VARCHAR(4000)  
 DECLARE @MatchingCode VARCHAR(100)  
 DECLARE @PKID VARCHAR(100)  
 DECLARE @MaxID SMALLINT  
 DECLARE @Counter SMALLINT  
 DECLARE @PKString VARCHAR(500)  
 DECLARE @LogLocation VARCHAR(100)  
 DECLARE @ErrorMessage VARCHAR(255)  
 SET @PKString = ''  
 SET @Counter = 0  
 SELECT @LogLocation = ConfigVariable FROM DataWarehouse.Common.Config WHERE ConfigDescription = 'ETL Cleanse Log Location'  
 -- CHECK FOR BATCHLOGID. IF NOT EXISTS THEN INITIATE BATCHLOG  
 IF ISNULL(@BatchLogID,0) = 0  
 BEGIN  
   INSERT INTO DataWarehouse.Audit.BatchLog(StartDatetime,[Status])  
     VALUES (GETDATE(), 'R')  
   SELECT @BatchLogID = SCOPE_IDENTITY()  
 END  
 -- BEGIN PROCESS  
 BEGIN TRY   
      BEGIN TRANSACTION  
           -- DEFINE THE LIKE CLAUSE FOR USE THROUGHOUT THE SPROC  
           SELECT @MatchingCode =   
           CASE WHEN (@TrimLeadingWhitespace = 'Y' AND @TrimTrailingWhitespace = 'N')     THEN @Column + ' LIKE '' %'''  
                WHEN (@TrimLeadingWhitespace = 'N' AND @TrimTrailingWhitespace = 'Y') THEN @Column + ' LIKE ''% '''  
                ELSE @Column + ' LIKE '' %'' OR ' + @Column + ' LIKE ''% ''' END   
           -- DO THE CLEANSE BY UPDATING THE COLUMN  
                -- BEGIN BUILDING THE DYNAMIC SQL TO PERFORM THE CLEAN                 
                SET @SQL = 'UPDATE ' + @Database + '.' + @Schema + '.' + @Table + ' SET ' + @Column + ' = '  
                -- APPEND TRIM FUNCTIONS REQUIRED  
                SELECT @SQL = @SQL +  
                     CASE WHEN (@TrimLeadingWhitespace = 'Y' AND @TrimTrailingWhitespace = 'N')     THEN 'LTRIM(' + @Column + ')'  
                     WHEN (@TrimLeadingWhitespace = 'N' AND @TrimTrailingWhitespace = 'Y') THEN 'RTRIM(' + @Column + ')'  
                     ELSE 'LTRIM(RTRIM(' + @Column + '))' END   
                -- COLLECT PK INFORMATION FOR OUTPUT COLUMNS  
                     -- GET PK FOR SPECIFIED TABLE  
                     EXEC DataWarehouse.Common.GetPrimaryKey @Database,@Schema,@Table,@PKList = @PKID OUTPUT  
                     -- CREATE TEMP TABLE OF PK COLUMNS WITH IDENTITY  
                     DECLARE @PKTable TABLE (ID INT IDENTITY(1,1), Items VARCHAR(100))  
                     -- POPULATE TABLE VARIABLE  
                     INSERT INTO @PKTable (Items) SELECT Items FROM DataWarehouse.Common.Spliter(@PKID,',')   
                     -- GET MAXIMUM ID VALUE  
                     SELECT @MaxID = MAX(ID) FROM @PKTable  
                     -- BUILD THE STRING INTHE TABLE VARIABLE  
                     UPDATE @PKTable SET Items = Items + ' = # + CAST(INSERTED.' + Items + ' AS VARCHAR(30))'  
                     UPDATE @PKTable SET Items = Items + ' + #' WHERE ID <> @MaxID  
                     UPDATE @PKTable SET Items = '(#' + Items WHERE ID = 1  
                     UPDATE @PKTable SET Items = Items + ',' WHERE ID <> @MaxID  
                     UPDATE @PKTable SET Items = Items + ') AS PrimaryKey' WHERE ID = @MaxID  
                     UPDATE @PKTable SET Items = REPLACE(Items,'#',CHAR(39))  
                     -- CONCATENATE THE ROWS INTO A STRING  
                     WHILE(@Counter < @MaxID + 1)  
                          BEGIN  
                               SELECT @PKString = @PKString + Items FROM @PKTable WHERE ID = @Counter  
                               SET @Counter = @Counter + 1       
                          END  
                -- APPEND PK BUILD STRING TO DYNAMIC SQL       
                SET @SQL = @SQL + ' OUTPUT ' + @PKString +', ''' + @Column + ''' AS Column, DELETED.' + @Column + ' AS OldValue, INSERTED.' +  
                     @Column + ' AS NewValue, GETDATE() AS ModifiedDate,''' + CAST(@BatchLogID AS VARCHAR(10)) + ''' AS BatchLogID, ''' +  
                     CAST(OBJECT_NAME(@@PROCID) AS VARCHAR(50)) + ''' AS ProcedureName INTO ' + @LogLocation  
                -- APPEND WHERE CLAUSES TO DYNAMIC SQL       
                SET @SQL = @SQL + ' WHERE ' + @MatchingCode  
                -- PRINT OR EXECUTE DEPENDING ON PARAMETER       
                IF(@PrintOrExecute = 'P')  
                     BEGIN  
                          PRINT (@SQL)  
                     END  
                     ELSE  
                          EXEC (@SQL)  
           -- COMPLETE THE TRANSACTION  
           COMMIT TRANSACTION  
 END TRY   
 -- RETURN ERRORS AND ROLLBACK WHEN ERRORS OCCUR  
 BEGIN CATCH   
      -- ROLLBACK  
      ROLLBACK TRANSACTION  
      -- REPORT ERROR  
      SET @ErrorMessage = ERROR_MESSAGE()       
      RAISERROR (@ErrorMessage,17,1)  
 END CATCH  
 GO

This is my code for returning a comma separated list containing the primary key columns for any specified table.

 ALTER PROCEDURE [Common].[GetPrimaryKey]  
 (  
      @Database AS VARCHAR(50),  
      @Schema AS VARCHAR(50),  
      @Table AS VARCHAR(50),  
      @PKList AS VARCHAR(200) OUTPUT  
 )  
 AS   
 SET NOCOUNT ON  
 -- DECLARE VARIABLES  
 DECLARE @SQL NVARCHAR(500)  
 -- CATCH ERRORS USING TRY..CATCH  
 BEGIN TRY   
           -- BUILD DYNAMIC SQL TO QUERY SYS TABLES AND RETURN THE PK FOR SPECIFIED TABLE IN A CSV LIST  
      SET @SQL = 'USE ' + @Database + '  
                SELECT @PKList = COALESCE(@PKList + '','','''') + CAST(CCU.COLUMN_NAME AS VARCHAR(20))  
                FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS TC  
                JOIN INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE CCU ON TC.CONSTRAINT_NAME = CCU.CONSTRAINT_NAME  
                WHERE TC.CONSTRAINT_TYPE = ''Primary Key''   
                AND TC.TABLE_CATALOG = ''' + @Database + '''  
                AND TC.TABLE_SCHEMA = ''' + @Schema + '''  
                AND TC.TABLE_NAME = ''' + @Table + ''''  
      -- EXECUTE THE DYNAMIC SQL  
      EXEC SP_ExecuteSQL @SQL, N'@PKList VARCHAR(200) OUTPUT', @PKList OUTPUT  
 END TRY  
 -- RETURN ERRORS AND ROLLBACK WHEN ERRORS OCCUR  
 BEGIN CATCH   
      -- REPORT ERRORS  
      SELECT ERROR_NUMBER() ErrorNumber, ERROR_SEVERITY() ErrorSeverity, ERROR_LINE() ErrorLine, ERROR_MESSAGE() ErrorMessage  
 END CATCH  
 GO

Here is the code for pivoting a comma separated list into a table for use in the creation of a primary key list with values appended into the string.

 ALTER FUNCTION [Common].[Spliter](@String VARCHAR(8000), @Delimiter CHAR(1))      
 RETURNS @TempTable TABLE (Items VARCHAR(8000))      
 AS   
 BEGIN  
   DECLARE @IDX INT      
   DECLARE @Slice VARCHAR(8000)      
   SELECT @IDX = 1      
     IF LEN(@String)<1 OR @String IS NULL RETURN  
   WHILE @IDX!= 0      
   BEGIN  
     SET @IDX = CHARINDEX(@Delimiter,@String)      
     IF @IDX!=0      
                SET @Slice = LEFT(@String,@IDX - 1)      
           ELSE  
                SET @Slice = @String      
     IF(LEN(@Slice)>0)   
       INSERT INTO @Temptable(Items) VALUES(@Slice)      
     SET @String = RIGHT(@String,LEN(@String) - @IDX)      
     IF LEN(@String) = 0 BREAK  
   END  
 RETURN     
 END  
 GO

Sunday, 10 June 2012

I broke the Distribution database

I've been dragged away from the creative side of my job to fix creaking infrastructure recently. It's not a part of the job I enjoy. But I recognise I've reached a level at which I may never again be able to escape back into pure development.

This weeks fun involved a corrupt LDF files on a production server. The server is virtual and appears to be having disk issues. The databases in question continued to mirror and replicate without issue and a test restore followed by DBCC checks revealed no database corruption or data loss.

To fix the corrupt ldf files we had to failover to our mirror, break the mirroring for the affected database, restore form a full backup and re-sync with the mirror again. It worked fine. But the second corruption coming just 6 days after the first suggested an underlying disk issue so our Network guys suggested we nuke the log file disk partition and recreate it. We failed-over all the mirrored databases and stopped replication from the suspect server. We also moved the ldf files for a few small less important databases away from the dodgy partition.

The disk work done, we had a few hours of reconfiguration to get the environment back to normal. However, when I attempted to setup replication again I realised I'd made a stupid mistake. The distribution database - which is used extensively by SQL Server in replication - was broken as I'd not moved the ldf file for this database. Googling the fix didn't turn up much of use. Most posts were DBA's having issues with replication after "accidentely" dropping the distribution database. None of the posts help me solve the particular issue I'd created. Consequently it took a couple of hours to find a solution. Hence why I've blogged it here.

The problem:
My Distribution database has no ldf file anymore. I want to drop the broken distribution database but decommissioning using the wizard initiated by right clicking Replication in the SSMS UI returns errors.

The solution:
MSDN revealed a number of system stored procedures for managing the distribution database and the procedure shown below with the addition of the parameter @Ignore_Distributor = 1 has the effect of removing the distribution database cleanly with 'No Questions Asked'. This enabled me to then use the simple UI wizard for reconfiguring a new distribution database and quickly establishing replication again.

 USE [master]  
 EXEC sp_dropdistributor @no_checks = 1, @ignore_distributor = 1  
 GO

What we learned this week.

Creative DBA work is more fun than high stress open heart surgery on your production database.
Our DR solution works!!
Our documentation is non-existent and despite my best efforts will probably remain that way.

Monday, 4 June 2012

Detecting & Fixing Collation Issues

Differences between database and server collations is a common DBA problem especially if the environment has developed organically over years. This is exactly the problem my team faces as we begin our data warehouse development. We're lucky in the sense that we're upgrading and consolidating old SQL Server instances to 2008 R2. This provides an excellent opportunity to standardise the server and database collations although it does require extra work as some legacy apps will need to be thoroughly tested.

We found a two useful scripts that help us identify and synchronise our databases which had collation differences.

The first is from respected SQL Blogger Tibor Karaski and identifies tables and columns within a database that do not match the database collation. Download this script here.

1:  IF OBJECT_ID('tempdb..#res') IS NOT NULL   
2:    DROP TABLE #res  
3:  GO  
4:  DECLARE @db SYSNAME,  
5:    @sql NVARCHAR(2000)  
6:  CREATE TABLE #res  
7:    (  
8:     server_name SYSNAME,  
9:     db_name SYSNAME,  
10:     db_collation SYSNAME,  
11:     table_name SYSNAME,  
12:     column_name SYSNAME,  
13:     column_collation SYSNAME  
14:    )  
15:  DECLARE c CURSOR  
16:  FOR  
17:  SELECT name  
18:  FROM  sys.databases  
19:  WHERE  NAME NOT IN ('master','model','tempdb','msdb') AND  
20:      state_desc = 'ONLINE'  
21:  OPEN c  
22:  WHILE 1 = 1   
23:    BEGIN  
24:      FETCH NEXT FROM c INTO @db  
25:      IF @@FETCH_STATUS <> 0   
26:        BREAK  
27:      SET @sql = 'SELECT  
28:    @@SERVERNAME AS server_name  
29:   ,''' + @db + ''' AS db_name  
30:   ,CAST(DATABASEPROPERTYEX(''' + @db + ''', ''Collation'') AS sysname) AS db_collation  
31:   ,OBJECT_NAME(c.object_id, ' + CAST(DB_ID(@db) AS SYSNAME) + ') AS table_name  
32:   ,c.name AS column_name  
33:   ,c.collation_name AS column_collation   
34:  FROM ' + QUOTENAME(@db) + '.sys.columns AS c  
35:   INNER JOIN ' + QUOTENAME(@db) + '.sys.tables AS t ON t.object_id = c.object_id  
36:  WHERE t.type = ''U''  
37:   AND c.collation_name IS NOT NULL  
38:   AND c.collation_name <> CAST(DATABASEPROPERTYEX(''' + @db + ''', ''Collation'') AS sysname)  
39:  '  
40:  --PRINT @sql  
41:      INSERT INTO #res  
42:          EXEC (@sql)  
43:    END  
44:  CLOSE c  
45:  DEALLOCATE c  
46:  SELECT *  
47:  FROM  #res

The second script is much bigger and I can't remember where we found it online. If you wrote it I'm happy to reference you here as I certainly can't claim it.

The script will amend a database and the objects within it to a collation of your choice. Line 68 is where you set the variable for the collation you're changing to and line 70 is the database you're altering. It takes time to run as there are some complex conversions involved.

DO NOT RUN THIS SCRIPT IN A PRODUCTION ENVIRONMENT BEFORE TESTING EXTENSIVELY. By publishing this script I am not guaranteeing it is safe to use in your environment and I'm not advocating using it without first understanding it. However, it worked really well for us.

Here is a link to the script.

Thursday, 5 April 2012

Tidying Up SQL Agent Jobs

I've inherited database servers which have grown organically over the years. For each project procedures, functions or views as well as SQL Agent jobs have been added with little thought of their place in the wider scheme of the organisations MI and with little consideration for their administration over time.

As a starting point for a mini project to have all the SQL Agent jobs on servers we support documented both to determine what they do and what actions we should take in the event of failure, I want to know which jobs are obsolete.

This script shows the last successful run date and time plus how long it took for each job step, whether it was successful, which database it queries, the location of any log files, the command run by each job step, the created and last modified date and the owner of the job.

 USE MSDB  
 GO  
 SELECT J.Name AS JobName,  
      JS.Step_Name AS StepName,  
      CAST(CAST(NULLIF(Last_Run_Date,0) AS VARCHAR(8)) AS DATETIME)   
           + STUFF(STUFF(RIGHT('00000' + CAST(Last_Run_Time AS VARCHAR(6)),6),3,0,':'),6,0,':') AS Last_Run_DateTime,  
      JS.Last_Run_Duration,  
      CASE WHEN (JS.Last_Run_Outcome = 1) THEN 'Succeeded' ELSE 'Failed' END Last_Run_Status,  
      JS.Database_Name,  
      JS.SubSystem,  
      JS.Command AS Command,  
      JS.Output_File_Name,  
      J.Date_Created,  
      J.Date_Modified,  
      SUSER_NAME(J.Owner_SID) AS Job_Owner  
 FROM SysJobs J  
 INNER JOIN SysJobSteps JS ON J.Job_ID = JS.Job_ID  
 --WHERE JS.SubSystem = 'SSIS'

With the addition of the WHERE constraint to only return jobs which use the SSIS sub-system you can determine which jobs are using SSIS packages stored either within a file structure on within MSDB.

I plan to use this to remove all obsolete jobs and packages from our servers prior to a move to new hardware later in the year.

Nothing is going on our new servers unless it's written up in our SQL Agent documentation which includes actions to take in the event of failure, any ports to be opened for SFTP tasks and crucially who owns the job both from the techie side and the business. Ownership is key to making sure SQL Agent jobs are maintained.

Wednesday, 28 March 2012

Easy Error Handling For Data Warehouse Sprocs

Our Data Warehouse makes use of both SSIS packages and T-SQL Stored Procedures during the ETL phase. I use an audit schema borrowed from the excellent book SSIS Problem -Design-Solution to collect errors generated by SSIS. For T-SQL SP's I needed a simple way to collect errors which could be integrated into our DBA Monitoring Dashboard. This is what I did.

Identified what data I wanted to capture.
Built a table to store the errors
Built a Stored Procedure to INSERT the errors
Embedded the SP in the CATCH block of a TRY...CATCH in every Data Warehouse Procedure

What Error Data

Keep it simple, I thought. I needed to know what happended when. Who executed which procedure on what server. What was the error and where within the procedure did the error occur.

Most of that data can be collected using the built-in T-SQL ERROR_ system functions. This includes the error number, the line number on which the error occured, the severity, the state, which procedure caused the error and a full error message. The remaining details could be collected by accessing the system user, server name and setting a default on a created date time column.

The Table Design

I used the data types specified for the ERROR_ system functions on MSDN and created a surrogate primary key to ensure the table has a clustered index.

 CREATE TABLE [Metadata].[ErrorLog](  
      [ErrorLogID] [int] IDENTITY(1,1) NOT NULL,  
      [SystemUser] [nvarchar](128) NOT NULL,  
      [CreatedDateTime] [datetime] NOT NULL,  
      [ServerName] [nvarchar](128) NOT NULL,  
      [ErrorNumber] [int] NOT NULL,  
      [ProcedureName] [nvarchar](128) NOT NULL,  
      [LineNumber] [int] NOT NULL,  
      [Severity] [int] NOT NULL,  
      [State] [int] NOT NULL,  
      [Message] [nvarchar](4000) NOT NULL,  
  CONSTRAINT [PK_ErrorLog] PRIMARY KEY CLUSTERED   
 (  
      [ErrorLogID] ASC  
 )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]  
 ) ON [PRIMARY]  
 GO  
 ALTER TABLE [Metadata].[ErrorLog] ADD DEFAULT (getdate()) FOR [CreatedDateTime]  
 GO

The Procedure Design

Again, keeping it as simple as possible and generic enough to be used across any solution was the main aim.
Most of the details will be passed through as parameters from the procedure which has encountered an error with the exception of Server Name and System User which I'll SET in this procedure.

The most interesting part of this procedure is the use of SERVERPROPERTY which can be configured to return a variety of useful information about the server environment. Be careful when using it to CONVERT it in order to assign it to a local variable as the returned data is in the form of a SQL_VARIANT. The rest of the procedure is pretty standard fare. A simple INSERT really.

 /*  
 **********************************************************************************************************************************************************************************************************************  
 DW - InsertErrorLog  
 **********************************************************************************************************************************************************************************************************************  
 VERSION     AUTHOR     DATE           COMMENTS  
 **********************************************************************************************************************************************************************************************************************  
 V1.0        GP         27/03/2012     Initial build of InsertErrorLog which is designed to take error details from inside a CATCH block and pass them to a central ErrorLog table  
 **********************************************************************************************************************************************************************************************************************  
 USAGE EXAMPLE   
 EXEC Common.InsertErrorLog 1234,'ProcedureName',1,17,1,'This is a test error message'  
 */  
 CREATE PROCEDURE [Common].[InsertErrorLog]  
      (  
           @ErrorNumber INT,  
           @ProcedureName NVARCHAR(128),  
           @LineNumber INT,  
           @Severity INT,  
           @State INT,  
           @Message NVARCHAR(4000)  
      )  
 AS  
 SET NOCOUNT ON  
 DECLARE @SystemUser AS NVARCHAR(128)  
 DECLARE @ServerName AS NVARCHAR(128)  
 SET @SystemUser = SUSER_NAME()       
 SET @ServerName = CONVERT(SYSNAME, SERVERPROPERTY('SERVERNAME'))  
 -- POPULATE ERROR TABLE WITH SUBMITTED DATA  
 INSERT INTO DataWarehouse.Metadata.ErrorLog (SystemUser,ServerName,ErrorNumber,ProcedureName,LineNumber,Severity,State,Message)  
 VALUES (@SystemUser,@ServerName,@ErrorNumber,@ProcedureName,@LineNumber,@Severity,@State,@Message)  
 GO

An Example Procedure

Let's step through an example starting on line 22 where we enter the body of this stored procedure which is designed to TRUNCATE and repopulate a Time Dimension. On line 22 the BEGIN TRY is entered an all the subsequent code to line 54 is executed.

In the event of an error the CATCH block starting on line 56 is executed.

Within the CATCH block I assign the ERROR_ values to local variables on line 58-63.

On lines 64-65 I then EXECUTE the InsertErrorLog procedure with the local variables passed as a parameter. A row is then added to the table Metadata.ErrorLog

In my team we have a daily monitoring report which allows us to see problems in the overnight processing before and users report issues. This gives us a chance to get ahead of the problem by either fixing it or communicating with stakeholders. Before this was implemented we faced the spectacle of users periodically asking, "Is there a problem with the Data Warehouse.". Nowadays we can answer with certainty.

1:  USE [DataWarehouse]  
2:  GO  
3:  /****** Object: StoredProcedure [DataMart].[LoadDIMTime]  Script Date: 03/28/2012 22:56:55 ******/  
4:  SET ANSI_NULLS ON  
5:  GO  
6:  SET QUOTED_IDENTIFIER ON  
7:  GO  
8:  /*  
9:  ***************************************************************************************************************************************************************  
10:  DW - LoadDIMTime  
11:  ***************************************************************************************************************************************************************  
12:  VERSION AUTHOR  DATE          COMMENTS  
13:  ***************************************************************************************************************************************************************  
14:  V1.0    GP      10/01/2012     Initial build of DIMTime  
15:  V1.1    GP      27/03/2012     Added ErrorLog functionality in the CATCH block  
16:  ***************************************************************************************************************************************************************  
17:  USAGE EXAMPLE   
18:  EXEC DataMart.LoadDIMTime  
19:  */  
20:  ALTER PROCEDURE [DataMart].[LoadDIMTime]  
21:  AS   
22:    BEGIN TRY   
23:      -- START AFRESH WITH AN EMPTY TABLE  
24:      TRUNCATE TABLE DataWarehouse.DataMart.DIMTime  
25:      -- INSERT DIMENSION MEMBERS  
26:      INSERT INTO DataWarehouse.DataMart.DIMTime  
27:          ( Hour ,  
28:           Minute ,  
29:           Meridiem  
30:          )  
31:          SELECT Hour ,  
32:              Minute ,  
33:              Meridiem  
34:          FROM  DataWarehouse.DataSource.vwTime  
35:      OPTION ( MAXRECURSION 0 )  
36:       -- ADD THE UNKNOWN MEMBER. SUSPEND THE IDENTITY DURING INSERT  
37:      IF NOT EXISTS ( SELECT TimeKey  
38:              FROM  DataWarehouse.DataMart.DIMTime  
39:              WHERE  TimeKey = -1 )   
40:        BEGIN  
41:          SET IDENTITY_INSERT DataWarehouse.DataMart.DIMTime ON  
42:          INSERT INTO DataWarehouse.DataMart.DIMTime  
43:              ( TimeKey ,  
44:               Hour ,  
45:               Minute ,  
46:               Meridiem  
47:              )  
48:              SELECT -1 AS TimeKey ,  
49:                  0 AS Hour ,  
50:                  0 AS Minute ,  
51:                  'UN' AS Meridiem  
52:          SET IDENTITY_INSERT DataWarehouse.DataMart.DIMTime OFF  
53:        END            
54:    END TRY   
55:  -- RETURN ERRORS WHEN ERRORS OCCUR  
56:    BEGIN CATCH   
57:      -- REPORT ERRORS  
58:      DECLARE @ErrorNumber INT = ERROR_NUMBER()  
59:      DECLARE @ErrorProcedure NVARCHAR(128) = ERROR_PROCEDURE()  
60:      DECLARE @ErrorLine INT = ERROR_LINE()  
61:      DECLARE @ErrorSeverity INT = ERROR_SEVERITY()  
62:      DECLARE @ErrorState INT = ERROR_STATE()  
63:      DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE()  
64:      EXEC Common.InsertErrorLog @ErrorNumber, @ErrorProcedure, @ErrorLine,  
65:        @ErrorSeverity, @ErrorState, @ErrorMessage  
66:    END CATCH  
67:  GO

Feel free to use the code and please email me any improvements you make.

Thursday, 1 March 2012

WinSCP sFTP Tasks for my SSIS Package Template

I have an SSIS Template package to speed up development. I use WinSCP as an sFTP task as the standard SSIS FTP Task can't do sFTP and all my recent development requirements have specifically asked for sFTP. There are many tools you can use. I like WinSCP because it works well both as an application and from the command line, it's got good online documentation for scripting command line tasks and it's free.

I call WinSCP from an Execute Package Task. This requires an installation on the server where the SSIS package executes - I don't run ETL tasks on a production server but I've found many instances where legacy systems do. I prefer to use a powerful workhorse BI/ETL server where I can guarantee an extensive overnight maintenance window and keep the production environment both secure and dedicated to its primary role.

In my SSIS Template I have a preconfigured PUT and GET task both driven from variables populated from a configuration table in my Metadata database. I could, if required, encrypt the sFTP credentials within the configuration table and then decrypt within the stored procedure that creates the string passed to the variable. But let's not over complicate matters unless we have to.

I build the sFTP command line arguments from variables which are collected from my configuration control table. This table can be amended - if for example an FTP password requires regular changes - without having to change the SSIS package.I feel this is a better solution than using a text file script as the text files then need to be version controlled, secured and backed up. Using a configuration table and SSIS variables allows other benefits such as security, centralisation of control data and the potential to encrypt if required.

I configure the Execute Process Task as shown here.

Below are examples of my SSIS expression and the output it returns.

Get File(s)

Expression - @[User::FTPPrefix] + @[User::FTPUsername] + ":" + @[User::FTPPassword] + "@" + @[User::FTPHost] + "\" \"get " + "\" \"//*.csv \" \"" + @[User::FTPDestinationFolder] + @[User::FTPSuffix]

Output - /console /command "option batch on" "option confirm off" "option transfer binary" "open sftp://username:password@mydomain.ftp.com" "get " "//*.csv " "C:\Temp "close" "exit"

Put File(s)

Expression - @[User::FTPPrefix] + @[User::FTPUsername] + ":" + @[User::FTPPassword] + "@" + @[User::FTPHost] + "\" \"put \"" + @[User::FTPSourceFolder] + "InsertFileNameHere" + (DT_STR,4,1252)YEAR(GETDATE()) + RIGHT("0" + (DT_STR,2,1252)MONTH(GETDATE()),2) + RIGHT("0" + (DT_STR,2,1252)DAY(GETDATE()),2) + ".zip" + "\" \"//\"\"" + @[User::FTPSuffix]

Output - /console /command "option batch on" "option confirm off" "option transfer binary" "open sftp://username:password@mydomain.ftp.com " "put "\\172.0.0.1\C$\sFTP\InsertFileNameHere20120301.zip" "//"" "close" "exit"

I hope you find this useful. I'll be sharing my full SSIS template shortly.

UPDATE: Check out my 7-Zip archiving task also used as part of my SSIS package template.

Wednesday, 15 February 2012

Data Warehouse High Level Overview

Most data warehouses have multiple data sources. Conforming similar, but not exactly the same, data from a variety of systems whilst retaining an ability to rollback to a previous version of the truth presents significant challenges. Data from flat files adds a further challenge. In addition, a variety of front-end tools need to be supported which ads a requirement for the presentation areas of the warehouse to be generic enough to support these tools plus others we don't yet know about.

The main elements of my high level design are;

A Metadata control database
Replicated source data with change tracking enabled.
An ETL staging area with no persistent data
A data warehouse for persisting the data
Data marts for specific business focused OLAP cubes
SQL Reporting services

Metadata

I like SSIS but I dislike having to fix other peoples poorly written packages. I spent over 3 weeks in my last job documenting the processes buried in SSIS tasks so I could understand 1 of the 2 main ETL loaders I was supporting. The process had been well designed but never documented and relied too heavily on SSIS when a parameterised stored procedure would have been more transparent and easier to amend 2-3 years after the system had been operating.

The lesson I took from that was to use SSIS where appropriate. I believe T-SQL can achieve many, but not all, ETL tasks better. I like to use SSIS to drive parameterised stored procedures where the parameters are sourced from a metadata database. So for example, During ETL cleaning I iterate through identified cleansing steps using an SSIS package that references a metadata process table. This enables me to add, amend, enable, disable or sequence each cleansing step. And as I add more data source to my warehouse all I have to do is add rows to the cleansing process table and the automated SSIS package picks up the new processes.

The Metadata database acts as the brain of my warehouse design.

Replicas

Change Tracking and Change Data Capture in SQL Server offer a way to identify data which has changed since the last load. Both come with performance overhead so I don't implement them over production databases. By Replicating those production databases, or just the elements we need for data warehousing, we suffer less overhead and we can work on a copy of the data source in an ETL environment without the constraints of a production server.

By appending the change tracking number to our imported rows we can roll back to a previous version of the truth to fix issues and then roll forward again when ready for up to 90 days depending on the change tracking history you store.

ETL Staging

Data does not persist in the staging element of the warehouse. It simply migrates through on it's way to a cleansed and conformed schema. The only exceptions are bad data we don't want to allow through which often requires manual intervention.

In the element all the cleansing and validation of data is performed. No users, other than the DW Admins, have any reason to be here.

Data Warehouse

Once we've established a clean set of data we can conform it. Data warehouses tend to store data from multiple sources and what at first appears to be common data is often of different data type or length and sometimes even different collation. That's no good to our dimension model so we need a place to store conformed data before it goes into the organisations reporting objects.

It does mean storing a lot of data but the cost of ETL is high and once you've got it right you want to keep it.

I follow the Kimball methodology quite closely but he probably wouldn't like this. But I try to be pragmatic not ideological and I've seen this method work really well even in organisations were data quality was high. For organisations with data quality challenges this could be a real benefit.

Data Marts

Terminology gets confusing sometimes. To me a data mart is a subset of the organisations data conformed into a model focusing around a specific business function. Examples would be underwriting performance, click stream analysis or stock control.

In my data mart database there are dimensions, both common and those specific to a data mart, as well as facts tables. Nothing else. Users may have access to this area and/or the data warehouse so I used schemas to both secure and for ease of identification.

Reporting Services

Even in an OLAP reporting world, traditional canned reports still have a place. SQL Server SSRS Report Manager's ability to distribute data is invaluable and it comes with a nice front end. I advocate using it get OLAP and non-OLAP reports to users.

In Conclusion

This high level overview outlines my preferred architecture. I believe it answers many of the questions I've faced. But I'm keen to improve it. If you have any feedback that might help improve it just comment below. After all, we're standing on the shoulders of others.

Wednesday, 8 February 2012

Pragmatic Principals - Templates

For both SSRS and SSIS create templates which the whole team uses. SSIS is purely for fast, consistent development, but SSRS is also to present a branded product which after your DW gains acceptance will reinforce your brand at the expense of any reports not source from your DW.

Tuesday, 7 February 2012

Pragmatic Principals - Focus

The DW is a business system not a techie playground. Use Business Analyst's to focus all development on revenue generating (or saving) work, work that makes the organisation more data-driven, innovative development and most of all keep in regular communication with the key business users in each department.

Every n months go and sit with them -1 on 1 - during their reporting cycle or during an analysis task and learn what they are doing why, and what is coming up in the next few months. During these at-desk meetings introduce them to new features and tools. Focus future development on the ideas and needs identified during these sessions.

Sunday, 29 January 2012

Dynamic SSIS package for Data Profiling

The data cleansing element of a data warehousing projects can take a long time depending on the quality of the source data. You need to know the extent of the data quality problems early in the project. The data profiling task in SSIS is a good starting point assuming you have access to the source data and know which tables within the source are likely to be required.

I wanted a dynamic data profiling template that I could configure using variables and schedule during a maintenance window to collect profiling data for specified tables within one or more databases. Looking into the problem I quickly found two good references.

The first was Jamie Thompsons blog from back in March/April 2008 - I'm not planning on moving to SQL Server 2012 any time soon so 2008 and 2008 R2 references are fine for now. The 10 articles are an excellent introduction to the data profiling task and interpreting the output.

But the most relevant article I found was by John Welch who seemed to have the same idea as me to create a re-useable data profiling package.

Data profiling template originally from John Welch.

It's a simple package and doesn't do everything I wanted. Specifically, it's not able to loop through multiple databases or limit to specified tables only. Also, I try to avoid direct input SQL statements in Execute SQL Tasks as I find administering SSIS packages with code buried in tasks to be less efficient than using parameterised stored procedures.

However, being pragmatic about things, it's not a package I'll run every day and it'll run in environments where I have no access to create an Admin database for DBA scripts and procedures, so I'll accept it as is. And I can schedule a SQL Admin job with multiple steps each calling the package with different configuration parameters to achieve my original goal.

You can download the source from the original blog here or with some minor amendments from me here.