Pragmatic Data Warehousing: February 2012

Most data warehouses have multiple data sources. Conforming similar, but not exactly the same, data from a variety of systems whilst retaining an ability to rollback to a previous version of the truth presents significant challenges. Data from flat files adds a further challenge. In addition, a variety of front-end tools need to be supported which ads a requirement for the presentation areas of the warehouse to be generic enough to support these tools plus others we don't yet know about.

The main elements of my high level design are;

A Metadata control database
Replicated source data with change tracking enabled.
An ETL staging area with no persistent data
A data warehouse for persisting the data
Data marts for specific business focused OLAP cubes
SQL Reporting services

Metadata

I like SSIS but I dislike having to fix other peoples poorly written packages. I spent over 3 weeks in my last job documenting the processes buried in SSIS tasks so I could understand 1 of the 2 main ETL loaders I was supporting. The process had been well designed but never documented and relied too heavily on SSIS when a parameterised stored procedure would have been more transparent and easier to amend 2-3 years after the system had been operating.

The lesson I took from that was to use SSIS where appropriate. I believe T-SQL can achieve many, but not all, ETL tasks better. I like to use SSIS to drive parameterised stored procedures where the parameters are sourced from a metadata database. So for example, During ETL cleaning I iterate through identified cleansing steps using an SSIS package that references a metadata process table. This enables me to add, amend, enable, disable or sequence each cleansing step. And as I add more data source to my warehouse all I have to do is add rows to the cleansing process table and the automated SSIS package picks up the new processes.

The Metadata database acts as the brain of my warehouse design.

Replicas

Change Tracking and Change Data Capture in SQL Server offer a way to identify data which has changed since the last load. Both come with performance overhead so I don't implement them over production databases. By Replicating those production databases, or just the elements we need for data warehousing, we suffer less overhead and we can work on a copy of the data source in an ETL environment without the constraints of a production server.

By appending the change tracking number to our imported rows we can roll back to a previous version of the truth to fix issues and then roll forward again when ready for up to 90 days depending on the change tracking history you store.

ETL Staging

Data does not persist in the staging element of the warehouse. It simply migrates through on it's way to a cleansed and conformed schema. The only exceptions are bad data we don't want to allow through which often requires manual intervention.

In the element all the cleansing and validation of data is performed. No users, other than the DW Admins, have any reason to be here.

Data Warehouse

Once we've established a clean set of data we can conform it. Data warehouses tend to store data from multiple sources and what at first appears to be common data is often of different data type or length and sometimes even different collation. That's no good to our dimension model so we need a place to store conformed data before it goes into the organisations reporting objects.

It does mean storing a lot of data but the cost of ETL is high and once you've got it right you want to keep it.

I follow the Kimball methodology quite closely but he probably wouldn't like this. But I try to be pragmatic not ideological and I've seen this method work really well even in organisations were data quality was high. For organisations with data quality challenges this could be a real benefit.

Data Marts

Terminology gets confusing sometimes. To me a data mart is a subset of the organisations data conformed into a model focusing around a specific business function. Examples would be underwriting performance, click stream analysis or stock control.

In my data mart database there are dimensions, both common and those specific to a data mart, as well as facts tables. Nothing else. Users may have access to this area and/or the data warehouse so I used schemas to both secure and for ease of identification.

Reporting Services

Even in an OLAP reporting world, traditional canned reports still have a place. SQL Server SSRS Report Manager's ability to distribute data is invaluable and it comes with a nice front end. I advocate using it get OLAP and non-OLAP reports to users.

In Conclusion

This high level overview outlines my preferred architecture. I believe it answers many of the questions I've faced. But I'm keen to improve it. If you have any feedback that might help improve it just comment below. After all, we're standing on the shoulders of others.

Pragmatic Data Warehousing

Wednesday, 15 February 2012

Data Warehouse High Level Overview

Metadata

Replicas

ETL Staging

Data Warehouse

Data Marts

Reporting Services

In Conclusion

Wednesday, 8 February 2012

Pragmatic Principals - Templates

Tuesday, 7 February 2012

Pragmatic Principals - Focus

Useful Links

Search This Blog