Pragmatic Data Warehousing: 2013

Saturday, 27 July 2013

Greenfield Data Warehouse - Problems faced and solutions found

I've been leading the building of a greenfield data warehouse - among many other things - for an SME over the last 18 months. My team work closely with an excellent Analytics department who are responsible - among many other things - of translating the data we provide into usable MI for stakeholders.

We recently presented the stakeholder focused, agile methodology we've adopted to another part of the organisation who require a reboot of their current data management practices and will be building a data warehouse of their own either with our guidance or under our direct control.

The two slides featured here show the before and after position. Although it has to be said that our project is far from finished we have nonetheless built a solid foundation on which we can expand. Both from a technical and business point of view our solution is very pragmatic. We knew all along that anything we built, whether process or technical solution, would have to be modular so it could adapt to the changing environment present in the vibrant, fast-moving and maturing SME in which we operate.

The Difficulties

As the business has grown the number of systems has grown and so has the complexity.

This creates challenges unique to us but comparable to challenges faced by all businesses of our size and composition.

We began back in January 2012 and at the time we didn't have many of the foundations we required.

Our data is mostly insurance premium and claims but from many different sources.
Within that data there were technical differences but also semantic differences so the same thing is known by different labels in different systems
There was no existing method of sharing data with colleagues from a central source
Even if there had been an MI Portal there was no central source for reporting from which such a portal could have been fed
The DBA and Analytics team worked together but none of us was dedicated full time to this project
We didn’t have a plan or a PM to drive it forward
And everything was needed yesterday

The Solutions

We identified an existing mid-range under-utilized server and declared it our reporting environment. Where ever possible we switch reporting – which was still running on our live servers – onto this environment. It gave us the power to run resource intensive processes away from any system that serves customers.
We were already a business run on Microsoft software by colleagues comfortable with that environment. It made sense to use SQL Server as our Business Intelligence back-end.
We already had data distributed across multiple servers and we expect that environment to get more complex over time. We created replica’s of all the reporting data sources on our reporting environment. Now we had all the data in one place but it was still in different formats.
Next we identified key business areas – led by stakeholder requirements – and we built data marts for over each process. We did this sometimes across multiple data sources so we had to create a translation layer between the inputs and the outputs.
We then shared the data to users via Microsoft Excel, as it was and remains today the easiest tool we have to get data in front of consumers. The Excel workbooks we provide are dynamic cubes which means they are connected to the reporting databases and function like massive pivot tables.
The early developments were really only straw men that we used to create a feedback with users until we were happy we had accurate data in the shape required to satisfy “all reasonable questions, and some unreasonable questions too.” Without an Agile approach we wouldn't have progressed as fast.
We work closely with the stakeholders. Without there constant input, testing and feedback we may have followed the unfortunate path that many IT geeks take of building a gold-plated technical solution which is of no use to the business.
We've never pretended that the data we present is perfect. Recently we uncovered a few glaring mistakes, not with our development, but in an underlying third-party database. Before we found that issue we were blind to it. Now we have our single view of the facts we’re discovering errors in previous reports and processes.

In Summary

It's horses for courses. As a business intelligence architect it's important to balance stakeholders and technical needs. There are many technical elements of our data warehouse which I would love to spend more time on as I love that type of work. But the opportunity to build a data warehouse solution from scratch comes with an implicit responsibility to provide ROI from day 1.

Wednesday, 8 May 2013

Big Data in the Real World - Part 2 of 2

Part 1 of this series described the first 2 stages in the processing of clickstream data to allow sales conversion analysis. This post describes the most complex part of the process - the pivoting of Key-Value-Pair data into a flattened schema.

Loading New Data

The third element to the process is transforming the KVP data from staging into a persistent schema. There are many reasons for collecting this data – it is a rich dataset - and each reason may require the data in a different format. We don’t know what analysis will be done in future on the data or the data format that analysis will require. As a result, we continue to store the data in string format for now and assume any future analysis will have to construct a layer on top of the base collection or extract the data into further persistent storage if specific transformations are required.

The staged data is still in a KVP format. This is difficult for users and report writers to consume so we pivot the data to create 1 row per click event. This has the effect of turning 150 narrow rows of KVP data per click event into 1 wide row in the persistent schema with a 90% reduction in storage space.

Isolating the click events

To achieve this we isolate the click events within the data. Each event has;

SessionID stored as a GUID
BornDateTime timestamp stored as a datetime
StepID which is generally an integer but can contain characters so is stored as an NVARCHAR(3)
LayoutID also stored as an NVARCHAR(3).

LayoutID is rarely used but exists in two table Log_Flow and Log_Quote but not Log_Sales. It is still part of the unique key for these tables.

StepID is used but in only Log_Flow and Log_Quote.

Together these 4 columns constitute a unique key across a single click event. We use them to create a Primary Key in the persistent storage tables.

Step one of the SSIS process identifies each distinct unique within the data. These keys are then inserted into the corresponding persistent storage tables in a DataSource schema as shown in the table below.

Staging Table	Persistent Table
Staging.LogFlow	DataSource.ConversionLogFlow
Staging.LogQuote	DataSource.ConversionLogQuotes
Staging.LogSales	DataSource.ConversionLogSales

Pivoting the data into the persistent schema

Each set of key columns we detected in the data and insert into the persistent storage schema as a row must be completed by updating the rest of the columns (the non-primary key columns).

We do this by joining between the Staging and DataSource tables on those recently isolated primary keys. But we don’t know which columns we’ll be updating because, as we found out in the second stage of the process, the schema can change. The only way around this was to use a dynamic SQL statement which lists the columns required to be updated. These columns must be source from the metadata we stored about the KVP data during stage 2 of the processing.

We build up variables and ready for the dynamic SQL through a series of steps which gradually build up the list of columns and augment the SQL to construct a valid SQL Update statement. The ultimate aim is to populate one VARCHAR(MAX) variable and then execute the dynamic SQL. That dynamic SQL uses a Common Table Expression which itself utilises the PIVOT function. The CTE and PIVOT reference the Staging KVP data and turn it into rows for which the subsequent Update statement then uses to populate the persistent schema.

The final element is to Update the process control table which stores the last successful timestamp.

In conclusion

Upon successful collection of data we’re able to use this data to measure conversion rates on our websites. Each GUID represents a session and we can measure the number of quote and sales resulting from that session. This data undergoes some further transformation before it’s loaded into a Clickstream Fact table and processed into an OLAP cube hosted by SSAS. The collection process is automated from raw data through to the final OLAP cube and requires very little maintenance. At present we collect about 100K rows which equates to 70Mb of new data daily.

The only issues found to date was after the number of KVP columns increased and the dynamic SQL which uses a list of those columns exceeded 8000 characters. That failed the process for a few days until we could fix the issue. The fix involved trimming white space from the dynamic SQL to reduce the character count but this is not a solution merely a temporary fix. A better solution is to break the dynamic SQL into segments and to run separately or to identify unused columns and ignore them.

Previous DBAs failed to get anything useful out of this KVP data primarily because they looked at the problem inflexibly. One had built a report over the raw data and once the data changed the report failed. Another had used nested cursors – OMG!!! - to perform the data loads and that process was so slow it never succeeded. Finally, it was outsourced to Mumbai with a poor data brief and without stipulating that the process had to be automated. As a result, the process was defined in a series of rigid steps which were followed by some poor soul manually and then emailed back to the UK for analysis. It regularly came in different formats and was constantly over 2 weeks late.

The current process took a couple of months to get right but now works automatically every day. It’s flexible and we’re able to use it for detailed analysis. We no longer pay the outsourcers a significant amount per month for poor quality data. Above all it was a really challenging but fun piece of development.

Wednesday, 1 May 2013

Big Data in the Real World - Part 1 of 2

There is so much talk of Big Data nowadays, mostly from sales people but also from technologists. Karen Lopez aka DataChick took issue at the Wikipedia entry on Big Data which describes it as "data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications". I think that definition is pretty good.

Over the last few months I've tackled a problem which within my environment counts as a Big Data problem. The solution is outlined in these next 2 articles.

Clickstream Conversion Data Collection Process

The daily process

Clickstream data is collected on a schedule, currently daily, via an SSIS package which is initiated by a step in a SQL Agent job. The underlying package uses the timestamp of the last successful load as a starting point for the next process which allows flexibility in the collection schedule.

The data source

All clickstream data is collected in a Key-Value-Pair (KVP) database. There are 3 main tables used to store the KVP data. Those tables are;

LogFlow - information about users web journey
LogQuote - all data related to a quote
LogSales - all data related to a sale

Only the last 5 months of data stored. This is a hangover from legacy environment where storage was limited. We wish to retain all the data so we collect it each night and store it in our Data Warehouse.

Staging data

As we perform a number of actions on the data once collected and we don’t want to run those actions over highly transactional replicated tables a set of staging tables exists which mirror exactly the schema on both out production and replication servers. Diagram 1 shows the SSIS steps for this section of the process. The staging tables are truncated to remove data from the last process before data flow tasks collect the data since the last successful load. This achieved using a view of the source table filtered to return only the rows created after the last datetime in the process control metadata.

The KVP data is large due to the volume of transactions and because the values for each row are contained in NVARCHAR(1000) format. For each action on our website over 150 rows are currently written to database. Often the data in the NVARCHAR(1000) key value column only uses a fraction of that space. It’s a drawback of the KVP implementation and as a result the database is approximately 90% smaller when converted to a relational schema.

Schema processing

Following on from a constraint for the KVP implementation we have an advantage. The KVP key names can be added to at any time which allows us to collect new data without extensive reconfiguration work. However, this means when collecting the data we cannot transform it into a rigid schema because the schema may change.

In addition, we can’t constrain the data within the key values to definite data types as this may also change therefore we continue to store the data – at least during collection – as NVARCHAR columns.

Finally the length of the data within the NVARCHAR columns can vary. We know the maximum lengths in the current data for each key name but that could change in future consequently we have to allow for potential changes.

To account for these data constraints we must scan the incoming data for new key names and key values which are larger than previously seen before each data load. If we detect new key names we must create a column to store that data from now on. If we detect a key value has been collected which is larger than the current NVARCHAR length in the storage schema then we must expand that column. These two actions require prior knowledge of the key names and the length of the key values which we store in a single metadata table. This table continues to be used throughout the remainder of the collection process.

This post is continued in part 2.

Sunday, 31 March 2013

Book Review - The Checklist Manifesto

Periodically you read books which change your habits and make you more effective. The Checklist Manifesto is one such book. Authors who can construct enjoyable prose from subjects as dry as this topic first appears should be treasured.

Although this book is primarily a story detailing the introduction of surgical checklist techniques learned from the aviation industry, Atul Gawande backs up his story with a broad range of examples. And as a surgeon himself, the author is a participant in the story too. I for one prefer to learn from experienced doers over of pure thinkers.

The crux of the book is simply that by highlighting important steps within a process with defined pauses for team coordination you can eliminate many avoidable errors freeing the mind to work on high-level problems.

But the problem is that we've come to venerate expertise to a point where we expect the expert to constantly be in control. We accept this situation unquestioned and the expert -whether in control or not - becomes an unapproachable authority. Aviation accidents of yesteryear abound with examples of decisions unchallenged which led to disaster and cost lives.

How then to implement a simple checklist in teams comprised of subject matter experts? Gawande explains that ego must be left at the door in order that experts can be challenged by members of a hierarchically flattened team.

Perhaps most difficult of all is consistency. We can all see the advantages provided by checklists which reduce simple errors. Indeed the author provides many statistical examples highlighting the transformative effects realised. But the self-control to step through a list for every repetitive process every time, sometimes through many repetitions in the same working day, requires discipline.

Discipline is hard – harder than trustworthiness and skill and perhaps even selflessness. We are by nature flawed and inconstant creatures. We can’t even keep from snacking in between meals. We are not built for discipline. We are built for novelty and excitement, not for careful attention to detail. Discipline is something we have to work at. The Checklist Manifesto, page 183.

And with this insight we have the habit changing gem that will make you more effective. Expertise alone is not enough. It must be harnessed within a framework that allows experts time to devote to those higher level problems without commonplace distractions.

Buy it. Read it. Use it.

Tuesday, 5 March 2013

Pragmatic Principals - Agile

The Data Warehouse exists to provide Business Intelligence. Its users will not always start with the end in mind. They will browse, discover, probe and dig. As designers it is our responsibility to construct solutions that allow users to answer questions which emerged as they were browsing.

How then, can we ask for a precise specification document typical of waterfall projects?

Sure there are elements we can map out in advance; Dimensions we know will be needed,architecture which will form part of the ETL process. But not everything can be defined.

My preferred approach is to get a proof of concept up and running quickly. This stirs the creative juices of the business users, especially those who think visually, and leads to feedback which drives the design and development forward.

This Agile is not the same as the Agile methodology, but it works.

Monday, 18 February 2013

Refactoring Legacy Databases

Know your enemy - It might not always seem so, but legacy code was written with a process in mind. Determining what the code was originally designed to do it the most import step toward improving it
Small bites - How do you eat an elephant. One bite at a time. Choose process to refactor rather than diving straight into an unstructured rewrite
Retire old processes - If the process is obsolete, kill it off
Start with the end in mind - Work with the system stakeholders to agree the projects goals
Set standards. Old code can be in conflict with your current coding standards. (You do have defined coding standards don't you?). A clean code approach will make the job refactoring your code easier when the days comes that your code is itself recfactored.
Stop digging - Don't let developers submit code which contains lazy hardcoding or other bad practices.
Engage - Talk to your developers. Explain what constitutes good practice. Educate them.
Process - If you've got developers adding poor quality code to your database put a code review process in place. (You do have a code review process don't you?)
Permissions - Often developers who've worked with a system for years feel they know it so well they can jump on the box and commit a code change. This is how spaghetti code begins. If the softly, softly approach doesn't work revoke the developers permissions to commit change to your database.
Evolve - Your unlikely to get executive sign-off on a project to go an clean up existing code especially in a hectic environment. So take a medium term view. Tackle the worse code first and keep steady pressure on to continually improve coding standards.
Test-driven refactoring - Changes you make to a legacy system often have to produce the same result otherwise you've just broken a live application. So using tests is vital to ensure the quality of the refactoring process just as TDD is a mainstay of modern application development.

Check out these links for useful resources;

RedGate to the rescue once again with their excellent toolset here and advice on how to use it here. I specifically like the ability to format code, auto-generate aliases, encapsulate code in stored procedures and qualify object names. This tool saves so much time.

http://databaserefactoring.com/ - Lot's of good specific advice here.

http://www.amazon.com/dp/0321293533/ - A good reference

Sunday, 27 January 2013

SQL Bad Practice - Hardcoding

Legacy systems present design challenges different to problems faced during greenfield development. One mistake I've seen time and again at multiple organisations is the problem of hardcoding. I cringe when I view a stored procedure in SSMS covered with red text. I find it's often accompanied by an absence of annotation within the code.

SELECT
    FirstName + ' ' + LastName as "Full Name",
    case UserRole
        when 2 then "Admin"
        when 1 then "Moderator"
        else "User"
    end as "User's Role",
    case SignedIn
        when 0 then "Logged in"
        else "Logged out"
    end as "User signed in?",
    Convert(varchar(100), LastSignOn, 101) as "Last Sign On",
    DateDiff('d', LastSignOn, getDate()) as "Days since last sign on",
    AddrLine1 + ' ' + AddrLine2 + ' ' + AddrLine3 + ' ' +
        City + ', ' + State + ' ' + Zip as "Address",
    'XXX-XX-' + Substring(
        Convert(varchar(9), SSN), 6, 4) as "Social Security #"
FROM Users

These are reasons why it's a bad practice:

Concatenating columns for presentation purposes should be done at the application or report to which the data is returned.
Extensive CASE statements classifying data is often a sign that a descriptive column is missing from the schema. I've seen CASE statements covering 2 pages of printed A4.
Hardcoding business logic into stored procedures leads to errors when the business logic changes. Alternatively it leads to more complexity being added to the hardcoding to cater for a slightly different case. Eventually, this path leads to code so complex that nobody dares to touch it.
Use of sting functions, particularly scalar functions, can cause performance deterioration. I once inherited a database where the developer had constructed his own BETWEEN function and used it extensively. It mirrored the built-in BETWEEN function in it's output whilst performing terribly. Aaron Bertrand has an excellent blog post on why "Between is Evil".

So what can you do to eliminate this bad practice. Read my post of refactoring legacy systems.

If you can think of other reasons why excessive hardcoding is bad news, leave a comment.

Pragmatic Data Warehousing