Pragmatic Data Warehousing: Big Data in the Real World

Part 1 of this series described the first 2 stages in the processing of clickstream data to allow sales conversion analysis. This post describes the most complex part of the process - the pivoting of Key-Value-Pair data into a flattened schema.

Loading New Data

The third element to the process is transforming the KVP data from staging into a persistent schema. There are many reasons for collecting this data – it is a rich dataset - and each reason may require the data in a different format. We don’t know what analysis will be done in future on the data or the data format that analysis will require. As a result, we continue to store the data in string format for now and assume any future analysis will have to construct a layer on top of the base collection or extract the data into further persistent storage if specific transformations are required.

The staged data is still in a KVP format. This is difficult for users and report writers to consume so we pivot the data to create 1 row per click event. This has the effect of turning 150 narrow rows of KVP data per click event into 1 wide row in the persistent schema with a 90% reduction in storage space.

Isolating the click events

To achieve this we isolate the click events within the data. Each event has;

SessionID stored as a GUID
BornDateTime timestamp stored as a datetime
StepID which is generally an integer but can contain characters so is stored as an NVARCHAR(3)
LayoutID also stored as an NVARCHAR(3).

LayoutID is rarely used but exists in two table Log_Flow and Log_Quote but not Log_Sales. It is still part of the unique key for these tables.

StepID is used but in only Log_Flow and Log_Quote.

Together these 4 columns constitute a unique key across a single click event. We use them to create a Primary Key in the persistent storage tables.

Step one of the SSIS process identifies each distinct unique within the data. These keys are then inserted into the corresponding persistent storage tables in a DataSource schema as shown in the table below.

Staging Table	Persistent Table
Staging.LogFlow	DataSource.ConversionLogFlow
Staging.LogQuote	DataSource.ConversionLogQuotes
Staging.LogSales	DataSource.ConversionLogSales

Pivoting the data into the persistent schema

Each set of key columns we detected in the data and insert into the persistent storage schema as a row must be completed by updating the rest of the columns (the non-primary key columns).

We do this by joining between the Staging and DataSource tables on those recently isolated primary keys. But we don’t know which columns we’ll be updating because, as we found out in the second stage of the process, the schema can change. The only way around this was to use a dynamic SQL statement which lists the columns required to be updated. These columns must be source from the metadata we stored about the KVP data during stage 2 of the processing.

We build up variables and ready for the dynamic SQL through a series of steps which gradually build up the list of columns and augment the SQL to construct a valid SQL Update statement. The ultimate aim is to populate one VARCHAR(MAX) variable and then execute the dynamic SQL. That dynamic SQL uses a Common Table Expression which itself utilises the PIVOT function. The CTE and PIVOT reference the Staging KVP data and turn it into rows for which the subsequent Update statement then uses to populate the persistent schema.

The final element is to Update the process control table which stores the last successful timestamp.

In conclusion

Upon successful collection of data we’re able to use this data to measure conversion rates on our websites. Each GUID represents a session and we can measure the number of quote and sales resulting from that session. This data undergoes some further transformation before it’s loaded into a Clickstream Fact table and processed into an OLAP cube hosted by SSAS. The collection process is automated from raw data through to the final OLAP cube and requires very little maintenance. At present we collect about 100K rows which equates to 70Mb of new data daily.

The only issues found to date was after the number of KVP columns increased and the dynamic SQL which uses a list of those columns exceeded 8000 characters. That failed the process for a few days until we could fix the issue. The fix involved trimming white space from the dynamic SQL to reduce the character count but this is not a solution merely a temporary fix. A better solution is to break the dynamic SQL into segments and to run separately or to identify unused columns and ignore them.

Previous DBAs failed to get anything useful out of this KVP data primarily because they looked at the problem inflexibly. One had built a report over the raw data and once the data changed the report failed. Another had used nested cursors – OMG!!! - to perform the data loads and that process was so slow it never succeeded. Finally, it was outsourced to Mumbai with a poor data brief and without stipulating that the process had to be automated. As a result, the process was defined in a series of rigid steps which were followed by some poor soul manually and then emailed back to the UK for analysis. It regularly came in different formats and was constantly over 2 weeks late.

The current process took a couple of months to get right but now works automatically every day. It’s flexible and we’re able to use it for detailed analysis. We no longer pay the outsourcers a significant amount per month for poor quality data. Above all it was a really challenging but fun piece of development.

Pragmatic Data Warehousing

Wednesday, 8 May 2013

Big Data in the Real World - Part 2 of 2

No comments:

Post a Comment

Useful Links

Search This Blog