Monday, 21 July 2014

Day #1 Session #4 - Brian Knight - Performance Tuning SSIS



Note on my shorthand [+] means actionable insight, [i] refers to something of interest, [!] indicates a warning, [?] poses a question I'll need to ask


[i] Use the right tool for the job. SSIS is not a hammer to hit all nails with. T-SQL and other add-ons or complimentary tools have their place.
  • Flat Files
    • [+] The data will be parsed as it goes in If you can trust the veracity of the source data types turn on Fast Parse which gives an 8% performance boost.
    • You'll find it only in the Advanced Editor - Input/Output Properties - By Columns - Custom Properties - fast parse = true
    • It only works on reads
    • Also works on Data Conversion task
    • If the data is not as good quality as you thought the task will still fail but it'll get to the end first
  • OLEDB Connections
    • Default network package size is 4K. You can increase this to 32K for larger datasets.
    • [+] You'll need the infrastructure guys to turn on Jumbo Packets across the network where the data is traveling. Network guys will need to run a traceroute to see where the data goes.
    • Change setting in the Connection Properties - All Tab - Packet Size is 0 = default change to 32767
    • Set the data set to be "pre-sorted" in the advanced properties so the engine will not have do this. As long as you have pre-sorted the data/
  • T-SQL
    • [i] Use T-SQL for the things it is good at and avoid the equivalent task in SSIS
  • Asynchronous components
    • e.g. a Sort. These are tasks that by definition have to wait for the full data set before they can act. They're bottlenecks avoid them if possible.
    • Brian's blog has a page which outs all the asynchronous tasks
    • [+] Tune the Single Customer View package to remove asynchronous tasks
    • A classic example is the OLE DB Command which is row by row and shouldn't be used. Use T-SQL instead for these actions. It even opens and closes a connection for each row as well. This is a terrible component of SSIS.
  • Default Buffer Max Row
    • [+] In the Data Flow Task properties trial and error a buffer of between 8K and 12K. See what performs best. 
    • This could provide up to a 20% performance improvement.
  • SQL Destination
    • [!] Is deprecated now. Don't use it.
  • Bulk Inserts
    • Turn on Fast Load in the OLE connection under Data Access mode for bulk inserts.
    • But you can't use it if you need some rows to use conditional logic.
  • Manage CPU Load
    • Instead of seeing a massive CPU spike as the package commits the data change the OLE DB connection manager Maximum Input Commit Rows to around 50K to get regular commits and smooth the CPU load.
    • [+] Alter the DW load packages specifically the clickstream collection package to use this. Monitor the effect with RedGate SQL Monitor.
  • Manage RAM
    • [+] Look at the Buffers in use and Buffers spooled in RedGate SQL Monitor. If your Spools > 0 your running out of RAM. You can also find this in the SSIS Pipeline in PerfMon.
  • Windows System Resource Manager
    • [+] It's free and you already have it. Use it.
    • Allows the creation of a soft throttle on resources so a .exe can only use n% of resources.
    • It can be set to apply different usage restrictions at different times so you can allow the DW build full resource during the night but limit it during the day
  • Work Pile Pattern
    • [i] Check this out on his blog
    • The basic theory is that you can load balance highly transactional SSIS processes by spawning multiple packages when demand spikes. Very cool.
  • Fuzzy Tuning
Overall - Excellent fast-paced session. Really good practical advice delivered by a pro. 5*'s

Day #1 Session #3 - John Welch - Continuous Delivery for Data Warehouses and Marts



[i] The most interesting thing he said was to practice deployments and you'll get better.

[+] We need to include the collection of test data and tests at the requirements stage.
[+] Perhaps we need to add a compare dataset check to SQL tests to allow us to detect on Checksums changes to a defined query.
[i] He also prefers to use a view layer between DIMs/FACTs and SSAS to allow for schema changes without altering SSAS project.

Overall - Very poor session. Content either too high-level or too detailed. Poorly demonstrated. Wasted everybody's time. Lot's of people walked out. 0*'s

Day #1 Session #2 - Jen Stirrup - Data Analysts Toolkit using R and PowerBI



Note on my shorthand [+] means actionable insight, [i] refers to something of interest, [!] indicates a warning, [?] poses a question I'll need to ask


R is an open source statistical programming language. It's complimentary to many other data tools but does not replace them entirely.

Variables are untyped therefore naming conventions are important.
Lists are know as Vectors
DataSets are known as DataFrames
A Dimension is known as a Factor

It can import and export data. It can access CSV's and ODBC connections. And even Hadoop.

It's difficult to Google for R resources because of it's name. Use RSeek.org which is an R specific search engine. Jen Stirrup the presenter has a blog which explains how to use the various data analytics tools with R.

[?] Have we got any R skill in the Analytics team or elsewhere in the business.

R is a commend line language but RStudio is a helpful tool.

Excel snap-in tools PowerQuery, PowerView,PowerMap and PowerPivot worth looking at to get Analytics using standard tools for testing and modelling.

PowerQuery really good for data mungeing and cleaning. It can create models for re-use. Analytics needs to use this to build automated testing over their XL estate. It's good for Data Mining as well.

Rattle is a data mining package for R which can reuse models.

Overall - Interesting session. CIG Analytics should use PowerQuery to build models to do testing. Send them the video. If they have any R skills we should encourage them to use it. Well delivered if a bit slow at times but she was following Brent. 3*'s

Day #1 Session #1 - Brent Ozar - Watch Brent Tuning Queries


Note on my shorthand [+] means actionable insight, [i] refers to something of interest, [!] indicates a warning


[+] Get Grant Fritchey's 2012 Tuning book Brent recommends it
[+] All content available on www.BrentOzar.com/go/tunequeries

He uses a checklist approach based on the Mnemonic BECREEPI. [+] Add this to the DBA OS for Performance Tuning

B = Blitz - Use SP_Blitz and SP_BlitzIndex from BrentOzar.com to rule out fundamental server level issues before you waste time going any further. It gives you the advice for what to change to tune the server. 

E = End-User Requirements - Figure out what the End-User Requirements are. Essentially you're defining the goal. It may be a level of performance but and they may have an acceptable wait time.

C = Capture Metrics - You need to collect data about the query in order to assist the tuning process.

R = Read the Metrics - You went to the trouble of collecting and storing it, now read it. There could be answers in this data.

E = Experimenting - Only now are you starting to refactor the code. Keep the original, version the changes with comments reflecting your thought process.

E = Execution Plan - Dive into the plan - but he warns against trusting an estimated plan. It looks like a foreign language and you should treat it like Arabic - start at the top right and work to the left and down.

P = Parallelism - If MAXDOP = 1 then the SQL Server cannot use parallelism when it's beneficial.

I = Index - It's the last resort. Indexes are not free so don;t just throw NC_IDX's at the problem as you'll slow down transactional writes thus creating another problem elsewhere.

He downloaded and used the StackOverflow DB in his demo. It's free and more "real world" than AdventureWorks.

He used www.StatisticsParser.com to copy his Statistics IO messages into and it presented the output very nicely for him.

[!] - "Estimates are junk" - he says in the actual plan you often get estimates as well so you can't rely on them. You must look at the reads to see the true picture. Often table variables look great in the estimated plan but it treats every table variable as a table of 1 row so it's false. This is precisely because they are variables.

Estimated sub-tree cost is a generic rating of a query based on arcane Microsoft code but valid if comparing queries on the same server. Across servers it has no value.

[i] - He advises no more than 5 indexes per table as a rule of thumb.

[+] He uses SQL Sentry Plan Explorer which is free. Add it to the core software build list for the DBAs.

[+] Go back and look again at Indexed Views & Schema Binding. They could be useful in some DW cases. It means a query across multiple tables can have an index on it. But it does cost and the Schema Binding means no schema changes to the underlying objects so the tables are static otherwise the view must be recompiled. And some ANSI NULL settings need to be aligned as well.

Overall - Excellent fast paced session worth watching again. 5*'s