Pragmatic Data Warehousing: July 2014

Day #1 Session #4 - Brian Knight - Performance Tuning SSIS

Note on my shorthand [+] means actionable insight, [i] refers to something of interest, [!] indicates a warning, [?] poses a question I'll need to ask

[i] Use the right tool for the job. SSIS is not a hammer to hit all nails with. T-SQL and other add-ons or complimentary tools have their place.

Flat Files
OLEDB Connections
T-SQL
- [i] Use T-SQL for the things it is good at and avoid the equivalent task in SSIS
Asynchronous components
Default Buffer Max Row
SQL Destination
- [!] Is deprecated now. Don't use it.
Bulk Inserts
Manage CPU Load
Manage RAM
- [+] Look at the Buffers in use and Buffers spooled in RedGate SQL Monitor. If your Spools > 0 your running out of RAM. You can also find this in the SSIS Pipeline in PerfMon.
Windows System Resource Manager
Work Pile Pattern
Fuzzy Tuning

Overall - Excellent fast-paced session. Really good practical advice delivered by a pro. 5*'s

Day #1 Session #3 - John Welch - Continuous Delivery for Data Warehouses and Marts

[i] The most interesting thing he said was to practice deployments and you'll get better.

[+] We need to include the collection of test data and tests at the requirements stage.

[+] Perhaps we need to add a compare dataset check to SQL tests to allow us to detect on Checksums changes to a defined query.

[i] He also prefers to use a view layer between DIMs/FACTs and SSAS to allow for schema changes without altering SSAS project.

Overall - Very poor session. Content either too high-level or too detailed. Poorly demonstrated. Wasted everybody's time. Lot's of people walked out. 0*'s

Day #1 Session #2 - Jen Stirrup - Data Analysts Toolkit using R and PowerBI

Note on my shorthand [+] means actionable insight, [i] refers to something of interest, [!] indicates a warning, [?] poses a question I'll need to ask

R is an open source statistical programming language. It's complimentary to many other data tools but does not replace them entirely.

Variables are untyped therefore naming conventions are important.

Lists are know as Vectors

DataSets are known as DataFrames

A Dimension is known as a Factor

It can import and export data. It can access CSV's and ODBC connections. And even Hadoop.

It's difficult to Google for R resources because of it's name. Use RSeek.org which is an R specific search engine. Jen Stirrup the presenter has a blog which explains how to use the various data analytics tools with R.

[?] Have we got any R skill in the Analytics team or elsewhere in the business.

R is a commend line language but RStudio is a helpful tool.

Excel snap-in tools PowerQuery, PowerView,PowerMap and PowerPivot worth looking at to get Analytics using standard tools for testing and modelling.

PowerQuery really good for data mungeing and cleaning. It can create models for re-use. Analytics needs to use this to build automated testing over their XL estate. It's good for Data Mining as well.

Rattle is a data mining package for R which can reuse models.

Overall - Interesting session. CIG Analytics should use PowerQuery to build models to do testing. Send them the video. If they have any R skills we should encourage them to use it. Well delivered if a bit slow at times but she was following Brent. 3*'s

Day #1 Session #1 - Brent Ozar - Watch Brent Tuning Queries

Note on my shorthand [+] means actionable insight, [i] refers to something of interest, [!] indicates a warning

[+] Get Grant Fritchey's 2012 Tuning book Brent recommends it

[+] All content available on www.BrentOzar.com/go/tunequeries

He uses a checklist approach based on the Mnemonic BECREEPI. [+] Add this to the DBA OS for Performance Tuning

B = Blitz - Use SP_Blitz and SP_BlitzIndex from BrentOzar.com to rule out fundamental server level issues before you waste time going any further. It gives you the advice for what to change to tune the server.

E = End-User Requirements - Figure out what the End-User Requirements are. Essentially you're defining the goal. It may be a level of performance but and they may have an acceptable wait time.

C = Capture Metrics - You need to collect data about the query in order to assist the tuning process.

R = Read the Metrics - You went to the trouble of collecting and storing it, now read it. There could be answers in this data.

E = Experimenting - Only now are you starting to refactor the code. Keep the original, version the changes with comments reflecting your thought process.

E = Execution Plan - Dive into the plan - but he warns against trusting an estimated plan. It looks like a foreign language and you should treat it like Arabic - start at the top right and work to the left and down.

P = Parallelism - If MAXDOP = 1 then the SQL Server cannot use parallelism when it's beneficial.

I = Index - It's the last resort. Indexes are not free so don;t just throw NC_IDX's at the problem as you'll slow down transactional writes thus creating another problem elsewhere.

He downloaded and used the StackOverflow DB in his demo. It's free and more "real world" than AdventureWorks.

He used www.StatisticsParser.com to copy his Statistics IO messages into and it presented the output very nicely for him.

[!] - "Estimates are junk" - he says in the actual plan you often get estimates as well so you can't rely on them. You must look at the reads to see the true picture. Often table variables look great in the estimated plan but it treats every table variable as a table of 1 row so it's false. This is precisely because they are variables.

Estimated sub-tree cost is a generic rating of a query based on arcane Microsoft code but valid if comparing queries on the same server. Across servers it has no value.

[i] - He advises no more than 5 indexes per table as a rule of thumb.

[+] He uses SQL Sentry Plan Explorer which is free. Add it to the core software build list for the DBAs.

[+] Go back and look again at Indexed Views & Schema Binding. They could be useful in some DW cases. It means a query across multiple tables can have an index on it. But it does cost and the Schema Binding means no schema changes to the underlying objects so the tables are static otherwise the view must be recompiled. And some ANSI NULL settings need to be aligned as well.

Overall - Excellent fast paced session worth watching again. 5*'s

Pragmatic Data Warehousing

Monday, 21 July 2014

Day #1 Session #4 - Brian Knight - Performance Tuning SSIS

Note on my shorthand [+] means actionable insight, [i] refers to something of interest, [!] indicates a warning, [?] poses a question I'll need to ask

Day #1 Session #3 - John Welch - Continuous Delivery for Data Warehouses and Marts

Day #1 Session #2 - Jen Stirrup - Data Analysts Toolkit using R and PowerBI

Note on my shorthand [+] means actionable insight, [i] refers to something of interest, [!] indicates a warning, [?] poses a question I'll need to ask

Day #1 Session #1 - Brent Ozar - Watch Brent Tuning Queries

Note on my shorthand [+] means actionable insight, [i] refers to something of interest, [!] indicates a warning

Useful Links

Search This Blog