SQLintersection session – Top Customer Problems: How to Avoid Calling Bob

I attended the SQL intersection conference in November 2014 in Las Vegas, Nevada.  I am including my notes from the sessions in this blog, primarily for my own reference later.  For anyone that comes across these notes, please take them with a grain of salt – I may very easily have misheard something or have written it down incorrectly.

Session Title:  Top Customer Problems: How to Avoid Calling Bob
Instructor: Bob Ward

  • Biggest issues (in order)
    • Performance
      • Is it a “running” problem (high CPU) vs a “waiting” problem (IO bottleneck, blocking, latch waits, other waits)?
        • Blocking – usually a problem within the app
        • IO bottleneck (PAGEIOLATCH or WRITELOG)
        • Latch waits, hotspots (PAGELATCH_XX)
        • RESOURCE_SEMAPHORE waits
        • ASYNC_NETWORK_IO can be a network problem, but usually a problem with client result processing
        • Other waits, check DMV
      • If the problem is current: Performance Dashboard Reports and DMVs
        • Performance Dashboard Reports
          • DMVs only
          • Create SP in master
          • 2005 version (use for 2005, 2008, 2008R2)
          • 2012 version (user for 2012, 2014)
        • Check for other process using lots of resources
          • High kernel time (i.e., anti-virus)
        • Within SQL Server
          • Background/system tasks
          • High recompiles
        • Query
          • Usually about cardinality estimation
            • Many root causes, see table
            • Tables variables (estimate = 1)
              • TF 2453 (2012 SP2+)
            • Data type mismatch (cannot make estimate, change data types)
          • Poor/missing indexes
          • Query design (lots of scans)
          • Lots of parallelism (check DOP)
        • Post-mortem problems: PSSDiag, SQLNexus
      • Backup/Restore
        • Backup
          • OS error 64 – Network problems (backup started, but connection dropped)
          • OS error 1450 – OS memory problems (backing up to a remote drive)
          • Transaction log growth (full recovery with no t-log backups)
          • VDI backup failure (3rd party app problems)
        • Restore
          • Disk bottleneck
          • Long-running recovery
          • VLF fragmentation
          • CHECKSUM failures
            • Use WITH CONTINUE_AFTER_ERROR to override
          • Consider restoring to a different DB vs replace (if error while restoring, still have fallback)
        • Consistency Issues
          • Message 824 error / DBCC CHECKDB
          • Checksum errors
          • Suspect DB
          • No “magic tricks” to fix
          • Almost always IO path error (try SQLIOSIM)
          • Can also be memory corruption
            • Constant page sniffing – thread that randomly check memory pages
          • Could be a SQL bug (replay the log)
          • Session id of -3 = recovery (deferred)
          • Page-level recovery
        • Stackdumps/Exceptions
          • Verify that the faulting module is SQL
        • Obscure errors
          • Message 1204 – Deadlock
          • Message 833 – IO delay warning (15 sec IO delay)
          • Message 17883 – Non-yielding scheduler (almost always a SQL bug)
          • Message 17884, 17888 – usually a long blocking chain
          • Message 844, 845, 846, 847 – latch time out
          • Use error_reported XEvent to track down a specific error
        • Startup problems
          • Permissions problem (may have changed service account followed by a hard restart)
          • Config (i.e., max server memory)
          • Damaged files
            • Keep file copies of master (shut down SQL, copy file off) and binaries
          • Backup tail of model causes restart problems
        • Memory issues
          • Out of memory
          • Memory grown (OK if below max server memory)
        • Hidden gems
          • Sytem health session persisted (2014)
          • NUMA guidance
          • System Center Advisor

tempdb bulk operation optimized (fix for 2012, built in to 2014)

SQLintersection session – In-Memory OLTP Overview

I attended the SQL intersection conference in November 2014 in Las Vegas, Nevada.  I am including my notes from the sessions in this blog, primarily for my own reference later.  For anyone that comes across these notes, please take them with a grain of salt – I may very easily have misheard something or have written it down incorrectly.

Session Title:  In-Memory OLTP Overview
Instructor: Kevin Farlee

  • Hekaton = In-Memory OLTP
  • Columnstore = In-Memory DW
  • CPU clock rates stalled out about 2005, but more power now coming in the form of multiple cores
  • Memory costs continue to plummet
  • Saying “It’s fast because it’s in memory” doesn’t really cover the Hekaton story
    • Can add enough memory to load file-based tables into buffer pool
    • Hekaton has many more speed optimizations
  • Focus on data hotspots, and load that data into Hekaton. Doesn’t have to be the entire dataset or even a significant portion of the dataset
  • Compiled stored procedures don’t have to context-switch (but still subject to quantum)
  • Cannot now replicate from Hekaton to another system, but can replicate from a disk table to a Hekaton table
  • Based on optimistic concurrency – assume that no one else will access simultaneously
  • Validation errors cause rollback; have to treat like deadlock in code (i.e., retry)
    • SQL does retry in some very limited conditions
  • Works with Always On, AG, backups, etc.
  • Not recommended for work queues
  • Can enable delayed durability on Hekaton tables
  • If run out of memory, can still delete rows, but cannot update or insert
  • Indexes are not persisted on disk; they are always built in memory
  • SCHEMA_ONLY does not even write to transaction log (other than table creation)
  • Case example of online gambling company that accessed web session state from SQL
    • Using Hekaton, went from 15,000 txn/sec to 250,000
  • Uses more efficient logging to improve write throughput
  • Need to set bucket count an order magnitude (or two) larger than expected row count
  • Tables can be schema and data or schema only
  • Can create natively compiled SPs
    • Calls out to C compiler to create
    • Entire SP is atomic
  • Some transactions cannot run in read committed
    • Must use snapshot, repeatable read, or serializable
  • Cannot comment on whether this will work its way into Azure, but wouldn’t be surprised
  • Don’t necessarily go big bang – convert tables a few at a time until goals are achieved

SQLintersection session – Parallel Query Execution, Part 2

I attended the SQL intersection conference in November 2014 in Las Vegas, Nevada.  I am including my notes from the sessions in this blog, primarily for my own reference later.  For anyone that comes across these notes, please take them with a grain of salt – I may very easily have misheard something or have written it down incorrectly.

Session Title:  Parallel Query Execution, Part 2
Instructor: Paul White

Batch mode parallel execution

    • Geared toward DW scenarios
    • Row mode model not as efficient on modern CPU architecture
      • Frequently has to access memory rather than L1, L2, L3 cache
    • Batch mode
      • Batches contain 64-900 rows
      • Code designed to fit on L1 cache
      • Data sized to fit on L2 cache
      • Geared toward 64-bit register use
      • Uses push model (whereas row mode is a pull model)
      • Avoids expensive exchanges with low-contention shared structures
      • Has high startup cost, so not good for OLTP workloads
      • Threads are dynamic and can move between operators
    • Batch is a 64KB structure designed to fit in L2 cache
      • Contains a column vector that fits in L1 cache
      • Allocated one time, reused many times
    • TF 9453 – Disable batch mode
    • Must keep columns small so that at least 64 rows will fit into 64KB structure
    • Batch mode not implemented on string aggregates
    • Columnstore
      • High compression, reduces IO
      • Stored as LOBs, optimizations for LOB
      • Created by partitioning into row groups of 1024^2 (about one million)
        • Partitioned into segment; reordered for compression
        • Dictionary per column per segment, plus global dictionary per column
      • SQL 2012 had many limitations, including being read-only
        • Also had very limited plan operators that were supported
        • Small changes to query could change to row mode and be much slower
        • Can often rewrite query to stay in batch mode, but can get ugly and requires a fair bit of internals knowledge – not well suited for end users
        • If operators spill to tempdb, query could start over in row mode
      • SQL 2014 has many improvements
        • Writeable, updateable, better compression, DDL allowed, more column types
        • More operators added, but still limited
          • No sort still
        • Spills stay in batch mode
        • Allows mixed mode (i.e., plan contains row mode, switch to batch mode, back to row mode) leaving more options for optimizer
        • Improved bitmaps
          • Any supported datatype
          • Both simple and Bloom filters
        • Bookmark support (this is not clustered index/RID lookups)
          • Unique key for columnstore indexes
        • Delta store for inserts; tuple mover to merge when full
          • Deleted rows stay in columnstore (marked as deleted) until REBUILD
          • Update = delete + insert
        • Still has limitations
          • No PK, FK, nonclustered indexes, unique constraints (can work around with materialized views), replication, CDC
          • Not readable on AG secondaries
          • (Interesting note – bug in error message when attempting to create FK – error message lists wrong table!)
        • TF 8607 – Output text of internal execution plan
        • TF 3604 – Output to client
        • TABLESAMPLE SYSTEM (15 PERCENT) – sample 15% of a table

SQLintersection session – Parallel Query Execution, Part 1

I attended the SQL intersection conference in November 2014 in Las Vegas, Nevada.  I am including my notes from the sessions in this blog, primarily for my own reference later.  For anyone that comes across these notes, please take them with a grain of salt – I may very easily have misheard something or have written it down incorrectly.

Session Title:  Parallel Query Execution, Part 1
Instructor: Paul White

Within a parallel plan, each thread is effectively running a serial plan

  • Parallel seeks and scans
    • 2005 and earlier use parallel page provider
    • 2008+ use range enumerators
      • May see ACCESS_METHODS_DATASET_PARENT latch waits
    • Only forward scans and seeks are supported
      • Backward scans can only happen serially
    • Exchange types
      • Hash – Distributed based on hash function
      • Round Robin – Sent to next available consumer
      • Broadcast – Pushed to all consumers
      • Demand – Row at a time (rare)
      • Range – Mostly index build plans
    • Exchanges can be sort order preserving or not
    • Root node has property called ThreadStats
      • Will show number of threads reserved and number of threads used
      • May not use all reserved threads depending on timing of thread allocated in the query (can reuse threads that are already complete)
    • exec_context_id is the thread number
      • 0 = serial
    • Query to identify node id where CXPACKET wait may be happening
    • Parallel merge join requires order preservation
      • Does not scale well to high DOP
    • Parallel hash join
      • Generally scales well to high DOP
      • Bitmap reduction – can only appear in parallel plans, and can greatly reduce work to be done (i.e., early reduction)
    • Parallel nested loops join
      • Inner side runs serially
      • Optimizer is “biased” against them
      • TF 8649 – Force parallel plan
      • Various tricks or plan guides to force usage
        • Such as specifying row goals
        • Express queries as APPLY – As query is pushed toward parallel nested loop queries, can push toward smaller memory grants and more efficient use of resources
      • Parallel sorts
        • Observer memory fraction percentage
        • Tricks to avoid spills
          • Cast strings to larger strings (i.e., select cast(str as varchar(100))) to force larger memory grant
          • Split sort into smaller sorts (initial select into temp table, then cross apply to temp table)
        • Avoid too much parallelism – set the cost threshold for parallelism
        • Parallel deadlocks
          • Once deadlock is detected, contents of CXPACKET buffers spill to tempdb
          • Interesting crafted demo that causes spills only at even DOPs (but not with odd)!

SQLintersection session – Deploying SQL Server with Microsoft Azure Virtual Machine

I attended the SQL intersection conference in November 2014 in Las Vegas, Nevada.  I am including my notes from the sessions in this blog, primarily for my own reference later.  For anyone that comes across these notes, please take them with a grain of salt – I may very easily have misheard something or have written it down incorrectly.

Session Title:  Deploying SQL Server with Microsoft Azure Virtual Machine
Instructor: Bob Ward

Models

    • Azure portal (pay-as-you-go)
    • Bring own VM (use own license)
    • Management Studio (pay-as-you-go)
    • Azure VM Templates (capture Azure VM) (pay-as-you-go)
  • Gallery images
    • Optimized for OLTP or DW
  • VM Sizing
    • A series
      • Don’t use basic tier except for testing
      • A2+ for Standard Edition
      • A3+ for Enterprise Edition
      • A4, A7-A9 optimized
    • D series (faster CPU, SSD)
      • D4, D13-14 optimized
    • Can be changed later
  • Cloud services for things like Availability Groups, Availability Sets, load balancing
  • See MS decision tree for virtual networks
  • Must create TCP endpoint for MSSQL port (1433)
  • Takes about 10 minutes to provision
  • Can provision via SSMS (Tasks, Deploy to Azure VM)
  • Be sure to connect via RDP right after provisioning
  • Storage decisions
    • 500 IOPS/drive limit, but can span across drives
    • Use Windows Server 2012+ to create storage pools
      • One for data, one for log
      • Span across multiple disks
    • Be sure not to use system drive ever (including for system databases)
    • Do not enable Azure options for caching on storage drives (disabled by default)
    • D drive – do not use on A series
      • On D series, use D drive (SSD) for tempdb or BPE
    • Enabled instant file initialization
    • Enabled locked pages
    • Turn off geo-replication (doesn’t guarantee write ordering, can cause transactional inconsistencies)
    • AZCOPY – like robocopy for Azure
    • Apply fix for tempdb eager writes (do this both on-premise and in Azure) for SQL2012
    • Disable SSAS, SSIS, SSRS if not used
    • Check that SQL authentication is enabled
    • Windows firewall – open up 1433 or SqlServer
    • Consider ACLs on TCP endpoints
    • Optimized images
      • 12 TB storage pool for data
      • 2 TB storage pool for log
      • Startup –E
      • TF 1117, 1118
      • 8 tempdb files
      • New DB appears to be on c: drive, but is on a mount point
      • Instant file initialization
      • Locked pages enabled
    • Disk transfers/sec on perf monitor to measure IOs
    • Configuration of Availability Groups
      • 3 primary options
        • All in Azure, same datacenter
        • Secondaries in different datacenter
        • Primary on-premise, replica in Azure
      • Requires DC in Azure
      • Coming soon: Always On template (about 5 images)
      • Add Replica Wizard in SSMS (requires hotfix)
      • Don’t try to upload AG VMs into Azure – won’t work and will have many problems
    • No failover clustering options (yet)
    • Can back up on-premise to Azure storage (2012+)
    • Can store data and log files on Azure storage (2014+)
    • Performance considerations
      • Batch transactions when possible
      • Autogrow considerations for log
    • Do not use autoscale for Azure
    • Future
      • G Series coming soon (more cores, RAM, larger SSDs)
      • Premium storage
      • Azure file storage (shared file system)

SQLintersection session – Squeezing the Most out of Row and Page Compression

I attended the SQL intersection conference in November 2014 in Las Vegas, Nevada.  I am including my notes from the sessions in this blog, primarily for my own reference later.  For anyone that comes across these notes, please take them with a grain of salt – I may very easily have misheard something or have written it down incorrectly.

Session Title:  Squeezing the Most out of Row and Page Compression
Instructor: Andrew J. Kelly

A highly underutilized feature in SQL Server

    • Row compression
      • Fixed length columns stored as variable
      • Does not compress BLOB/LOB (i.e., LOB pages)
      • Storage format is a superset of vardecimal
      • Overhead for more metadata (i.e., variable lengths), but optimized for this
      • Recommended for high update systems
      • Overall better compression than page compression
    • Page compression
      • Does row compression first
      • Once page is full, further compression based on column and page dictionary
      • Also no BLOB/LOG (LOB pages)
      • Recommended for high read workloads
    • Can be enabled at table, index (clustered and/or nonclustered), partition level
      • Partition can be some or all
    • sp_estimate_data_compression_savings
      • Copies sample into tempdb and compresses it; extrapolate to entire table
    • Can be enabled or disabled online (essentially does an online rebuild)
    • Nulls and zero stored as four bits
    • No trailing spaces on char columns
    • Column prefix
      • Determine common prefix on columns
      • Even on binary and integer data
      • Stores prefixes, then uses pointers to prefixes
    • Page dictionary
      • Find repeating columns
      • Store dictionary, use pointer to dictionary entries
    • If bulk-inserting or inserting into a heap with page compression, will only get row compression
      • Unless TABLOCK is specified
    • May increase size of compressed backups
    • Don’t compress small objects without a good reason

SQLintersection session – A Tour through C# and Visual Basic Language Changes

I attended the SQL intersection conference in November 2014 in Las Vegas, Nevada.  I am including my notes from the sessions in this blog, primarily for my own reference later.  For anyone that comes across these notes, please take them with a grain of salt – I may very easily have misheard something or have written it down incorrectly.

This is one of two development session I attended.

Session Title:  A Tour through C# and Visual Basic Language Changes
Instructor: Kathleen Dollard

Visual Studio v14 is coded in native language (C#, VB)

    • Open source, transparent, etc.
    • Language extensibilities through analyzers
    • Language features have cleaner semantics
      • Some of these are in flux and may not happen
      • public int Id {get;} [=1;]
        • Can only init during construction; or, have initializer in code
      • Remove class name from static method calls
      • Initializers for dictionaries
        • new Dictionary<string, Customer> = { [“A”] = new Cust(…), [“B”] = new Cust(…) }
      • await in catch/finally blocks
      • Conditional catch (exception filters)
        • catch (Exception ex) if (ex.Code == 4) { … }
      • Expression body members
        • public string FullName => FirstName + “ “ + LastName;
      • Null propagation (aka null conditionals; sometimes call Elvis operator!)
        • x ?. x.Y
        • Evaluates to null if x is null, else evaluates to x.Y
        • Example
          • If (Event != null) { Event.Invoke (…); }
          • Now becomes: Event?.Event.Invoke (…)
        • array ?[index]
          • Checks if index has value (?)
        • String interpolation
          • Format(“{0} {1} {2}”, x, y, z)
          • Can now be written as $”\{x} \{y} \{z}”  (maybe will require \ prefix instead of $; still in limbo)
        • nameof
          • void method(int abc) { if (abc < 0} throw new ArgumentOutOfRangeException(nameof(abc)); }
        • Visual Studio features
          • Breakpoint peek
            • Debug breakpoints (conditions, actions)
            • Tip: set hit count = 999999, will show actual hit count on exception
            • Tip: if breakpoints don’t hit, close VS, delete sou files and restart VS
          • Performance tips
            • Stepping through code will show elapsed time for running
          • Name refactoring
            • Ctrl RR while on an identifier will change all references
          • Other refactoring
            • Select code block, extract block
          • Touch to zoom or scroll
          • Improved coloration (i.e., hover tips on identifier)
          • Word part intelligence (ArgExp will find ArgumentException)
          • Proper case menus

Peek in XAML

SQLintersection session – Anatomy of a Data Modification: Locking and Isolation

I attended the SQL intersection conference in November 2014 in Las Vegas, Nevada.  I am including my notes from the sessions in this blog, primarily for my own reference later.  For anyone that comes across these notes, please take them with a grain of salt – I may very easily have misheard something or have written it down incorrectly.

Session Title:  Anatomy of a Data Modification: Locking and Isolation
Instructor: Kimberly Tripp

  • Versioning is a powerful and underused mode that can solve many problems
  • SQL uses locks on resources as indicators (i.e., intent shared, intent exclusive)
  • Use SSD for most random IO drives; for server consolidation scenarios, this may be a common log drive
  • SQL Server can be in 1 of 4 states at any given time
    • Locking only
    • Locking + versioning
      • RCSI (statement-level read consistency)
      • Snapshot isolation (transaction-level read consistency)
      • RCSI + snapshot isolation
    • Lock compatibility
      • X – not compatible with IX, IS, S
      • IX – not compatible with S; is compatible with IS
      • IS – compatible with S
    • dm_exec_sessions: isolation level + session setting
      • 1 = Read uncommitted
      • 2 = Read committed (no distinction between locking vs. versioning)
      • 3 = Repeatable reads
      • 4 = Serializable
      • 5 = Snapshot
    • Whitepapers

SQLintersection session – Performance Troubleshooting Using Waits and Latches

I attended the SQL intersection conference in November 2014 in Las Vegas, Nevada.  I am including my notes from the sessions in this blog, primarily for my own reference later.  For anyone that comes across these notes, please take them with a grain of salt – I may very easily have misheard something or have written it down incorrectly.

Session Title:  Performance Troubleshooting Using Waits and Latches
Instructor: Paul Randal

 

  • Keep historical snapshots of wait stats
  • See 2014 whitepaper on using wait stats
  • dm_os_schedulers; interesting columns
    • visible_online, visible_online (dac), hidden_online
    • context_switch: number of context switches
    • loads
    • quantum_length_us: length of the scheduler’s quantum in microseconds (time until the thread must yield)
  • Waiter list is completely unordered. Each thread will be notified when its resource is available, and can be notified in any order.  It is then moved to the runnable list
  • No limit to how many threads are on the waiter list
  • No limit to how long a thread can be on the list
  • When a thread’s quantum is exhausted, it is automatically moved to the runnable list
    • “Check for abort or yield” function called by thread – SQLOS will notify if quantum is exhausted or if the user has cancelled
    • NON_YIELDING_THREAD error is always a bug. Must bounce SQL to kill it
    • Must yield even if no other threads are on the scheduler
  • Threads on the waiter list do not periodically wake up to see if resource is available
  • When a thread owning a lock releases the lock and another thread is waiting on the resource, it transfers ownership to the second thread and moves it off the waiter list. That is, the thread is “signaled.”
  • Time that a thread sits on the runnable queue is called signal wait time. This is usually very small and generally not of interest.
  • Runnable queue is ordered as first-in, first-out
    • Special case: Resource governor allows for threads to be given priority. However, this is hard to set up and nobody uses it
  • dm_os_schedulers.runnable_tasks_count. If this amount is large (perhaps >= 10 per scheduler) this indicates CPU pressure
  • Waits happen. It is simply a part of SQL Server.
  • Wait time = resource wait time + signal wait time
  • dm_os_waiting_tasks – What is happening right now?
  • dm_os_wait_stats – What has happened
    • Aggregate view of wait stats since they were last cleared
    • Capture snapshots over time. Every 15-30 minutes should be sufficient
    • Figure out how long to retain
  • Coming by about the end of the year: sqlskills.com will have a repository to wait types! (and latch types)
  • Capture waits via extended events: sqlos.wait_info event
  • A couple of demos on collecting XEvent, including SQL Server call stacks!
  • Latch whitepaper [.pdf]
  • Superlatches – efficiency mechanism on NUMA nodes
    • Can produce benign error message in log: “Failure to allocate super-latch promotion threshold” – nothing to do about it
  • dm_os_latch_stats: Ignore unless wait stats show latch waits
  • Clearing wait/latch stats: DBCC SQLPERF
  • Spinlock – synch mechanism that is lighter than a latch
    • Uses CPU-level instruction to get a bit
    • Tries a number of times (500 to 1000); if still cannot acquire will backoff and sleep on the CPU, then try again
    • Does not yield thread, so runnable threads have to wait –high signal wait times will be the result
    • In practice, this does not result in a SOS_NONYIELDING thread scenario (which takes about 60 seconds to come up)
    • Rare to see, but can happen
    • Diagnosing requires lots of internals knowledge; see this whitepaper [.pdf]
  • Common wait types
    • CXPACKET – parallelization
    • OLEDB – Linked servers or DMVs
    • ASYNC_NETWORK_IO – TDS packets sent to client; awaiting ack.
  • Common latch types
    • ACCESS_METHODS_DATASET_PARENT – multithreading where worker thread is getting next set of work
    • ACCESS_METHODS_SCAN_RANGE_GENERATION
    • ACCESS_METHODS_HOBT_COUNT
  • Wait types
    • PAGEIOLATCH_XX – Waiting for data to be read into memory
      • XX = SH – Page to be read
      • XX = EX – Page to be written
    • Determining which table a page belong to
  • dm_os_buffer_descriptors – one row per page in the buffer pool
  • dm_os_virtual_file_stats
    • For each DB file – IOs, etc.
    • Snapshot this table from time to time
  • PAGE_LATCH_XX
    • Waiting for access to memory page
    • Often indicates tempdb contention
    • TF 1118 – Removes SGAM contention
      • Also, TF 3226 – Don’t write backup success messages to the log
      • TF 4199 – changes query opt., not necessarily recommended for all systems
    • For PFS contention, add move tempdb files
      • If logical cores < 8, tempdb files = number of logical cores
      • Else use 8 tempdb files; if contention in tempdb not alleviated, add 4 more at a time until problem is alleviated
    • May also indicate excessive page splits or an insert hotspot
  • LCK_M_XX wait
    • Analyzing – see this Michael Swart blog post
    • Especially look for LCK_M_RS_XX; indicates serializable isolation level
    • On SharePoint systems consider using snapshot isolation
  • Row locks escalate to table locks
  • Page locks escalate to table locks
  • Row locks never escalate to page locks
  • WRITELOG wait
    • LOGBUFFER wait – copying log block to log cache
    • Once log block is written, waits for SPID 1 (the log writer) to asynchronously flush the log to disk; when done, signals the original thread
    • Limit on how many log writes per database can be enqueued at once
      • Pre-2012: 32
      • 2012+: 112
    • Can be symptomatic of page splits or many small txns
    • Delayed durability
      • Only enable is SLA has data loss permissible
      • Can significantly improve throughput on small transactions
    • CXPACKET
      • Completely normal on thread 0 (execution_context_id = 0) (controller thread)
      • Only of concern if wait is high on a thread other than 0
      • See Bob Door’s blog post on how to track foreign memory access (foreign pages in performance monitor)
      • Consider increasing cost threshold for parallelism
    • SOS_SCHEDULER_YIELD
      • Have to use XEvents to view call stacks to really identify source!
    • LATCH_XX
      • Latch class largely controls understanding of root cause; see this resource page
    • FGCB_ADD_REMOVE latches
      • Often auto-growth related
    • DBCC_XX latches
      • Related to CHECKDB
      • Nothing to do about these
      • Note that persisted computed columns can significantly increase latching (DBCC_OBJECT_METADATA)
    • PREEMPTION_OS_XX waits
      • Call-out to Windows
      • XX = name of Win API method; Google for “MSDN XX”
    • ASYNC_NETWORK_IO waits
      • Client is likely doing RBAR on the resultset rather than caching the results client-side

Many, many other wait types

SQLintersection session – Mastering SQL Server Execution Plan Analysis

I attended the SQL intersection conference in November 2014 in Las Vegas, Nevada.  I am including my notes from the sessions in this blog, primarily for my own reference later.  For anyone that comes across these notes, please take them with a grain of salt – I may very easily have misheard something or have written it down incorrectly.

Session Title:  Mastering SQL Server Execution Plan Analysis
Instructor: Paul White

 

  • SQL technically has “bags” not “sets”
  • Binding order: from, on, join, where, group by, cube/rollup, having, select, distinct, order by, top
    • Can refer to elements defined in previous items
  • “Anybody using Hekaton? No, me either.  Don’t see any need for it.”
  • Query optimization pipeline – multiple steps
  • SQL Server not strongly typed; implicit type conversions can have serious consequences for performance
  • Constant folding: Evaluate deterministic expressions, such as 1 + 1
  • OPTION RECOMPILE vs WITH RECOMPILE
    • Option recompile is very powerful tool, often worth paying the price to recompile for each value of a parameter
    • Does not work if assigning to variable
    • Option recompile enables parameter embedding
    • Can provide more aggressive (better) query plans
    • Substitutes parameters, does constant folding
    • Use with caution!
  • TF 3604 – Output to client
  • TF 8605
    • Outputs tree in message tab
    • Particularly useful to view when plan does not optimize as expected
  • SQL Server does not treat CTEs as a “fence” – i.e., not evaluated separately; rather, rolled into the plan (treated as internal view)
    • Even causes view read counts to increase
  • Simplification
    • Eliminate unused columns
    • Rewrite logical subqueries as inner/outer join/apply
    • Full and outer joins to inner joins (if, i.e., WHERE clause does not allow nulls)
    • Remove redundancies and empty expressions
    • Primarily targets auto-generated SQL
    • WITH QUERYRULEOFF XX
    • Can combine query with things like check constraints to further restrict/simplify query
    • Computed columns (persisted) can be simplified, but only if expression is in the same order!
    • Joins that check for existence but FK exists
      • See MVP Deep Dives Volume 1 for a method of querying against a generic view (Rob Farley) (or here)
      • Unless FK is multicolumn
    • Cardinality estimation
      • Uses histograms, density, trie trees (short strings)
      • Uses derived stats
      • Heavily influences initial join order
      • When writing expressions, always ask how the CE will come up with an estimate
    • New 2014 cardinality estimator
      • The compatibility level of the context DB controls which CE is used
      • TF 2312 = new, TF 9481 = old
      • Lots of square roots (exponential backoff)
        • Example – join 2 tables
          • t0 = most selective
          • t1 = less selective
          • old: t0 * t1
          • new: t0 * sqrt(t1)
        • “If you want an exciting support life, use linked servers and throw in lots of new features”
        • Trivial Plan Queries
          • Queries without joins only
          • Property on root node of query plan: OptimizationLevel = Trivial vs Full
          • Subqueries prevent trival plans, so can use SELECT (SELECT col) FROM tbl to force full optimization
          • TF 8757 also prevents trivial
        • Cost-based optimization
          • Cascades framework (SearchSpace) – readily available
          • TF 8606 – show tree at different stages
          • TF 8612 – Show cardinality estimate (2012+)
        • Memo groups
          • Logical properties per group are the same (column lists)
          • Physical properties vary – different ways explored for doing the same thing
            • Sort order, cardinality est., parallelism, execution mode, Halloween protection
          • Visualization: tree with single node, then add variations to grow tree on out
          • Based on applying rules (395 of them in SQL2014)
            • Most rules are not used most of the time
            • Small set of rules most of the time
          • Works very well in vast majority of cases
          • Query hints work by disabling rules in the optimizer
            • FORCE ORDER query hint disables many rules – has a big impact on plan
            • Other hints imply FORCE ORDER, such as inner LOOP join
            • OPTION (QUERYRULEOFF)
            • DBCC SHOWONRULES – List of rules currently enabled. Examples:
              • JNToNL = logical join to nested loop
              • LOJNToNL = left outer logical join to nested loop
              • FOJNnoneqToSM = full outer join, non-equality, to sort merge
            • DBCC SHOWOFFRULES
            • dm_exec_query_transformation_stats – shows how frequently rules are used
            • See Paul’s query Q07 – which rules for test query
            • TF 8619 (2012+) – also shows applied rules
            • TF 8621 (2012+) – more detailed
          • Optimization phases
            • Transaction processing (Search 0)
              • At least 3 tables
              • Mostly OLTP queries, small number of rows with nested loops
            • Quick plan (Search 1)
              • Subset of rules
              • If cost exceed threshold for parallelism, run second time with parallel enabled
            • Full optimization (Search 2)
              • At least 5 tables
              • If Search1 resulted in serial, only serial plans searched; same for parallel
              • All rules available
            • Costing
              • Always assumes cold cache
              • Doesn’t consider specific hardware, mostly (but does account for CPU cores and memory)
              • Many assumptions that may or may not be true
              • Costing numbers originally mean an estimate of number of seconds on “Nick’s” machines (dev at MS)
              • Now completely meaningless for most purposes
                • Not really valid to compare two queries
                • Not even valid within a query; i.e., if it says sort of 60% of query, doesn’t make it so
                • Useful only as a sign post
              • Plan guides don’t necessarily reduce compilation time
                • Still runs Search 2
                • Guide used to shape search
              • TF 8608 – Shows initial memo groups
              • TF 8675 – Shows search phase info
              • TF 8606 – Shows tree info
              • TF 8615 – Shows final memos
              • TF 8607 – Output tree from optimizer
              • TF 7352 – Copy out tree (similar to plan)
              • Query execution engine
                • The “true” execution plan is called the MXC (“execution context”)
                  • Far more detailed that the user execution plan
                  • Many internal details are hidden
                • The building blocks of the execution plan are “iterators”
                  • Have a generic interface
                • fn_PhysLocFormatter(%%physloc%%)
                  • Returns file, page, slot for row
                • TF 8666 – Adds some debugging data to plan (see the F4 properties, InternalDebuggingInfo)
                • Interesting query with 9 different seeks:
                  • LastName in (‘aaa’, ‘bbb’, ‘ccc’) AND FirstName in (‘ddd’, ‘eee’, ‘fff’)
                  • Plan gives only one index seek, but:
                    • Properties show 9 different seek predicates
                  • Physically does 9 seeks
                  • Limited to 64 seek operations
                  • Beware that seeks can be hidden in this way
                • Filter iterator – predicates than can’t be used in seeks
                  • May have start-up expression: can control whether or not entire subtree is executed at all
                • Residual – hidden filter
                  • Filter predicate is pushed to the child iterator
                  • Because it is hidden, is easy to overlook
                  • “In a lot of ways, this is the major takeaway of the day”
                • TF 9130 – Reveal pushed predicates (doesn’t work for join predicates)
                • Also: If predicate is too large to fit on the tooltip, SSMS will simply leave it off! Have to look at properties window.
                • Nested loops
                  • Be sure to look at “outer references” property
                  • SSMS shows estimated execution stats based on one loop; actual exec stats on aggregated loops, so they generally don’t correlate
                  • If having performance issues with nested loop joins, be sure to check for residual predicates. May be able to rewrite to push predicate further down the tree
                • TF 2324 – Disables implications of predicates
                • Row-mode parallel plans
                • TF 8649 – Force parallel plan
                • If a table has a computed column, and the table is referenced in the query (even if the computed column is not referenced), will not get a parallel query (computed column is considered a scalar function)
                • Batch mode parallel execution – Rather than operating one row at a time, operates on between 64 and 900 at once
                  • Queue up batches; and thread can process next batch
                • Merge Joins – Check if many-to-many property is true; usually indicates a query error
                • See Paul’s blog on residual predicates on hash joins (make sure types match)
                • Bitmap – either simple or Bloom filter bitmap (for parallel plans only)
                • TOP (n) PERCENT must read the entire set and will include an eager spool
                • Sorts have a memory fraction property
                • Spool is a place in tempdb to cache rows in case they may be needed later
                  • Eager spool often indicates a missing index
                • Merge join operator can be performing a UNION ALL rather than a standard join!
                • SSMS update operator can actually be updating multiple indexes; view properties to see for sure
                • Wide vs. Narrow update plans
                  • Wide: Each index updated by separate operator
                  • Narrow: Multiple indexes updated by one operator
                • Split, Sort, Collapse operators in a row generally indicates that a unique index is being updated
                • TF 4138 (documented, supported) – Disable row goal behavior
                  • TOP adds row goal behavior
                • Queries that execute LIKE against a variable can have a Constant Expression that computes at run-time the range begin/end values
                • GetRangeWithMismatchedTypes (only in XML) indicates a type mismatch that must be converted to a range (i.e., datetime to date)
                • One way to handle catch-all search queries (rather than dynamic SQL or option recompile) is to use a columnstore index