SQL and MSBI Tips and Tricks: 2013

Thursday, November 28, 2013

How to get index usage information in SQL Server

How to get index usage information in SQL Server

Problem

Databases have two primary storage needs; data pages and index pages. Understanding and viewing the actual data in your tables is pretty straightforward by running some sample queries to get an idea of what columns are being used as well as what type of data is actually being stored. On the flip side of this, it is often difficult to know exactly what indexes are being used and how they are being used. So how can you get a better understanding of how your indexes are being used and what operations are occurring (inserts, updates, deletes, selects)?

Solution

In SQL Server, many new dynamic management objects have been created that allow you insight into a lot of data that was not accessible or just difficult to get in previous versions of SQL Server. One new function and one new view that provide data about index usage are sys.dm_db_index_operational_stats and sys.dm_db_index_usage_stats.

sys.dm_db_index_operational_stats

This function gives you information about insert, update and delete operations that occur on a particular index. In addition, this view also offers data about locking, latching and access methods. There are several columns that are returned from this view, but these are some of the more interesting columns:

leaf_insert_count - total count of leaf level inserts
leaf_delete_count - total count of leaf level deletes
leaf_update_count - total count of leaf level updates

Here is a sample query that provides some of the key columns that are helpful to determine insert, update and delete operations.

SELECT OBJECT_NAME(A.[OBJECT_ID]) AS [OBJECT NAME],
       I.[NAME] AS [INDEX NAME],
       A.LEAF_INSERT_COUNT,
       A.LEAF_UPDATE_COUNT,
       A.LEAF_DELETE_COUNT
FROM   SYS.DM_DB_INDEX_OPERATIONAL_STATS (NULL,NULL,NULL,NULL ) A
       INNER JOIN SYS.INDEXES AS I
         ON I.[OBJECT_ID] = A.[OBJECT_ID]
            AND I.INDEX_ID = A.INDEX_ID
WHERE  OBJECTPROPERTY(A.[OBJECT_ID],'IsUserTable') = 1

Here is the output from the above query. From this function we can get an idea of how many inserts, updates and delete operations were performed on each table and index.

From this view we can get an idea of how many inserts, updates and delete operations were performed on each table and index

sys.dm_db_index_usage_stats

This view gives you information about overall access methods to your indexes. There are several columns that are returned from this DMV, but here are some helpful columns about index usage:

user_seeks - number of index seeks
user_scans- number of index scans
user_lookups - number of index lookups
user_updates - number of insert, update or delete operations

SELECT   OBJECT_NAME(S.[OBJECT_ID]) AS [OBJECT NAME],
         I.[NAME] AS [INDEX NAME],
         USER_SEEKS,
         USER_SCANS,
         USER_LOOKUPS,
         USER_UPDATES
FROM     SYS.DM_DB_INDEX_USAGE_STATS AS S
         INNER JOIN SYS.INDEXES AS I
           ON I.[OBJECT_ID] = S.[OBJECT_ID]
              AND I.INDEX_ID = S.INDEX_ID
WHERE    OBJECTPROPERTY(S.[OBJECT_ID],'IsUserTable') = 1

Here is the output from the above query. From this view we can get an idea of how many seeks, scans, lookups and overall updates (insert, update and delete) occurred.

From this view we can get an idea of how many seeks, scans, lookups and overall updates (insert, update and delete) occurred

Notes

The values for these counters get reset each time you restart SQL Server. In addition, the values for the data returned by sys.dm_db_index_operational_stats exists only as long as the metadata cache object that represents the heap or index is available.

Indexes That Do Appear in the Usage Statistics Table, But Are Never Used

DECLARE @MinimumPageCount int

SET @MinimumPageCount = 500

SELECT Databases.name AS [Database],

Indexes.name AS [Index],

Objects.Name AS [Table],

PhysicalStats.page_count as [Page Count],

CONVERT(decimal(18,2), PhysicalStats.page_count * 8 / 1024.0) AS [Total Index Size (MB)],

CONVERT(decimal(18,2), PhysicalStats.avg_fragmentation_in_percent) AS [Fragmentation (%)],

ParititionStats.row_count AS [Row Count],

CONVERT(decimal(18,2), (PhysicalStats.page_count * 8.0 * 1024) / ParititionStats.row_count) AS [Index Size/Row (Bytes)]

FROM sys.dm_db_index_usage_stats UsageStats

INNER JOIN sys.indexes Indexes

ON Indexes.index_id = UsageStats.index_id

AND Indexes.object_id = UsageStats.object_id

INNER JOIN sys.objects Objects

ON Objects.object_id = UsageStats.object_id

INNER JOIN SYS.databases Databases

ON Databases.database_id = UsageStats.database_id

INNER JOIN sys.dm_db_index_physical_stats (DB_ID(), NULL, NULL, NULL, NULL) AS PhysicalStats

ON PhysicalStats.index_id = UsageStats.Index_id

and PhysicalStats.object_id = UsageStats.object_id

INNER JOIN SYS.dm_db_partition_stats ParititionStats

ON ParititionStats.index_id = UsageStats.index_id

and ParititionStats.object_id = UsageStats.object_id

WHERE UsageStats.user_scans = 0

AND UsageStats.user_seeks = 0

AND UsageStats.user_lookups = 0

AND PhysicalStats.page_count > @MinimumPageCount -- ignore indexes with less than 500 pages of memory

AND Indexes.type_desc != 'CLUSTERED' -- Exclude primary keys, which should not be removed

ORDER BY [Page Count] DESC

Update Statistics

Update Statistics:

SQL Server tries to use statistics to “react intelligently” in its query optimization. Knowing number of records, density of pages, histogram, or available indexes help the SQL Server optimizer “guess” more accurately how it can best retrieve data

Query to get statistics on Tables

SELECT OBJECT_NAME(object_id) AS [ObjectName]

,[name] AS [StatisticName]

,STATS_DATE([object_id], [stats_id]) AS [StatisticUpdateDate]

FROM sys.stats order by STATS_DATE([object_id], [stats_id])desc

How are statistics created?

Statistics can be created different ways
- Statistics are automatically created for each index key you create.

http://65.38.114.172/wp-content/uploads/2012/06/idera-statistics-createdviaindex-570x2412.png-12.png

- If the database setting autocreate stats is on, then SQL Server will automatically create statistics for non-indexed columns that are used in queries.

http://65.38.114.172/wp-content/uploads/2012/06/idera-statistics-createdviaiquery-570x2812.png2_2.png

- CREATE STATISTICS

What do statistics look like?

If you’re curious, there’s a couple ways you can peek at what statistics look like.

Option 1 – you can go to your Statistics node in your SSMS, right click > Properties, then go to Details. Below is a sample of the stats and histogram that’s collected for one of the tables in my database

http://65.38.114.172/wp-content/uploads/2012/06/idera-statistics-histogram-570x4291.png4_1.png

Option 2 – you can use DBCC SHOW_STATISTICS WITH HISTOGRAM

http://65.38.114.172/wp-content/uploads/2012/06/idera-statistics-histogram-DBCC-SHOWSTATISTICS-570x2021.png5_1.png

The histograms are a great way to visualize the data distribution in your table.

How are statistics updated?

The default settings in SQL Server are to autocreate and autoupdate statistics.

http://65.38.114.172/wp-content/uploads/2012/06/idera-statistics-defaultdbsettings2-570x5132.png6_2.png

Notice that there are two (2) options with the Auto Update statistics.
- Auto Update Statistics basically means, if there is an incoming query but statistics are stale, SQL Server will update statistics first before it generates an execution plan.
- Auto Update Statistics Asynchronously on the other hand means, if there is an incoming query but statistics are stale, SQL Server uses the stale statistics to generate the execution plan, then updates the statistics afterwards.

However, if you want to manually update statistics, you can use either sp_updatestats or UPDATE STATISTICS <statistics name>

How do we know statistics are being used?

One good check you can do is when you generate execution plans for your queries:

check out your “Actual Number of Rows” and “Estimated Number of Rows”.

If these numbers are (consistently) fairly close, then most likely your statistics are up-to-date and used by the optimizer for the query. If not, time for you to re-check your statistics create/update frequency.

What configuration settings should we set?

There may be cases when you may want to disable statistics update temporarily while you’re doing massive updates on a table, and you don’t want it to be slowed down by the autoupdate.

However, for the most part, you will want to keep the SQL Server settings:
- auto create statistics
- auto update statistics

StatisticsHeader Details

CREATE TABLE StatisticsHeader

(

Name sysname,

Updated datetime,

[Rows] bigint,

RowsSampled bigint,

Steps tinyint,

Density decimal(9,5),

AverageKeyLength decimal(9,5),

StringIndex nchar(3),

FilterExpression nvarchar(1000),

UnfilteredRows bigint

)

INSERT StatisticsHeader

EXEC ('DBCC SHOW_STATISTICS (''FactSMCSubscriptionBaseLIR'', ''IXC_AgreementKey'')

WITH stat_header')

SELECT * FROM StatisticsHeader

Memory and IO in SSIS and SQL

1.SSIS is an in-memory pipeline, so ensure that all transformations occur in memory.

The purpose of having Integration Services within SQL Server features is to provide a flexible, robust pipeline that can efficiently perform row-by-row calculations and parse data all in memory.

While the extract and load phases of the pipeline will touch disk (read and write respectively), the transformation itself should process in memory. If transformations spill to disk (for example with large sort operations), you will see a big performance degradation. Construct your packages to partition and filter data so that all transformations fit in memory.

A great way to check if your packages are staying within memory is to review the SSIS performance counter Buffers spooled, which has an initial value of 0; above 0 is an indication that the engine has started swapping to disk.

2.Plan for capacity by understanding resource utilization.

SQL Server Integration Services is designed to process large amounts of data row by row in memory with high speed. Because of this, it is important to understand resource utilization, i.e., the CPU, memory, I/O, and network utilization of your packages.
CPU Bound

Seek to understand how much CPU is being used by Integration Services and how much CPU is being used overall by SQL Server while Integration Services is running. This latter point is especially important if you have SQL Server and SSIS on the same box, because if there is a resource contention between these two, it is SQL Server that will typically win – resulting in disk spilling from Integration Services, which slows transformation speed.

The perfmon counter that is of primary interest to you is Process / % Processor Time (Total). Measure this counter for both sqlservr.exe and dtexec.exe. If SSIS is not able to drive close to 100% CPU load, this may be indicative of:

Application contention: For example, SQL Server is taking on more processor resources, making them unavailable to SSIS.
Hardware contention: A common scenario is that you have suboptimal disk I/O or not enough memory to handle the amount of data being processed.
Design limitation: The design of your SSIS package is not making use of parallelism, and/or the package uses too many single-threaded tasks.

Network Bound

SSIS moves data as fast as your network is able to handle it. Because of this, it is important to understand your network topology and ensure that the path between your source and target have both low latency and high throughput.

The following Network perfmon counters can help you tune your topology:

Network Interface / Current Bandwidth: This counter provides an estimate of current bandwidth.
Network Interface / Bytes Total / sec: The rate at which bytes are sent and received over each network adapter.
Network Interface / Transfers/sec: Tells how many network transfers per second are occurring. If it is approaching 40,000 IOPs, then get another NIC card and use teaming between the NIC cards.

These counters enable you to analyze how close you are to the maximum bandwidth of the system. Understanding this will allow you to plan capacity appropriately whether by using gigabit network adapters, increasing the number of NIC cards per server, or creating separate network addresses specifically for ETL traffic.
I/O Bound
If you ensure that Integration Services is minimally writing to disk, SSIS will only hit the disk when it reads from the source and writes to the target. But if your I/O is slow, reading and especially writing can create a bottleneck.

Memory bound
A very important question that you need to answer when using Integration Services is: "How much memory does my package use?"

The key counters for Integration Services and SQL Server are:

Process / Private Bytes (DTEXEC.exe): The amount of memory currently in use by Integration Services. This memory cannot be shared with other processes.
Process / Working Set (DTEXEC.exe): The total amount of allocated memory by Integration Services.
SQL Server: Memory Manager / Total Server Memory: The total amount of memory allocated by SQL Server. Because SQL Server has another way to allocate memory using the AWE API, this counter is the best indicator of total memory used by SQL Server.
Memory / Page Reads / sec: Represents to total memory pressure on the system. If this consistently goes above 500, the system is under memory pressure.

3.　Baseline source system extract speed.

Understand your source system and how fast you extract from it. After all, Integration Services cannot be tuned beyond the speed of your source – i.e., you cannot transform data faster than you can read it.

http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-00-50-01/1856.Figure-1-_2D00_-Baseline-Source-System.jpg

Measure the speed of the source system by creating a very simple package reading data from your source with the a destination of "Row Count":

Execute the package from the command line (DTEXEC) and measure the time it took for it to complete its task. Use the Integration Services log output to get an accurate calculation of the time. You want to calculate rows per second:

Rows / sec = Row Count / TimeData Flow

Based on this value, you now know the maximum number of rows per second you can read from the source – this is also the roof on how fast you can transform your data. To increase this Rows / sec calculation, you can do the following:

• Improve drivers and driver configurations: Make sure you are using the most up-to-date driver configurations for your network, data source, and disk I/O. Often the default network drivers on your server are not configured optimally for the network stack, which results in performance degradations when there are a high number of throughput requests. Note that for 64-bit systems, at design time you may be loading 32-bit drivers; ensure that at run time you are using 64-bit drivers.

• Start multiple connections: To overcome limitations of drivers, you can try to start multiple connections to your data source. As long as the source can handle many concurrent connections, you may see an increase in throughput if you start several extracts at once. If concurrency is causing locking or blocking issues, consider partitioning the source and having your packages read from different partitions to more evenly distribute the load.

• Use multiple NIC cards: If the network is your bottleneck and you’ve already ensured that you’re using gigabit network cards and routers, then a potential solution is to use multiple NIC cards per server. Note that you will have to be careful when you configure multiple NIC environments; otherwise you will have network conflicts.

4. Tune your network.

A key network property is the packet size of your connection. By default this value is set to 4,096 bytes. This means a new network package must be assemble for every 4 KB of data. As noted in

SqlConnection.PacketSize Property in the .NET Framework Class Library, increasing the packet size will improve performance because fewer network read and write operations are required to transfer a large data set.

http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-00-50-01/8015.Figure-3-_2D00_-tune-your-nextwork.jpg

If your system is transactional in nature, with many small data size read/writes, lowering the value will improve performance.

Since Integration Services is all about moving large amounts of data, you want to minimize the network overhead. This means that the value 32K (32767) is the fastest option. While it is possible to configure the network packet size on a server level using sp_configure, you should not do this. The database administrator may have reasons to use a different server setting than 32K. Instead, override the server settings in the connection manager as illustrated below.

Another network tuning technique is to use network affinity at the operating system level. At high throughputs, you can sometimes improve performance this way.

For the network itself, you may want to work with your network specialists to enable jumbo frames to increase the default payload of 1,500 bytes to 9,000 bytes. By enabling jumbo frames, you will further decrease the amount of network operation required to move large data sets.

SSIS and Memory

One thing that you're bound to run into when doing SSIS is memory issues. Buffer overflows. Virtual memory out of space.

Although I'm no expert in it, I can offer a few tips from things I've ran into.

1) If you're running out of physical memory, increase the pagefile amount. (Windows 7)
Right click on "My Computer" --> "Properties"
Select "Advanced System Settings" option on the left hand side
On the "Advanced" tab, under the "Performance" group box, click [Settings]
Click on the "Advanced" tab
On the "Virtual memory" group box, click [Change...]
Uncheck the "Automatically managed paging files for all drives" option as this puts a smaller page file
Under "Custom Size", input the inital and max size. There is a recommended amount. I'd do 10-50% over the recommended amount. Note: Make sure that you have enough disk space for your file.
Click [Set]
Click [OK]
Restart your system

2) SQL Server likes to eat up memory. It's a smart thing to set the maximum amount that it can consume, as it will consume everything until it's reached it's max - only to be reset when the service is restarted or the computer is rebooted.
Open up SSMS.
Right click on [Server Name] --> Properties
Click on the "Memory" tab
Set your max server memory to something manageable. If you have 8gb in your system, you don't want the max server memory to be 8gb since your os needs memory to also manage and there will be slowdown issues because of the imminent paging that will happen. I set it to 50% of my total memory. Once it reaches 50%, it will stay there regardless if it is currently using the entire space or not. Once it's reached the allocated amount. It stays. There - I've said the same thing 5 times. You get the idea.

What are the different properties which control the number of rows to be stored in the data buffers?

Each Data Flow Task has two properties (DefaultMaxBufferRows and DefaultMaxBufferSize) which control the number of rows which can be stored in single buffer and how many buffers can be created. The size of each row plays a pivotal role in deciding the number of rows which can be stored in a single buffer. DefaultMaxBufferRows property has default value of 10,000 rows and specifies the maximum number of rows that can be stored in each individual buffer.
The DefaultBufferSize property has a default value of 10 MB and specifies the maximum size of each individual buffer. Please note maximum and minimum size of each individual buffer is also constrained by two internal properties: MinBufferSize = 64 KB and MaxBufferSize = 100MB. As a best practice, you should configure these properties in such a way, that SSIS creates a small number of large buffers instead of creating a large number of small buffers. Also keep in mind, you should not make these values too large, especially if you minimal memory on the system, to a point where SSIS starts swapping/spooling data to disk. For more information click here.

How does BLOB data get stored in the data buffer and what happens if SSIS does not find enough buffers to hold the BLOB data?

As long as the BLOB data can fit in the data buffers (created out of memory) it is stored in the data buffers. However, the size of BLOB data is often too large to fit in data buffers, so in this circumstance SSIS spools the BLOB data to the file system at the default location or at a specified location.
Any data, including BLOB data, which cannot be stored in the data buffers, because of memory scarcity or memory pressure, is spooled to the file system.

When and where are data buffers are spooled to disk? How can this issue be avoided?

Often times, the size of BLOB data is too large to fit in the data buffers, so SSIS spools the BLOB data to the file system at the default location or at a specified location. However, anytime data cannot be stored in the data buffers because of memory scarcity or memory pressure, the data is spooled to the file system.
By default SSIS derives the physical locations to store buffer overflow data and BLOB overflow data from the TEMP/TMP environment variable. You can change this default behavior by changing these two properties of data flow task:

BufferTempStoragePath
BLOBTempStoragePath.

Note - sometimes spooling causes your system to run slow because of the input/output involved. As such, you should try to optimize the data in data buffers by adding more RAM or removing asynchronous transformations to the extent possible. If this is not possible, then try to specify the physical location for the spooling to the fastest drives that you have available. You can learn more this here.

What is parallel processing in SSIS and how do you control it?

One of the biggest benefits of using SSIS is it's ability to process data in parallel. You can have multiple tasks run in parallel for your ETL operations and utilize the power of your today's multi-core/multi-processer machines.
Each SSIS package has MaxConcurrentExecutables property which has its default value as -1; which means the maximum number of tasks that can be executed is equal to the total number of processors on the machine plus two. For example, on a four processor machine can execute 6 tasks in parallel in a package with this default property value. You can change this default value to an absolute number to indicate absolute number of tasks that can be executed in parallel.
Each Data flow task has the EngineThreads property which specifies the total number of threads that can be created for executing the data flow task. In SQL Server 2008, the default value for this property is 10, which means 10 threads can be allocated at one time to execute different execution trees of the data flow task in parallel.

SQL and MSBI Tips and Tricks

Thursday, November 28, 2013