Monitoring and Improving Database Performance Part I

At TrackAbout, we already do a lot of system performance measurement. We’re about to start doing even more. Database performance plays a large role in the overall performance of the TrackAbout service. In this blog series, we will discuss a new approach to measuring and monitoring the performance of queries against our Microsoft SQL Server relational database environment.

The performance of our software is an important part of our end-user experience. We encourage all developers to adopt the philosophy that performance is a feature. Apps and services that perform well yield many benefits. At its simplest:

  • Speed makes users happy
  • Efficiency keeps costs down, in the form of requiring less hardware to get the work done. This enables us to scale up our company more easily.

In technology, many performance problems can be solved by throwing money at them. Throwing money generally boils down to either (1) upgrading or adding hardware and infrastructure or (2) improving the code. Sometimes it’s not clear which approach yields the best price/performance. First, you need to know where the bottlenecks are. That means measurement.

Ultimately, we intend to define internal Service Level Agreements (SLAs) for query performance to which our engineers will strive to adhere. We do not know yet what our target SLA numbers are going to be. We will have a clearer idea once we start gathering and analyzing current performance data for our queries.

Understanding why certain queries are slow will enable us to better educate our developers as to how to avoid creating performance problems in the first place.

We are open-sourcing all the code we develop as part of this effort. You can find it on GitHub in our sql-perf-monitoring repository.

We want to measure the following:

  • Slowest queries: Fairly self-explanatory, these are the queries that take the longest time to execute.
  • Queries returning an excessive number of rows: We want to identify queries that return thousands, tens of thousands, or even millions of rows. Barring a few exceptions, we do not have scenarios where such a large number of rows should be returned.
  • Queries causing the most logical (as opposed to physical) disk reads and writes: We want our queries to perform as few logical reads and writes as possible. We will be focusing on logical I/O and not physical I/O. The operating system’s file cache helps speed up queries by using RAM instead of disk. This can result in queries that log orders of magnitude more logical reads than physical reads. While having queries that make few physical reads is nice, the queries are still consuming RAM in the form of the file system cache that they may not need to. We want to make room for queries that actually need the RAM.
  • Queries consuming the most CPU: Most of what our database does is I/O-bound. Queries generating high CPU load are generally anomalous and should be investigated.
  • Application logic that causes the most queries to be executed: It may be possible to modify the logic to fetch the same data using fewer queries.

SQL Server Dynamic Management Views

We considered using SQL Profiler or Extended Events to capture the data. These tools can provide the level of detail we are looking for, but they have high overhead, on the order of 10-20%. This is an unacceptable price to pay for the kind of ongoing monitoring we intend to implement.

Using the SQL Server Dynamic Management Views (DMVs) is the best way to get the data we need, but they have a weakness. The data the SQL Server engine collects is reset whenever that engine is restarted or when a cached query plan is evicted from the cache.

We therefore created a mechanism to keep historical data around for analysis. A SQL Agent job queries the DMVs every 30 minutes, inserting results into this reporting table.

QueryPerformanceStats table schema

(The datatype of the QueryHash column is not shown, as it is a computed column. Internally, the data type will be varbinary.)

Most of the columns map directly to those returned from the DMVs.

Columns we added are:

  • DateRecorded: A time-stamp of when the data was inserted.
  • QueryType: Indicates if the query is a stored procedure / user-defined function or a batch query.
  • CycleNumber: Is incremented when the data for a particular query has been reset, either due to the query plan being evicted from the cache or due to a SQL Server engine restart. This is set through an insert trigger on the table. The value is incremented if any one of TotalElapsedTime, TotalLogicalReads, TotalLogicalWrites, TotalRowsReturned, TotalWorkerTime values is less than the last recorded value.
  • QueryHash: This is a MD5 hash of the Query column, generated using the HashBytes function. The Query column is too wide to be indexed (the maximum index length in SQL Server is 900 bytes), so we index this column instead and use it in our queries.

DMV Queries

We query the DMVs to measure the following:

  • Queries that generate the most rows
  • Queries causing the most logical reads
  • Queries causing the most logical writes
  • Queries consuming the most CPU
  • Queries taking the longest time to execute (wall clock time) – We recognize that due to varying workloads and contention on the server, wall clock time is not a dependable measure, but it’s still interesting so we’re going to capture it.

For all of the above, we get the top queries both by average and absolute values.

Here is an example of one of the queries used. (For all the code, see our Github repository.) This query retrieves the top 100 queries that return the most number of rows.

insert into reports.QueryPerformanceStats 
	DateRecorded = GETUTCDATE(),
	QueryText = ISNULL(CASE WHEN OBJECT_NAME(qt.objectid, dbid) IS NOT NULL 
				THEN OBJECT_NAME(qt.objectid, dbid)
							WHEN qs.statement_end_offset = -1 
						 THEN LEN(CONVERT(nvarchar(max), qt.[text])) * 2 
							ELSE qs.statement_end_offset 
						 END - qs.statement_start_offset)/2)
				END, ''),
	QueryType = CASE WHEN OBJECT_NAME(qt.objectid, dbid) IS NOT NULL THEN 0 ELSE 1 END, 
	DatabaseName = ISNULL(DB_NAME(qt.dbid), ''),
	ExecutionCount = qs.execution_count,

	TotalWorkerTime = qs.total_worker_time,
	MaxWorkerTime = qs.max_worker_time,
	AverageWorkerTime = qs.total_worker_time / qs.execution_count,

	TotalLogicalReads = qs.total_logical_reads, 
	MaxLogicalReads = qs.max_logical_reads, 
	AverageLogicalReads = qs.max_logical_reads / qs.execution_count,

	TotalLogicalWrites = qs.max_logical_writes,
	MaxLogicalWrites = qs.max_logical_writes,
	AverageLogicalWrites = qs.max_logical_writes / qs.execution_count,

	TotalElapsedTime = qs.total_elapsed_time,
	MaxElapsedTime = qs.max_elapsed_time,
	AverageElapsedTime = qs.max_elapsed_time / qs.execution_count,

	TotalRows = qs.total_rows,
	MaxRows = qs.max_rows,
	AverageRows = qs.max_rows / qs.execution_count
FROM sys.dm_exec_query_stats AS qs WITH (NOLOCK)
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt 

Reporting and Analysis

We’re using SQL Server Reporting Services (SSRS) to generate our analysis reports. We use SSRS internally, we understand it, and we’re not doing anything terribly complicated that would call for a more robust analysis tool (yet).

Reports fall into two broad categories:

  1. Reports that show the worst-performing queries, and
  2. Reports that show trends over time

Worst-Performing Queries

The purpose of this series of reports is to help us determine which queries need to be optimized. We have the following reports under this category:

  • Top Max Logical Reads
  • Top Average Logical Reads
  • Top Max Rows Returned
  • Top Average Rows Returned
  • Top Max Elapsed Time
  • Top Average Elapsed Time

Query Performance Trends over Time

The purpose of our query trend reports is to measure the impact (hopefully positive) that our query optimization efforts are having.

We feel that the Average Logical Reads and Average Elapsed Time trends will be good indicators. We are generating reports for these values that show the daily, weekly, monthly and yearly data.

Next Steps

The work above does not address issues where code may be executing a large volume of inexpensive queries (a typical ORM SELECT N+1 scenario). We will be adding logging shortly to identify these scenarios as well.

Related Posts

Leave a comment