Saturday, May 25, 2013

db2 9.7 Basic set of system performance monitor elements

About 10 metrics of system performance provide a good basic set to use in an on-going operational monitoring effort.
There are hundreds of metrics to choose from, but collecting all of them can be counter-productive due to the sheer volume of data produced. You want metrics that are:
  • Easy to collect - You don't want to have to use complex or expensive tools for everyday monitoring, and you don't want the act of monitoring to significantly burden the system.
  • Easy to understand - You don't want to have to look up the meaning of the metric each time you see it.
  • Relevant to your system - Not all metrics provide meaningful information in all environments.
  • Sensitive, but not too sensitive - A change in the metric should indicate a real change in the system; the metric should not fluctuate on its own.
This starter set includes about 10 metrics:
  • The number of transactions executed:
    TOTAL_APP_COMMITS 
    This provides an excellent base level measurement of system activity.
  • Buffer pool hit ratios, measured separately for data, index, and temporary data:
    100 * (POOL_DATA_L_READS - POOL_DATA_P_READS) / POOL_DATA_L_READS
    100 * (POOL_INDEX_L_READS - POOL_INDEX_P_READS) / POOL_INDEX_L_READS
    100 * (POOL_TEMP_DATA_L_READS - POOL_TEMP_DATA_P_READS) / POOL_TEMP_DATA_L_READS 
    100 * (POOL_TEMP_INDEX_L_READS - POOL_TEMP_INDEX_P_READS)
      / POOL_TEMP_INDEX_L_READS
    Buffer pool hit ratios are one of the most fundamental metrics, and give an important overall measure of how effectively the system is exploiting memory to avoid disk I/O. Hit ratios of 80-85% or better for data and 90-95% or better for indexes are generally considered good for an OLTP environment, and of course these ratios can be calculated for individual buffer pools using data from the buffer pool snapshot.
    Although these metrics are generally useful, for systems such as data warehouses that frequently perform large table scans, data hit ratios are often irretrievably low, because data is read into the buffer pool and then not used again before being evicted to make room for other data.
  • Buffer pool physical reads and writes per transaction:
    (POOL_DATA_P_READS + POOL_INDEX_P_READS +
      POOL_TEMP_DATA_P_READS + POOL_TEMP_INDEX_P_READS)
      / TOTAL_APP_COMMITS
    
    (POOL_DATA_WRITES + POOL_INDEX_WRITES)
      / TOTAL_APP_COMMITS
    These metrics are closely related to buffer pool hit ratios, but have a slightly different purpose. Although you can consider target values for hit ratios, there are no possible targets for reads and writes per transaction. Why bother with these calculations? Because disk I/O is such a major factor in database performance, it is useful to have multiple ways of looking at it. As well, these calculations include writes, whereas hit ratios only deal with reads. Lastly, in isolation, it is difficult to know, for example, whether a 94% index hit ratio is worth trying to improve. If there are only 100 logical index reads per hour, and 94 of them are in the buffer pool, working to keep those last 6 from turning into physical reads is not a good use of time. However, if a 94% index hit ratio were accompanied by a statistic that each transaction did twenty physical reads (which could be further broken down by data and index, regular and temporary), the buffer pool hit ratios might well deserve some investigation.
    The metrics are not just physical reads and writes, but are normalized per transaction. This trend is followed through many of the metrics. The purpose is to decouple metrics from the length of time data was collected, and from whether the system was very busy or less busy at that time. In general, this helps ensure that similar values for metrics are obtained, regardless of how and when monitoring data is collected. Some amount of consistency in the timing and duration of data collection is a good thing; however, normalization reduces it from being critical to being a good idea.
  • The ratio of database rows read to rows selected:
    ROWS_READ / ROWS_RETURNED
    This calculation gives an indication of the average number of rows that are read from database tables in order to find the rows that qualify. Low numbers are an indication of efficiency in locating data, and generally show that indexes are being used effectively. For example, this number can be very high in the case where the system does many table scans, and millions of rows need to be inspected to determine if they qualify for the result set. On the other hand, this statistic can be very low in the case of access to a table through a fully-qualified unique index. Index-only access plans (where no rows need to be read from the table) do not cause ROWS_READ to increase.
    In an OLTP environment, this metric is generally no higher than 2 or 3, indicating that most access is through indexes instead of table scans. This metric is a simple way to monitor plan stability over time - an unexpected increase is often an indication that an index is no longer being used and should be investigated.
  • The amount of time spent sorting per transaction:
    TOTAL_SORT_TIME / TOTAL_APP_COMMITS
    This is an efficient way to handle sort statistics, because any extra overhead due to spilled sorts automatically gets included here. That said, you might also want to collect TOTAL_SORTS and SORT_OVERFLOWS for ease of analysis, especially if your system has a history of sorting issues.
  • The amount of lock wait time accumulated per thousand transactions:
    1000 * LOCK_WAIT_TIME / TOTAL_APP_COMMITS
    Excessive lock wait time often translates into poor response time, so it is important to monitor. The value is normalized to one thousand transactions because lock wait time on a single transaction is typically quite low. Scaling up to one thousand transactions simply provides measurements that are easier to handle.
  • The number of deadlocks and lock timeouts per thousand transactions:
    1000 * (DEADLOCKS + LOCK_TIMEOUTS) / TOTAL_APP_COMMITS
    Although deadlocks are comparatively rare in most production systems, lock timeouts can be more common. The application usually has to handle them in a similar way: re-executing the transaction from the beginning. Monitoring the rate at which this happens helps avoid the case where many deadlocks or lock timeouts drive significant extra load on the system without the DBA being aware.
  • The number of dirty steal triggers per thousand transactions:
    1000 * POOL_DRTY_PG_STEAL_CLNS / TOTAL_APP_COMMITS
    A "dirty steal" is the least preferred way to trigger buffer pool cleaning. Essentially, the processing of an SQL statement that is in need of a new buffer pool page is interrupted while updates on the victim page are written to disk. If dirty steals are allowed to happen frequently, they can have a significant impact on throughput and response time.
  • The number of package cache inserts per thousand transactions:
    1000 * PKG_CACHE_INSERTS / TOTAL_APP_COMMITS
    Package cache insertions are part of normal execution of the system; however, in large numbers, they can represent a significant consumer of CPU time. In many well-designed systems, after the system is running at steady-state, very few package cache inserts occur, because the system is using or reusing static SQL or previously prepared dynamic SQL statements. In systems with a high traffic of ad hoc dynamic SQL statements, SQL compilation and package cache inserts are unavoidable. However, this metric is intended to watch for a third type of situation, one in which applications unintentionally cause package cache churn by not reusing prepared statements, or by not using parameter markers in their frequently executed SQL.
  • The time an agent waits for log records to be flushed to disk:
    LOG_WRITE_TIME 
      / TOTAL_APP_COMMITS
    The transaction log has significant potential to be a system bottleneck, whether due to high levels of activity, or to improper configuration, or other causes. By monitoring log activity, you can detect problems both from the DB2® side (meaning an increase in number of log requests driven by the application) and from the system side (often due to a decrease in log subsystem performance caused by hardware or configuration problems).
  • In partitioned database environments, the number of fast communication manager (FCM) buffers sent and received between partitions:
    FCM_SENDS_TOTAL, FCM_RECVS_TOTAL
    These give the rate of flow of data between different partitions in the cluster, and in particular, whether the flow is balanced. Significant differences in the numbers of buffers received from different partitions might indicate a skew in the amount of data that has been hashed to each partition.

Cross-partition monitoring in partitioned database environments

Almost all of the individual monitoring element values mentioned above are reported on a per-partition basis.
In general, you expect most monitoring statistics to be fairly uniform across all partitions in the same DB2 partition group. Significant differences might indicate data skew. Sample cross-partition comparisons to track include:
  • Logical and physical buffer pool reads for data, indexes, and temporary tables
  • Rows read, at the partition level and for large tables
  • Sort time and sort overflows
  • FCM buffer sends and receives
  • CPU and I/O utilization

No comments:

Post a Comment