Saturday, May 25, 2013

db2 9.7 Diagnosing and resolving locking problems

To resolve a locking problem, you need to start by diagnosing the type of lock event causing the SQL query performance slowdown, or query completion failure, and the SQL statement or statements involved. The steps to help in diagnosing the type of locking problem and the steps that can then be taken to help resolve the locking issue are provided here.

Introduction

A locking problem is the proper diagnosis if you are experiencing a failure of applications to complete their tasks or a slow down in the performance of SQL queries due to locks. Therefore, the ideal objective is not to have any lock timeouts or deadlocks on a database system, both of which result in applications failing to complete their tasks.
Lock waits are normal expected events, but if the time spent waiting for a lock becomes large, then lock waits can slow down both SQL query performance and completion of an application. Excessive lock wait durations have a risk of becoming lock timeouts which result in the application not completing its tasks.
Lock escalations are a consideration as a locking problem when they contribute to causing lock timeouts. Ideally, the objective is not to have any lock escalations, but a small number can be acceptable if adverse effects are not occurring.
It is suggested that you monitor lock wait, lock timeout, and deadlock locking events at all times; typically at the workload level for lock waits, and at the database level for lock timeouts and deadlocks.
The diagnosis of the type of locking problem that is occurring and its resolution begins with the collection of information and looking for diagnostic indicators. The following sections help to guide you through this process.

Collect information

In general, to be able to objectively assess that your system is demonstrating abnormal behavior which can include processing delays and poor performance, you must have information that describes the typical behavior (baseline) of your system. A comparison can then be made between your observations of suspected abnormal behavior and the baseline. Collecting baseline data, by scheduling periodic operational monitoring tasks, is a key component of the troubleshooting process. For more detailed information about establishing the baseline operation of your system, see: "Operational monitoring of system performance".
To confirm what type of locking problem is the reason for your SQL query performance slowdown or query completion failure, it is necessary to collect information that would help to identify what type of lock event is involved, which application is requesting or holding this lock, what was the application doing during this event, and the SQL statement or statements that are involved in being noticeably slow.
The creation of a locking event monitor, use of a table function, or use of the db2pd command can collect this type of information. The information gathered by the locking event monitor can be categorized into three main categories:
  • Information about the lock in question
  • Information about the application requesting this lock and its current activities. In the case of a deadlock, this is information about the statement referred to as the victim.
  • Information about the application owning the lock and its current activities. In the case of a deadlock, this is information about the statement referred to as the participant.
For instructions about how to monitor lock wait, lock timeout, and deadlock locking events, see: Monitoring locking events.

Look for diagnostic indicators

The locking event monitor, a table function, or running the db2pd command can collect information that can help isolate the nature of a locking problem. Specifically, the following topics contain diagnostically indicative information to help you to diagnose and confirm the particular type of locking problem you are experiencing.
  • If you are experiencing long wait times and no lock timeouts, then you likely have a lock wait problem. To confirm: t0055234.html
  • If you are experiencing an increased number of deadlocks than the baseline number, then you likely have a deadlock problem. To confirm: t0055236.html
  • If you are experiencing an increased number of lock timeouts and the locktimeout database configuration parameter is set to a nonzero time value, then you likely have a lock timeout problem. To confirm (also consider lock wait problem): t0055235.html
  • If you are experiencing a higher than typical number of lock waits and the locking event monitor indicates that lock escalations are occurring (Yes), then you likely have a lock escalation problem. To confirm: t0055237.html

db2 9.7 Resolving lock escalation problems

After diagnosing a lock escalation problem, the next step is to attempt to resolve the issue resulting from the database manager automatically escalating locks from row level to table level. The guidelines provided here can help you to resolve the lock escalation problem you are experiencing and help you to prevent such future incidents.

About this task

The guidelines provided here can help you to resolve the lock escalation problem you are experiencing and help you to prevent such future incidents.
The objective is to minimize lock escalations, or eliminate them, if possible. A combination of good application design and database configuration for lock handling can minimize or eliminate lock escalations. Lock escalations can lead to reduced concurrency and potential lock timeouts, so addressing lock escalations is an important task. The lock_escals monitor element and messages written to the administration notification log can be used to identify and correct lock escalations.
First, ensure that lock escalation information is being recorded. Set the value of the mon_lck_msg_lvl database configuration parameter to 1. This is the default setting. When a lock escalation event occurs, information regarding the lock, workload, application, table, and error SQLCODEs are recorded. The query is also logged if it is a currently executing dynamic SQL statement.

Before you begin

Confirm that you are experiencing a lock escalation problem by taking the necessary diagnostic steps for locking problems outlined in Diagnosing and resolving locking problems.

Procedure

Use the following steps to diagnose the cause of the unacceptable lock escalation problem and to apply a remedy:
  1. Gather information from the administration notification log about all tables whose locks have been escalated and the applications involved. This log file includes the following information:
    • The number of locks currently held
    • The number of locks needed before lock escalation is completed
    • The table identifier and table name of each table being escalated
    • The number of non-table locks currently held
    • The new table-level lock to be acquired as part of the escalation. Usually, an S or X lock is acquired.
    • The internal return code that is associated with the acquisition of the new table-level lock
  2. Use the administration notification log information about the applications involved in the lock escalations to decide how to resolve the escalation problems. Consider the following options:
    • Check and possibly adjust either the maxlocks or locklist database configuration parameters, or both. In a partitioned database system, make this change on all database partitions. The value of the locklist configuration parameter may be too small for your current workload. If multiple applications are experiencing lock escalation, this could be an indication that the lock list size needs to be increased. Growth in workloads or the addition of new applications could cause the lock list to be too small. If only one application is experiencing lock escalations, then adjusting the maxlocks configuration parameter could resolve this. However, you may want to consider increasing locklist at the same time you increase maxlocks - if one application is allowed to use more of the lock list, all the other applications could now exhaust the remaining locks available in the lock list and experience escalations.
    • You might want to consider the isolation level at which the application and the SQL statements are being run, for example RR, RS, CS, or UR. RR and RS isolation levels tend to cause more escalations because locks are held until a COMMIT is issued. CS and UR isolation levels do not hold locks until a COMMIT is issued, and therefore lock escalations are less likely. Use the lowest possible isolation level that can be tolerated by the application.
    • Increase the frequency of commits in the application, if business needs and the design of the application allow this. Increasing the frequency of commits reduces the number of locks that are held at any given time. This helps to prevent the application from reaching the maxlocks value, which triggers a lock escalation, and helps to prevent all the applications from exhausting the lock list.
    • You can modify the application to acquire table locks using the LOCK TABLE statement. This is a good strategy for tables where concurrent access by many applications and users is not critical; for example, when the application uses a permanent work table (for example, not a DGTT) that is uniquely named for this instance of the application. Acquiring table locks would be a good strategy in this case as it will reduce the number of locks being held by the application and increase the performance because row locks no longer need to be acquired and released on the rows that are accessed in the work table.
      If the application does not have work tables and you cannot increase the values for locklist or maxlocks configuration parameters, then you can have the application acquire a table lock. However, care must be taken in choosing the table or tables to lock. Avoid tables that are accessed by many applications and users because locking these tables will lead to concurrency problems which can affect response time, and, in the worst case, can lead to applications experiencing lock timeouts.

What to do next

Rerun the application or applications to ensure that the locking problem has been eliminated by checking the administration notification log for lock-related entries.

db2 9.7 Monitoring database locking

Diagnosing and correcting lock contention situations in large DB2® environments can be complex and time consuming. The lock event monitor and other facilities are designed to simplify this task by collecting locking data.

Introduction

The lock event monitor is used to capture descriptive information about lock events at the time that they occur. The information captured identifies the key applications involved in the lock contention that resulted in the lock event. Information is captured for both the lock requestor (the application that received the deadlock or lock timeout error, or waited for a lock for more than the specified amount of time) and the current lock owner.
The information collected by the lock event monitor is written in binary format to an unformatted event table in the database. The captured data is processed in a post-capture step improving the efficiency of the capture process.
You can also directly access DB2 relational monitoring interfaces (table functions) to collect lock event information by using either dynamic or static SQL.
Determining if a deadlock or lock timeout has occurred is also simplified. Messages are written to the administration notification log when either of these events occurs; this supplements the SQL0911N (sqlcode -911) error returned to the application. In addition, a notification of lock escalations is also written to the administration notification log; this information can be useful in adjusting the size of the lock table and the amount of the table an application can use. There are also counters for lock timeouts (lock_timeouts), lock waits (lock_waits), and deadlocks (deadlocks) that can be checked.
The types of activities for which locking data can be captured include the following:
  • SQL statements, such as:
    • DML
    • DDL
    • CALL
  • LOAD command
  • REORG command
  • BACKUP DATABASE command
  • Utility requests
The lock event monitor replaces the deprecated deadlock event monitors (CREATE EVENT MONITOR FOR DEADLOCKS statement and DB2DETAILDEADLOCK) and the deprecated lock timeout reporting feature (DB2_CAPTURE_LOCKTIMEOUT registry variable) with a simplified and consistent interface for gathering locking event data, and adds the ability to capture data on lock waits.

Functional overview

Two steps are required to enable the capturing of lock event data using the locking event monitor:
  1. You must create a LOCK EVENT monitor using the CREATE EVENT MONITOR FOR LOCKING statement. You provide a name for the monitor and the name of an unformatted event table into which the lock event data will be written.
  2. You must specify the level for which you want lock event data captured by using one of the following methods:
    • You can specify particular workloads by either altering an existing workload, or by creating a new workload using the CREATE or ALTER WORKLOAD statements. At the workload level you must specify the type of lock event data you want captured (deadlock, lock timeout or lock wait), and whether you want the SQL statement history and input values for the applications involved in the locking. For lock waits you must also specify the amount of time that an application will wait for a lock, after which data is captured for the lock wait.
    • You can collect data at the database level and affect all DB2 workloads by setting the appropriate database configuration parameter:
      mon_lockwait
      This parameter controls the generation of lock wait events
      Best practice is to enable lock wait data collection at the workload level.
      mon_locktimeout
      This parameter controls the generation of lock timeout events
      Best practice is to enable lock timeout data collection at the database level if they are unexpected by the application. Otherwise enable at workload level.
      mon_deadlock
      This parameter controls the generation of deadlock events
      Best practice is to enable deadlock data collection at the database level.
      mon_lw_thresh
      This parameter controls the amount of time spent in lock wait before an event for mon_lockwait is generated
The capturing of SQL statement history and input values incurs additional overhead, but this level of detail is often needed to successfully debug a locking problem.
After a locking event has occurred, the binary data in the unformatted event table can be transformed into an XML or a text document using a supplied Java-based application called db2evmonfmt. In addition, you can format the binary event data in the unformatted event table BLOB column into either an XML report document, using the EVMON_FORMAT_UE_TO_XML table function, or into a relational table, using the EVMON_FORMAT_UE_TO_TABLES procedure.
To aid in the determination of what workloads should be monitored for locking events, the administration notification log can be reviewed. Each time a deadlock or lock timeout is encountered, a message is written to the log. These messages identify the workload in which the lock requestor and lock owner or owners are running, and the type of locking event. There are also counters at the workload level for lock timeouts (lock_timeouts), lock waits (lock_waits), and deadlocks (deadlocks) that can be checked.
Information collected for a locking event
Some of the information for lock events collected by the lock event monitor include the following:
  • The lock that resulted in an event
  • The application holding the lock that resulted in the lock event
  • The applications that were waiting for or requesting the lock that result in the lock event
  • What the applications were doing during the lock event
Limitations
  • There is no automatic purging of the lock event data written to the unformatted event table. You must periodically purge data from the table.
  • You can output the collected event monitor data to only the unformatted event table. Outputs to file, pipe, and table are not supported.
  • It is suggested that you create only one locking event monitor per database. Each additional event monitor only creates a copy of the same data.

Deprecated lock monitoring functionality

The deprecated detailed deadlock event monitor, DB2DETAILDEADLOCK, is created by default for each database and starts when the database is activated. The DB2DETAILDEADLOCK event monitor must be disabled and removed, otherwise both the deprecated and new event monitors will be collecting data and will significantly affect performance.
To remove the DB2DETAILDEADLOCK event monitor, issue the following SQL statements:
SET EVENT MONITOR DB2DETAILDEADLOCK state 0
DROP EVENT MONITOR DB2DETAILDEADLOCK

db2 9.7 Collecting lock event data and generating reports

You can use the lock event monitor to collect lock timeout, lock wait, and deadlock information to help identify and resolve locking problems. After the lock event data has been collected in an unreadable form in an unformatted event table, this task describes how to obtain a readable text report.

About this task

The lock event monitor collects relevant information that helps with the identification and resolution of locking problems. For example, some of the information the lock event monitor collects for a lock event is as follows:
  • The lock that resulted in a lock event
  • The applications requesting or holding the lock that resulted in a lock event
  • What the applications were doing during the lock event
This task provides instructions for collecting lock event data for a given workload. You might want to collect lock event data under the following conditions:
  • You notice that lock wait values are longer than usual when using the MON_GET_WORKLOAD table function.
  • An application returns a -911 SQL return code with reason code 68 in the administration notification log, stating that "The transaction was rolled back due to a lock timeout." See also message SQL0911N for further details.
  • You notice a deadlock event message in the administration notification log (-911 SQL return code with reason code 2, stating that "The transaction was rolled back due to a deadlock."). The log message indicates that the lock event occurred between two applications, for example, Application A and B, where A is part of workload FINANCE and B is part of workload PAYROLL. See also message SQL0911N for further details.
Restrictions
To view data values, you need the EXECUTE privilege on the EVMON_FORMAT_UE_* routines, which the SQLADM and DBADM authorities hold implicitly. You also need SELECT privilege on the unformatted event table table, which by default is held by users with the DATAACCESS authority and by the creator of the event monitor and the associated unformatted event table.

Before you begin

To create the locking event monitor and collect lock event monitor data, you must have DBADM, or SQLADM authority.

Procedure

To collect detailed information regarding potential future lock events, perform the following steps:
  1. Create a lock event monitor called lockevmon by using the CREATE EVENT MONITOR FOR LOCKING statement, as shown in the following example:
    CREATE EVENT MONITOR lockevmon FOR LOCKING
       WRITE TO UNFORMATTED EVENT TABLE
    Note: The following lists important points to remember when creating an event monitor:
    • You can create event monitors ahead of time and not worry about using up disk space since nothing is written until you activate the data collection at the database or workload level
    • In a partitioned database environment, ensure that the event monitors are placed in a partitioned table space across all nodes. Otherwise, lock events will be missed at partitions where the partitioned table space is not present.
    • Ensure that you set up a table space and bufferpool to minimize the interference on high performance work caused by ongoing work during accesses to the tables to obtain data.
  2. Activate the lock event monitor called lockevmon by running the following statement:
    SET EVENT MONITOR lockevmon STATE 1
  3. To enable the lock event data collection at the workload level, issue the ALTER WORKLOAD statement with one of the following COLLECT clauses: COLLECT LOCK TIMEOUT DATA, COLLECT DEADLOCK DATA, or COLLECT LOCK WAIT DATA. Specify the WITH HISTORY option on the COLLECT clause. Setting the database configuration parameter affects the lock event data collection at the database level and all workloads are affected.
    For lock wait events
    To collect lock wait data for any lock acquired after 5 seconds for the FINANCE application and to collect lock wait data for any lock acquired after 10 seconds for the PAYROLL application, issue the following statements:
    ALTER WORKLOAD finance COLLECT LOCK WAIT DATA WITH HISTORY AND VALUES
       FOR LOCKS WAITING MORE THAN 5 SECONDS
    ALTER WORKLOAD payroll COLLECT LOCK WAIT DATA 
       FOR LOCKS WAITING MORE THAN 10 SECONDS WITH HISTORY
    To set the mon_lockwait database configuration parameter with HIST_AND_VALUES input data value for the SAMPLE database, and to set the mon_lw_thresh database configuration parameter for 10 seconds, issue the following commands:
    db2 update db cfg for sample using mon_lockwait hist_and_values
    db2 update db cfg for sample using mon_lw_thresh 10000000
    For lock timeout events
    To collect lock timeout data for the FINANCE and PAYROLL applications, issue the following statements:
    ALTER WORKLOAD finance COLLECT LOCK TIMEOUT DATA WITH HISTORY
    ALTER WORKLOAD payroll COLLECT LOCK TIMEOUT DATA WITH HISTORY
    To set the mon_locktimeout database configuration parameter with HIST_AND_VALUES input data value for the SAMPLE database, issue the following command:
    db2 update db cfg for sample using mon_locktimeout hist_and_values
    For deadlock events
    To collect data for the FINANCE and PAYROLL applications, issue the following statements:
    ALTER WORKLOAD finance COLLECT DEADLOCK DATA WITH HISTORY
    ALTER WORKLOAD payroll COLLECT DEADLOCK DATA WITH HISTORY
    To set the mon_deadlock database configuration parameter with HIST_AND_VALUES input data value for the SAMPLE database, issue the following command:
    db2 update db cfg for sample using mon_deadlock hist_and_values
  4. Rerun the workload in order to receive another lock event notification.
  5. Connect to the database.
  6. Obtain the locking event report using one of the following approaches:
    1. Use the XML parser tool, db2evmonfmt, to produce a flat-text report based on the event data collected in the unformatted event table and using the default stylesheet, for example:
      java db2evmonfmt -d db_name -ue table_name -ftext -u user_id -p password
    2. Use the EVMON_FORMAT_UE_TO_XML table function to obtain an XML document.
    3. Use the EVMON_FORMAT_UE_TO_TABLES procedure to output the data into a relational table.
  7. Analyze the report to determine the reason for the lock event problem and resolve it.
  8. Turn OFF lock data collection for both FINANCE and PAYROLL applications by running the following statements or resetting the database configuration parameters:
    For lock wait events
    ALTER WORKLOAD finance COLLECT LOCK WAIT DATA NONE
    ALTER WORKLOAD payroll COLLECT LOCK WAIT DATA NONE
    To reset the mon_lockwait database configuration parameter with the default NONE input data value for the SAMPLE database, and to reset the mon_lw_thresh database configuration parameter back to its default value of 5 seconds, issue the following command:
    db2 update db cfg for sample using mon_lockwait none
    db2 update db cfg for sample using mon_lw_thresh 5000000
    For lock timeout events
    ALTER WORKLOAD finance COLLECT LOCK TIMEOUT DATA NONE
    ALTER WORKLOAD payroll COLLECT LOCK TIMEOUT DATA NONE
    To reset the mon_locktimeout database configuration parameter with the default NONE input data value for the SAMPLE database, issue the following command:
    db2 update db cfg for sample using mon_locktimeout none
    For deadlock events
    ALTER WORKLOAD finance COLLECT DEADLOCK DATA NONE
    ALTER WORKLOAD payroll COLLECT DEADLOCK DATA NONE
    To reset the mon_deadlock database configuration parameter with the default WITHOUT_HIST input data value for the SAMPLE database, issue the following command:
    db2 update db cfg for sample using mon_deadlock without_hist

db2 9.7 Types of data to collect for operational monitoring

Types of data to collect for operational monitoring

Several types of data are useful to collect for ongoing operational monitoring.
  • A basic set of DB2 systemperformance monitoring metrics.
  • DB2 configuration information
    Taking regular copies of database and database manager configuration, DB2 registry variables, and the schema definition helps provide a history of any changes that have been made, and can help to explain changes that arise in monitoring data.
  • Overall system load
    If CPU or I/O utilization is allowed to approach saturation, this can create a system bottleneck that might be difficult to detect using just DB2 snapshots. As a result, the best practice is to regularly monitor system load with vmstat and iostat (and possibly netstat for network issues) on Linux and UNIX-based systems, and perfmon on Windows. You can also use the administrative views, such as ENV_SYS_RESOURCES, to retrieve operating system, CPU, memory, and other information related to the system. Typically you look for changes in what is normal for your system, rather than for specific one-size-fits-all values.
  • Throughput and response time measured at the business logic level
    An application view of performance, measured above DB2, at the business logic level, has the advantage of being most relevant to the end user, plus it typically includes everything that could create a bottleneck, such as presentation logic, application servers, web servers, multiple network layers, and so on. This data can be vital to the process of setting or verifying a service level agreement (SLA).
 The DB2 system performance monitoring elements and system load data are compact enough that even if they are collected every five to fifteen minutes, the total data volume over time is irrelevant in most systems. Likewise, the overhead of collecting this data is typically in the one to three percent range of additional CPU consumption, which is a small price to pay for a continuous history of important system metrics. Configuration information typically changes relatively rarely, so collecting this once a day is usually frequent enough to be useful without creating an excessive amount of data.

db2 9.7 Basic set of system performance monitor elements

About 10 metrics of system performance provide a good basic set to use in an on-going operational monitoring effort.
There are hundreds of metrics to choose from, but collecting all of them can be counter-productive due to the sheer volume of data produced. You want metrics that are:
  • Easy to collect - You don't want to have to use complex or expensive tools for everyday monitoring, and you don't want the act of monitoring to significantly burden the system.
  • Easy to understand - You don't want to have to look up the meaning of the metric each time you see it.
  • Relevant to your system - Not all metrics provide meaningful information in all environments.
  • Sensitive, but not too sensitive - A change in the metric should indicate a real change in the system; the metric should not fluctuate on its own.
This starter set includes about 10 metrics:
  • The number of transactions executed:
    TOTAL_APP_COMMITS 
    This provides an excellent base level measurement of system activity.
  • Buffer pool hit ratios, measured separately for data, index, and temporary data:
    100 * (POOL_DATA_L_READS - POOL_DATA_P_READS) / POOL_DATA_L_READS
    100 * (POOL_INDEX_L_READS - POOL_INDEX_P_READS) / POOL_INDEX_L_READS
    100 * (POOL_TEMP_DATA_L_READS - POOL_TEMP_DATA_P_READS) / POOL_TEMP_DATA_L_READS 
    100 * (POOL_TEMP_INDEX_L_READS - POOL_TEMP_INDEX_P_READS)
      / POOL_TEMP_INDEX_L_READS
    Buffer pool hit ratios are one of the most fundamental metrics, and give an important overall measure of how effectively the system is exploiting memory to avoid disk I/O. Hit ratios of 80-85% or better for data and 90-95% or better for indexes are generally considered good for an OLTP environment, and of course these ratios can be calculated for individual buffer pools using data from the buffer pool snapshot.
    Although these metrics are generally useful, for systems such as data warehouses that frequently perform large table scans, data hit ratios are often irretrievably low, because data is read into the buffer pool and then not used again before being evicted to make room for other data.
  • Buffer pool physical reads and writes per transaction:
    (POOL_DATA_P_READS + POOL_INDEX_P_READS +
      POOL_TEMP_DATA_P_READS + POOL_TEMP_INDEX_P_READS)
      / TOTAL_APP_COMMITS
    
    (POOL_DATA_WRITES + POOL_INDEX_WRITES)
      / TOTAL_APP_COMMITS
    These metrics are closely related to buffer pool hit ratios, but have a slightly different purpose. Although you can consider target values for hit ratios, there are no possible targets for reads and writes per transaction. Why bother with these calculations? Because disk I/O is such a major factor in database performance, it is useful to have multiple ways of looking at it. As well, these calculations include writes, whereas hit ratios only deal with reads. Lastly, in isolation, it is difficult to know, for example, whether a 94% index hit ratio is worth trying to improve. If there are only 100 logical index reads per hour, and 94 of them are in the buffer pool, working to keep those last 6 from turning into physical reads is not a good use of time. However, if a 94% index hit ratio were accompanied by a statistic that each transaction did twenty physical reads (which could be further broken down by data and index, regular and temporary), the buffer pool hit ratios might well deserve some investigation.
    The metrics are not just physical reads and writes, but are normalized per transaction. This trend is followed through many of the metrics. The purpose is to decouple metrics from the length of time data was collected, and from whether the system was very busy or less busy at that time. In general, this helps ensure that similar values for metrics are obtained, regardless of how and when monitoring data is collected. Some amount of consistency in the timing and duration of data collection is a good thing; however, normalization reduces it from being critical to being a good idea.
  • The ratio of database rows read to rows selected:
    ROWS_READ / ROWS_RETURNED
    This calculation gives an indication of the average number of rows that are read from database tables in order to find the rows that qualify. Low numbers are an indication of efficiency in locating data, and generally show that indexes are being used effectively. For example, this number can be very high in the case where the system does many table scans, and millions of rows need to be inspected to determine if they qualify for the result set. On the other hand, this statistic can be very low in the case of access to a table through a fully-qualified unique index. Index-only access plans (where no rows need to be read from the table) do not cause ROWS_READ to increase.
    In an OLTP environment, this metric is generally no higher than 2 or 3, indicating that most access is through indexes instead of table scans. This metric is a simple way to monitor plan stability over time - an unexpected increase is often an indication that an index is no longer being used and should be investigated.
  • The amount of time spent sorting per transaction:
    TOTAL_SORT_TIME / TOTAL_APP_COMMITS
    This is an efficient way to handle sort statistics, because any extra overhead due to spilled sorts automatically gets included here. That said, you might also want to collect TOTAL_SORTS and SORT_OVERFLOWS for ease of analysis, especially if your system has a history of sorting issues.
  • The amount of lock wait time accumulated per thousand transactions:
    1000 * LOCK_WAIT_TIME / TOTAL_APP_COMMITS
    Excessive lock wait time often translates into poor response time, so it is important to monitor. The value is normalized to one thousand transactions because lock wait time on a single transaction is typically quite low. Scaling up to one thousand transactions simply provides measurements that are easier to handle.
  • The number of deadlocks and lock timeouts per thousand transactions:
    1000 * (DEADLOCKS + LOCK_TIMEOUTS) / TOTAL_APP_COMMITS
    Although deadlocks are comparatively rare in most production systems, lock timeouts can be more common. The application usually has to handle them in a similar way: re-executing the transaction from the beginning. Monitoring the rate at which this happens helps avoid the case where many deadlocks or lock timeouts drive significant extra load on the system without the DBA being aware.
  • The number of dirty steal triggers per thousand transactions:
    1000 * POOL_DRTY_PG_STEAL_CLNS / TOTAL_APP_COMMITS
    A "dirty steal" is the least preferred way to trigger buffer pool cleaning. Essentially, the processing of an SQL statement that is in need of a new buffer pool page is interrupted while updates on the victim page are written to disk. If dirty steals are allowed to happen frequently, they can have a significant impact on throughput and response time.
  • The number of package cache inserts per thousand transactions:
    1000 * PKG_CACHE_INSERTS / TOTAL_APP_COMMITS
    Package cache insertions are part of normal execution of the system; however, in large numbers, they can represent a significant consumer of CPU time. In many well-designed systems, after the system is running at steady-state, very few package cache inserts occur, because the system is using or reusing static SQL or previously prepared dynamic SQL statements. In systems with a high traffic of ad hoc dynamic SQL statements, SQL compilation and package cache inserts are unavoidable. However, this metric is intended to watch for a third type of situation, one in which applications unintentionally cause package cache churn by not reusing prepared statements, or by not using parameter markers in their frequently executed SQL.
  • The time an agent waits for log records to be flushed to disk:
    LOG_WRITE_TIME 
      / TOTAL_APP_COMMITS
    The transaction log has significant potential to be a system bottleneck, whether due to high levels of activity, or to improper configuration, or other causes. By monitoring log activity, you can detect problems both from the DB2® side (meaning an increase in number of log requests driven by the application) and from the system side (often due to a decrease in log subsystem performance caused by hardware or configuration problems).
  • In partitioned database environments, the number of fast communication manager (FCM) buffers sent and received between partitions:
    FCM_SENDS_TOTAL, FCM_RECVS_TOTAL
    These give the rate of flow of data between different partitions in the cluster, and in particular, whether the flow is balanced. Significant differences in the numbers of buffers received from different partitions might indicate a skew in the amount of data that has been hashed to each partition.

Cross-partition monitoring in partitioned database environments

Almost all of the individual monitoring element values mentioned above are reported on a per-partition basis.
In general, you expect most monitoring statistics to be fairly uniform across all partitions in the same DB2 partition group. Significant differences might indicate data skew. Sample cross-partition comparisons to track include:
  • Logical and physical buffer pool reads for data, indexes, and temporary tables
  • Rows read, at the partition level and for large tables
  • Sort time and sort overflows
  • FCM buffer sends and receives
  • CPU and I/O utilization