| A new best practice for application and database | | | | Wait-Time Analysis for Service Level Management |
| performance management | | | | Because Wait-Time analysis measures the collective |
| Source: Until recently, tuning IT application performance | | | | time delays causing end users to wait for an |
| has been largely a guessing game. This is both | | | | information request, it's the measurement technique |
| surprising and unacceptable considering the relentless | | | | most closely matched to end-user service levels. For |
| focus IT organizations put on cost-efficiency and | | | | organizations focused on Service Level Management |
| productivity. | | | | (SLM) techniques, or those bound by Service Level |
| The traditional approaches to database and application | | | | Agreements (SLAs), Wait-Time analysis techniques |
| tuning that involve collecting large volumes of statistics | | | | allow the IT department to measure the performance |
| and making trial-and-error changes are still in | | | | that is most relevant to achieving the stated service |
| widespread use. Today, most server management | | | | level goals. Service level management typically |
| and monitoring tools deliver "server-oriented" statistics | | | | identifies technical metrics that define whether |
| that don't translate to concrete end-user benefits. | | | | performance is adequate, and Wait-Time data is the |
| The landscape is changing, however. The current | | | | basis for evaluating those metrics. |
| thinking of leading consultants, DBAs, and training | | | | The Problem with Conventional Statistics |
| organizations is focusing on performance tuning | | | | There are so many management tools gathering |
| practices that are tied directly to end-user service | | | | thousands of statistics from IT systems. Don't these |
| levels and improvements in operating efficiency. | | | | provide the same answer as Wait-Time methods? |
| Wait-Time analysis is a new approach to application | | | | Why are they not effective? |
| and database performance improvement that allows | | | | Traditional approaches to database tuning and |
| users to make tuning decisions based on the optimal | | | | performance analysis introduce the same errors |
| service impact. Using the principles of Wait-Time | | | | identified in the driving example above. |
| analysis described here, DBAs, developers, and | | | | 1. Event Counters versus Wait-Time Methods |
| application owners can align their efforts with the | | | | Typical tools count the number of events, but don't |
| service levels desired by their IT customers. Wait-Time | | | | measure time. These statistics are numerous and |
| analysis lets IT find the root cause of the most | | | | easy to capture, so they tend to flood management |
| important problem impacting customers and identify | | | | dashboards. But, are they useful? |
| which critical resource will resolve it. | | | | Broad management dashboards have sophisticated |
| What Is Wait-Time Analysis? | | | | displays of monitored data, but counting events or |
| Measure Time | | | | calculating ratios doesn't indicate or predict better |
| If you were trying to shorten your commute to work, | | | | performance for database customers. In fact, this |
| what would you measure? Would you count the | | | | approach can have the effect of covering up, rather |
| number of tire rotations? Would you measure the car's | | | | than exposing, the real service level bottlenecks. |
| temperature? Would these statistics have any | | | | The example is an excerpt from a long summary of |
| meaning in the context of your goal? All that really | | | | counted statistics. Clearly there's much detail and |
| matters is what impacts the time for your trip. All the | | | | technical accuracy. But where would you go to begin |
| other statistics are distractions that don't help your | | | | your diagnosis? Do these raw numbers reveal a |
| mission. Wait-Time analysis gets to the root of the | | | | performance problem? Is the value for "physical writes |
| problem to achieve the end business result. Although | | | | direct" in the table too high or too low? There's no |
| this seems obvious, common IT practices suggest that | | | | indication of impact on the end-user service level to |
| other practices hold the answer. Rather than | | | | make that judgment. |
| immediately focusing on the time to complete | | | | On the other hand, ranks individual SQL requests by |
| requested services, IT tools barrage the user with | | | | Wait-Time. The statement with the highest Wait-Time |
| detailed statistics that count the number of many | | | | is at the top of the list. Its relative impact on overall |
| different operations. So while the DBA should really be | | | | user service is reflected in the length of the bar - |
| looking at how long it took for the database to return | | | | measuring how much time users experience waiting |
| the results of a query, typical tools display the number | | | | on this request. Without counting how many times an |
| of input/output (I/O) operations and locks encountered. | | | | operation occurred, this is a much more meaningful |
| Get the Details | | | | measure of end-user service. |
| Under the trial-and-error approach, what level of detail | | | | 2. System-Wide Averages |
| do you need to actively improve your commute time? | | | | Typically statistics are gathered across an entire |
| If the only statistic you have is that the trip took 40 | | | | system, rather than on a basis that applies to an |
| minutes, you can compare one day to the next, but | | | | individual user request. When averaging performance |
| there's not enough data to help improve the situation. | | | | across all requests, it becomes impossible to tell which |
| What you need is detailed insight into how long you | | | | requests are the most critical resource drains and |
| spent at each stoplight, which stretches of road have | | | | which resources are impacting service levels. |
| the most stop-and-go traffic and how long you waited | | | | Vendor-supplied database tools, for example, typically |
| there. This detail is essential to making the exercise | | | | display data across the entire database without |
| useful. | | | | breaking it down into specific user requests. As a |
| The same concept applies to IT performance | | | | result, there's no indication which end-user functions |
| systems. When Wait-Time is typically measured, a | | | | were impacted. |
| "black box" approach is taken, where the user sees | | | | 3. Silos versus End-to-End Analysis |
| how long a server took to respond to a request. | | | | Another key problem with typical IT monitoring tools is |
| However, no indication is given as to which of the | | | | the creation of individual information "silos" that localize |
| thousands of steps performed by the server were | | | | statistics for a single type of system, but don't expose |
| actually responsible for the delay. As will be shown | | | | an end-user's view of performance. |
| here, it's important not just to measure Wait-Time but | | | | Because of the differing technical skill sets, separate |
| to break it down into sufficient detail so that you can | | | | groups manage databases, application servers, and |
| take action. | | | | Web infrastructure. Each group has a primary focus - |
| Wait-Time analysis for IT applications is the singular | | | | to optimize the performance of their box. And typically |
| focus of measuring and improving the service time to | | | | they use the most common and convenient statistics |
| the IT customers. By identifying exactly what | | | | to measure and improve performance. For an |
| contributes to longer service time, IT professionals can | | | | application server, this often means watching memory |
| focus not on the thousands of available statistics, but | | | | utilization, thread counts, and CPU utilization. For a |
| on the most important bottlenecks that have direct and | | | | database, this is a count of the number of sessions, |
| quantifiable impact on the IT customer. | | | | number of reads, or number of processes. |