Measurement is the most central concept in any performance-related activity. If you are not measuring you are blind. As important as measuring per se is collecting the right measurements. Which metrics are the right ones depends on what you want to do. However there are some general principles which – when followed – can make your life much easier.
Many people think that measuring application performance means measuring response times. While this is a good starting point it is not the whole story. First of all it is not that straightforward what response times actually mean. Depending on whom you ask you will get different definitions. Week 2 of our Application Performance Almanac, “The many faces of end-user monitoring” , describes different ways to measure response times.
For the sake of simplicity we define response time as the time it takes for a server to service a request – such as a web page. The response time in this case is the time it takes to deliver the result of the page. This time is the most important metric from an end-user perspective. It is also great input from an SLA management perspective as it tells us whether user requests stay within certain time constraints.
Response times alone cannot tell us why it took that time to deliver this response. This is the information we need for optimization, tuning and troubleshooting. So in the next step we have to break down response time into its components. The good thing is that this is similar for every type of transaction. Once you understand how response times are structured, finding optimization and tuning points is easy. So let’s look at the components of application response time.
The first component is CPU time. CPU time is the time we actually use the CPU – i.e. we perform some sort of calculation. CPU time tells how computationally intensive our transaction is. So first we look at the relation between response time and CPU time. If we increase load on our system we will see that CPU time in most cases increases proportionally to the number of transactions. If our response times stay stable we talk about a system which scales linearly. As this is very often not the case there must be other components involved which impact response times.
So what is our application doing when it is not performing any calculations? Well, it is waiting – waiting for resources to become available. Waiting for resources can mean waiting for a database connection, the response of a service call, file operations or other shared resources. As these resources are shared this means the more transactions there are the more time you spend waiting. Everybody who has ever queued up at the checkout counter of a shop knows this situation.
So if response times increase disproportionally with the number of requests this means that our resource wait times are increasing. This is essential information for performance tuning, and some formal tuning builds heavily on this principle.
So that’s it? Well, not yet. There are other factors which impact response times as well. These are external factors which must be correlated to transaction times. These times can be summed up as suspension times. Suspension times are time periods when our system was suspended and could not execute any code. A typical reason for suspension times is Garbage Collection. For certain steps during Garbage Collection the JVM has to be suspended. This means that massive Garbage Collection can result in high response times. I have seen cases where GC time was about 25 percent of the overall transaction response time.
So if we now put this all together we can build a very nice graph which shows us the major contributors of our application’s response time. I recommend creating two graphs: one showing sum values, and the other one showing average values. Why? Well the sum graph shows us how my system behaves as a whole while the average-value chart shows us some general transaction trends for individual transactions. Ok, some of you might now say that averages suck. This is true, so if your tooling supports working with percentiles then go for it. I personally prefer using percentiles as they provide better information than averages.
Is this all the information we need? While this information helps us to get a better understanding which kind of problem we have, we can do better finding out where the problem exactly is – or better which application component caused the problem. Therefore we split the response time – and its components – by application components like our business layer, database layer etc. This creates a matrix of the different timing components and layers.
This information is now highly valuable as it tells us everything we need to know. We can say that response time is high because of high Web Service response times. As you can see this two dimensional break down is a great help for finding application performance issues. The good thing is that this approach is generic and can be used for all kinds of applications. Below you see the tabular representation of this information
If we now compare this to the data from another time frame we can immediately see the changes which make pinpointing problems very easy.
So are we done now? Well we can do pretty much any kind of analysis using this information but we can still make our life easier. When it comes to wait times we can use additional metrics which tell us what resource usage was like at a certain point in time. Depending on your platform, JMX, PMI or PerfMon can be used to get this information. For a database connection pool these are for example connection wait and usage times as well as idle and used connections. There are similar metrics for other technologies which help you better understand where wait times originate.
Finally, we must not forget to collect operating system CPU metrics. Why? Don’t we already have the CPU consumptions of our transactions? This is true, but these metrics do not allow us to decide whether we have increased response times because we cannot get more CPU cycles. So we need to know how much CPU our processes – as well as others – consume. In virtualized environments we also have to understand CPU consumption of other guest systems.
In case of high garbage collector activities we would need additional metrics to analyze and understand these problems.
Measuring application performance is more than simply measuring transaction response times. If we split up the metrics properly and relate timings to application components we can get an easy, in-depth understanding of application performance and answer many performance-related questions. If we then use this level of detail to compare metrics we get an easy-to-use regression analysis.