Michael Kopp About the Author

Michael is a Technical Product Manager at Dynatrace. Reach him at @mikopp

Why Averages Suck and Percentiles are Great

Anyone that ever monitored or analyzed an application uses or has used averages. They are simple to understand and calculate. We tend to ignore just how wrong the picture is that averages paint of the world. To emphasis the point let me give you a real world example outside of the performance space that I read recently in a newspaper.

The article was explaining that the average salary in a certain region in Europe was 1900 Euro’s (to be clear this would be quite good in that region!). However when looking closer they found out that the majority, namely 9 out of 10 people, only earned around 1000 Euros and one would earn 10.000 (I over simplified this of course, but you get the idea). If you do the math you will see that the average of this is indeed 1900, but we can all agree that this does not represent the “average” salary as we would use the word in day to day live. So now let’s apply this thinking to application performance.

The Average Response Time

The average response time is by far the most commonly used metric in application performance management. We assume that this represents a “normal” transaction, however this would only be true if the response time is always the same (all transaction run at equal speed) or the response time distribution is roughly bell curved.

A Bell curve represents the "normal" distribution of response times in which the average and the median are the same. I rarely ever occurs in real applications

A Bell curve represents the “normal” distribution of response times in which the average and the median are the same. I rarely ever occurs in real applications

In a Bell Curve the average (mean) and median are the same. In other words observed performance would represent the majority (half or more than half) of the transactions.

In reality most applications have few very heavy outliers; a statistician would say that the curve has a long tail. A long tail does not imply many slow transactions, but few that are magnitudes slower than the norm.

This is a typical Response Time Distribution with few but heavy outliers - it has a long tail

This is a typical Response Time Distribution with few but heavy outliers – it has a long tail. The average here is dragged to the right by the long tail.

We recognize that the average no longer represents the bulk of the transactions but can be a lot higher than the median.

You can now argue that this is not a problem as long as the average doesn’t look better than the median. I would disagree, but let’s look at another real-world scenario experienced by many of our customers:

This is another typical Response Time Distribution. Here we have quite a few very fast transactions that drag the average to the left of the actual median

This is another typical Response Time Distribution. Here we have quite a few very fast transactions that drag the average to the left of the actual median

In this case a considerable percentage of transactions are very, very fast (10-20 percent), while the bulk of transactions are several times slower. The median would still tell us the true story, but the average all of a sudden looks a lot faster than most of our transactions actually are. This is very typical in search engines or when caches are involved, some transactions are very fast, but the bulk are normal. Another reason for this scenario are failed transactions, more specifically transactions that failed fast. Many real world applications have a failure rate of 1-10 percent (due to user errors or validation errors). These failed transactions are often magnitudes faster than the real ones and consequently distorted an average.

Of course performance analysts are not stupid and regularly try to compensate with higher frequency charts (compensating by looking at smaller aggregates visually) and by taking in minimum and maximum observed response times. However we can often only do this if we know the application very well, those unfamiliar with the application might easily misinterpret the charts. Because of the depth and type of knowledge required for this, it’s difficult to communicate your analysis to other people – think how many arguments between IT teams have been caused by this. And that’s before we even being to think about communicating with business stakeholders!

A better metric by far are percentiles, because they allow us to understand the distribution. But before we look at percentiles, let’s take a look a key feature in every production monitoring solution: Automatic Baselining and Alerting.

Automatic Baselining and Alerting

In real world environments, performance gets attention when it is poor and has a negative impact on the business and users. But how can we identify performance issues quickly to prevent negative effects? We cannot alert on every slow transaction, since there are always some. In addition, most Operations teams have to maintain a large number of applications are not familiar with all of them, so manually setting thresholds can be inaccurate, quite painful and time consuming.

The industry has come up with a solution called Automatic Baselining. Baselining calculates out the “normal” performance and only alerts us when an application slows down or produces more errors than usual. Most approaches rely on averages and standard deviations.

Without going into statistical details, this approach again assumes that the response times are distributed over a bell curve:

The Standard Deviation represents 33% of all transactions with the mean as the middle. 2xStandard Deviation represents 66% and thus the majority, everything outside could be considered an outlier.

The Standard Deviation represents 33% of all transactions with the mean as the middle. 2xStandard Deviation represents 66% and thus the majority, everything outside could be considered an outlier. However most real world scenarios are not bell curved…

Typically, transactions that are outside 2 times standard deviation are treated as slow and captured for analysis. An alert is raised if the average moves significantly. In a bell curve this would account for the slowest 16.5 percent (and you can of course adjust that), however if the response time distribution does not represent a bell curve it becomes inaccurate. We either end up with a lot of false positives (transactions that are a lot slower than the average but when looking at the curve lie within the norm) or we miss a lot of problems (false negatives). In addition if the curve is not a bell curve than the average can differ a lot from the median, applying a standard deviation to such an average can lead to quite a different result than you would expect! To work around this problem these algorithms have many tunable variables and a lot of “hacks” for specific use cases.

Why I love percentiles

A percentile tells me at which part of the curve I am looking at and how many transactions are represented by that metric. To visualize this look at the following chart:

This chart shows the 50th and 90th percentile along with the average of the same transaction. It shows that the average is influenced far mor heavily by the 90th, thus by outliers and not by the bulk of the transactions

This chart shows the 50th and 90th percentile along with the average of the same transaction. It shows that the average is influenced far mor heavily by the 90th, thus by outliers and not by the bulk of the transactions

The green line represents the average. As you can see it is very volatile. The other two lines represent the 50th and 90th percentile. As we can see the 50th percentile (or median)  is rather stable but has a couple of jumps. These jumps represent real performance degradation for the majority (50%) of the transactions. The 90th percentile (this is the start of the “tail”) is a lot more volatile, which means that the outliers slowness depends on data or user behavior. What’s important here is that the average is heavily influenced (dragged) by the 90th percentile, the tail, rather than the bulk of the transactions.

If the 50th percentile (median) of a response time is 500ms that means that 50% of my transactions are either as fast or faster than 500ms. If the 90th percentile of the same transaction is at 1000ms it means that 90% are as fast or faster and only 10% are slower. The average in this case could either be lower than 500ms (on a heavy front curve), a lot higher (long tail) or somewhere in between. A percentile gives me a much better sense of my real world performance, because it shows me a slice of my response time curve.

For exactly that reason percentiles are perfect for automatic baselining. If the 50th percentile moves from 500ms to 600ms I know that 50% of my transactions suffered a 20% performance degradation. You need to react to that.

In many cases we see that the 75th or 90th percentile does not change at all in such a scenario. This means the slow transactions didn’t get any slower, only the normal ones did. Depending on how long your tail is the average might not have moved at all in such a scenario!

In other cases we see the 98th percentile degrading from 1s to 1.5 seconds while the 95th is stable at 900ms. This means that your application as a whole is stable, but a few outliers got worse, nothing to worry about immediately. Percentile-based alerts do not suffer from false positives, are a lot less volatile and don’t miss any important performance degradations! Consequently a baselining approach that uses percentiles does not require a lot of tuning variables to work effectively.

The screenshot below shows the Median (50th Percentile) for a particular transaction jumping from about 50ms to about 500ms and triggering an alert as it is significantly above the calculated baseline (green line). The chart labeled “Slow Response Time” on the other hand shows the 90th percentile for the same transaction. These “outliers” also show an increase in response time but not significant enough to trigger an alert.

Here we see an automatic baselining dashboard with a violation at the 50th percentile.

Here we see an automatic baselining dashboard with a violation at the 50th percentile. The violation is quite clear, at the same time the 90th percentile (right upper chart) does not violate. Because the outliers are so much slower than the bulk of the transaction an average would have been influenced by them and would not have have reacted quite as dramatically as the 50th percentile. We might have missed this clear violation!

How can we use percentiles for tuning?

Percentiles are also great for tuning, and giving your optimizations a particular goal. Let’s say that something within my application is too slow in general and I need to make it faster. In this case I want to focus on bringing down the 90th percentile. This would ensure sure that the overall response time of the application goes down. In other cases I have unacceptably long outliers I want to focus on bringing down response time for transactions beyond the 98th or 99th percentile (only outliers). We see a lot of applications that have perfectly acceptable performance for the 90th percentile, with the 98th percentile being magnitudes worse.

In throughput oriented applications on the other hand I would want to make the majority of my transactions very fast, while accepting that an optimization makes a few outliers slower. I might therefore make sure that the 75th percentile goes down while trying to keep the 90th percentile stable or not getting a lot worse.

I could not make the same kind of observations with averages, minimum and maximum, but with percentiles they are very easy indeed.


Averages are ineffective because they are too simplistic and one-dimensional. Percentiles are a really great and easy way of understanding the real performance characteristics of your application. They also provide a great basis for automatic baselining, behavioural learning and optimizing your application with a proper focus. In short, percentiles are great!

About The Author
Michael Kopp
Michael Kopp Michael is a Technical Product Manager at Dynatrace. Reach him at @mikopp


  1. Rick Viscomi says:

    Is there a particular reason why number of requests is on the y axis and response time is on the x axis? This seems like an odd choice, as user-controlled variables are usually placed on the x axis.

  2. “If the 50th percentile moves from 500ms to 600ms I know that 50% of my transactions suffered a 20% performance degradation.”
    Not true. The same can be observed for example if only the transactions between 500 and 600 ms (plus a few below 500) get past 600ms. That may or may not be 50 % of transactions.

  3. How did you produce the two last charts? Which software did you use?

  4. Great article! Could you explain how the automatic baseline (green lines in the last chart) is calculated? Thanks!

  5. @Rick The reason was simply that otherwise one would not have a bell shape analogy ;-), that all figures here are application controlled, there is no user control. So no real scientific reason for the choice.

    @9×0 If only the transactions between 500 and 600 move the 50th percentile is not necessarily impacted, the percentile depends on the number of requests that shift. But I accept the correction that not 50% change if the 50th percentile moves. Rather I can say that now 50% of my transactions are faster than 600ms. Its an SLA type metric.

    @x86 the baseline represents the historically recorded 50th/90th percentile plus statistical error margin. Thus if the performance of the currently observed transactions degrades beyond the statistical margin of error it violates that baseline.

  6. Marcel Verkerk says:

    Very good awareness article! Thanks for publishing this! More people need to understand this and realize that Average is an OK calculation but you have to look at data in more ways than one. Go into an average Operations Center at any company that has one and you will see nothing but average values on the boards! So very misleading and very often the reason that significant events (especially unexpected spikes) go unnoticed! Percentiles, histograms, etc. should be standard in any APM tool out there!

    Just for fun: http://filipspagnoli.files.wordpress.com/2009/09/statistician-drowning-in-a-pond-with-an-average-depth-of-3ft.jpg

  7. Alexander says:

    Hi, Can you please explain how you draw the average and median verticle lines in your charts? The mathematics behind it.

  8. Hugh Smith says:

    The normal distribution or bell curves are not appropriate for modeling response times or arrival rates.
    There is a physical reason for this. There is a physical limit at 0 on the low side and no limit on the high side.
    Therefore skewed distributions such as Poisson and Exponential are usually used for modeling this phenomena. You wont see any negative values and even if all values are positive there will not be a hump in the middle with similar tails on each side. In extreme cases there can be long tailed distributions that are skewed even further than those 2 ditributions mentioned. A bell shaped curve is not an appropriate analogy for response times in the real world. The terminology of typical transactions is better than saying normal transactions because this may suggest that the transactions are normally distriubuted.

    Also, two-tailed statistical tests are not used for analyzing response times. The SLAs are set on the high side. A typical SLA says that the 90th %tile for the response times should be less than Y seconds. There is no SLA referencing the low side. No one complains that response times are too fast or requests to limit the fast transactions to less than 5 % for the transactions under x seconds.

    It is a standard practice in Performance Engineering to use the 90th or 95th percentile rather than averages or outliers to represent response time performance.

  9. Jamie Tabone says:

    Thanks for the article. It is really informative.

    I am currently doing a dissertation which compares a web technology stack (web server and database) with another web technology stack. One of the test scenarios issues high writes (i.e. an increasing number of users per second submitting a registration). The graphs generated show various all percentiles but no average.

    In your opinion how should I conclude which stack scales better? In other words which percentiles do you recommend to analyze in order to conclude which stack scales better?

    Thank you

    • In a scenario with increasing load (users and/or traffic) I would watch the different percentile charts: 50th, 75th, 90th and so on. If a web stack truly scales than the response time on those would NOT increase with the load but stay relatively stable. e.g. If Web stack A shows a stable 90th percentile whereas the 90th of Web stack B is increasing with the same load then Web stack A scales better. If 50th and 90th remain stable its even better, as it means that both the Median and the “slower” ones remain stable with increasing load.

      Now of course Stack B that scales less well (its percentile charts increase more or sooner than the other) can still perform better at a certain load (its 50th or 90th being lower than the other).

      That is the difference between raw performance at a certain load and scalability.

      • Jamie Tabone says:

        Thank you for your insite. Given that the same request is being used throughout thus study, I would then consider the 90th Percentile trend for scalability.


  10. Hi Michael,

    I’m currently reporting on average open incidents for work days during the month. It takes the number of open incidents for each day, adds them up and divides by work days. It shows on average how many calls teams were dealing with on a daily basis replacing a end of month figure that is carried over.

    A few concerns have been raised over the baseline. As some calls are open for more than a few days it would be counted more than once and then averaged out.

    I’d like to know your take on this type of average?

  11. Hi Michael
    A simple question on percentiles here. I review many noise impact assessment reports and all acoustics consultants persist in averaging the percentiles. I have repeatedly said that this is unacceptable as each percentile is relevant only to the data set from which its calculated. As such , the average of a set of percentiles represents nothing. Am I correct or is averaging percentiles acceptable?

    • Michael Kopp Michael Kopp says:

      Hi Chris,

      Honestly I am a little bit out of my depth here. I do not know how percentiles are used in noise impact or acoustics, but here are my 2 cents.

      When talking about response time though it does make sense depending on the use case. e.g. If I chart the 50th percentile over time it does make sense to calculate an average across the time. e.g. to say the average response time based in 50th over one week was X.
      This makes sense, as to me having a percentile on a percentile (calculating the 50th over time of a median chart) does represent nothing. So averaging the 50th over time is ok.

      Averaging across different percentiles, e.g. averaging 50th and 90th however does not make any sense to me. As you say, doing this represents nothing.

      Again not sure if that answers your concern though.

  12. Curious, if the data source one gets to work with is aggregated/processed/rolled up data, e.g. a set of data points representing the computed average, median, and/or 90th percentile of the raw data rather than the raw data itself (e.g. 100 vs 10000 data points, say for performance/storage optimization as a reason), and let’s assume the original raw data is not available to use. And there is a possibility we don’t have the count of the sample set used to compute the aggregated data points to be able to compute a weighted average or percentile.

    In such case, is the average of 90th percentile data points useful? As per the note about caveat concerning medians (or average median value) at http://www.incontext.indiana.edu/2013/mar-apr/article3.asp. And for a summarization of results, would a 90th percentile of 90th percentile data points be useful/better than average of 90th percentiles, although 90th of 90th would essentially represent a really worst case scenario if we have quite some bad outliers.

    • FYI, did some follow up analysis comparing the aggregated results to processing stats (average, 90th percentile, etc.) from the raw data, and it looks like if you are using percentiles against aggregated (preprocessed data), then using the average best matches the stats of the raw data (e.g. if the aggregated rollup is for 90th percentiles of the raw data over some intervals, then the average of these 90th percentiles roughly matches the 90th percentile of the raw dataset as a whole – doing a 90th percentile against those 90th percentiles of the intervals will give you a skewed result in this case).

  13. Averages don’t suck and percentiles aren’t great, both are tools to be used in the appropriate context. Averages being affected by outliers in the way they do, which is what you’re complaining about, is what averages are supposed to do and one of the reasons that they’re used. Averages don’t exist to paint a picture of the distribution of a dataset, as the name implies, they’re supposed to be an average of the dataset, which, in many situations, might be useful while in others it might not. Averages are typically used to enable higher level thinking of certain problems where minutia isn’t exactly a requirement. If I were to make an article as you did I could also argue that percentiles suck and distributon curves FTW where it should be blatantly obvious that a curve can be overkill for a lot of situations.

    • wx,

      You are of course correct that they are both tools and as such they have their uses and their weakness. The point of this article was that too many people still use an average to monitor the performance of their applications and sites. They do this because they implicitly assume a bell curve distribution. They also assume that the “normal” user is experiencing this average response time. This is not true! In fact the time calculated as average response time might never have occurred at all. The average response time does not necessarily represent the bulk of the users, which is what people actually want to monitor!
      This and the fact that an average can be very volatile when being dominated by outliers (makes it hard to automatically alert to changes) shows that it is not the right tool to monitor performance.
      A median on the other hand explicitly states that 50% of the requests are as fast or faster (this is what many people implicitly assume about the average to be true as well, but it obviously isn’t). It represents actual experience and is not dominated by outliers. A 90th percent represents the majority of requests while still excluding extreme events. As such percentiles are the better tool to measure and monitor response time, as they tell me more about the actual experience of the users.

      I could also have named this article “An Average is not the right tool for response time monitoring; percentiles are”. But I wanted to make a strong point that an average really really is not the right tool in this context :).

  14. kimaya Waghmare says:

    The article is really nice 🙂 Great stuff ….

  15. Hi,

    Have you seen a scenario where Average is more than 99th percentile. Is that possible. I fetched some Metrics and see Average is greater than the 90th, 95th and 99th Percentiles.
    If someone can explain with an example

    • If the outliers are much bigger that the bulk of the data then the average would spike past the 9x% of interest.
      Think of a case in which normal values are between 1 and 100, say you have 100K of them and then you have one single outlier of 10^1000000. Except p100 all other percentiles (up to p99.999) would exclude such outlier and would not be influenced (as the percentile is an order statistic [positional]) while the average would be severely impacted (as it always encompasses the complete set) in our example even if all 100K values were 100 would just give us 10M summed together while the outlier is clearly still bigger.
      Such big value acts as an attractor for the Average (actually dominates it in this case).

Speak Your Mind


Do the math so we know you are human *