Andreas Grabner About the Author

Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

Detecting Bad Deployments on Resource Impact and Not Response Time: Hotspot Garbage Collection

This story came in from Joseph – one of our fellow dynaTrace users and a performance engineer at a large fleet management service company. Their fleet management software runs on .NET,  is developed in-house, is load tested with JMeter and monitored in Production with dynaTrace. A usage and configuration change of their dependency injection library turned out to dramatically impact CPU and memory usage while not yet impacting end user experience. Lessons learned: resource usage monitoring is as important as response time and throughput!

On Wednesday, July 3rd Joseph’s ops team deployed the latest version into their production environment. Load (=throughput) and response time are two key application health measures the application owner team has on their production dashboards. Especially after a new deployment an unchanged behavior is what you are looking/ hoping for:

APM Lesson #1: Don’t judge a deployment based on response time and throughput alone

APM Lesson #1: Don’t judge a deployment based on response time and throughput alone

Joseph is an experienced performance engineer, so he knew to check more than just these two metrics. Looking at host and process monitoring data he identified a 6x increase in CPU and 40% increase in process memory usage which alarmed the folks responsible for the infrastructure that hosted the ASP.NET Services:

APM Lesson #2: Expand your alerting to resource usage and infrastructure health

APM Lesson #2: Expand your alerting to resource usage and infrastructure health

Despite the fact that the app still shows the same response time and throughput this is very alarming. Why? Because it means that the application is much more expensive to run as compared to the previous build.

Whether it runs on physical, virtualized or in a cloud environment: more resource consumption means higher operational costs. And – who knows what happens during peak load: will CPU and memory be enough or will it eventually impact response time and throughput? That’s the reason why analyzing changes in resource consumption is as critical to performance engineering then optimizing response time and throughput.

If you are a performance engineer, application owner or developer I suggest you read the rest of the blog to learn from Joseph’s best practices on analyzing and dealing with situations like this and what they changed in their development and deployment process:

Step 1: Analyzing the Impact of Memory Usage on CPU

Runtime engines such as the JVM (more on Java Enterprise Performance) or the .NET CLR (more on .NET Performance Analysis) provide detailed performance counters to start analyzing the resource usage hotspots on CPU and memory. A good start is to look at memory usage in the different heap generations as well as the activity of the garbage collector. The following screenshot shows the .NET garbage collection generations (# of Collections and Heap Size) as well as the time spent in garbage collection metrics of their production environment:

APM Lesson #3: Look at memory metrics to figure out how memory usage and GC behavior has changed

APM Lesson #3: Look at memory metrics to figure out how memory usage and GC behavior has changed

What can you read from these two graphs?

  1. Most of the objects live in generation 0 and never get promoted to an older generation. That means we are talking about short living objects that are probably created and cleared on a per request basis. Would we see a shift to older generations it is more likely that these objects are part of a cache
  2. We don’t seem to have a memory leak as the garbage collector seems to clear objects from Gen 0 and not promoting them to higher generations which could eventually lead to an out of memory crash
  3. The increase in # of collections as well as time spent in garbage collection indicates that the application either uses more short living objects or these objects grew significantly in size

Step 3: More Objects or Larger Objects or Both?

Next step is to find out what has changed in terms of objects on the heap. Joseph therefore takes regular dynaTrace memory snapshots (YouTube Tutorial) and keeps them to later compare heap usage patterns between releases. The next screenshot shows the difference between the previous version and the current version and highlights that there are not only about twice as many ASP.NET MVC controller objects on the heap but these objects are also about 6x larger than before:

APM Lesson #4: Continuously take memory snapshots to identify regressions between builds

APM Lesson #4: Continuously take memory snapshots to identify regressions between builds

It turned out it that this pattern was not exclusive to the PlanningController shown in the screenshot but happened across all ASP.NET MVC Controller objects. Following the Deep Memory Leak Snapshots best practices, Joseph found the actual referenced objects that caused these controller objects to grow in size. It turned out to be related to objects that were instantiated and referenced using dependency injection.

Finding and identifying the problem was the first step which raised two more questions:

a)      Why wasn’t this found earlier?

b)      Which code change caused this problem?

Question #1: Why Wasn’t This Found Earlier?

Performance engineering is a discipline for the whole development lifecycle. Along the delivery pipeline in a fully automated deployment process you need to execute tests that allow you to analyze not only functionality (via unit and functional tests) but also any performance or resource regressions. Joseph and his team admit that they don’t always get to run all these tests before deploying a release nor did they look at all these memory related metrics. That’s why this problem wasn’t detected before deploying it into production. Want to learn more? Check out the following continuous delivery blogs (Metrics for Continuous Delivery) and slides (Sydney Continuous Delivery Meeting)

Question #2: Which Code Change Caused this Problem?

Many code changes across several builds made it into this new release. In order to figure out which code change was responsible for the changed memory behavior Joseph ran load tests against 5 different builds. He picked builds where he identified major changes in the controller implementations based on the code change sets in their configuration management system. Using JMeter to generate the load and dynaTrace to analyze the performance metrics showed him in which of these builds memory behavior started to change. The following shows several key metrics he looked at for every build. These graphs make it easy to spot in which builds memory behavior changes got introduced:

APM Lesson #5: Analyze resource behavior on every build along your delivery pipeline to identify regressions early on.

APM Lesson #5: Analyze resource behavior on every build along your delivery pipeline to identify regressions early on

It is easy to spot that GC executions, time spent in GC and overall CPU time consumption jumped with the changes introduced with build 1.7.14233. More changes got introduced with the builds 14235 and 14252. This allowed Joseph to narrow down the code change set with which he approached the development team. Looking at the code changes and analyzing PurePaths and memory snapshots on their local development environments showed them that the problem was related to usage changes of their dependency injection (DI) library. PurePaths with memory allocations showed them how their DI configuration changes impacted actual code and memory allocations. They optimized the settings and now have the same resource usage as with the previous release.

Lessons Learned: Performance Engineering Must be Continuous and More than Just Response Time

The big lessons learned – and with that the new best practice for Joseph and team are:

  1. No need to wait for problems like this to happen in production as these patterns can be detected earlier by looking at some key metrics.
  2. Developers now use dynaTrace Free Trial on their local machines to better understand code execution and memory allocations done by 3rd party frameworks.
  3. Test Automation now runs small scale load tests and monitors memory usage and GC Impact on every build

Benefit: With that new approach they eliminate the tedious and time consuming work of re-running tests to identify the code change that introduced a problem.

I am sure there are other stories like this out there. Feel free to share it with us.

About The Author
Andreas Grabner
Andreas Grabner Andreas Grabner has been helping companies improve their application performance for 15+ years. He is a regular contributor within Web Performance and DevOps communities and a prolific speaker at user groups and conferences around the world. Reach him at @grabnerandi

Comments

  1. Yep, unfortunately not being able to run a load regression always gets folks when you least expect it. I understand all too well the time constraints and effort for such a “small change.” Depending on how busy that system was before or what resource consumption they had before it could have tipped them over.

    We once took a patch for a third party who built the fix from a previous branch with a nasty GC bug in it (.NET). The package change was about 5k and was a fix to a couple of screens. The old version “fixed” their connection leak by calling GC.Collect(). Which, as we all know, is just awful.

    Anyway, the patch gets fast tracked and is installed a couple of days after it’s received at 10pm. About 7am users start rolling on to the system, and by 11am the application is maxed out on CPU and response times are suffering horribly.

    If we would have tested the patch, even under moderate load, we would have seen the CPU time spike. Luckily with dynaTrace we instantly saw it was CPU bound and started looking at long running PurePaths. GC.Collect() was showing up in just about every call. Knowing that it should *NEVER* be in a call stack we ran the code through the IL Disassembler and found the call.

    We had a new patch from the vendor in about an hour off of the proper branch.

Speak Your Mind

*

Do the math so we know you are human *