There is a common belief that to make the right decision or to find the root cause of a problem we need to analyze as much information as possible. We, citizens of the information society, however, sometimes feel overloaded and strained with information. As a result, we try to weed out unnecessary bits quickly and focus on only those that can bring us closer to solving the current problem.
One of our customers, offering an e-banking solution, had to identify the cause of a performance problem reported by its customers. Initially, the Operations team thought the issue was caused by one of the front-end tiers. Second glance at what initially looked like a very small increase in transaction time revealed that one of mainframe database servers was responsible for the performance problem.
Analyzing a performance problem usually requires an end-to-end view of the complete transaction that impacted the end user’s experience. In this article we will show how numbers can be misleading and, if not carefully analyzed, may seem pretty insignificant. We will also discuss how a proper data analysis tool with the fault domain isolation workflow and understanding of what is normal (baselining) can help us get to the bottom of the issue, e.g., that a database server is experiencing a performance problem.
Things Are Getting Slow
Frank, the system operator at FBT, the First Bank of Tsalal (name changed for commercial reasons), got a call from the first line support that some clients were reporting problems with the e-banking solution being way slower than usual. He looked at the real user monitoring dashboard and noticed that one application was under-performing.
Figure 1 shows an overview dashboard with all applications monitored by the Operations team at FBT with one of applications indicating a performance problem. The Application Performance metric is calculated as the percentage of transactions that completed in time less than the pre-configured performance threshold.
Figure 1. Overview of the application at the First Bank of Tsalal shows performance problems of the RIB application that impacts over 77k users ↩
Mind the Trap
Frank drilled down to see the tiers implementing this application and noticed that a bunch of front-end tiers were reporting Transaction time and Server time above the baseline. He also noticed that one of the mainframe database tiers were reporting higher Transaction times.
He considered the value reported there: 5ms (see Figure 2).
His first thought was that this didn’t seem too bad even though the response time was shown in red. We will later find out why.
The figure below shows the Operation time way above the baseline and five applications tiers reporting performance issues. The tiers are ordered by those closer to the end user shown at the top through to the database tiers shown at the bottom of the report.
Figure 2. Tier breakdown for the affected RIB application shows performance problems at four top-most tiers and at the Mainframe DB2 tier ↩
As Frank wanted to get to the bottom of the problem quickly, he decided to analyze those “slow” front-end tiers first (although normally one would drill into the bottom most impacted tier).
Figure 3 shows a report with the list of application servers behind the firewall to which Frank drilled-down from previous report (see Figure 2) through tier RIB – 4 – Web Servers -> App Servers Below FW. Only four of the servers are actually reported to be slightly above the baseline; this is indicated by the application performance highlighted in orange and the response time metrics in red.
Figure 3. List of application servers below the Firewall tier ↩
Before drilling down through one of the under-performing servers, Frank, based on his knowledge of the system infrastructure, identified that all four servers were calling the same database region on the mainframe.
Figure 4 shows that all transactions, executed on the selected server, reported poor Application Performance and increased Operation and Server Time.
Figure 4. All pages delivered from one of affected servers are impacted, showing that this is not an isolated incident with just a certain type of transaction ↩
Apparently all of the transactions were impacted. Frank could not pin point the problem to one transaction or module, e.g., authentication in case of login problems. So, he decided to go back to the report showing tiers status and continue the fault domain isolation process.
Little Big Transaction Time
Analyzing the tiers report again (see Figure 5), Frank focused his attention on the “App Servers – Mainframe DB2″ tier that was reporting problem with high transaction time. 5ms is almost nothing, but compared to the baseline, i.e., 3ms, and multiplied by number of times this transaction was executed (3.3M) started to look like a real problem.
Figure 5. Reconsidering DB tier with 5ms Transaction time to affect the application performance ↩
After drilling down to report showing servers implementing this tier, Frank noticed that only one database server was reported to be experiencing application performance (see Figure 6). For that server not only the Application Performance was less than 53%, but also the average operation time was 3x above the baseline (9ms compared to 3ms).
Figure 6. One of DB servers is reporting performance problems: this could be the root cause of the overall application performance ↩
Figure 7 shows report with database queries executed at the affected server. Since all transactions were equally affected Frank could pin point the problem to the infrastructure, i.e., the database server, and not to the application itself or an architecture design issue.
Figure 7. All queries run on the under-performing server are equally affected: this indicates a problem with the server and not with one of transactions ↩
Isolating Fault Domain One Step at a Time
Using a set of fault domain isolation reports delivered by the Dynatrace Application-Aware Network Monitoring (DCRUM), Frank could identify that the cause behind the end user performance problem was one of the database servers. Even though his first, snap decision to investigate front-end tiers turned to be a dead-end, he could easily restart the process and zero in on the real cause of the problem: one particular database server not performing too well.
Figure 8 depicts a four-steps fault domain isolation process used by Frank:
- First he chose one of under-performing applications.
- Next he selected one of tiers potentially affecting the performance of this application.
- Finally he drilled down to one of the affected servers …
- … to confirm whether all transactions were affected (server issue) or one or more transactions were affecting the overall performance (and further analysis were required).
Figure 8. Three steps of the Fault Domain Isolation workflow ↩
Is It a Mystery Or a Puzzle?
When we try to zero in on the root cause of the performance or availability problem of our application we often assume we deal with a puzzle where more information could help us to understand the situation better. On some occasions, however, there is too much data to analyze.
Malcolm Gladwell tells about the difference between mysteries and puzzles: when Enron’s shady business was over and people sat down to analyze what really happened they were not missing pieces of a puzzle, they were overloaded with information Enron produced and they were solving a mystery.
Gladwell shows that when solving a mystery we need a set of tools to can help us cut through all that information quickly: in our story, Frank benefited from a fault domain isolation workflow to quickly analyze the root cause of the problem, even though he had to restart his analysis after initially wrong assumptions.
Following the story from the First Bank of Tsalal, we are able to learn about the efficiency of properly designed fault domain isolation workflow, and also take away a number of interesting observations:
- Even as little as 2ms of extra Transaction Time, especially if that is a query execution time, may have a significant input on the overall performance.
- When considering Transaction Time values we should also keep an eye on the number of transactions that were actually executed and think about the total transaction time before we jump to any (misleading) conclusions.
- A large number of database queries, especially when compared with number of transactions at other tiers, may indicate architectural problems. Maybe we should check if the same query is not executed multiple times per transaction?.
(This article has been based on materials contributed by Nate Austin based on original customer data. Screens presented are customized while delivering the same value as out of the box reports.)