Far too often “the database” is blamed for impacting performance and scalability of applications. In many cases, however, it turns out that it’s not the database itself, but the way the database is accessed by the application. Common problem patterns are requesting too much data, inefficient queries, lack of data caching, and waste connection usage, for example.
But, there are cases where the database is to be blamed. This is the scenario of the following story one of our customers shared with me:
The company runs 4 JVMs in a WebLogic cluster with each of them having a connection pool to its Oracle database with 20 max connections. They have setup proper application monitoring which includes both system monitoring and application and real user monitoring. Here is what happened on December 16th at 2PM.
Step 1: Alerting on Connection Pool Exhaustion
As part of its application monitoring, the team members are watching connection pool stats from WebLogic via JMX. They look at measures such as ActiveConnectionsCurrentCount, ActiveConnectionsHighCount as well as ConnectionDelayTime. Besides having these metrics on a Dynatrace dashboard that shows them pool usage per JVM over time they also have an automatic alert configured that triggers when a pool is exhausted. This alert triggered at 2:06PM.
Tip: Alerting on 100% pool utilization is one option. Another option would be to have an alert before the pool is exhausted and combine the alert with ConnectionDelayTime. Why? Because even a 100% utilized pool may just mean that your app is making optimal use of your connections. Only when more connections are requested than available does it becomes a problem which is reflected by increased connection acquisition time.
DO NOT prematurely and automatically increases the connection pool limit if you reach your limit. First: you don’t even know if you have a problem as explained above and Second: This doesn’t solve your problem but just puts more pressure on your database that is already under pressure anyway. Unfortunately I’ve seen this “Best Practice” in several demos of tool vendors with the message: This is how we help you to MANAGE performance and not just MONITOR it. So – please be cautious with that approach! It demos well but that’s about it!
Step 2: Identifying Long Running SQL
A typical behavior in such a case is to blame it on bad connection handling by the application. For example, holding on to connections too long, or a “classical” connection leak. As there was no recent update to the software, a regression was ruled out.
Besides observing the exhausted connection pools, the team also observed much longer transaction response times. The team members also observed a database execution time contribution to the response time of their transactions. That’s why they took a closer look at the database statements executed in that timeframe. They were surprised by the following list of very long running SQL statements which they immediately forwarded to the engineering and database team. There are several UPDATE statements that took up to 15 minutes. These statements were executed by several transactions and with that not only impacting the execution time of these transactions but also the execution time of all other transactions as they had to wait for the next available connection from the pool.
Tip: Every app will have some UPDATE statements that take a long time to execute. We see them often executed as background jobs on a scheduled basis. For example, report generation or batch updates. These statements should be optimized if possible to ensure fast execution times. But it is more important to look at statements that are called by many transactions. With UPDATE statements that update the same table you want to ensure that these statements are not running into locks. A metric such as % of Transactions Calling and Response Time Impact of Database Executions are good metrics to look at.
Step 3: Fixing Database Table Statistics
The database team looked into execution plans and figured out that the problem was actually not caused by locks – even though these update statements were executed pretty much at the same time. It turned out that the table statistics were outdated and with that had the major impacted on these UPDATE statements. This resulted in these very long running statement executions that impacted all other transactions because database connections were bound and not available for transactions executed by other end users.
After updating the table statistics, the response time of the application went back to normal. The following response time chart shows the immediate effect of that database update:
Tip: If you have long running SQL statements that are executed by batch jobs you want to think about executing them in times where there is less traffic on your application. The other option is to give these “special purpose” jobs a dedicated connection pool. If you experienced something like this case, you need to optimize the statement executions by either optimizing the SQL statements itself or optimizing the database.
More on database performance problems
If you are interested in more problem patterns around database I encourage you to read some of our other blogs on database related performance topics and also check out our series of Hibernate blogs as we often see E/R Mappers as performance problems in enterprise applications.
If you happen to have your own stories we are happy to share them on our blog as well as discuss your own approaches to identify and solve problems like this.