Using a Server Performance Monitor to find Code Problems

If you are reading this blog you already know that server performance monitoring is valuable for a number of reasons, but one thing that you may not have considered is using historical performance data to help to indentify problems introduced to a system by new programming code.

A long time Performance Sentry user of ours is an Internet company that does not have the budget for a dedicated performance team.  In fact, they don’t even have the budget for a dedicated performance person. Though, they did see the need to collect and retain server performance data to help to identify and solve performance problems for when they would pop up.

This particular customer runs Performance Sentry to collect performance data on eight different Internet facing servers. That data is then processed nightly by the Performance Sentry Portal for reporting and trend analysis purposes.

This customer doesn’t often experience performance problems, so they only review performance reports every couple of weeks.  In March they noticed an uptick in CPU usage in two of their servers.  The uptick was not dramatic, but it was noticeable.    They made a note to keep an eye on those servers.

The next time the reports were run those same two servers had periods where the processor queue length had reached alarming levels and CPU had begun to bounce around 100% for brief periods of time.  Clearly something would need to be done, just not quite yet.  The reports were placed in a “needs to be reviewed” stack.

Two servers with high CPU and Processor Queue lengths

The very next day our customer received their first customer call complaining of slower than normal response times.  Well, as you can image that set off alarm bells and caused those performance reports to move to the top of the stack.

Our customer decided to again utilize the Portal to try and figure out exactly when the processor queue lengths began to increase.  Because they store historical data in the portal they were able to quickly pinpoint the exact date and time when the trouble started.

They used this information to go through code change logs to see what changes occurred when the problem began to present itself.

Well to make a long story short they were able to determine that some folder creation code had been changed and that new code was causing the problem.  That code was re-written and put back into one of the production servers at about 2:30 PM.  The results of the change can be seen in the graph below.

As you can see both CPU and the Processor Queue Length dropped right back to acceptable levels.   Our customer was thrilled to be able solve what could have been a tricky problem by quickly isolating the time the problem was introduced into their system.

Single Server after the code fix

This is a great example of the importance of retaining performance information.  Many smaller companies use only occasional real time snapshots for server performance information.  That works well if you’re just concerned with brief windows of time.   But for this company, having access to historical data proved to be invaluable in solving their problem quickly.

We thank them for sharing this success story as well as a few of their Performance Sentry Portal graphs with us to share with this post.

Comments are closed.
Bitnami