Performance Tuning In Industry

By 01

Monday, September 29, 2008
IT Professionals face an up hill battle in the realm of performance tuning.  First, management must be convinced that time needs to be set aside for it.  Next, the skill sets to successfully move through the performance tuning life cycle must be present.  Then, the tools to effectively identify issues must be available.  Finally, how success will be measured must be identified before you start; otherwise, these efforts could turn into a political quagmire.  I only briefly touch on most of these.  This article is about how to carry out a performance tuning effort in the absence of industry-leading tools.  I tie that together with the motivation behind my masters project, in which I wanted to explore in more formal way the underpinnings of how I've been approaching performance tuning in the past ten years.
 
I orignally wrote this in November, 2007.  At the time, I was finishing my Masters Project, which would conclude my Masters Program and efforts to get said degree.  I wrote this article because I was still putting together notes for my project defense presentation.  My professor's feedback had been I have a sufficient amount of detail, but he wanted to see more "story telling" regarding why I believe this work will be useful to someone in industry--it made sense.  It isn't the type of information I thought would be in a Project Defense, in the end, it wasn't.  But, it helped solidify a couple of points in my mind.  
 
So, I have to ask myself the question, what use is this material?  I've spent nearly eighteen months exploring this subject.  For anyone reading this out of context on www.thinkmiddleware.com, my Masters Project involves studying information efficiently gathered from an OS at runtime during a load test.  The application under study ran on a Java and J2EE middleware platform.  Many of the standard J2EE technologies--Servlets, JSP, EJB, JMS, JDBC--were used by the application.  Obviously, any dynamic analysis of a J2EE application that focuses on low-level OS data will produce an abstract set of data.  That was the whole idea of the masters project.  Using machine learning and statistical analysis techniques, what could be learned?
 
So, what use is this material?  In my mind, this bleeds into a much larger question of the place performance tuning has in the modern IT industry.  By modern, I mean the past ten years.  Let us not even concern ourselves with the past ten years; let's focus on the past two.  When I entered this industry in late ninties, I was fortunate to be hired in an IT shop that placed a disproprotinate emphasis on Performance Tuning.  It was a do-it-yourself shop whose highest concern, after availability, was system responsiveness and performance.  For a long time, I thought all shops worked this way.  I liked this environment; I liked their way of doing things--my view of the IT universe, today, is deeply influenced by these early experiences.  But, all good things must come to an end.  Most IT shops don't work this way.  It wasn't until I left that first hob six years later that I was exposed to how most of the rest of the industry works. 
 
The leap I just made from gathering and analyzing low-level OS data to determine the behavior of high-level application threads is drastic.  But, I shall take a slightly different approach in bridging the gap.  That first IT shop I worked for in the late ninties emphasized performance tuning, but there weren't sophisticated tools to achieve the goal.  In fact, we were left to largely tune Java applications with stick and stones--metaphorically, as it were.  Those sticks and stones involves tools suck as truss and prstat on Solaris or their equivalents, strace and ps/top, respectively, on Linux.  Later on, at other companies, I started using tools like Wily Introscope.  This is a vastly more efficient way to troubleshoot problems in JVMs, but it has its limitations, which are outside the scope of this discussion.  But, in the beginning, the only tools available to me to troubleshoot complex problems and performance tuning (which tends to be a side effect of complex problems) tended to be the OS tools I already mentioned.  With this limitatioin, I marched forward as best as I could.  This largely motiviates the points I am trying to emphasis here.  It must also be remembered, that I am coming at this from the role of a Middleware Administrator, not a Java/J2EE Developer.  I'm assuming an audience of Middleware Administrators whose roles constrain them to treat the application as a closed system where the source code is not available.  If that is not the case, then you are in a better position than I have often been.  This discussion will be especially relevent if you are working with a third-party vendor application where the source code is simply not available.
I've eluded to two different performance tuning tool sets for JVMs (especially JVMs running J2EE middleware software):
There are other categories of tools for this job:
Each of these categories have there place, but we are focused on production environments.  And, I'm trying to promote a Performance Tuning/Troubleshooting ethos that is relevent in any production environment.  I've had the built-in JVM profilers result in core dumps more often than I've had success; there is also a significant performance penalty associated with its use.  The "heavy-weight" JVM Profilers are designed for use in development environments or developer workstations.  I've tried using this in production before; there was a significant increase in system resource usage and response time--and, a decrease in total throughput.  The third option is practical if you have access to the source-code or a cooperative developer who has access to the source code.  If that is not the case, it isn't really an option.  Most of the popular J2EE Middleware containers (Websphere, Weblogic) have some type of tracing facility to profile applications and the J2EE Container themselves.  Websphere & Weblogic have two piecies: something that gathers detailed information for the vendor (support cases) and one that measures performance metrics.  The latter can be very useful, but they have limitations, but they can be very useful and should not be overlooked.  If you have bought one of the industry-leader J2EE containers, then you have these tools.  If not, this category won't help you either.  Again, coming from where I did, I am promoting a tool set, techniques, and methodology that can apply to the widest possible audience.  However, if one of the other categories mentioned applies to you, go for it.
  
Please note, that I have previously said IT shop.  I speak from the stand point of an IT professional working for a company whose core business and strengths lies outside of IT.  A company that is large enough, with a substatial investment in Information Technology, to maintain their own IT department--this is not every company.  Also, I'm not refering to a technology company (or software company) whose core comptenancy is technology and developing specialized software.  What I have to say may apply to this situation, but I'm not sure--never worked for a company like that.
In all, but the most dysfunctional shops, the business/marketing side of a company drives what happens inside of an IT department.  They set direction, dates, requirements (well, hopefully they do).  Somehow, the IT department has to accomodate all of this, but still build in the reliablility, redundancy, scalability, and performance that is almost always expected, but rarely seems to be explictly stated.  I've worked for companies where the business set requirements and IT gave realistic dates in which those requirements could be met; I've also worked for companies where the business set requirements and dates for the IT department.  I haven't actually seen the latter model work yet,  but in some circles it is enthusastically followed.
Much has been written about all four of these ideas.  But, I'm only only interested in the last one at the moment--Performance.  Performance encompasses many things.  And, it means many different things depending upon who is asked.
The list can go on, but, I think I've made my point.  Modern Web Applications tend to be complex.  Most people don't appreciate what is involved in just getting a static web page to display in a browser.  There have been many technological advances and very intelligent people over the past twenty years involved in allowing that to happen.  There are many moving parts to web applications; there are many people involved in supporting them, which most end users will never meet or conceive exist.  For everyone of these unknown specters in the background, there is a deep world of many, many possiblitites for small, incremental improvements in performance.  But, at the end of the day, the only real performance measure that matters is the last one I mentioned.  The last one is a thin wrapper over the many that happen and can potentially go wrong during a user's experience.  So, anyone who is going to tackle a performance issue must be prepared to peel back the onion layers.
 
Over the past couple of years, I've been exposed to several competing philosphies regarding performance tuning ranging from "who bothers doing that?" to people who know more about OS internals, JVM internals, TCP tuning, and network tracing that I will for sometime to come.  Which one is typical?  I'm not really sure.  What I was exposed to in my first job seems to be the exception, not the norm.  Places I have worked at since have historically placed performance at the bottom of their agenda.  In one shop, I was fortunate to have come in at a pivotal moment, when attitudes were slowly changing, but just needed a little final push.  I had a lot of fun at that job and had many exceptional opportunities to explore system & application problems.  I've had other experiences where convincing management and peers to take performance tuning seriously was a constant challenge.  There was never time built into the projects to address even basic JVM tuning and Garbage Collection Tuning(something that I recommend be done for any project involving a JVM).  Everyone recognized the need for such things, but years of synosizism and neglect made it difficult for anyone to truly believe something would change--this is unfortunate.
 
In recent years, the art of performance tuning has been lossed in many quarters.  A prevailing wisdom has developed over the past ten years that performance tuning isn't really necessary anmore.  CPUs are so fast; memory is so cheap; optimizing compilers so advanced (etc, etc).  And, people seem to accept this.  But, at the same time, I routinely encounter web applications where a link is clicked and we wait, and wait, and wait.  In my version of these affairs, anything taking 1+ seconds to return to the end user is unacceptable.  But, with the constraints of reality being what they are, 5+ seconds is probably more achievable.  This has been used as the official boundary between "acceptable" end-user response time and "not acceptable" in many IT shops.  But, the definition of acceptable will be unique to each company and should be discussed with business leaders and analysts as a part of a performance tuning effort.
 
So, let's assume you've decided to make Performance Tuning a priority in your company.  This doesn't mean that it is the only priority.  It doesn't mean every design decision made from here on out is centered around performance.  It doesn't mean the company's vision statement will be rewritten to discuss how many milliseconds can be removed from end-user response time, if we disable DNS looksups on the load balancers.  So, now what?  I've already laid out the measure you need to be concerned with--end-user response time.  This is the only thing your non-technical keepers will measure your success by.  So, we should probably use it.  "End-User Response Time" hides a formidable number of details, potential problems, and complex inner-workings of computers, networks,  operating systems, middleware, and application code.  So, where do you begin?
 
Performance Tuning a middleware environment running a high-volume web application is going to involve many systems, groups, people, and talents in an organization.  Ideally, there would be one person or group of people that specialize in tracking performance problems through all the pieces of the environment.  This person(s) would need to know a high-level view of what the application does, how it is used in business terms, how requests flow through the environment, and who to pull in when focusing on a particular piece.  This last part is probably the most important; it is not possible to know everything.  This person needs a functional working relationship with everyone from the DBA to the LDAP admin to the Web Admin to the Developer (so on and so forth). 
 
So, you have your Performance Tuning expert.  Someone has complained that a particular application or work-flow or screen within an application is slow.  What happens next?  Reproduce it!  Before anything is going to get fixed; this must happen under controlled circumstances.  If you can't reproduce it, you're going to have to figure out how.  This can be very challenging.  If you can't reproduce it in a development or QA environment, then an increased monitoring exercise in production is probably called for.  If you can't reproduce it, capturing the event  in production with as much detail as possible will have to suffice.
So, an end-user clicks on a link and N seconds go by before a response comes back.  Let's assume N is much larger than five seconds.  This is certainly not good.  But, where has the time been spent?  Behind the simple act of clicking, is a veriteble cornucopia of technologies obscuring endless perils and pitfalls:
So, what do you do?  Where is the problem?  Many people will not have the patience to follow this through to the end.  There is some high-level advice I have followed for years.
In "High Performance Web Sites", Steve Souders says that in his expeirence, 85% of all time spent processing an HTTP request is on the browser side(1).  My experiences are the exact opposite.  I read this book recently.  Many of the conclusions he reached were contradictory to my own.  I spent a lot of time thinking about this.  My conclusion was that at Yahoo they had a dedicated Performance Group that had been working on server-side issues for years.  The server-side was already optimized by the time he made his observation at Yahoo.  At the first place I worked, we eliminated many issues he explored by banning all images in web applications and placing Javascript files, CSS files, and some Java Applet classes on each individual workstation.  I'm not suggesting everyone do this.  But, this made the time spent downloading these things small compared to the time that was spent on the server-side.  So, much time and effort was spent optimizing the server-side.  I think this shows that every situation is different.  But, in my experience, so far,  most companies have spent very little time or effort optimizing their code or environment on the server-side.  So, this part of the environment still accounts for the majority of the End-User Response Time.  Thus, this is where the vast majority of the reported problems still show up.  I'm not suggesting this is true of the ten most popular web sites in the US (which Steve Souders studied).  I'm suggesting this is true in the types of IT shops I described earlier.
 
So, my experiences have resulted in a specialization in troubleshooting and performance tuning on the server-side.  Most of that experience wasn't looking at or optimizing code--that was some of it.  Most of it was starring at network traces, truss (or strace on Linux) output, and JVM debug output to identify where the problem was.  So, I finally come around to answering my original question.  In this situation.  Troubleshooting a performance problem not unlike I've described here. 
 
With that, I have essentially answered my original question.  I've touched on a much larger subject in the last half of this article.  How would I continue to troubleshoot that performance problem?  That I will save for another day.

References:

  1. Souders, Steve 2007. High Performance Web Sites. O'Reilly Media, Inc, Sebastopol
  2. Wily Introscope
  3. IBM Websphere
  4. BEA/Oracle Weblogic

 

©2008 www.thinkmiddleware.com

All copyrights & trademarks belong to their respective owners.

The comments and opinions herein are that of the author.

Please direct all comments to 01.

While the information presented on this web site is believed to be correct, the author is not responsible for any damage, loss of data, or other issues that may arise from using the information posted here.

Made with CityDesk
Last Modified: Sunday, 09-Nov-2008 10:48:33 MST