These days, it is common to use software or hardware load-balancing to distribute work evenly among multiple, similar servers. Our ultimate goal is to determine how busy our servers are, but it would be useful if we could consider a complete group of load-balanced servers as a single entity, rather than reporting upon each server individually. However, this approach is only valid if we know for sure that the group is indeed balanced. So, we need a quick and reliable way to determine if load-balancing is working as intended. If we can do that, our performance reporting will be greatly simplified.
Anomalies
We shall soon examine various alternative ways to distinguish between balanced and unbalanced configurations – that is what we are really concerned with here – but let’s first consider the implications of finding that the load on a given group of servers is not evenly distributed. An unbalanced load could result from a poor load-balancing mechanism, but it is far more likely that it indicates an issue with one of more of the servers – a hung or looping program, for example. When one of the servers in a group is experiencing problems, it is likely that the load will increase artificially on the remaining servers, reducing overall performance, so it is important to monitor for this situation and take corrective action when it happens.
Anomalies such as this are troublesome for the Capacity Planner too, because (hopefully!) they represent an abnormal situation, so measurements taken at these times are unreliable and should be discarded.
So we have good reason to look for anomalies, but what is the best way to do that ? We want to do it in a way that is simple, efficient and effective. What’s more, we want any test that we devise to generate a numeric outcome – not a chart of some kind, because charts need people to interpret them and they take up a lot of space in reports. No, we want a single number that we can perhaps use to trigger some other activity (such as raising an alert or generating additional reporting) when an unbalanced configuration is detected.
Approach
The approach we adopted was to examine a number of statistical measures (“statistics”, for short) , and see how well they performed the task for different scenarios. When carrying out an investigation of this nature, we need to bear in mind that a given statistic might perform quite differently under different circumstances. To give a trivial example, consider maximum percent busy – max for short – of a number of servers:
Using the max, we can easily detect a looping (100% busy) server, but we can’t detect a hung (idle) server.
Furthermore, we do not know how much the max varies from the other values, so it is of very limited use in this context. It would be more useful if we combined it with a second statistic – such as the minimum or the arithmetic mean, but we would prefer to use a single measure, if possible.
In our assessment, we considered scenarios of looping servers, hung servers and reasonably well-balanced servers. We considered each of these in the context of server groups that were on average:
- Running a very light load.
- Approximately 50% busy.
- Nearly full.
This was necessary, because some statistics perform well at low utilisation levels, but poorly at high levels (or vice versa).
Statistical Measures
Our research put to the test the following well-known statistical measures:
- The arithmetic mean.
- The standard deviation (StdDev).
- The maximum.
- The minimum.
- The range.
We also evaluated these:
- The mean range. This is one we invented ourselves. Unlike the range, which measures the difference between the maximum and minimum observations, the mean range statistic measures the difference between the minimum and the mean.
- The relative mean range (RmR). In an attempt to make the results of the mean range more user-friendly, we extended the concept by dividing the mean range by the mean. Consequently, this metric is best expressed as a percentage (of the mean). Low values of the RmR imply that the sampled values are close together and high values imply that there are at least some values that vary significantly from the mean.
- The relative standard deviation (Rel StdDev). The standard deviation is used widely, but it can sometimes be difficult to interpret. We have addressed this in the relative standard deviation which is calculated by dividing the standard deviation by the mean. We can interpret its values as follows: low values indicate a balanced configuration, medium values imply a quiet or hung server, while high values imply a looping or unusually busy server. This statistics is also known as the coefficient of variation.
Conclusions
We examined a number of statistics to see how well they differentiated between different anomalies under varying load. Based upon our findings we assigned numeric ratings (0 = poor, 1 = quite good, 2 = good, 3 = very good) for each statistic/scenario. In assigning ratings, we gave particular emphasis to the ability of the statistic to distinguish between balanced and unbalanced loadings. The ability to distinguish the particular kinds of abnormality was considered useful, but less significant, because we anticipate always having to investigate further when any unbalanced condition is detected. The results are summarised in the table below.
The clear winner of our contest is the relative standard deviation which performed well over a fairly wide range of circumstances. Our “home-made” statistic, the relative mean range came in 2nd, closely followed by the standard deviation. It was rather surprising to find that the well-known, but rarely-used range statistic performed the task reasonably well: it came 4th.
You can use these statistics to simplify your reporting and proactively detect anomalies in situations where load-balancing is used, particularly when there are three or more servers within a load-balanced group. This is just one example that illustrates how straightforward statistics can simplify reporting on complex systems. We have observed that, increasingly, we are resorting to statistical methods to replace traditional approaches that are impractical or labour-intensive on large configurations.
Are you struggling with performance reporting across large numbers of load balanced servers? If you are, contact us.