Uncategorised

You Can Only Fix What You Measure

Part of a successful network monitoring deployment strategy is defining the exact metrics that need to be alerted on. Says Prineel Padayachy CTO at ATS Network Management. This is what came to mind while I was reading “You Can Only Fix What You Measure (So Measure What You Want to Fix)” – an article written by SolarWinds Head Geek Leon Adato.

The more granular the detail the better.  A common example of this would be disk space monitoring i.e. Creating an alert to open a ticket when disk space is less than 10%.  This might be useful if all disks in the environment are 10GB in size in which case 1GB free is likely to cause a problem.  This same alert logic applied to a 1TB disk will most likely not be an issue since the disk still has 100GB of free space.  By simply adding additional ‘AND’ logic to the trigger (Disk Bytes free is Less than 10000000000 (10GB) this alert becomes more relevant to opening a ticket with your Service Desk.

Turn off out of the box alerts, until they are fine tuned to your specification and environment, they will do more harm than good, one of which is to lose the trust of your Support Teams that are receiving these alerts”

“Slow is the new down is becoming more and more relevant.  The same should be applied to Alert Thresholds.  Static values are not the best approach in this case, instead use Dynamic Thresholds built into the Orion Platform to help you with understanding the behaviour and resources of your systems.

You Can Only Fix What You Measure (So Measure What You Want to Fix)

Recently, my colleagues Pete Di Stefano, Ashley Adams, and I hosted a webcast on the topic of capacity planning and optimization. You can listen/watch it here, it was a really fun conversation.

As part of the discussion, we talked about the need to measure the right things to get the correct outcome. Keep in mind my oft-repeated mantra: monitoring is simply the collection of data. You need a robust, mature tool to add context, which transforms those metrics into information. Only when you have contextually accurate information can IT folks hope to act—fixing the right problem to achieve the best result. But do you know the result you want? Because if you aren’t clear on this point, you’ll end up fixing the wrong thing.

And often, this leads you to emphasize measuring the wrong thing. Let me start off with a true story to illustrate my point:

As I’ve mentioned in the past my dad was a musician, a percussionist (drummer), for the Cleveland Orchestra for almost 50 years. Because of the sheer variety of percussion instruments a piece might require—snare drum, kettle drum, bass drum, gong, xylophone, marimba, cymbals, and more—the folks in the section would give “ownership” of specific instruments to each team member. My dad always picked cymbals.

Here’s the punchline: I asked him why, one time, and he told me, “My pay per note is way higher than the guy playing snare drum.”

The ridiculousness of Dad’s comment underscores something I see in IT (and especially in monitoring): Two valid metrics (pay, and number of notes you play) aren’t necessarily valid when you relate them. It’s yet another expression of the old XKCD comic on correlation versus causation.

Here’s a less fanciful but equally ridiculous example I saw at several past jobs: ticket queue management. It’s fair to say if you work on a helpdesk, closing tickets is important. Ostensibly, it’s an indication of completed work. But anyone who has actually worked on a helpdesk knows this isn’t completely true. I can close a ticket without completing the work. Heck, some ticket systems let me mass-close a whole bunch of tickets at the same time, never having done any of the tasks requested in them.

Nevertheless, it’s common to see a young, hungry, eager-to-prove-themselves helpdesk manager implement a policy emphasizing ticket closure rates. These policies use a wide range of carrots (or sticks), but all pointing in the same direction: “close ALL your tickets, or else!” The inevitable result is support folks dutifully closing every one of their tickets, whether they’re completed or not, sometimes before the customer has hung up the phone.

The problem stems from using ticket closure as a key metric. Tickets aren’t, in and of themselves, a work product. They’re merely indicators of work. It’s possible to close 100 tickets a day and not help a single person or accomplish a single task. Closed tickets don’t speak to quality, satisfaction, or expertise. They also don’t speak to the durability of the solution. “Have you tried turning it off and on again?” will probably make the problem go away (and let you close the ticket), but it’s highly likely the problem will come back, since nothing was actually fixed, only deferred.

We who make our career (or at least spend part of our day) with monitoring solutions are also familiar with how this plays out day to day. Measure high CPU (or RAM, or disk usage) with no other context or input, where a spike over 80% triggers an alert, and you’ll quickly end up with technicians who over-provision the system. Measure uptime on individual systems without the context of load-balancing, and you’ll quickly end up with managers who demand five-nines even though the customer experience was never impacted. And so on.

The point—both of this blog and of my original discussion with Ashley and Pete—is to understand how optimization cannot happen without first gathering metrics; and gathering raw metrics without also including context leads to all manner of tomfoolery.

Prineel Padayachy
Director
ATS Network Management (Pty) Ltd
Tel: +2711 886 1740
Email: info@ats.co.za

Leave a Reply

Your email address will not be published. Required fields are marked *