Large and small companies alike have it easy when it comes to network monitoring. Large companies can afford the biggest and best solution available, and an army of people to monitor every little twitch in the network.
Small companies, on the other hand, either don't monitor at all (this is the much-vaunted, "Hey… the Internet is down" approach--also called "Canary" monitoring), or only have a very small number of devices to keep an eye on. What happens if you're in the middle somewhere? What happens if you're big enough to have a staff, but not big enough to have one dedicated person assigned to sitting around watching for failures?
If you are like me, you buy a great monitoring solution like SolarWinds's Network Performance Monitor (NPM). You do the initial installation, run through some wizards to get some basic monitoring up and going, then you start playing with the "nerd knobs" in the software, and boy does NPM have that in spades. NPM monitors everything from port states to power supplies, link errors to wireless authentications, and everything in between. You can then send alerts to a distribution list that your team tracks, assign responsibilities for monitoring and escalation, and everyone on the team now has visibility into failures and problems anywhere on the network.
Nirvana. Bliss. You have now become the Network Whisperer, a beacon of light in the darkness. Except, now you have 150 email messages about a printer because that printer has been offline for a few hours waiting for a part and someone forgot to kill the monitoring for that device. Easily fixed. Ooh… 300 more emails… someone rebooted a server.
I exaggerate a bit, but you see the problem. Pretty soon you set up mail rules, you shove your monitoring messages into a folder, and your entire monitoring solution is reduced to an in-house spam-generation machine. You find you don't actually know when stuff breaks because you're so used to the mail folder filling up, you ignore it all. The only thing you've accomplished is the creation of another after-the-fact analysis tool. Someone from accounting tells you that so-and-so system is down, you can quickly see that, yes, it is down. Well, go you.
I'll talk about how we work to solve this problem in my next post, but I'm curious how everyone here deals with these very real issues:
* What do you monitor and why?
* How do you avoid the "monitoring-spam" problem?