Beware of silent failures

What's worse than experiencing a failure in a production system? I'll tell you: not knowing that it occurred.

I was inspired to write this post after seeing a neat presentation at work about a new internal tool at the office. One fact that jumped out at me during the presentation was that, if a failure occurred while running a periodic background task, an email would be fired off informing the admin(s) of this problem. On the surface, that seems like a reasonable action to take; be silent unless something's wrong.

But there's a fundamental flaw in this concept. If something does indeed go wrong, how can you be sure that email will still function? Or to say this in a more generic manner, if a failure occurs, how can you be sure that a push-based notification of this failure can still happen? The answer, of course, is that you can't. And that's why you shouldn't do it.

So, what options are there? Fundamentally, you can do two things to verify the correct functionality of your system. The first thing you can do is push success notifications, in addition to or instead of failure ones. In the particular example of the internal tool, it could send out an email every time the periodic task completed successfully. Or every 30th time it completed successfully if it runs often. The point here is to let the admin(s) know that the system is working as expected. If suddenly the emails stop or begin reporting failures (e.g., "Tasks ran; 2 of 30 failed") then appropriate actions can be taken to remedy the situation.

But what if you don't want to spam your admin(s) with useless status reports that they'll just ignore anyway? Well, it should be part of their job to monitor this stuff, so don't be afraid to do it! Alternatively, you could use a pull-based approach to this problem. If you have an external monitoring solution set up, use it to get status reports from your system. You can have your system publish a report of its background activities on a special URL or a shared network location and then have your external monitoring solution periodically check that report for problems. For websites, this could even be achieved by using uptime monitoring services like Uptime Robot. You can have a special reporting URL show its status (as simple as "OK" or "ERROR") based on dynamic determination of whether any failures occurred as well as whether the periodic activities actually ran. Then the uptime monitoring service can check for keywords (like "OK") in the reporting page's contents to verify proper functionality or alert you should the keywords fail to match.

Of course, if you have your own separate monitoring solution, you'll need to ensure that that solution itself is continuing to function properly. Yes, what I'm getting at is that at some point you should still have periodic success notifications for certain critical services. After all, if your monitoring solution stops being able to alert you, that'd be a major cause for concern.

 
comments powered by Disqus