Advice on Setting Up Monitor Alarms for Your Service

Basil A.
2 min readFeb 5, 2023

These are simple rules that you should take seriously when setting up alarms for your service:

Rule #1: Alert on the Symptom (The What), not the Cause (The Why)

First, make sure your alarm alerts on the pain-symptom; in other words, “the what”, and not the cause (the “why”).

Examples of Symptoms:

  • My Responses are slow (latency higher than 500ms)
  • 500 Family of Http Responses being returned by service
  • 404 Http Responses being returned by service

Examples of Causes:

  • A dependent service is down or experiencing high load
  • A dependent database ran out of connections
  • A dependency machine is experiencing high CPU usage
  • A dependent package got upgraded and giving an unexpected error message

Rule #2: Alarms should be Actionable to find out the Cause (Why?)

“Actionable” here means that the responder to the alarm has all the necessary information to debug the incident. If the service is missing traceability logs then the incident is no longer actionable since the responder doesn’t have enough information to perform root “cause” analysis.

Make sure your service is actionable by covering these points:

  • Does the service provide sufficient logs and traces?
  • Are the error messages meaningful and contain context information to help debug the issue.

Rule #3: Alarms that have many “false-positives” should be removed

Alarms with High-rates of false-positives should better be suppressed. Example, alerting on low-memory conditions when we already know that a process consumes high-memory from time to time. This will generate many false-positives and better be removed.

To avoid False-Positive Alarms, ask these questions:

  1. Is another alarm taking place at the same time to a different service or team? If yes, then probably one of the two alarms should be removed. To take the decision, remove the alarm that is farther-away from the root causing service.
  2. Is the incident affecting the customer? If no customer impact then probably some signals should be filtered out of the alarm to make the alarm more meaningful. Example, if the service went down for 4 minutes of availability but no users were using the system, then this should not alarm. More on this approach on Google’s Meaningful Availability paper.

References:

(1) Google’s SRE-Book. Monitoring Distributed Systems. https://sre.google/sre-book/monitoring-distributed-systems/#tying-these-principles-together-nqsJfw

(2) Observability Engineering O’Reilly Publishing Book. Chapter 12. Using Service-Level Objectives for Reliability

(3) Google’s Meaningful Availability Paper. https://www.usenix.org/conference/nsdi20/presentation/hauer

--

--

Basil A.

A Software Engineer with interests in System Design and Software Engineering done right.