Back to Blog
KubernetesMonitoringAlertingSREDevOpsObservabilitySRExpert

Kubernetes Alerting Done Right: How to Turn Signals Into Actionable Alerts (Not Noise)

Most Kubernetes alerting setups produce more noise than signal. Engineers get paged at 3am for non-critical warnings, alert fatigue sets in, and real incidents get lost in the flood. Here is how to build an alerting system that surfaces what matters and routes it where it needs to go.

P
Davi Nunes
March 17, 202614 min read

The average SRE team receives over 500 alerts per week. Of those, fewer than 5% require human intervention. The rest are noise — transient spikes, informational warnings, and duplicate notifications that train engineers to ignore alerts entirely.

This is not a technology problem. Prometheus, Grafana, and modern monitoring stacks can detect anything. The problem is that most teams alert on everything instead of alerting on what matters.

Effective Kubernetes alerting is built on three principles: alert only on symptoms that affect users, route notifications to the right channel at the right urgency, and continuously refine the system based on what actually required action.

Here is how to implement these principles across your Kubernetes infrastructure.

The Alert Fatigue Problem

Alert fatigue is the single biggest risk to operational reliability. When engineers receive hundreds of alerts daily, they develop coping mechanisms: muting channels, ignoring notifications, and assuming alerts are false positives. When a real incident occurs, it gets the same treatment — ignored until a customer reports the problem.

The root cause is usually one of three patterns.

Alerting on causes instead of symptoms. An alert for "CPU usage above 80%" fires constantly but rarely indicates an actual problem. Kubernetes handles high CPU utilization well — what matters is whether latency or error rates have increased as a result.

No severity differentiation. Critical outages and informational warnings arrive through the same channel with the same urgency. When everything is urgent, nothing is urgent.

Missing deduplication. A single failing pod generates alerts from the pod itself, the deployment controller, the node, and the monitoring stack — four notifications for one problem.

Building an Alerting Strategy That Works

Step 1: Define What Deserves an Alert

Not every metric anomaly needs a human response. Structure your alerts into three tiers.

Tier 1 — Page (wake someone up): User-facing impact right now. Error rate above SLO threshold. Full service outage. Data loss risk. Security breach indicators. These go to PagerDuty or OpsGenie and page the on-call engineer.

Tier 2 — Notify (next business day): Potential impact if not addressed soon. Disk usage above 85%. Certificate expiring in 7 days. Pod restart count increasing. Deployment stuck in rollout. These go to a Slack channel or email — urgent but not immediate.

Tier 3 — Record (for investigation): Informational signals useful for debugging. Temporary CPU spikes. Single pod restarts. Brief network latency increases. These go to dashboards and logs, not to humans.

Most teams start with everything in Tier 1. The discipline is moving things down.

Step 2: Alert on Symptoms, Not Causes

The golden rule of SRE alerting: alert on what users experience, not on what you think might cause problems.

Instead of: "Node CPU above 90%" Alert on: "P99 request latency exceeds 500ms for 5 minutes"

Instead of: "Pod memory usage above 80%" Alert on: "OOMKill events detected in production namespace"

Instead of: "Disk usage above 75%" Alert on: "Projected disk exhaustion within 4 hours at current write rate"

Symptom-based alerts reduce noise by 80% because they only fire when users are actually affected. A node running at 95% CPU is fine if latency is normal. A node at 40% CPU with spiking latency is a real problem.

Step 3: Implement Smart Routing

Different alerts need different channels. A critical outage should not arrive in the same Slack channel as a certificate renewal reminder.

Configure routing rules based on severity and service ownership:

Slack channels: Create dedicated channels per team or service area. Route Tier 2 alerts to the owning team's channel. Include enough context in the message for the engineer to decide whether to act now or later.

Email notifications: Use for Tier 2 alerts that do not need immediate visibility but should be tracked. Weekly digest emails summarizing Tier 3 trends help teams identify slow-burning issues.

Webhooks: Integrate with ticketing systems (Jira, Linear) to automatically create tickets for Tier 2 alerts. This ensures nothing falls through the cracks without adding notification pressure.

PagerDuty/OpsGenie: Reserved exclusively for Tier 1. If an alert is not worth waking someone up at 3am, it does not belong in the paging system.

Platforms like SRExpert provide multi-channel alert routing with smart throttling built in. You define alert rules from Prometheus metrics and Kubernetes events, set severity levels, and route to Slack, Email, PagerDuty, or custom webhooks. The platform handles deduplication and delivery tracking, so you know every alert was delivered and acknowledged — no more "I did not see the alert" during postmortems.

Step 4: Add Intelligent Throttling and Deduplication

A single incident can trigger dozens of related alerts. Without deduplication, the on-call engineer receives 50 notifications for one problem and wastes time triaging instead of fixing.

Alert grouping: Group alerts by namespace, service, or failure domain. If 10 pods in the same deployment are failing, send one alert with the count — not 10 individual alerts.

Throttle repeated alerts: If the same alert fires continuously, send the initial notification, a reminder at 15 minutes if unacknowledged, and then suppress until the condition changes. Do not send the same alert every 60 seconds for hours.

Correlation: When multiple alerts fire within a 2-minute window for related services, correlate them into a single incident. A database alert plus an API alert plus a frontend alert is one incident, not three.

Inhibition: If a node-level alert fires, suppress pod-level alerts on that node. The node problem is the root cause — individual pod alerts are just symptoms.

Step 5: Monitor Your Alerting System

Your alerting system itself needs monitoring. Channel health checks ensure that Slack webhooks, email relays, and PagerDuty integrations are actually working. A perfectly configured alert that cannot reach the on-call engineer is worse than no alert at all.

Track these meta-metrics about your alerting system:

Alert delivery rate: What percentage of alerts are successfully delivered to their configured channel? Webhook failures, email bounces, and API rate limits can silently drop alerts.

Acknowledgment time: How long between alert delivery and human acknowledgment? If Tier 1 alerts average 30 minutes to acknowledge, your routing or escalation needs adjustment.

False positive rate: What percentage of Tier 1 alerts do not require action? If more than 20% are false positives, your alert thresholds need tuning.

Alert-to-incident ratio: How many alerts does it take to identify one real incident? A ratio above 10:1 indicates excessive noise.

SRExpert includes channel health checks and delivery tracking for every configured notification channel. If a Slack webhook starts failing or an email relay goes down, you get notified about the notification failure — ensuring your alerting system never silently degrades.

Practical Alert Rules for Kubernetes

Here are battle-tested alert rules that balance signal quality with coverage.

Workload Health

High error rate: Alert when the 5xx error rate exceeds 1% of total requests for 5 minutes. This catches real service degradation while ignoring transient errors.

Pod CrashLoopBackOff: Alert when a pod has restarted more than 5 times in 15 minutes. CrashLoopBackOff is almost always a real problem — bad config, missing dependencies, or resource exhaustion.

Deployment rollout stuck: Alert when a deployment has not completed its rollout within 10 minutes. Stuck rollouts indicate image pull failures, health check failures, or resource contention.

HPA at maximum: Alert when the Horizontal Pod Autoscaler has been at maximum replicas for 30 minutes. This means the service cannot handle current load with its configured scaling limits.

Infrastructure Health

Node not ready: Alert immediately when a node transitions to NotReady. This affects all pods scheduled on that node.

Persistent volume near capacity: Alert when PV usage exceeds 85%. Unlike ephemeral storage, persistent volumes cannot be auto-expanded in most configurations.

etcd latency: Alert when etcd commit duration exceeds 100ms. High etcd latency affects the entire control plane.

Security Events

Privileged container detected: Alert when a new pod runs with privileged security context in a production namespace. This should never happen in a properly secured cluster.

Unexpected namespace creation: Alert on new namespace creation outside of your GitOps pipeline. Manual namespace creation often indicates unauthorized access or shadow operations.

Continuous Improvement: The Alert Review Process

Alerting is never "done." Schedule a monthly alert review with your SRE team.

Review every Tier 1 alert from the past month. For each one, ask: did it require human action? If not, move it to Tier 2 or tune the threshold. Was the alert actionable — did the engineer know what to do? If not, add a runbook link. Was the alert timely — did it fire before users reported the problem? If not, tighten the detection window.

Review Tier 2 alerts that were ignored. If a Tier 2 alert is consistently ignored for weeks, either it is not important (remove it) or the routing is wrong (move it to a more visible channel).

Review incidents that had no alerts. These are the most dangerous gaps. If an incident occurred without a corresponding alert, you have a detection blind spot that needs a new alert rule.

This monthly cycle continuously reduces noise, closes detection gaps, and ensures your alerting system evolves with your infrastructure.

Conclusion

Great Kubernetes alerting is not about detecting more — it is about detecting better. Every alert in your system should either wake someone up, trigger a next-business-day action, or feed a dashboard. If it does not fit one of these categories, it should not exist.

Build your alerting around user-facing symptoms, not infrastructure metrics. Route alerts to the right channel at the right severity. Deduplicate and throttle to prevent fatigue. Monitor your monitoring to ensure reliability. And review continuously to keep the system sharp.

The goal is an alerting system where every notification is expected, understood, and actionable. When your on-call engineer sees an alert, their first thought should be "I know what to do" — not "here we go again."


Stop drowning in alert noise. SRExpert provides intelligent alert routing with smart deduplication, multi-channel delivery (Slack, Email, PagerDuty, Webhooks), and channel health monitoring. Start exploring now or talk to our SRE team about optimizing your alerting strategy.