Auto-Diagnosing Kubernetes Alerts: How STCLab Uses HolmesGPT & CNCF Tools

STCLab’s SRE team automated their Kubernetes alert triage using HolmesGPT, Robusta, and Markdown runbooks. This reduced manual incident investigation time from 20 minutes to under 2 minutes per alert, with the LLM autonomously diagnosing 40% of common cluster issues.

Millie

Apr 27, 2026

Auto-Diagnosing Kubernetes Alerts: How STCLab Uses HolmesGPT & CNCF Tools

Contents

Executive Summary The Problem: Alert Fatigue & Manual Triage The Solution : HolmesGPT + AI Runbooks The Impact

Executive Summary

The Challenge: Severe alert fatigue. The SRE team was spending 15 to 20 minutes manually correlating data across Prometheus, Loki, Tempo, and clusters for every single Kubernetes alert.
The Solution: An automated first-pass triage pipeline powered by HolmesGPT (an LLM that dynamically selects investigative tools), Robusta (for Slack integration and alert routing), and strict Markdown Runbooks.
The Business Impact: Manual triage time plummeted from 20 minutes to under 2 minutes per alert. The AI pipeline now autonomously diagnoses and resolves 40% of routine alerts (like OOMKilled or ImagePullBackOff) at a cost of just $0.04 per investigation.
Key Takeaway: Giving the LLM strict constraints via runbooks (telling it exactly what tools to use and what to ignore) improved investigation quality drastically more than upgrading the LLM itself.

The Problem: Alert Fatigue & Manual Triage

Even with a modern observability stack (Prometheus, Loki, Tempo), STCLab’s SRE team was spending 15 to 20 minutes manually correlating data for every single Kubernetes alert. They needed the first pass of triage to happen automatically.

The Solution : HolmesGPT + AI Runbooks

STCLab built an automated pipeline relying on three core components:

HolmesGPT (CNCF Sandbox): An LLM that dynamically selects tools (like kubectl or PromQL) to investigate live cluster data.
Robusta Custom Playbook: A 200-line Python "glue code" script that manages operational logistics. It matches LLM outputs to the correct Slack threads, routes to namespace-specific channels, and deduplicates noisy workload-level alerts.
Strict Markdown Runbooks: Instead of letting the AI guess, STCLab gave the model strict boundaries (e.g., "skip Loki in this namespace, use kubectl logs only").

The Impact

Giving the AI strict runbook constraints proved far more effective than simply upgrading to a larger LLM.

Faster MTTR: Manual investigation dropped from 20 minutes to an AI summary read in under 2 minutes.
High Auto-Resolution: The LLM now autonomously and correctly diagnoses 40% of routine issues (like OOMKilled or ImagePullBackOff).
Low Cost: Automated investigations cost just $0.04 per alert (roughly $12/month).

Learn About STCLab ➔ https://cloud.stclab.com

Read the full story on the CNCF blog ➔ https://www.cncf.io/blog/2026/04/21/auto-diagnosing-kubernetes-alerts-with-holmesgpt-and-cncf-tools/

Contents

Executive Summary The Problem: Alert Fatigue & Manual Triage The Solution : HolmesGPT + AI Runbooks The Impact

Team Story

Auto-Diagnosing Kubernetes Alerts: How STCLab Uses HolmesGPT & CNCF Tools

Millie

Apr 27, 2026

Contents

Executive Summary The Problem: Alert Fatigue & Manual Triage The Solution : HolmesGPT + AI Runbooks The Impact

Executive Summary

The Challenge: Severe alert fatigue. The SRE team was spending 15 to 20 minutes manually correlating data across Prometheus, Loki, Tempo, and clusters for every single Kubernetes alert.
The Solution: An automated first-pass triage pipeline powered by HolmesGPT (an LLM that dynamically selects investigative tools), Robusta (for Slack integration and alert routing), and strict Markdown Runbooks.
The Business Impact: Manual triage time plummeted from 20 minutes to under 2 minutes per alert. The AI pipeline now autonomously diagnoses and resolves 40% of routine alerts (like OOMKilled or ImagePullBackOff) at a cost of just $0.04 per investigation.
Key Takeaway: Giving the LLM strict constraints via runbooks (telling it exactly what tools to use and what to ignore) improved investigation quality drastically more than upgrading the LLM itself.

The Problem: Alert Fatigue & Manual Triage

The Solution : HolmesGPT + AI Runbooks

STCLab built an automated pipeline relying on three core components:

HolmesGPT (CNCF Sandbox): An LLM that dynamically selects tools (like kubectl or PromQL) to investigate live cluster data.
Robusta Custom Playbook: A 200-line Python "glue code" script that manages operational logistics. It matches LLM outputs to the correct Slack threads, routes to namespace-specific channels, and deduplicates noisy workload-level alerts.
Strict Markdown Runbooks: Instead of letting the AI guess, STCLab gave the model strict boundaries (e.g., "skip Loki in this namespace, use kubectl logs only").

The Impact

Giving the AI strict runbook constraints proved far more effective than simply upgrading to a larger LLM.

Faster MTTR: Manual investigation dropped from 20 minutes to an AI summary read in under 2 minutes.
High Auto-Resolution: The LLM now autonomously and correctly diagnoses 40% of routine issues (like OOMKilled or ImagePullBackOff).
Low Cost: Automated investigations cost just $0.04 per alert (roughly $12/month).

Learn About STCLab ➔ https://cloud.stclab.com

Read the full story on the CNCF blog ➔ https://www.cncf.io/blog/2026/04/21/auto-diagnosing-kubernetes-alerts-with-holmesgpt-and-cncf-tools/

Contents

Executive Summary The Problem: Alert Fatigue & Manual Triage The Solution : HolmesGPT + AI Runbooks The Impact