logo
|
Blog
    Team Story

    Auto-Diagnosing Kubernetes Alerts: How STCLab Uses HolmesGPT & CNCF Tools

    STCLab’s SRE team automated their Kubernetes alert triage using HolmesGPT, Robusta, and Markdown runbooks. This reduced manual incident investigation time from 20 minutes to under 2 minutes per alert, with the LLM autonomously diagnosing 40% of common cluster issues.
    Millie's avatar
    Millie
    Apr 27, 2026
    Auto-Diagnosing Kubernetes Alerts: How STCLab Uses HolmesGPT & CNCF Tools
    Contents
    Executive SummaryThe Problem: Alert Fatigue & Manual TriageThe Solution : HolmesGPT + AI RunbooksThe Impact

    Executive Summary

    • The Challenge: Severe alert fatigue. The SRE team was spending 15 to 20 minutes manually correlating data across Prometheus, Loki, Tempo, and clusters for every single Kubernetes alert.

    • The Solution: An automated first-pass triage pipeline powered by HolmesGPT (an LLM that dynamically selects investigative tools), Robusta (for Slack integration and alert routing), and strict Markdown Runbooks.

    • The Business Impact: Manual triage time plummeted from 20 minutes to under 2 minutes per alert. The AI pipeline now autonomously diagnoses and resolves 40% of routine alerts (like OOMKilled or ImagePullBackOff) at a cost of just $0.04 per investigation.

    • Key Takeaway: Giving the LLM strict constraints via runbooks (telling it exactly what tools to use and what to ignore) improved investigation quality drastically more than upgrading the LLM itself.


    The Problem: Alert Fatigue & Manual Triage

    Even with a modern observability stack (Prometheus, Loki, Tempo), STCLab’s SRE team was spending 15 to 20 minutes manually correlating data for every single Kubernetes alert. They needed the first pass of triage to happen automatically.

    The Solution : HolmesGPT + AI Runbooks

    STCLab built an automated pipeline relying on three core components:

    • HolmesGPT (CNCF Sandbox): An LLM that dynamically selects tools (like kubectl or PromQL) to investigate live cluster data.

    • Robusta Custom Playbook: A 200-line Python "glue code" script that manages operational logistics. It matches LLM outputs to the correct Slack threads, routes to namespace-specific channels, and deduplicates noisy workload-level alerts.

    • Strict Markdown Runbooks: Instead of letting the AI guess, STCLab gave the model strict boundaries (e.g., "skip Loki in this namespace, use kubectl logs only").

    The Impact

    Giving the AI strict runbook constraints proved far more effective than simply upgrading to a larger LLM.

    • Faster MTTR: Manual investigation dropped from 20 minutes to an AI summary read in under 2 minutes.

    • High Auto-Resolution: The LLM now autonomously and correctly diagnoses 40% of routine issues (like OOMKilled or ImagePullBackOff).

    • Low Cost: Automated investigations cost just $0.04 per alert (roughly $12/month).

    Learn About STCLab ➔ https://cloud.stclab.com

    Read the full story on the CNCF blog ➔ https://www.cncf.io/blog/2026/04/21/auto-diagnosing-kubernetes-alerts-with-holmesgpt-and-cncf-tools/

    Share article

    STCLab Inc.

    RSS·Powered by Inblog