logo
|
Blog
    Business InsightProduct

    From Legacy to Cloud-Native: How STCLab Reduced Observability Costs by 72%

    We share how we resolved cost and performance challenges during our cloud native transition by adopting OpenTelemetry and the LGTM stack.
    Millie's avatar
    Millie
    Dec 26, 2025
    From Legacy to Cloud-Native: How STCLab Reduced Observability Costs by 72%
    Contents
    The Challenge: High-Stakes Traffic ManagementThe Solution: Standardizing on OpenTelemetryTransformative ResultsObservability Architecture Overview: Why OTel & LGTM?Key architectural decisions1. Centralized Backend via Multi TenancyKey Challenges We Encountered1. Metric Explosion2. Version Alignment Issues3. Collector OOM on Small NodesSolution: Enforce Minimum Node Size via nodeAffinityConclusion

    Originally featured on the CNCF Blog

    Posted on December 16, 2025 by By Grace Park, DevOps Engineer, STCLab SRE Team


    The Challenge: High-Stakes Traffic Management

    At STCLab, we operate NetFUNNEL (Virtual waiting room solution) and BotManager (bot mitigation solution)-platforms that handle massive traffic surges during flash sales and global online voting. Supporting 3.5 million simultaneous users across 200 countries requires absolute visibility.

    By 2023, our 20-year legacy architecture reached its limit. As we rebuilt NetFUNNEL 4.x as a global, Kubernetes-native SaaS, we hit a wall with traditional monitoring:

    • The "Sampling Trap": High costs forced us to disable APM in dev/staging and sample only 5% of production traffic.

    • Reactive Firefighting: Performance bugs were only caught after they reached production.

    • Unsustainable Costs: Scaling our observability was becoming more expensive than scaling our service.

    The Solution: Standardizing on OpenTelemetry

    We migrated to a full CNCF-backed observability stack using OpenTelemetry (OTel) for instrumentation and the LGTM stack (Loki, Grafana, Tempo, and Mimir) for the backend.


    Transformative Results

    By moving away from proprietary vendors to an open-standard architecture, we achieved:

    Metric

    Previous Vendor

    New OTel + LGTM Stack

    Total Cost

    100% (Baseline)

    72% Reduction

    APM Trace Coverage

    5% (Production only)

    100% (All Environments)

    Vendor Lock-in

    High

    Zero (CNCF Standard)

    Environment Parity

    Production Only

    Unified Dev/Staging/Prod


    Observability Architecture Overview: Why OTel & LGTM?

    As we transitioned our core platforms to a global SaaS model, we realized that traditional proprietary observability tools created a kind of “success tax.” As traffic increased, the costs quickly became unsustainable. To monitor 3.5M concurrent users across NetFUNNEL and BotManager in a scalable and cost efficient way, we built our foundation on the CNCF ecosystem.

    Key architectural decisions

    1. Centralized Backend via Multi Tenancy

    Instead of the overhead of deploying full LGTM stacks in every cluster, we centralized all telemetry into a single management cluster. This ensures consistent governance while reducing resource consumption across our global infrastructure.

    Technical Implementation:

    • Edge Collection: Every cluster runs only a lightweight OTel Collector.

    • Tenant Identification: Collectors automatically inject tenant IDs using the X-Scope-OrgID header (for example: scp dev or scp prod).

    • Data Isolation: The central Mimir, Loki, and Tempo instances isolate data strictly by tenant ID.

    • Resiliency: Per tenant rate limiting prevents "noisy neighbor" scenarios. A metric surge in a development environment only throttles that specific tenant, ensuring production remains stable and unaffected.

    2. OpenTelemetry as the Universal Ingestion Layer

    The OTel Collector serves as the primary engine for multi tenancy tagging, batching, buffering, and tail sampling. By leveraging OTel auto instrumentation for Java and Node.js workloads, we enabled full APM capabilities without modifying any application code.

    Key Payoffs:

    • Complete Backend Decoupling: Our instrumentation is entirely independent of the storage layer.

    • Seamless Migration: Moving from Tempo to Jaeger requires only a single configuration line change with zero impact on the application layer.

    • Operational Agility: We can update or swap backend providers without requiring developer intervention or service restarts.

    Key Challenges We Encountered

    1. Metric Explosion

    Deploying OTel Collector as a DaemonSet caused our metrics to explode, multiplying by 20-40x. Every collector scraped all cluster-wide targets with 14 nodes. Kubelet metrics were scraped 14 times.

    We fixed this with a Target Allocator per-node strategy that assigns scrape jobs only to collectors on the same node as targets:

    Monitoring Signals

    Watch the following metrics:

    • otelcol_receiver_refused_metric_points_total: non zero indicates data loss
    • opentelemetry_allocator_targets_per_collector: uneven distribution is normal in per node mode, because pods per node vary

    2. Version Alignment Issues

    Prometheus scraping failed after enabling the Target Allocator because the Operator, Collector, and Target Allocator were running mismatched versions.

    Example failure log:

    2025-06-27T05:31:27.578Z    error    error creating new scrape pool    {"resource": {"service.instance.id": "bfa11ae0-f6ad-4d5b-97e8-088b8cd0a7f4", "service.name": "otelcol-contrib", "service.version": "0.128.0"}, "otelcol.component.id": "prometheus", "otelcol.component.kind": "receiver", "otelcol.signal": "metrics", "err": "invalid metric name escaping scheme, got empty string instead of escaping scheme", "scrape_pool": "otel-collector"}

    We traced the issue to breaking changes in Prometheus dependencies between versions.

    • v0.127.0: didn’t recognize or improperly implemented the configuration for the new escaping scheme.

    • v0.128.0: Was built on a newer Prometheus dependency, which enforced stricter validation for the escaping scheme, causing the prometheusreceiver to fail when it received the older-style configuration.

    3. Collector OOM on Small Nodes

    On nodes with 2GB memory, collectors consistently OOM’d even with memory_limiter enabled.
    Reason:
    Graceful shutdown requires additional memory headroom that simply does not exist on small nodes.

    Solution: Enforce Minimum Node Size via nodeAffinity

    Collectors should only run on nodes with 4GB+ memory.


    Conclusion

    When we began this journey, there were few production references for operating OpenTelemetry Collectors at scale. Much of our progress came from hands on experimentation and support from the open source community. We hope these learnings help teams implementing multi tenant observability architectures in real world environments.

    Share article

    STCLab Inc.

    RSS·Powered by Inblog