From Legacy to Cloud-Native: How STCLab Reduced Observability Costs by 72%

We share how we resolved cost and performance challenges during our cloud native transition by adopting OpenTelemetry and the LGTM stack.

millie

Dec 26, 2025

From Legacy to Cloud-Native: How STCLab Reduced Observability Costs by 72%

Contents

The Challenge: High-Stakes Traffic Management The Solution: Standardizing on OpenTelemetry Transformative Results Observability Architecture Overview: Why OTel & LGTM?Key architectural decisions 1. Centralized Backend via Multi Tenancy Key Challenges We Encountered 1. Metric Explosion 2. Version Alignment Issues 3. Collector OOM on Small Nodes Solution: Enforce Minimum Node Size via nodeAffinity Conclusion

Originally featured on the CNCF Blog

Posted on December 16, 2025 by By Grace Park, DevOps Engineer, STCLab SRE Team

The Challenge: High-Stakes Traffic Management

At STCLab, we operate NetFUNNEL (Virtual waiting room solution) and BotManager (bot mitigation solution)-platforms that handle massive traffic surges during flash sales and global online voting. Supporting 3.5 million simultaneous users across 200 countries requires absolute visibility.

By 2023, our 20-year legacy architecture reached its limit. As we rebuilt NetFUNNEL 4.x as a global, Kubernetes-native SaaS, we hit a wall with traditional monitoring:

The "Sampling Trap": High costs forced us to disable APM in dev/staging and sample only 5% of production traffic.
Reactive Firefighting: Performance bugs were only caught after they reached production.
Unsustainable Costs: Scaling our observability was becoming more expensive than scaling our service.

The Solution: Standardizing on OpenTelemetry

We migrated to a full CNCF-backed observability stack using OpenTelemetry (OTel) for instrumentation and the LGTM stack (Loki, Grafana, Tempo, and Mimir) for the backend.

Transformative Results

By moving away from proprietary vendors to an open-standard architecture, we achieved:

Metric	Previous Vendor	New OTel + LGTM Stack
Total Cost	100% (Baseline)	72% Reduction
APM Trace Coverage	5% (Production only)	100% (All Environments)
Vendor Lock-in	High	Zero (CNCF Standard)
Environment Parity	Production Only	Unified Dev/Staging/Prod

Observability Architecture Overview: Why OTel & LGTM?

As we transitioned our core platforms to a global SaaS model, we realized that traditional proprietary observability tools created a kind of “success tax.” As traffic increased, the costs quickly became unsustainable. To monitor 3.5M concurrent users across NetFUNNEL and BotManager in a scalable and cost efficient way, we built our foundation on the CNCF ecosystem.

Key architectural decisions

1. Centralized Backend via Multi Tenancy

Instead of the overhead of deploying full LGTM stacks in every cluster, we centralized all telemetry into a single management cluster. This ensures consistent governance while reducing resource consumption across our global infrastructure.

Technical Implementation:

Edge Collection: Every cluster runs only a lightweight OTel Collector.
Tenant Identification: Collectors automatically inject tenant IDs using the X-Scope-OrgID header (for example: scp dev or scp prod).
Data Isolation: The central Mimir, Loki, and Tempo instances isolate data strictly by tenant ID.
Resiliency: Per tenant rate limiting prevents "noisy neighbor" scenarios. A metric surge in a development environment only throttles that specific tenant, ensuring production remains stable and unaffected.

2. OpenTelemetry as the Universal Ingestion Layer

The OTel Collector serves as the primary engine for multi tenancy tagging, batching, buffering, and tail sampling. By leveraging OTel auto instrumentation for Java and Node.js workloads, we enabled full APM capabilities without modifying any application code.

Key Payoffs:

Complete Backend Decoupling: Our instrumentation is entirely independent of the storage layer.
Seamless Migration: Moving from Tempo to Jaeger requires only a single configuration line change with zero impact on the application layer.
Operational Agility: We can update or swap backend providers without requiring developer intervention or service restarts.

Key Challenges We Encountered

1. Metric Explosion

Deploying OTel Collector as a DaemonSet caused our metrics to explode, multiplying by 20-40x. Every collector scraped all cluster-wide targets with 14 nodes. Kubelet metrics were scraped 14 times.

We fixed this with a Target Allocator per-node strategy that assigns scrape jobs only to collectors on the same node as targets:

Monitoring Signals

Watch the following metrics:

• otelcol_receiver_refused_metric_points_total: non zero indicates data loss
• opentelemetry_allocator_targets_per_collector: uneven distribution is normal in per node mode, because pods per node vary

2. Version Alignment Issues

Prometheus scraping failed after enabling the Target Allocator because the Operator, Collector, and Target Allocator were running mismatched versions.

Example failure log:

2025-06-27T05:31:27.578Z error error creating new scrape pool {"resource": {"service.instance.id": "bfa11ae0-f6ad-4d5b-97e8-088b8cd0a7f4", "service.name": "otelcol-contrib", "service.version": "0.128.0"}, "otelcol.component.id": "prometheus", "otelcol.component.kind": "receiver", "otelcol.signal": "metrics", "err": "invalid metric name escaping scheme, got empty string instead of escaping scheme", "scrape_pool": "otel-collector"}

We traced the issue to breaking changes in Prometheus dependencies between versions.

v0.127.0: didn’t recognize or improperly implemented the configuration for the new escaping scheme.
v0.128.0: Was built on a newer Prometheus dependency, which enforced stricter validation for the escaping scheme, causing the prometheusreceiver to fail when it received the older-style configuration.

3. Collector OOM on Small Nodes

On nodes with 2GB memory, collectors consistently OOM’d even with memory_limiter enabled.
Reason:
Graceful shutdown requires additional memory headroom that simply does not exist on small nodes.

Solution: Enforce Minimum Node Size via nodeAffinity

Collectors should only run on nodes with 4GB+ memory.

Conclusion

When we began this journey, there were few production references for operating OpenTelemetry Collectors at scale. Much of our progress came from hands on experimentation and support from the open source community. We hope these learnings help teams implementing multi tenant observability architectures in real world environments.