From Legacy to Cloud-Native: How STCLab Reduced Observability Costs by 72%
Originally featured on the CNCF Blog
Posted on December 16, 2025 by By Grace Park, DevOps Engineer, STCLab SRE Team
The Challenge: High-Stakes Traffic Management
At STCLab, we operate NetFUNNEL (Virtual waiting room solution) and BotManager (bot mitigation solution)-platforms that handle massive traffic surges during flash sales and global online voting. Supporting 3.5 million simultaneous users across 200 countries requires absolute visibility.
By 2023, our 20-year legacy architecture reached its limit. As we rebuilt NetFUNNEL 4.x as a global, Kubernetes-native SaaS, we hit a wall with traditional monitoring:
The "Sampling Trap": High costs forced us to disable APM in dev/staging and sample only 5% of production traffic.
Reactive Firefighting: Performance bugs were only caught after they reached production.
Unsustainable Costs: Scaling our observability was becoming more expensive than scaling our service.
The Solution: Standardizing on OpenTelemetry
We migrated to a full CNCF-backed observability stack using OpenTelemetry (OTel) for instrumentation and the LGTM stack (Loki, Grafana, Tempo, and Mimir) for the backend.
Transformative Results
By moving away from proprietary vendors to an open-standard architecture, we achieved:
Metric | Previous Vendor | New OTel + LGTM Stack |
Total Cost | 100% (Baseline) | 72% Reduction |
APM Trace Coverage | 5% (Production only) | 100% (All Environments) |
Vendor Lock-in | High | Zero (CNCF Standard) |
Environment Parity | Production Only | Unified Dev/Staging/Prod |
Observability Architecture Overview: Why OTel & LGTM?
As we transitioned our core platforms to a global SaaS model, we realized that traditional proprietary observability tools created a kind of “success tax.” As traffic increased, the costs quickly became unsustainable. To monitor 3.5M concurrent users across NetFUNNEL and BotManager in a scalable and cost efficient way, we built our foundation on the CNCF ecosystem.
Key architectural decisions
1. Centralized Backend via Multi Tenancy
Instead of the overhead of deploying full LGTM stacks in every cluster, we centralized all telemetry into a single management cluster. This ensures consistent governance while reducing resource consumption across our global infrastructure.
Technical Implementation:
Edge Collection: Every cluster runs only a lightweight OTel Collector.
Tenant Identification: Collectors automatically inject tenant IDs using the
X-Scope-OrgIDheader (for example:scp devorscp prod).Data Isolation: The central Mimir, Loki, and Tempo instances isolate data strictly by tenant ID.
Resiliency: Per tenant rate limiting prevents "noisy neighbor" scenarios. A metric surge in a development environment only throttles that specific tenant, ensuring production remains stable and unaffected.
2. OpenTelemetry as the Universal Ingestion Layer
The OTel Collector serves as the primary engine for multi tenancy tagging, batching, buffering, and tail sampling. By leveraging OTel auto instrumentation for Java and Node.js workloads, we enabled full APM capabilities without modifying any application code.
Key Payoffs:
Complete Backend Decoupling: Our instrumentation is entirely independent of the storage layer.
Seamless Migration: Moving from Tempo to Jaeger requires only a single configuration line change with zero impact on the application layer.
Operational Agility: We can update or swap backend providers without requiring developer intervention or service restarts.
Key Challenges We Encountered
1. Metric Explosion
Deploying OTel Collector as a DaemonSet caused our metrics to explode, multiplying by 20-40x. Every collector scraped all cluster-wide targets with 14 nodes. Kubelet metrics were scraped 14 times.
We fixed this with a Target Allocator per-node strategy that assigns scrape jobs only to collectors on the same node as targets:
Monitoring Signals
Watch the following metrics:
• otelcol_receiver_refused_metric_points_total: non zero indicates data loss
• opentelemetry_allocator_targets_per_collector: uneven distribution is normal in per node mode, because pods per node vary
2. Version Alignment Issues
Prometheus scraping failed after enabling the Target Allocator because the Operator, Collector, and Target Allocator were running mismatched versions.
Example failure log:
2025-06-27T05:31:27.578Z error error creating new scrape pool {"resource": {"service.instance.id": "bfa11ae0-f6ad-4d5b-97e8-088b8cd0a7f4", "service.name": "otelcol-contrib", "service.version": "0.128.0"}, "otelcol.component.id": "prometheus", "otelcol.component.kind": "receiver", "otelcol.signal": "metrics", "err": "invalid metric name escaping scheme, got empty string instead of escaping scheme", "scrape_pool": "otel-collector"}
We traced the issue to breaking changes in Prometheus dependencies between versions.
v0.127.0: didn’t recognize or improperly implemented the configuration for the new escaping scheme.
v0.128.0: Was built on a newer Prometheus dependency, which enforced stricter validation for the escaping scheme, causing the
prometheusreceiverto fail when it received the older-style configuration.
3. Collector OOM on Small Nodes
On nodes with 2GB memory, collectors consistently OOM’d even with memory_limiter enabled.
Reason:
Graceful shutdown requires additional memory headroom that simply does not exist on small nodes.
Solution: Enforce Minimum Node Size via nodeAffinity
Collectors should only run on nodes with 4GB+ memory.
Conclusion
When we began this journey, there were few production references for operating OpenTelemetry Collectors at scale. Much of our progress came from hands on experimentation and support from the open source community. We hope these learnings help teams implementing multi tenant observability architectures in real world environments.