Per-Tenant SLOs in a Multi-Tenant SaaS with OpenTelemetry
Aggregate SLOs hide the customer who's about to churn. Here's how to track per-tenant error rates and latency using standard OpenTelemetry conventions.
Per-Tenant SLOs in a Multi-Tenant SaaS with OpenTelemetry
Aggregate SLOs are a lie. Your service is "99.9% available" because most customers had a great month and the one customer paying you $50K/year had a 4-second checkout for two weeks straight. Their churn risk doesn't show up in the dashboard.
Per-tenant SLOs are the fix. The OpenTelemetry data model already supports them — you just have to tag the spans. For broader SaaS observability context, see the SaaS observability page.
The data model
Every span gets a tenant.id attribute. That's the entire architecture. Once it's there, every metric and detection rule can be sliced by tenant.
In OpenTelemetry semantic conventions, the standard attribute is enduser.id for the user and a custom tenant.id for the multi-tenant axis. Use both — they answer different questions ("which user?" vs "which paying customer organization?").
Tagging at request entry
The cleanest pattern is to extract tenant ID from the request (JWT claim, subdomain, header) and set it on the active span at the top of your handler:
import { trace, context, propagation } from '@opentelemetry/api';
app.use((req, res, next) => {
const tenantId = req.user?.tenantId || req.subdomain || 'unknown';
const span = trace.getActiveSpan();
span?.setAttribute('tenant.id', tenantId);
// Also propagate via baggage so downstream services see it
const baggage = propagation.getActiveBaggage()?.setEntry('tenant.id', { value: tenantId })
|| propagation.createBaggage({ 'tenant.id': { value: tenantId } });
context.with(propagation.setBaggage(context.active(), baggage), () => next());
});
This single middleware ensures every span in the request — yours and every downstream service's — carries the tenant ID.
The query side
Once tagged, per-tenant queries are simple aggregations.
Per-tenant error rate (last hour):
SELECT
span_attributes['tenant.id'] AS tenant,
countIf(span_attributes['http.status_code'] LIKE '5%') / count() AS error_rate,
count() AS total_requests
FROM otel_traces
WHERE timestamp > now() - INTERVAL 1 HOUR
GROUP BY tenant
HAVING total_requests > 100
ORDER BY error_rate DESC;
Per-tenant p99 latency:
SELECT
span_attributes['tenant.id'] AS tenant,
quantile(0.99)(duration_ns / 1000000) AS p99_ms,
count() AS requests
FROM otel_traces
WHERE
span_kind = 'SERVER' AND
timestamp > now() - INTERVAL 1 HOUR
GROUP BY tenant
HAVING requests > 100
ORDER BY p99_ms DESC;
SLO budget consumption per tenant (for a 99.9% availability SLO):
SELECT
tenant,
total_requests,
failed_requests,
failed_requests::float / total_requests AS error_rate,
-- 99.9% SLO means 0.1% error budget; over 30 days, allowed errors:
total_requests * 0.001 AS allowed_errors,
failed_requests > total_requests * 0.001 AS budget_blown
FROM (
SELECT
span_attributes['tenant.id'] AS tenant,
count() AS total_requests,
countIf(span_attributes['http.status_code'] LIKE '5%') AS failed_requests
FROM otel_traces
WHERE timestamp > now() - INTERVAL 30 DAY
GROUP BY tenant
)
WHERE total_requests > 1000;
Alerts that matter
The detection rules write themselves once you have per-tenant aggregation. The high-value ones:
Tenant-specific error rate spike. Any tenant where the error rate exceeds 1% over 5 minutes (their normal is probably <0.1%).
Tenant-specific latency degradation. P99 > 2× the tenant's 7-day baseline. Aggregate latency might be fine while one tenant's experience tanks.
SLO budget burn-down. Alert when a tenant has consumed 80% of their 30-day error budget. Lets you proactively reach out before they churn.
The aggregate equivalents of these alerts mostly miss the long tail. Per-tenant equivalents catch the customer who's having a bad day.
The cardinality concern
The standard objection: "tenant ID is high cardinality, won't this blow up our metrics storage?"
For traces, no. Trace storage is row-oriented (or columnar in ClickHouse). Each span is independent, and adding a tenant.id attribute is 30–50 bytes per span. For 10M spans/day across 1,000 tenants, that's 300–500 MB/day of additional data — trivial.
For metrics (Prometheus-style), yes. Per-tenant metrics blow up the time series count. The fix is to compute per-tenant metrics from trace data on demand rather than emitting them as continuous time series. Trace data is queryable cheaply; pre-aggregated time series are not.
Per-tenant cost attribution
The same data drives cost attribution. If you want to know which customer is consuming 40% of your CPU, group spans by tenant and aggregate the CPU-equivalent metric (request count × average duration × server count is a reasonable proxy).
This is what enables honest enterprise pricing: "this customer's usage is costing us $4,300/month" is concrete and defensible. "Their workload is heavy" is not.
Privacy and the tenant.id leak
A consideration: tenant IDs in logs and traces are sensitive. They tell you who your customers are, which competing dashboards, vendors, or contractors might find interesting.
Two practices help:
- Hash tenant IDs in attributes that go to third-party tools. Keep the original ID in your own backend; send a SHA-256 hash to vendors.
- Don't expose tenant IDs in logs unless your team needs them for support. Use a separate identifier scheme for support cases.
This is more about hygiene than urgent risk. Most SaaS exposes tenant IDs in URLs and API responses anyway. But it's worth thinking about.
How SecureNow uses tenant.id
In the SecureNow backend, every dashboard view supports a tenant.id filter. Per-tenant SLOs, per-tenant security investigations, per-tenant cost attribution — all from one query layer. The firewall extends this: per-tenant blocklist/allowlist rules, per-tenant kill switches.
The architectural pattern is the same regardless of vendor: one attribute, propagated correctly, queried widely.
Related
Frequently Asked Questions
What's a per-tenant SLO?
A service-level objective measured per individual customer/tenant rather than across the whole user base. Lets you see if one specific customer is having a bad day even when fleet-wide metrics look healthy.
How do I tag spans with tenant ID?
Use OpenTelemetry's baggage API to propagate `tenant.id` from request entry through every downstream call. Most modern SDKs make this a one-line setup at the top of your request handler.
What's the storage cost?
Tenant ID is one extra attribute per span — typically 30–50 bytes. For most apps this is negligible. ClickHouse compresses high-cardinality attributes well, so the disk impact is small.
Can I use this for billing?
Yes. Counting requests per tenant from trace data is one of the cleanest ways to drive usage-based billing. Just be careful about sampling — if you sample traces, your billing numbers need to scale up to compensate.
Recommended reading
If your team uses Sentry for frontend errors and needs backend distributed tracing without doubling the Sentry bill, here's the OpenTelemetry path that doesn't make you choose.
May 9Five approaches to bot blocking in Express, ranked by effort vs. effectiveness. From a 5-line allowlist to a full IP-reputation firewall — all without Cloudflare, AWS WAF, or any new infrastructure.
May 9Fastify hooks (onRequest) and the SecureNow preload both work cleanly. Here's the production setup for IP blocking and user-agent filtering.
May 9