Per-Tenant SLOs in a Multi-Tenant SaaS with OpenTelemetry

Aggregate SLOs hide the customer who's about to churn. Here's how to track per-tenant error rates and latency using standard OpenTelemetry conventions.

Lhoussine

May 9, 2026·7 min read

Per-Tenant SLOs in a Multi-Tenant SaaS with OpenTelemetry

Aggregate SLOs are a lie. Your service is "99.9% available" because most customers had a great month and the one customer paying you $50K/year had a 4-second checkout for two weeks straight. Their churn risk doesn't show up in the dashboard.

Per-tenant SLOs are the fix. The OpenTelemetry data model already supports them — you just have to tag the spans. For broader SaaS observability context, see the SaaS observability page.

The data model

Every span gets a tenant.id attribute. That's the entire architecture. Once it's there, every metric and detection rule can be sliced by tenant.

In OpenTelemetry semantic conventions, the standard attribute is enduser.id for the user and a custom tenant.id for the multi-tenant axis. Use both — they answer different questions ("which user?" vs "which paying customer organization?").

Tagging at request entry

The cleanest pattern is to extract tenant ID from the request (JWT claim, subdomain, header) and set it on the active span at the top of your handler:

import { trace, context, propagation } from '@opentelemetry/api';

app.use((req, res, next) => {
  const tenantId = req.user?.tenantId || req.subdomain || 'unknown';
  const span = trace.getActiveSpan();
  span?.setAttribute('tenant.id', tenantId);

  // Also propagate via baggage so downstream services see it
  const baggage = propagation.getActiveBaggage()?.setEntry('tenant.id', { value: tenantId })
    || propagation.createBaggage({ 'tenant.id': { value: tenantId } });

  context.with(propagation.setBaggage(context.active(), baggage), () => next());
});

This single middleware ensures every span in the request — yours and every downstream service's — carries the tenant ID.

The query side

Once tagged, per-tenant queries are simple aggregations.

Per-tenant error rate (last hour):

SELECT
  span_attributes['tenant.id'] AS tenant,
  countIf(span_attributes['http.status_code'] LIKE '5%') / count() AS error_rate,
  count() AS total_requests
FROM otel_traces
WHERE timestamp > now() - INTERVAL 1 HOUR
GROUP BY tenant
HAVING total_requests > 100
ORDER BY error_rate DESC;

Per-tenant p99 latency:

SELECT
  span_attributes['tenant.id'] AS tenant,
  quantile(0.99)(duration_ns / 1000000) AS p99_ms,
  count() AS requests
FROM otel_traces
WHERE
  span_kind = 'SERVER' AND
  timestamp > now() - INTERVAL 1 HOUR
GROUP BY tenant
HAVING requests > 100
ORDER BY p99_ms DESC;

SLO budget consumption per tenant (for a 99.9% availability SLO):

SELECT
  tenant,
  total_requests,
  failed_requests,
  failed_requests::float / total_requests AS error_rate,
  -- 99.9% SLO means 0.1% error budget; over 30 days, allowed errors:
  total_requests * 0.001 AS allowed_errors,
  failed_requests > total_requests * 0.001 AS budget_blown
FROM (
  SELECT
    span_attributes['tenant.id'] AS tenant,
    count() AS total_requests,
    countIf(span_attributes['http.status_code'] LIKE '5%') AS failed_requests
  FROM otel_traces
  WHERE timestamp > now() - INTERVAL 30 DAY
  GROUP BY tenant
)
WHERE total_requests > 1000;

Alerts that matter

The detection rules write themselves once you have per-tenant aggregation. The high-value ones:

Tenant-specific error rate spike. Any tenant where the error rate exceeds 1% over 5 minutes (their normal is probably <0.1%).

Tenant-specific latency degradation. P99 > 2× the tenant's 7-day baseline. Aggregate latency might be fine while one tenant's experience tanks.

SLO budget burn-down. Alert when a tenant has consumed 80% of their 30-day error budget. Lets you proactively reach out before they churn.

The aggregate equivalents of these alerts mostly miss the long tail. Per-tenant equivalents catch the customer who's having a bad day.

The cardinality concern

The standard objection: "tenant ID is high cardinality, won't this blow up our metrics storage?"

For traces, no. Trace storage is row-oriented (or columnar in ClickHouse). Each span is independent, and adding a tenant.id attribute is 30–50 bytes per span. For 10M spans/day across 1,000 tenants, that's 300–500 MB/day of additional data — trivial.

For metrics (Prometheus-style), yes. Per-tenant metrics blow up the time series count. The fix is to compute per-tenant metrics from trace data on demand rather than emitting them as continuous time series. Trace data is queryable cheaply; pre-aggregated time series are not.

Per-tenant cost attribution

The same data drives cost attribution. If you want to know which customer is consuming 40% of your CPU, group spans by tenant and aggregate the CPU-equivalent metric (request count × average duration × server count is a reasonable proxy).

This is what enables honest enterprise pricing: "this customer's usage is costing us $4,300/month" is concrete and defensible. "Their workload is heavy" is not.

Privacy and the tenant.id leak

A consideration: tenant IDs in logs and traces are sensitive. They tell you who your customers are, which competing dashboards, vendors, or contractors might find interesting.

Two practices help:

Hash tenant IDs in attributes that go to third-party tools. Keep the original ID in your own backend; send a SHA-256 hash to vendors.
Don't expose tenant IDs in logs unless your team needs them for support. Use a separate identifier scheme for support cases.

This is more about hygiene than urgent risk. Most SaaS exposes tenant IDs in URLs and API responses anyway. But it's worth thinking about.

How SecureNow uses tenant.id

In the SecureNow backend, every dashboard view supports a tenant.id filter. Per-tenant SLOs, per-tenant security investigations, per-tenant cost attribution — all from one query layer. The firewall extends this: per-tenant blocklist/allowlist rules, per-tenant kill switches.

The architectural pattern is the same regardless of vendor: one attribute, propagated correctly, queried widely.

Frequently Asked Questions

What's a per-tenant SLO?

A service-level objective measured per individual customer/tenant rather than across the whole user base. Lets you see if one specific customer is having a bad day even when fleet-wide metrics look healthy.

How do I tag spans with tenant ID?

Use OpenTelemetry's baggage API to propagate `tenant.id` from request entry through every downstream call. Most modern SDKs make this a one-line setup at the top of your request handler.

What's the storage cost?

Tenant ID is one extra attribute per span — typically 30–50 bytes. For most apps this is negligible. ClickHouse compresses high-cardinality attributes well, so the disk impact is small.

Can I use this for billing?

Yes. Counting requests per tenant from trace data is one of the cleanest ways to drive usage-based billing. Just be careful about sampling — if you sample traces, your billing numbers need to scale up to compensate.

Per-Tenant SLOs in a Multi-Tenant SaaS with OpenTelemetry

Per-Tenant SLOs in a Multi-Tenant SaaS with OpenTelemetry

The data model

Tagging at request entry

The query side

Alerts that matter

The cardinality concern

Per-tenant cost attribution

Privacy and the tenant.id leak

How SecureNow uses tenant.id

Related

Frequently Asked Questions

What's a per-tenant SLO?

How do I tag spans with tenant ID?

What's the storage cost?

Can I use this for billing?

Recommended reading