Tracking Customer Cost-of-Serve from Your Trace Data
If you can't see which customer is consuming 40% of your CPU, you can't price your enterprise tier. Here's how to derive cost-of-serve per customer from OpenTelemetry traces.
Tracking Customer Cost-of-Serve from Your Trace Data
If you sell enterprise tiers, you've had this conversation: a customer wants a 30% discount, your CFO asks what that does to gross margin, and nobody actually knows because per-customer cost is a guess. The data to answer this is in your traces. You just have to compute it.
For broader SaaS observability context see the SaaS observability page and the per-tenant SLOs guide.
The simplest cost model
Start with a single approximation: the cost to serve a customer is proportional to the time your servers spend handling their requests. If a customer accounts for 5% of total request-seconds, they account for roughly 5% of your variable infrastructure cost.
The formula:
customer_cost_share = sum(duration of all spans for this tenant) / sum(duration of all spans across all tenants)
customer_monthly_cost = customer_cost_share × monthly_infrastructure_cost
For a SaaS with $20K/month in cloud infrastructure and a customer accounting for 8% of request-seconds, their cost-of-serve is roughly $1,600/month. If their MRR is $1,500, that customer is unprofitable.
The query
Assuming you've tagged spans with tenant.id (see per-tenant SLOs):
SELECT
span_attributes['tenant.id'] AS tenant,
sum(duration_ns) / 1e9 AS total_seconds,
count() AS request_count,
sum(duration_ns) / 1e9 / 86400 / 30 AS avg_concurrent_load
FROM otel_traces
WHERE
span_kind = 'SERVER' AND
timestamp > now() - INTERVAL 30 DAY
GROUP BY tenant
ORDER BY total_seconds DESC;
This gives you the raw per-tenant compute time. Multiply by your cost-per-second to get dollars.
To compute cost-per-second:
cost_per_second = monthly_infrastructure_cost / (total_request_seconds_across_all_tenants)
For a service running 24/7 on $5K/month of compute serving 50M request-seconds/month, that's $0.0001 per request-second. A customer with 5M request-seconds costs you $500/month in compute alone.
Refinements
The simple model is good enough for first-pass pricing decisions. Three refinements make it more accurate:
Database query cost. Traces include database span durations. Database compute is typically 30–50% of total backend cost; weighting database span time at 1.5× general server time captures this.
Outbound API cost. If you call paid third-party APIs (Stripe, OpenAI, Twilio), those have per-request costs. Tag outbound spans with app.api.cost_usd and sum per tenant.
Storage by tenant. Traces don't include this directly; query your database tables grouped by tenant to get bytes-per-tenant. Most SaaS data costs ~$0.10/GB/month at scale.
A more complete formula:
total_cost = (compute_seconds × $/s) +
(db_seconds × 1.5 × $/s) +
sum(api_costs) +
(gb_stored × $/gb)
Per-customer profitability dashboard
Joined with your billing data:
SELECT
t.tenant,
t.total_cost,
b.monthly_revenue,
b.monthly_revenue - t.total_cost AS gross_margin,
(b.monthly_revenue - t.total_cost) / b.monthly_revenue AS margin_pct
FROM (
SELECT
span_attributes['tenant.id'] AS tenant,
sum(duration_ns) / 1e9 * 0.0001 AS compute_cost,
-- ... add other cost components
compute_cost AS total_cost
FROM otel_traces
WHERE timestamp > now() - INTERVAL 30 DAY
GROUP BY tenant
) t
JOIN billing_data b ON t.tenant = b.tenant_id
ORDER BY margin_pct ASC;
The customers at the bottom of this list are your problem children. Sometimes they're early-stage customers ramping up usage; sometimes they're plan-mismatched (using enterprise features on a starter plan); sometimes they're outright unprofitable and should be repriced.
What to do with the data
Three concrete uses:
Plan repricing. If your $99/month plan customers are costing you $130/month to serve, the plan is wrong. Either raise the price or move the high-cost customers to a higher tier. Trace data tells you which customers, not just that the plan is bad.
Enterprise contract sizing. When negotiating a custom enterprise deal, you can quote a price floor based on actual cost-of-serve plus a margin target. Beats guessing.
Churn risk identification. Counterintuitively, the customers most likely to churn are often the ones costing you the most — they're using the product hard and any pricing change disproportionately affects them. Cross-reference cost-of-serve with usage frequency and support ticket volume to find at-risk accounts.
The vendor-pricing parallel
The same logic applies in reverse to your APM/observability bill. If your APM vendor charges per-host but only 3 hosts are doing 80% of the actual work, you're overpaying. Per-host pricing is a rough approximation of actual cost-of-serve from the vendor's perspective; usage-based pricing matches it better.
This is one reason SecureNow and similar tools use $/TB scanned: it scales with your real usage, not with infrastructure that may or may not be doing useful work.
The honest limitation
This methodology assumes your cost is proportional to compute time. For most SaaS that's roughly true; for some it isn't.
- Storage-heavy SaaS (file hosting, video) — compute time is misleading, you need to track bytes-stored per tenant.
- AI/ML-heavy SaaS — GPU time per inference dominates and isn't visible in regular traces. Tag inference spans separately.
- Bandwidth-heavy SaaS (CDN, video) — egress cost dominates. Tag spans with response byte counts and weight accordingly.
For the typical web app SaaS, compute time is the right proxy and the simple model works. For specialized cases, refine the cost model with the dominant axis.
Setup time
If you already have OpenTelemetry traces with tenant.id attribution, this is one query and one dashboard panel — about 30 minutes. If you don't have tenant ID tagging yet, that's the prerequisite (instructions) and adds 1–2 hours.
The first time you see your customer profitability sorted by margin, three numbers will surprise you. That's the point.
Related
Frequently Asked Questions
What's customer cost-of-serve?
The cost your business incurs to serve one specific customer — server time, database load, third-party API calls, storage, support burden. Used for pricing decisions, churn analysis, and identifying which enterprise deals are actually profitable.
Why use trace data instead of cloud billing tags?
Cloud billing tags work for shared infrastructure but miss the granular per-request CPU/memory consumption that traces capture directly. Combine them — cloud billing for fixed costs, traces for variable consumption.
What's the simplest cost model?
Total request count × average request duration, weighted by some cost-per-second factor (your blended infrastructure cost / total request seconds). Refine from there.
How accurate is this?
Within 10–20% of true cost for most SaaS. Refine by adding database query cost, outbound API cost, and storage by tenant. The first iteration is good enough to drive pricing decisions.
Recommended reading
If your team uses Sentry for frontend errors and needs backend distributed tracing without doubling the Sentry bill, here's the OpenTelemetry path that doesn't make you choose.
May 9Five approaches to bot blocking in Express, ranked by effort vs. effectiveness. From a 5-line allowlist to a full IP-reputation firewall — all without Cloudflare, AWS WAF, or any new infrastructure.
May 9Fastify hooks (onRequest) and the SecureNow preload both work cleanly. Here's the production setup for IP blocking and user-agent filtering.
May 9