Private working draft. Please do not circulate or reproduce without permission.

Measurement for Shared Compute

This guide develops a practical framework for evaluating routing, rate limits, cache policies and other allocation decisions in AI inference systems where an intervention affects both an individual request and the shared capacity system around it.

It combines my research into state-dependent inference costs and fat-tailed demand with my working experience as a pricing and product data scientist. An earlier version of these ideas was discussed on Live with Tim O'Reilly.

The central challenge is that compute allocation decisions should usually be evaluated as policies under shared system state, rather than as isolated user-level treatments.

01

Start With the Question

Turn the operational decision into a concrete causal question before choosing a method.

Whenever we deal with causal questions, it is worth thinking about the ideal experiment. Always ask yourself, if you could, what would be the perfect experiment you would run to uncover this causal effect? This tends to shed some light on how we can discover the causal effect even without the ideal measurement environment.

Consider the example. A router sends latency-sensitive, high-value requests to a premium fleet when capacity is available, then falls back to a cheaper fleet during congestion.
  1. Name the decision. Is the decision to ship broadly, target a segment, change a routing policy, reserve more capacity, or instrument before acting?
  2. Choose the comparison. The counterfactual might be the old policy, the untreated users, a different cluster, a prior model checkpoint, or a threshold-adjacent group.
  3. Pick the estimand. Average Treatment Effect (ATE) is for broad rollout, Average Treatment Effect of the Treated (ATT) is for the users who actually received treatment, Conditional Average Treatment Effect (CATE) is for targeting, and policy value is for routing or allocation decisions.
  4. Unit economics as a guardrail. A policy can improve task success while burning too much compute. Pair retention or task completion with latency, errors, GPU-hours, and cost per successful request.
  5. End with a rule. Ship, ramp, target, hold out, pause, roll back, or instrument more. The point is to state the recommendation and the tradeoff that could change it.

Estimand Cheat Sheet

Estimand Use it when the decision is... Potential Tradeoff
ATE Should the policy ship to the whole eligible population? Can hide that only one workload or customer segment benefits enough to justify the cost.
ATT What happened to the users or requests that actually received the treatment? Can be less useful for future rollout if treated users were unusually selected.
CATE Who should receive scarce compute, higher limits, or better routing? Can overfit unless the segment is stable, interpretable, and operationally usable.
Policy value Which routing, scheduling, or allocation rule should run? Needs support in the logs, system guardrails, and a clear cost/value objective.
Policy value, and why it's not ATE or CATE: ATE and CATE are properties of a treatment ("what does action A do, on average / for segment x?"), whereas policy value is a property of a decision rule ("what's the expected outcome if the system runs rule π?") — where π looks at the context and picks an action. 

A dynamic rule only treats some units, in some states, and its value depends on how often those contexts occur and on constraints. As we'll get into the next section, this is the central challenge of measuring resource requiremnts — that allocation policies are not static.
02

Allocation Policies are Not Static Treatments

When the action depends on workload and system state, evaluate the rule that made the decision.

In shared compute, the "treatment" is often a policy rather than a single static action. A router, rate limit, cache policy, or capacity allocation rule is dynamic, and chooses an action based on the request-level workload and the live system state.

A = π(X, S)
X = pre-decision workload and user information
S = live decision-time system state: queue, cache, region, capacity, incidents, and recent demand

This changes what has to be measured. A credible evaluation requires the inputs the policy saw, the intended action, the executed action, the fallback path, and the reason for any divergence. Otherwise the analysis can confuse the policy effect with the state that caused the policy to act.

In the router example, premium-fleet assignment is not simply "treated." The same request may be routed to premium capacity when the queue is shallow and sent to a cheaper fleet when the system is congested. The estimand must therefore attach to the allocation rule, not merely to the serving path that happened to execute.

Minimum Logging for a Policy Readout

Field Why it matters Failure if missing
Decision-time state Queue depth, region, cache state, incidents, and capacity pressure may drive both assignment and outcome. You mistake a hard-to-serve state for a treatment effect.
Intended action This is the policy recommendation: route, admit, defer, cache, limit, or prioritize. You cannot report intent-to-treat.
Executed action Serving paths can change because of timeouts, provider issues, or fallback logic. You analyze the policy as if it actually ran.
System guardrails Latency, errors, queue depth, cost per successful request, and untreated-user experience show spillovers. A user-level lift hides damage to the shared system.

[need a handoff to next section]

03

State, Mechanisms, and Bias [needs better title]

Use the DAG as a short checklist: confounder, mediator, collider, spillover.

The control question should come from the assignment story, not from feature importance. Adjustment variables should exist before the policy decision and plausibly cause both assignment and the outcome. The analysis should not control away the mechanism by which the treatment worked.

Basic System DAG [I'll update this later]

Workload / user X, before decision System state S, before decision Allocation action A = π(X, S) Mechanism latency, errors Y

X and S are candidates for adjustment when they are pre-treatment common causes. Latency and errors sit after the action, so they should be reported as mechanism metrics unless the question is specifically about a direct effect.

System State Is Both a Confounder and an Outcome

Shared inference systems make the timing of state measurement central. At time t, system state can be a pre-treatment confounder for the current routing, admission, or fallback decision. At time t + 1, that same class of state variables can be downstream of earlier policy decisions.

At = π(Xt, St)
St+1 = f(St, At, Wt)
Wt = arrivals, failures, provider events, cache eviction, and other external disturbances

Queue depth before a routing decision may be an adjustment variable. Queue depth after the policy has run may be part of the policy's effect. In the router example, congestion can determine whether a high-value request reaches the premium fleet, while the routing decision itself can also change later congestion for untreated requests.

  1. Control for common causes. If queue depth, customer tier, workload length, or provider health affected both routing and success, it belongs in the adjustment story.
  2. Do not control for mediators in a total-effect question. Latency, errors, cache hit, and completion may be how the routing change affects retention or task success.
  3. Do not select only completed requests. Completion can be caused by the treatment. Dropping failures makes a bad serving policy look cleaner than it is.
  4. Name the spillover path. If the treatment changes a shared queue, cache, fleet, or regional capacity pool, user-level treatment and control groups may no longer be independent.

Probably Bad Controls

Variable How to treat it Why
Queue depth before routing Usually adjust for it. It can drive both assignment and outcome.
Queue depth after routing Report it as an outcome or mechanism metric. It can be downstream of the policy and part of the effect on the shared system.
Latency after routing Do not control for it in a total-effect estimate. It is probably part of the mechanism.
Completed requests only Avoid conditioning on this sample. Treatment may affect completion, so failures disappear from the readout.
Untreated-user latency Use as a spillover guardrail. It shows whether treated traffic harmed the shared system.
Operational principle: Controls come from the assignment story, not from feature importance. Use pre-treatment common causes for adjustment and keep downstream variables visible as mechanisms or outcomes.
04

Fat-Tailed Costs and Capacity Risk**

Cost tails matter both across users and across the system states that set capacity risk.

For AI infrastructure and inference products, a plain difference-in-means readout can be fragile when the metric is tokens, cost, or GPU-hours per user. Those outcomes can be dominated by a small number of very large users, long-context workflows, or high-concurrency windows.

Tokens per request are bounded by the context window, but the variance can still be large as use cases get more heterogeneous. True resource cost is messier because it depends on live system state: cache pressure, batch concurrency, prefill/decode balance, provider health, and the current capacity boundary. The sample mean can move because the policy worked, or because one arm happened to receive more unusually large workloads.

Two tail risks should be kept separate. Cross-sectional tail risk means a small number of users, accounts, or workflows dominate cost. Temporal and system-state tail risk means peak-concurrency periods push the system toward capacity boundaries and alter batching, cache pressure, fallback behavior, and untreated-user experience.

Y = tokens, cost, or GPU-hours per user
Problem = the sample mean can be dominated by tail users or tail system states
First move = report the raw mean, then stress-test whether it is carrying the whole conclusion
Trimming tail users can stabilize an estimate, but it does not answer whether a policy remains safe in the tail states that determine fleet capacity, fallback rates, and customer experience.

Robust Cost Readouts [needs review]

Move When to use it Decision caveat
Raw mean difference Start here when the business question is aggregate cost, margin, or total capacity consumed. The tail is part of the economics, so it should not be thrown away automatically. Sensitivity checks show whether one account or time window carries the result.
Winsorized or trimmed mean Use as a robustness readout for tokens, cost, or GPU-hours per user. This changes the estimand. The readout is no longer total average cost; it is a capped or central-user version of cost.
Quantile treatment effects Use when the tail is the question: p90, p95, p99 cost, latency, or GPU-hours. This answers a different question than average ROI. It is about tail risk and capacity exposure.
Log transform Use for a more stable multiplicative readout, such as percent movement in cost or usage. It downweights very large users and can be easier to model, but the result is no longer in raw dollars or raw GPU-hours.
Randomization inference Use when CLT-based confidence intervals feel shaky because the sample is small, clustered, or tail-dominated. It is tied to the actual assignment mechanism, which is good, but it does not solve bad randomization or spillovers.
CUPED / regression adjustment Use pre-period usage as a covariate, especially for heavy users with persistent behavior. This is often the biggest variance win, but the covariate has to be pre-treatment and measured consistently.

For a heavy-tailed cost metric, the raw mean should remain visible because it is the business cost, but it should not be the only readout. Pair it with a winsorized or trimmed estimate, a quantile readout if tail risk matters, and CUPED or regression adjustment using pre-period usage. If the confidence interval depends heavily on normal approximations, use randomization inference tied to the actual assignment mechanism.

When the policy changes a shared system, uncertainty may need to be computed at the time-window, cluster, fleet, or account level. In the router example, a policy can look efficient on ordinary traffic while creating unacceptable fallback rates during peak-concurrency blocks. That is a capacity-risk question, not only a central-tendency question.

**I'm couching this section since I don't know what the actual distributions look like. It also needs to be rewritten also given that tail risk events can represent overcommitted resources and system crash.
05

Choosing a Credible Design

Pick the strongest credible design, then say the assumption that could fail.
If the situation is... Recommendation Key identification question
Randomization is feasible and spillovers are low. Run a user/workspace/request RCT, depending on the outcome. Check SRM, balance, guardrails, and whether the randomization unit matches the decision.
The intervention changes shared system state. Use cluster randomization or a switchback experiment. Do not pretend requests are independent if the policy changes queue or cache state.
The rollout happened in waves. Use DiD or an event study. Show pre-trends and check for incidents or launches that happened at the same time.
Treatment changes at a threshold. Use RDD around the cutoff. Keep the claim local to users near the cutoff and check manipulation.
Assignment nudges treatment but take-up varies. Use IV / encouragement and report LATE. Check first stage and defend the exclusion restriction.
Only observational logs are available. Use regression, propensity weighting, or doubly robust estimation if pre-treatment confounders are logged. Say clearly that unobserved confounding is the remaining risk.
One region, fleet, or provider changed. Use synthetic control if untreated units match the pre-period. Show pre-fit and placebo tests. If pre-fit is bad, do not oversell it.
[Note: this section might be redundant? It's a summary of the preceding sections and leadup into the experimental design..]
06

Experimental Designs Under Shared Compute

The unit of randomization has to match the user experience and the capacity system.

User or Workspace RCT

Use for: rate limits, prices, model access, priority tiers, or any change where the same user should keep seeing the same experience.

Why: if the outcome is retention, expansion, or sustained usage, the randomization unit should match how the user actually experiences the product.

Potential downside: randomizing by user usually has less statistical power than randomizing by request, and takes longer to read.

Mitigation: use short-run system metrics as early checks, keep a holdout for long-run outcomes.

Request RCT

Use for: latency, error rate, cost per request, retry rate, or whether a route succeeds.

Why: request-level randomization gives much more statistical power, especially for high-volume traffic. It is a good fit for isolated serving changes where the main outcome is request-level.

Potential downside: it can be a bad user experience if the same workflow gets mixed treatment.

Mitigation: use it only when the change is invisible or low-risk to the user.

Cluster Randomization

Use for: changes to batching, caching, regional routing, or fleet allocation where the intervention changes a shared capacity pool.

Why: if users share the same capacity pool, individual randomization can break down because spillover effects from the treated group can change the experience of the control group. Randomizing by cluster contains most interference within the randomized unit.

Potential downside: there are fewer independent units, so statistical power is worse. The clusters may also be of different sizes or otherwise look different: different chip generation, geography, user types, latency, customer value, or incident patterns.

Mitigation: stratify before randomization, check pre-period balance, analyze at the cluster level, and report system-level outcomes instead of only the treated users' outcomes.

Switchback Experiment

Use for: cache policy, global routing configuration, fleet allocation, scheduler settings, or any change that has to be turned on for a whole system at once.

Why: if the policy changes queue state or cache state, compare blocks of time under policy A versus policy B rather than pretending individual requests are independent.

Potential downside: demand changes over time. Hour-of-day, day-of-week, incidents, launches, and seasonal traffic can get mixed into the treatment effect. Carryover is also real: cache warmth, queue backlog, retry storms, and provider throttling can persist across blocks.

Mitigation: randomize across comparable time blocks, balance treatment across weekends and weekdays, record incidents, and include washout when cache or queue persistence is material. Blocks must be long enough for the policy to take effect but short enough to avoid conflating treatment with demand cycles.

[Note: I wonder if I should frontload this. It is the most useful/tractable framework to my mind's eye.
Could do with mention of cluster-level randomization for network-based products, and switchbacks for marketplaces (eg. doordash)]
07

Observational Methods

Use them when randomization is unavailable, but be honest about measured versus unmeasured confounding.
[Note: Everything from this section onwards is pure LLM prose. will clean and refine.]

Observational work should not oversell the estimator. The real question is whether the historical comparison has the right pre-treatment variables and enough overlap to support the recommendation.

Method When to use it Identification risk
Outcome regression Good first pass when the main assignment variables are logged: model, account tier, region, time of day, queue depth, prompt length, and past usage. A model that predicts retention well can still estimate the treatment effect badly if the assignment story is wrong.
Propensity weighting or matching Useful when treated and untreated units overlap but had different probabilities of treatment. Extreme weights or poor overlap mean the result depends on a few unusual records. Show overlap and trim if needed.
Doubly robust estimation Stronger default when both assignment and outcome can be modeled. It still cannot fix missing confounders. Lead with the identification assumption, not the estimator name.
Double ML Useful with rich telemetry and many covariates where flexible models help with nuisance functions. Better prediction is not better causality by itself. Explain cross-fitting plainly and still show diagnostics.
If the policy used a field that was not logged, no observational method fixes that. The recommendation should be instrumentation or an experiment, not a fancier model.
08

Quasi-Experimental Designs

Use rollout timing, thresholds, instruments, or a synthetic comparison only when the assignment story supports it.

These are strongest when the business already created quasi-random variation: a phased rollout, a threshold, a beta invite, or one region/fleet/provider changing before the rest. Pick the design based on the assignment story, then state the assumption that could fail.

Design Recommendation Tradeoff to check
Difference-in-differences Use when capacity, routing, pricing, or provider changes roll out to some units before others. Pre-trends have to look credible. Also check incidents, launches, or demand shifts at the same time.
Synthetic control Use when one region, fleet, provider, or customer tier changes and there is a long pre-period. A clean post-period gap is not enough. Show pre-period fit, placebo units, and donor weights.
Instrumental variables / LATE Use when assignment nudges treatment but actual take-up is imperfect or self-selected. The estimate is local to compliers, and the instrument has to affect the outcome through treatment only.
Regression discontinuity Use when eligibility, limits, priority, or pricing changes sharply at a known cutoff. The result is local to the cutoff. Check bunching, covariate continuity, and bandwidth sensitivity.
09

Heterogeneity and Targeting

Use segments to make a decision, not to decorate the analysis.

Heterogeneity should change the allocation rule, not merely decorate the analysis. If the result does not change who gets capacity, which workload is routed differently, or which customer segment is worth serving differently, it is mostly decoration.

  1. Start with the average effect. That shows whether there is a real lever worth segmenting.
  2. Pre-specify operational segments. Good candidates are workload length, customer tier, model family, region, latency sensitivity, and peak versus off-peak demand.
  3. Use CATE only if targeting is the decision. The segment has to be stable, interpretable, and simple enough to operate.
  4. Convert lift into net value. A high treatment effect can still be a bad allocation if it uses expensive capacity for low-margin traffic.
  5. Keep some exploration alive. If the router only exploits the current best segment, future evaluation loses support for alternatives.
Segment lens What it helps decide What to watch
Workload shape Whether long-context, batch, real-time, or agentic workflows should receive different routing or limits. Aggregate usage can hide very different cost and latency profiles.
Customer tier Whether scarce capacity should be prioritized for customers with higher value or stricter SLAs. Higher value may also come with higher reliability expectations.
System state Whether the policy should change during peak demand, incidents, or low-utilization periods. The same treatment can be worth it off-peak and too expensive during congestion.
Operational principle: The average effect answers whether the lever works. Heterogeneity answers where scarce compute should be spent first.
10

Offline Evaluation of Routing Policies

Use OPE only when the logs actually support the policy you want to evaluate.

Offline policy evaluation is useful for routers, schedulers, ranking policies, and allocation rules, but only if the logs contain enough examples of the actions the new policy wants to take. Check support before estimating anything.

Logged propensities can support short-horizon contextual routing evaluation, especially when the action affects the current request and the relevant state was logged. They are less sufficient for full dynamic-policy evaluation when actions alter future queue and cache state. In that setting, IPS or doubly robust estimates can help screen policies, but live validation is still needed before broad deployment.

  1. Check support. Did the old policy try the same actions for similar state often enough?
  2. Check propensities. If action probabilities were logged, IPS and doubly robust methods become more credible.
  3. Inspect weights. If a few records drive the answer, the offline estimate is too fragile for a broad rollout.
  4. Validate live. Even a good OPE result should usually become a limited ramp or controlled exploration bucket before full deployment.
Estimator When to use it Identification risk
Direct method Quick baseline when rich state and enough outcomes exist for each action. Low variance, but biased if the outcome model is wrong for rarely chosen actions.
IPS Stochastic routers or experiments where action probabilities are known. Unstable when weights explode, especially if the new policy chooses actions the old one rarely tried.
Doubly robust OPE Logged propensities plus a reasonable reward model. Stronger default, but it still needs support, the right state variables, and no major interference.
Operational principle: Before trusting OPE, verify that the old policy actually tried the actions the new policy wants to take in comparable states. If support is weak, add controlled exploration.
11

Long-Run Outcomes and Short-Run Signals

Use short-term metrics as evidence only after showing why they should predict the delayed outcome.

Short-run metrics are useful, but they should not be treated as business outcomes by default. First state what part of the mechanism they measure, then explain why that mechanism should translate into retention, expansion, or LTV.

Surrogate Ladder

  1. System metric: TTFT, latency, cache hit rate, error rate, retry rate, or cost per successful request.
  2. Session outcome: task completion, abandonment, re-run rate, or successful workflow completion.
  3. User behavior: return rate, deeper usage, broader adoption, or more high-value workflows.
  4. Business outcome: retention, expansion, renewal, LTV, or margin.

The ladder can break. A faster request is not automatically higher retention, and more usage can still hurt margin if the workload is expensive or low value. A metric that predicts retention is not necessarily a valid surrogate for retention. A valid surrogate claim requires evidence that movement in the short-run metric captures the causal path by which the policy changes the long-run outcome.

If the long-run outcome takes too long to observe, the recommendation should be conditional: ramp only if the mechanism metric improves, the cost and reliability guardrails hold, and early behavior moves in the direction past experiments suggest. For policies expected to affect retention or expansion, keep the smallest durable holdout that can answer the long-run question. If no holdout is possible, label the surrogate risk explicitly rather than pretending the short-run metric proves the whole case.

12

Checks That Change the Decision

Checks should tell you whether to ship, ramp, target, pause, roll back, or instrument more.

The check is not there to show diligence for its own sake. It should change the recommendation: ship, ramp, pause, roll back, narrow the target population, or instrument more.

Check What to look for How it changes the recommendation
Assignment SRM, eligibility, exposure time, and whether treatment started when the logs say it did. If assignment is broken, pause the readout.
Execution Intended action, executed action, fallback reason, timeout path, and retry path. If execution diverged, report ITT and separately analyze the operational failure.
Balance and overlap Balance tables, propensity overlap, pre-trends, or pre-period synthetic fit depending on the design. If overlap is poor, narrow the estimand or downgrade the claim.
System spillover Untreated-user latency, errors, queue depth, cache state, and utilization. If spillovers are material, move to cluster or switchback evidence.
Economic significance User value per GPU-hour, cost per successful request, margin impact, and reliability risk. If the practical effect does not clear the cost bar, do not ship just because the p-value is good.
Rollback trigger Latency above X, error rate above Y, cost per success above Z, or no movement in task completion. If those triggers fire, pause or roll back even if the average user metric looks fine. In the router example, elevated fallback rates during congestion should stop the ramp even if premium-routed requests improved.
13

Decision Readout Template

A compact structure for a technical decision memo or experiment readout.
Start by naming the counterfactual: for this population, what would this outcome have looked like under the old policy or the best operational alternative?

State how treatment or allocation was assigned. If randomization is available and spillovers are low, use the right RCT unit. If the policy changes shared capacity, use cluster randomization or a switchback. If the evidence is historical, look for rollout timing, a threshold, an instrument, or enough pre-treatment controls.

Name the main failure mode: confounding, bad controls, noncompliance, spillovers, weak support, or tail-state risk. Mitigate it with balance checks, pre-trends, full-funnel logging, holdouts, cluster-level uncertainty, washout, or controlled exploration.

Close with the decision: ship, ramp, target a segment, keep a holdout, pause, roll back, instrument more, or run a stronger design, based on user lift, system guardrails, and compute cost.

Common Decision Patterns

Observed situation Decision pattern Contingency
Routed users retained better. Do not compare routed vs not-routed directly. Ask what drove routing. If queue state or workload type drove routing, control for pre-treatment state or run an experiment.
The policy can be tested safely. Randomize at the unit that matches the outcome. If shared capacity matters, move up to cluster or switchback.
The policy already rolled out. Use DiD/event study if the rollout timing gives a credible comparison. If pre-trends fail or there was a simultaneous incident, do not lean on DiD.
There is a cutoff. Use RDD near the threshold. Check manipulation and keep the claim local.
Only router logs are available. Use OPE only if propensities and support exist. If support is weak, recommend controlled exploration.
Operational principle: End with the decision, not the method. State what should ship, what should be monitored, and what result should stop the ramp.