Diagnostic model

Token and Authority

A token meter counts consumption. A session defines the economic event.

In April 2026, reporting surfaced that Uber had exhausted its 2026 Claude Code budget within the first months of the year. CEO Dara Khosrowshahi acknowledged the overrun and noted that roughly ten percent of code changes were being produced by autonomous agents under human review. The cost surface was agent-driven. The tools were doing what they were marketed to do. The budgeting assumptions did not survive contact with the consumption pattern.

In May 2026, The Verge reported that Microsoft's Experiences and Devices division, covering Windows, Microsoft 365, Outlook, Teams, and Surface, was winding down most Claude Code usage by the end of June and steering developers toward GitHub Copilot CLI. The reporting pointed to operating-expense pressure aligned with the June 30 fiscal-year close, and to the difficulty of forecasting token-based consumption at scale.

In the same period, China Mobile, China Telecom, and China Unicom began rolling out tiered subscriber plans metering not bandwidth but inference. Ten million tokens a month at the consumer tier. Two hundred and fifty million at the enterprise tier. Bundled connectivity, security, API access, multi-agent routing, and model ecosystem entitlements packaged with the plan. The 1990s were voice minutes. The 2000s were SMS. The 2010s were gigabytes. The 2026 plan is intelligence, billed by the token.

Two enterprise overruns and a carrier monetization move. Different industries, different stacks, different decisions in response. The same architectural cause underneath.

The press has framed these as a subscription-versus-utility problem, a predictability problem, a procurement problem, the rise of the CFO in AI buying decisions. All of those framings are accurate. None of them is the story.

The story is that the token has been treated as a cost model when it is actually a meter without a transaction. And once inference becomes a billable participant event inside a carrier subscriber relationship, the architectural absence that produced Uber's overrun and Microsoft's pullback becomes a structural property of every tokenized AI deployment at scale.

Token Is Four Different Things

The same overloaded vocabulary that collapsed storage, retention, memory, and focus into one word has collapsed four distinct economic functions into the word token. The collapse hides the architecture. Pulling the word apart is the first move, because cost control begins with knowing what cost was actually incurred.

Model unit

The computational quantity the model consumes or produces. This is what the GPU accounts for: electricity, cooling, depreciation on the silicon, capacity opportunity cost. When the system retries a failed call, runs an extra summarization pass, or sustains a longer context window than the task required, it is spending compute. The compute cost is paid by whoever owns the inference infrastructure, which may not be the actor who is billed for the consumption.

Billing unit

The line item on the invoice. This is what the vendor counts and what the buyer pays. The meter advances regardless of whether the underlying work was authorized, useful, or repeated. Microsoft's exposure was at least partly visible in this ledger. Uber's overrun was more directly visible there. The meter does not distinguish between a valid inference and a retry caused by agent failure, or between two agents that triggered the same query because they were unaware of each other.

Entitlement unit

The quota the plan allocates. A family plan with a shared cap. An enterprise plan with role-specific limits. A carrier plan with content restrictions. Entitlement is drawn from a pool sized against an assumption of who and what would consume from it. If an agent workflow consumes in an afternoon what a senior developer uses in a month, the plan is still being honored. The drawdown has simply exceeded the assumption on which the plan was sized.

Authority event

The participation act admitted into a live interaction. This is the ledger nobody is currently counting. The cost is governance load: the policy reconciliation that runs before the inference, the audit binding that runs during, the compliance review that runs after, the forensic reconstruction that runs when something goes wrong, and the liability exposure that accumulates across all of it. When the authority chain remains open longer than it should, every retry, tool call, and delegated agent is operating under a permission that nobody has re-evaluated.

Four ledgers. One word. The vendor counts the billing ledger. The model provider counts the compute ledger. The procurement team counts the entitlement ledger. Nobody is counting the authority ledger, because nobody has a place to count it.

Unit-cost deflation at the model ledger does not establish cost deflation at the session ledger. The four ledgers can move in different directions at the same time, and the field has been arguing about which direction the cost is moving while operating under a unit that does not specify which cost is being discussed.

Cost Accounting versus Cost Control

The enterprise response to AI cost surprise has been to reach for forecasting and attribution tools. CloudZero, Apptio, Anodot, and a growing category of FinOps vendors are building dashboards, alerts, and spend attribution infrastructure. The tools are real. The work is competent. The market need is being addressed.

None of it is cost control. The distinction matters, and it is the load-bearing economic argument of this page.

Cost accounting tells you where the money went. It produces invoices, dashboards, forecasts, and reports. It tells the enterprise that the AI budget was exhausted in four months instead of twelve. It does not change what happens next. It does not stop the agent that is about to retry a failed call. It does not deny the entitlement drawdown that is about to push the team over its cap. It does not terminate the session that is about to spawn a multi-agent cascade. It produces the receipt.

Cost control requires three things that cost accounting does not provide. The first is classification: the ability to identify which ledger the cost belongs to, because the four cost centers respond to different interventions and treating them as one number means every intervention is a blunt instrument. The second is influence over the cause before it produces more cost: not a forecast that the budget will exceed its cap by Tuesday, but a control surface that can deny the next retry, expire the next authority grant, or terminate the next session. The third is a unit small enough to act on at inference-time. Cost accounting reports against the month or the quarter. Cost control acts against the next inference. The unit that operates at inference-time is not the budget. It is the session.

A company cannot control AI cost merely by knowing what was spent. It must know which ledger created the cost and have authority to change the condition that would create the next one.

A token meter is not a cost-control system. It is a receipt. The FinOps category is building accounting infrastructure for an economy that has not yet defined its unit of account. The tools aggregate four different currencies into a column labeled in dollars and report the total. The total is technically accurate. It does not tell the enterprise what was bought, under what authority, or whether the purchase was admissible.

When Inference Becomes a Participant Event

The Chinese carrier plans are more than a pricing story. They represent the crossing of a structural threshold: inference has moved from application usage into subscriber entitlement. Once that happens, the model is no longer merely called. It is admitted.

Bandwidth was transport. A gigabyte is a quantity of data that moved through a pipe. It carries no opinion about who sent it, who received it, what it contained, or what was done with it. The metering unit and the governance unit were the same thing: the pipe.

Tokens are not transport. A token is a unit of inference. It represents a participant act inside the interaction the subscriber is having. It carries a who, a what, a where, a why, and a downstream consequence. The metering unit is the same word as before, but the governance unit is no longer the pipe. It is the act of participation.

Most current architectures treat the model as infrastructure, something the application calls, something invisible to the user, something the system orchestrates. That framing was always architecturally imprecise. It becomes operationally wrong the moment the model is metered as a participant in a subscriber relationship. A participant in an interaction can observe, transform, summarize, remember, recommend, act, trigger downstream effects, persist state, and return later with the history of what it did the first time. None of that is what infrastructure does. All of it is what participants do.

The model is not the session. It is a participant admitted into the session. Admission is an authority act. Treating admission as if it were infrastructure provisioning is what produces the failure modes the field has been documenting. The Cursor agent that deleted a production database in a reported April 2026 incident was admitted to a session that gave it production-grade authority because the system had no place to evaluate the admission as an authority act. It was provisioned. It was not admitted.

The carrier billing for tokens acknowledges, in economic form, that something was participating. The architectural question is whether the admission was governed.

Agents Make the Problem Recursive

A model invocation is bounded by a prompt and a response. The cost is roughly predictable at the per-call level: input length, output length, model selected, rate card. Enterprises have been pricing model use for two years and the per-call unit economics are understood.

An agent invocation is bounded by a task, and a task can require an unbounded number of model calls, tool calls, retries, memory reads, memory writes, and delegations to other agents before the task is considered complete. The cost of an agent invocation is not roughly predictable. It is a sequence of cost-bearing acts, each of which appears locally rational, with no bound on the sequence's total cost beyond whatever the agent eventually decides is done. That is the cost surface that exhausted Uber's budget. Not the price per token. The unboundedness of task-level consumption that token pricing was never designed to govern.

Multi-agent systems compound the problem. Agent A delegates to Agent B, which calls a third-party model, which routes to a tool, which writes to a memory that Agent C later acts on. Each step appears locally valid. None of the steps was authorized as part of the original task in any way the cost ledgers can recognize. The compute fires, the meter advances, the entitlement drains, and the authority chain extends across actors who never explicitly consented to be on the chain.

A model spends tokens to answer. An agent spends authority to act, and the authority spends across all four ledgers at once. The model creates tokens. The agent creates token liability.

In a model world, the question is: was this inference allowed? In an agent world, the question becomes: was this chain of inferences, tool calls, state changes, and delegated actions still operating under the authority originally granted? That is not a metering problem. It is a session-governance problem, and the carrier plans currently being marketed do not have a place to evaluate it.

The Failure Domains This Activates

Once inference becomes a billable participant event inside a carrier subscriber relationship, the architecture is asked to coordinate several failure domains simultaneously. Each is a known site of industry failure independently. The carrier-as-AI-distributor pattern activates them at once.

Authority ambiguity: who authorized the model? The end user, the device, the carrier, the application, the enterprise IT policy, the family plan owner, the model provider's terms of service? When inference happens inside an enterprise account on a personal device on a corporate plan calling a third-party model under a regional regulation, the authority is fragmented across at least six actors, none of whom is currently structurally responsible for resolving the fragmentation.

Billing-authority mismatch: the model may be invoked by one actor, billed to another, routed by a third, governed by a fourth, and held accountable by a fifth. A toll road cannot charge a vehicle coherently unless the system knows who entered, where they entered, which account applies, and which rules govern the trip. Tokenized inference has the same structure, without the toll plaza.

Model admission failure: no widely deployed architecture distinguishes between "the model is available" and "the model is admitted into this specific interaction for this specific participant with this specific authority." The first is a capability statement. The second is an authority statement. Once the carrier meter starts, the model is, by definition, participating. Whether it should be participating in this session, at this moment, under this policy, is the question the meter does not ask.

Policy fragmentation: consumer plan, family plan, enterprise plan, jurisdictional rule, model provider rule, carrier rule, device rule, and application rule can all disagree with one another in any given session. None of them is currently authoritative over the others. The interaction proceeds under whichever rule was checked last, or whichever rule the model decides to weight.

Compute placement and residency: once a token can route through carrier edge compute, regional cloud, private cloud, model partner infrastructure, or third-party GPU capacity, the placement decision is not just an optimization. It is a policy act with sovereignty, privacy, contract, audit, and liability dimensions. It must be resolved against the session's policy stack before the inference fires.

Audit discontinuity: token consumption, model selection, prompt context, output return, billing event, and policy decision are typically logged to different systems run by different parties. The carrier can prove tokens were used. The model provider can prove the model was invoked. The enterprise can prove the user was authenticated. None of them can prove, jointly, why this specific inference was admitted under what authority. The forensic gap is not a software bug. It is a structural property of how the participants in this market are configured to log.

The efficiency paradox: AI was introduced to reduce friction. Tokenized AI creates a pre-inference coordination layer that consumes infrastructure before any output exists. Authority resolution, entitlement check, model admission, compute placement, policy reconciliation, billing assignment, and audit binding all run before the model answers. The token meter starts after this work has already happened. The system spends on coordination, then sells inference. The marketing claims efficiency. The infrastructure absorbs the cost of producing it.

The Session Is the Economic Container

Tokens are the accounting unit. Energy is the physical input. Coordination is the hidden tax that determines how much energy survives into useful output. Before the system produces useful output, it must spend energy maintaining identity, policy, state, authority, routing, memory, and admissibility across the interaction. Token output competes with coordination energy, and the coordination cost is not on the meter. It is what the meter is built on top of.

The four ledgers only become governable if a session binds them together. Without a session, the four ledgers describe one live interaction from four incompatible positions. The model counts compute. The vendor counts billing. The plan counts entitlement. The compliance system tries to reconstruct authority after the fact. Each is locally accurate. None is jointly meaningful.

A session is what closes the gap. The session is the bounded unit that can say: this inference was admitted here, billed here, allocated here, authorized here, and terminated here. With that boundary in place, the four ledgers describe the same economic event from four complementary positions, and the system can act on the event before, during, and after its execution.

When the session is the governor, the four ledgers cohere. Compute spending is bounded by the session's compute scope. Billing is bounded by the session's accounting scope. Entitlement is drawn against the session's authorization scope. Authority is held by the session's policy scope. When the session terminates, all four close at once. The compute stops, the billing closes, the entitlement is restored, and the authority chain dies. That is the economic shape of a governed transaction, and it is the shape no current architecture produces.

Token billing without a session is invoicing without a contract. The system can count what crossed the meter. It cannot define what was bought, by whom, under what authority, for what purpose. The FinOps tools sit on top of the four ledgers and try to make sense of them as observed phenomena. The architectural answer sits underneath the four ledgers and binds them into a single transaction the system can act on before cost is incurred.

A toll authority cannot run a toll road by assuming each driver self-reports. A bank cannot clear settlements by assuming each counterparty self-attests. A power grid cannot remain synchronized if each generator decides locally what frequency to produce. In every case where capability has consequences at scale, the system requires an authority structure that exists outside the participants and refuses operations that would let the system slip into incoherence, regardless of what any individual participant decides. Tokenized inference is now in that category.

Due Diligence Test

The diligence question is not whether a vendor can count tokens. Most can. The diligence question is whether the vendor can bind compute, billing, entitlement, and authority to the same governed session, then influence the next cost-bearing event before it occurs. If the answer is no, the system has cost accounting but not cost control.

A governed session can say: this inference was admitted here, billed here, allocated here, authorized here, and terminated here. A cost-accounting system can say: this many tokens were consumed this month. Those are not the same capability, and acquiring or licensing the second does not provide the first.

Cost per token is not a well-formed measurement until the token is attached to a bounded economic event.

The token is a metered unit of movement. The session is the clearinghouse. Without the clearinghouse, there is no final settlement. There are only accumulated meter readings in four currencies that do not convert against each other.

The model does not govern the session. The session governs the model, and the session is what makes the cost of the model expressible in the first place.