Every Token Has a Price: Why LLM Cost Telemetry Is Now Production Infrastructure A team ships an internal assistant that “just summarizes docs.” Usage triples after rollout. Two weeks later, finance flags a spike in LLM spend. Engineering cannot answer basic questions: Which app caused it? Which prompts? Which users? Which model? Which retries or agent loops? The system is working. The bill is not explainable. This is not a failure of the model. It is a failure of visibility. Between 2023 and 2025, AI observability and FinOps moved from optional tooling to core production infrastructure for LLM applications. The driver is straightforward: LLM costs are variable per request, difficult to attribute after the fact, and can scale faster than traditional cloud cost controls. Unlike traditional compute, where costs correlate roughly with traffic, LLM costs can spike without any change in user volume. A longer prompt, a retrieval payload that grew, an agent loop that ran one extra step: each of these changes the bill, and none of them are visible without instrumentation built for this purpose. Context: A Three-Year Shift Research shows a clear timeline in how this capability matured: 2023: Early, purpose-built LLM observability tools emerge (Helicone, LangChain’s early LangSmith development). The core problem was visibility into prompts, models, and cost drivers across providers. At this stage, most teams had no way to answer “why did that request cost what it cost.” 2024: LLM systems move from pilot to production more broadly. This is the point where cost management becomes operational, not experimental. LangSmith’s general availability signals that observability workflows are becoming standard expectations, not optional add-ons. 2025: Standardization accelerates. OpenTelemetry LLM semantic conventions enter the OpenTelemetry spec in January 2025. Enterprise LLM API spend grows rapidly. The question shifts from “should we instrument” to “how fast can we instrument.” Across these phases, “observability” expands from latency and error rates into token usage, per-request cost, prompt versions, and evaluation signals. How the Mechanism Works This section describes the technical pattern that research indicates is becoming standard, separating the build pattern from interpretation. 1. The AI Gateway Pattern as the Control Point The dominant production architecture for LLM observability and cost tracking is the “AI gateway” (or proxy). What it does: Why it matters mechanically: Because LLM usage is metered at the request level (tokens), the gateway becomes the most reliable place to measure tokens, compute cost, and attach organizational metadata. Without a gateway, instrumentation depends on every team doing it correctly. With a gateway, instrumentation happens once. Typical request flow: User request → Gateway (metadata capture) → Guardrails/policy checks → Model invocation → Response → Observability pipeline → Analytics 2. Token-Based Cost Telemetry Token counts are the base unit for cost attribution. Typical per-request capture fields: Research emphasizes that cost complexity drivers appear only when measuring at this granularity: input versus output token price asymmetry, caching discounts, long-context tier pricing, retries, and fallback routing. None of these are visible in aggregate metrics. 3. OpenTelemetry Tracing and LLM Semantic Conventions Distributed tracing is the backbone for stitching together an LLM request across multiple services. OpenTelemetry introduced standardized LLM semantic conventions (attributes) for capturing: This matters because it makes telemetry portable across backends (Jaeger, Datadog, New Relic, Honeycomb, vendor-specific systems) and reduces re-instrumentation work when teams change tools. 4. Cost Attribution and Showback Models Research describes three allocation approaches: Operationally, “showback” is the minimum viable step: make cost visible to the teams generating it, even without enforcing chargeback. Visibility alone changes behavior. What Happens Without This Infrastructure Consider a second scenario. A product team launches an AI-powered search feature. It uses retrieval-augmented generation: fetch documents, build context, call the model. Performance is good. Users are happy. Three months later, the retrieval index has grown. Average context length has increased from 2,000 tokens to 8,000 tokens. The model is now hitting long-context pricing tiers. Costs have quadrupled, but traffic has only doubled. Without token-level telemetry, this looks like “AI costs are growing with usage.” With token-level telemetry, this is diagnosable: context length per request increased, triggering a pricing tier change. The fix might be retrieval tuning, context compression, or a model swap. But without the data, there is no diagnosis, only a budget conversation with no actionable next step. Analysis Why This Matters Now Three factors explain the timing: LLM costs scale with usage variability, not just traffic. Serving a “similar number of users” can become dramatically more expensive if prompts grow, retrieval payloads expand, or agent workflows loop. Traditional capacity planning does not account for this. LLM application success is not binary. Traditional telemetry answers “did the request succeed.” LLM telemetry needs to answer “was it good, how expensive was it, and what changed.” A 200 OK response tells you almost nothing about whether the interaction was worth its cost. The cost surface is now architectural. Cost is a design constraint that affects routing, caching, evaluation workflows, and prompt or context construction. In this framing, cost management becomes something engineering owns at the system layer, not something finance reconciles after the invoice arrives. Implications for Enterprises Operational implications: Technical implications: The Quiet Risk: Agent Loops One pattern deserves particular attention. Agentic workflows, where models call tools, evaluate results, and decide next steps, introduce recursive cost exposure. A simple example: an agent is asked to research a topic. It searches, reads, decides it needs more context, searches again, reads again, summarizes, decides the summary is incomplete, and loops. Each step incurs tokens. Without step-level telemetry and loop limits, a single user request can generate dozens of billable model calls. Research flags this as an open problem. The guardrails are not yet standardized. Teams are implementing their own loop limits, step budgets, and circuit breakers. But without visibility into agent step counts and per-step costs, even well-intentioned guardrails cannot be tuned effectively. Risks and Open Questions These are open questions that research raises directly, not predictions. Further Reading