Standardizing Observability Across LLM and Agent Workflows
An enterprise incident responder investigating a failed customer request can usually follow a distributed trace through APIs, services, databases, and infrastructure components. That process becomes significantly more difficult when the request passes through an AI agent that performs retrieval, invokes tools, calls multiple models, accesses memory, delegates work to other agents, and dynamically changes its execution path.
As AI systems become more complex, organizations need a way to observe and troubleshoot these workflows using the same operational infrastructure they already use for traditional software systems. OpenTelemetry's Generative AI semantic conventions represent one of the most significant efforts to create a common telemetry model for AI workloads, extending existing observability practices into the world of models, agents, retrieval systems, memory operations, and tool execution.
Traditional Observability and Agent-Based Systems
Traditional observability systems were designed around relatively predictable application architectures. A user request typically follows a defined path through APIs, services, databases, and supporting infrastructure.
Agent-based AI systems introduce a different execution model. A single user request may trigger retrieval operations, multiple model invocations, planning steps, memory access, tool execution, workflow orchestration, MCP interactions, and sub-agent delegation. The exact sequence can vary from one request to the next.
This variability creates operational challenges. Different frameworks often expose telemetry differently. Model providers report usage and performance metrics in different formats. Retrieval activity, tool execution, memory access, and orchestration logic may be difficult to observe consistently across environments.
OpenTelemetry's Generative AI semantic conventions were developed to address this problem. Rather than creating a separate monitoring ecosystem for AI systems, the effort extends OpenTelemetry's existing tracing, metrics, and logging architecture through a standardized set of AI-specific telemetry definitions.
How the Mechanism Works
The GenAI semantic conventions do not standardize model behavior. They standardize how model behavior is observed.
Many of the newer workflow-oriented conventions remain in Development status and may change as the specification evolves. Even so, they illustrate how OpenTelemetry is attempting to represent increasingly complex AI workflows within existing observability systems.
Model Telemetry
The specification defines common attributes for identifying model providers and models. It distinguishes between the model requested by the application and the model that ultimately serves the request. This distinction is important because providers may route requests to specific underlying model versions.
The conventions also standardize token accounting. Input tokens, output tokens, reasoning tokens, and cache activity can be represented using a common telemetry structure, creating a consistent measurement model across providers.
Performance telemetry is similarly normalized. Metrics such as request duration, streaming behavior, and time-to-first-token can be collected and analyzed using a shared vocabulary.
Planning and Memory Telemetry
Development-status conventions extend observability beyond model execution into planning and memory operations.
Planning spans allow planning activity to be represented as a distinct stage within a distributed trace, with model calls used during planning appearing as child spans. This provides visibility into how agents structure and sequence work rather than exposing only the underlying model invocations.
Memory spans provide visibility into memory creation, search, update, and deletion activities. These operations become observable components of the workflow rather than remaining hidden inside framework-specific implementations.
Agent and Workflow Observability
The specification also includes Development-status telemetry models for agent creation, agent invocation, and workflow execution.
Agents can be represented as observable entities with associated metadata such as names, identifiers, descriptions, and versions. Workflow spans make orchestration activity visible within traces, allowing investigators to understand how requests move through agent-driven systems.
Retrieval and Tool Execution
Retrieval systems have become a foundational component of many enterprise AI architectures, particularly retrieval-augmented generation workflows.
Recent versions of the specification introduced retrieval spans that make retrieval operations visible within traces. These spans can capture information about data sources, retrieved documents, relevance information, and related retrieval activity.
Tool execution is similarly represented through dedicated spans. Execution duration, success states, failure conditions, and tool identity become traceable components within a broader workflow.
MCP Observability
The conventions also introduce telemetry structures for Model Context Protocol interactions.
MCP-specific attributes enable protocol operations to be observed within the same traces as model execution, retrieval, planning activity, memory access, and tool execution. This creates a unified view of AI workflows that span multiple systems and services.
Trace Correlation
The architecture relies on the same distributed tracing mechanisms already used across modern software environments.
Trace IDs, span IDs, parent-child relationships, and W3C Trace Context headers allow telemetry to remain correlated as requests move across services, frameworks, tools, model gateways, and MCP servers.
The result is a single distributed trace that can represent an entire AI workflow from initial request through final response.
Analysis: Why This Matters Now
The significance of recent developments is less about any individual telemetry attribute and more about the continued expansion of observability beyond model invocation. The conventions increasingly attempt to represent complete AI workflows within distributed traces, allowing enterprises to observe how AI systems operate across multiple components rather than viewing model calls in isolation.
Equally significant is the decision to extend existing observability infrastructure rather than create a separate AI monitoring model. Organizations can apply established tracing, logging, metrics, and governance practices to AI systems instead of building parallel operational processes.
Implications for Enterprises
Operational Integration
Organizations can integrate AI telemetry into existing distributed tracing platforms, application performance monitoring systems, logging infrastructure, incident management processes, and service-level objective programs.
AI workflows become part of the same operational view used to manage traditional applications and services.
Cost Visibility
Standardized token telemetry enables more consistent measurement of model usage across providers. While the specification does not currently define a standard cost attribute, the telemetry can support cost attribution and chargeback models when combined with external pricing information.
Security Operations
Security teams gain the ability to correlate AI activity with broader operational telemetry.
AI workflow activity can be examined within the same investigative processes already used for application and infrastructure events.
Governance Controls
The OpenTelemetry Collector is emerging as an important governance control point.
Organizations can use collector processors to filter, redact, transform, route, and sample telemetry before export. This allows governance policies to be applied centrally without requiring changes to application code.
Risks and Open Questions
The GenAI semantic conventions remain in Development status. Breaking changes are still possible, and backward compatibility is not guaranteed.
The move of the conventions into a dedicated semantic-conventions-genai repository reflects the growing scope of AI observability work, but it should not be interpreted as a sign of specification stability. Significant portions of the schema remain under active development.
Multi-agent systems remain an area without comprehensive standardization. Agent-to-agent collaboration patterns continue to evolve faster than observability standards.
Other gaps include cost attribution, prompt version tracking, model routing visibility, policy and guardrail decision telemetry, and evaluation workflows.
Memory management remains only partially standardized. Development-status spans exist for memory operations, but broader guidance around memory semantics, lifecycle management, propagation, conversation continuity, and multi-agent memory coordination remains unresolved.
Privacy and security considerations also remain significant.
Prompt content, retrieval results, memory contents, tool arguments, and model interactions may contain sensitive information. For this reason, prompt and response capture are intentionally disabled by default and must be explicitly enabled through configuration.
Organizations adopting AI observability must therefore balance operational visibility with data protection, compliance requirements, retention policies, and governance controls.
Further Reading
- OpenTelemetry Semantic Conventions SIG
- OpenTelemetry Generative AI Semantic Conventions
- semantic-conventions-genai Repository
- OpenLLMetry Project
- W3C Trace Context Specification
- OpenTelemetry Collector Documentation