Serving Composite Multimodal AI Systems as Executable Dataflow Graphs
A customer uploads a damaged equipment photo and asks for a repair recommendation. The request passes through a vision encoder to interpret the image, a retrieval system to pull maintenance records, a language model to generate an explanation, and a speech component to produce an audio response. To the user, it appears to be a single AI interaction. Operationally, it is a coordinated workflow involving multiple specialized systems.
Traditional model serving infrastructure was designed around a simple assumption: one model, one endpoint, one execution path. Recent systems research suggests that assumption is becoming a constraint. A growing body of work, most recently including the June 2026 M* paper, treats multimodal AI systems not as monolithic models but as executable dataflow graphs composed of independently managed components.
The shift is architectural rather than algorithmic. The question is no longer only how to train increasingly capable models, but how to efficiently operate increasingly complex AI systems.
The Heterogeneity Problem
Modern multimodal systems combine heterogeneous components with very different computational characteristics. Language models, vision encoders, speech systems, diffusion generators, retrieval services, and action models often require different hardware resources, different batching strategies, and different scaling policies.
This creates a mismatch with conventional serving architectures. In a monolithic deployment, all components share the same execution environment and scaling unit even when their workloads differ substantially. Some stages become bottlenecks while others remain underutilized.
In some cases, monolithic deployment is not merely inefficient but infeasible. Cornserve's evaluation found that a monolithic deployment of Qwen 3 Omni 30B exhausted available memory, while a disaggregated deployment was able to serve the model successfully.
Several research efforts between 2025 and 2026 explored different forms of disaggregated serving, including Encode-Prefill-Decode (EPD) architectures, stage-based multimodal serving systems, and fully disaggregated multimodal runtimes. The June 2026 M* paper extends this direction by introducing a more general graph-based abstraction intended to support a broad range of multimodal and agentic model architectures.
How the Mechanism Works
The central idea is to represent an AI system as a directed graph of interconnected components.
Each node in the graph performs a specific function. Examples include:
- Language models for reasoning and text generation
- Vision encoders for image understanding
- Audio models for speech recognition or synthesis
- Diffusion or flow-based generators for image creation
- Retrieval components for external knowledge access
- Tool or action models for external operations
- Routing and orchestration components
The edges between nodes represent both data movement and execution dependencies.
Instead of sending every request through a single model endpoint, the serving system executes only the graph path required for a particular task. Different requests may traverse different subgraphs depending on the modalities involved and the outputs requested.
This architecture allows each component to be deployed independently. Components can be placed on different accelerators, scaled separately, batched according to their own workload characteristics, and upgraded without requiring changes to the entire system.
Research systems such as Cornserve, vLLM-Omni, HydraInfer, ModServe, and M* all implement variations of this idea. While the details differ, they share a common principle: the execution graph becomes the primary operational unit rather than the individual model.
Analysis: Why This Matters Now
The significance of this development is tied to the changing structure of AI applications.
Many current AI systems are no longer single-model workloads. They are collections of specialized models connected through orchestration logic and intermediate data flows. As multimodal capabilities expand, the number of components involved in a single request often increases rather than decreases.
The graph-based serving model reflects this reality.
The performance gains reported across recent systems are substantial. Cornserve, for example, reported up to 3.8× higher throughput than monolithic serving.
Equally important, graph-based serving changes how infrastructure teams think about AI operations. Scaling decisions, performance bottlenecks, deployment strategies, and observability models increasingly become component-level concerns rather than endpoint-level concerns.
In that sense, the architectural transition resembles earlier shifts in software engineering from monolithic applications to distributed services. The goal is not simply higher performance. It is creating an operational model that better matches the structure of modern AI systems.
Implications for Enterprises
Infrastructure and Resource Management
Graph-based serving enables organizations to allocate hardware according to the needs of individual components rather than the needs of an entire application.
Language models, speech systems, and visual encoders can each run on infrastructure optimized for their specific workload profiles.
Independent Scaling
Organizations can scale high-demand components without replicating the entire application stack. This becomes increasingly important as multimodal systems grow larger and more heterogeneous.
Improved Observability
The architecture naturally exposes component-level telemetry, including queue depth, latency, resource utilization, transfer times, and execution-path frequency.
This creates a more detailed operational view than a single end-to-end latency metric.
Failure Isolation
Problems in one component do not necessarily require restarting or scaling the entire system. Individual nodes can be isolated, replaced, or upgraded independently.
Multi-Application Reuse
Several systems demonstrate the ability to share common components across multiple applications. Shared encoders, retrieval systems, or foundation-model backbones can reduce infrastructure duplication and improve utilization.
Risks and Open Questions
Despite strong technical results, several important questions remain unresolved.
First, much of the work remains at the research-prototype stage. Performance improvements have been demonstrated, but long-term operational experience is still limited.
Second, graph execution introduces additional architectural complexity. Scheduling, routing, deployment management, and observability become more sophisticated than in monolithic deployments.
The current research also leaves important governance questions unresolved. Cornserve, for example, assumes model code originates from a trusted entity, while intermediate tensors may move between components through shared-memory or high-speed networking mechanisms designed for performance rather than governance. As AI systems become increasingly distributed, organizations may need new controls for intermediate outputs, component-level access management, and graph-wide provenance tracking.
Finally, security and governance considerations remain underdeveloped across the current literature. Most systems focus primarily on performance, scheduling, and resource efficiency. Topics such as component-level access control, provenance tracking, intermediate-output governance, and data-handling policies receive comparatively little attention.
As organizations move toward increasingly composite AI architectures, these operational and governance questions may become as important as the performance gains that motivated the architectural shift.
Further Reading
- M*: A Modular, Extensible, Serving System for Multimodal Models
- Cornserve: A Distributed Serving System for Any-to-Any Multimodal Models
- vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models
- ModServe
- HydraInfer
- EPD Disaggregation
- EPD-Serve
- Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity
- The Shift from Models to Compound AI Systems