Beyond the Loop: Architecting a VAST, Multi-Agent AI System
Why the future of AI is a swarm of millions of agents, and how the VAST Data Platform’s Shared-Everything architecture provides the foundation to scale it.
We are at a critical inflection point, one that fundamentally reshapes the future for both cloud service providers (CSPs) and enterprise buyers. The industry is rapidly moving beyond “AI as a chat box” and hurtling toward “AI as an autonomous workforce.” Today’s single-agent demos, running in a simple while loop, are the “hello, world” prototypes of a new computing era.
The future isn’t one hyper-intelligent agent; it’s a swarm of millions. Why? Specialization.
A single company isn’t one hyper-productive CEO. It’s a team of specialists in finance, engineering, marketing, and logistics, all collaborating. Complex AI problems are no different. You wouldn’t ask one agent to run a global supply chain, design a silicon chip, or manage a live marketing campaign. These tasks require a team of specialized agents —a PlannerAgent, a ResearchAgent, a DataAnalystAgent, a SecurityMonitorAgent, and a NotificationAgent — all working in concert.
For CSPs, this shift represents the next generation of massive compute demand and a new, high-value AI platform to host. For enterprise buyers, this marks the opportunity to achieve unprecedented levels of automation and operational efficiency.
However, this also creates an unprecedented infrastructure problem. A swarm of millions of autonomous agents, all reasoning, communicating, and accessing data simultaneously, cannot run on today’s siloed architecture. It’s an infrastructure nightmare, creating problems of cost, security, and complexity.
The Bottleneck: The Environment vs. The Swarm
The primary challenge in scaling from a single agent to a swarm is the lack of infrastructure designed for multi-agent orchestration. Today’s agent deployments typically rely on general-purpose infrastructure—virtual machines, containers, or serverless functions—that wasn’t built with swarm coordination in mind. While modern platforms can scale individual components, they lack the integrated primitives agents need to operate collectively at scale.
This creates critical bottlenecks that make swarm deployment challenging:
These aren’t agent failures; they’re platform gaps—the absence of infrastructure primitives purpose-built for multi-agent systems.
A Shared, Decoupled Foundation for Running AI Agents
VAST’s architectural approach to this problem is built on a core design principle: compute and state must be fully separated, but all compute resources must have shared, high-speed access to a global, consistent state layer.
It turns out that agentic workloads are a perfect fit for this model. Rather than running agents in an “all-in-one-box” environment, we can run the separate functions of an agent—its compute logic, its reasoning, its state, and its communication—as their own layers on our platform. The agent’s logic (CPU) runs on one set of resources, while its state and memory (storage) live on another, accessible by all. CPU compute, GPU compute, storage, and networking can all be scaled independently, which opens the door for efficiently and economically serving large numbers of autonomous agents.
The primary advantage of this disaggregated architecture is the ability to scale only the component that is the bottleneck.
Architectural and Engineering Deep Dive: Pillars of a Scalable Agentic Platform
This section examines the core components required to run disaggregated agents at scale on the VAST AI OS, addressing the fundamental challenges of coordination, state, scaling, and operability.
1. The Execution Model: Stateless Compute with Shared State Access
The fundamental architectural pattern is simple: agent logic runs as stateless compute, while all persistent state lives in a globally-accessible storage layer. Applying this pattern to agents requires solving specific problems around latency, consistency, and scale.
Agent Kernels as Ephemeral Functions
Agent “kernels” (the executable logic of an agent) run as serverless functions on VAST CNodes (compute nodes). These functions are stateless and short-lived. They:
This is a different architectural approach from traditional container-based agents, where the agent’s logic, memory, and state are bundled. Here, the kernel is throwaway; only the storage persists.
The Data Locality Advantage
Because kernels run on CNodes that are part of the storage fabric, data access is local NVMe, not remote network I/O. When an agent needs to read a 10GB dataset, that data doesn’t move across a network—the compute moves to the data. This is critical for agents doing data-intensive work (e.g., a DataAnalysisAgent processing logs or a VectorSearchAgent scanning embeddings).
Execution Triggers and Lifecycle
Agent kernel triggers can be:
When is it important to write agents output storage:
In-memory storage is used for short-lived, task-specific context such as conversational state, temporary results, inter-agent messages, or intermediate computations. This minimizes latency and provides fast, stateful interactions crucial for responsiveness in ongoing agent flows.
Persistent storage (databases, object stores, file systems) holds data that must survive agent restarts, session losses, or system failures—such as project histories, decision logs, audit trails, user profiles, and learned strategies or policies. This ensures reliability, traceability, and long-term knowledge accumulation
Write everything to persistent storage when:
2. Event-Driven Coordination: Communication Through the Environment
Agents don’t communicate through direct messages alone. They communicate through the environment — by modifying shared state, writing files, updating tables, or inserting vectors. Other agents observe these changes and react.
This is a fundamentally different communication model from traditional RPC or message-passing. Instead of AgentA explicitly calling AgentB, AgentA modifies the environment (writes a file, updates a table), and the platform automatically triggers AgentB based on registered event patterns.
Environment-Driven Agent Triggers
Agents register themselves to be triggered by specific environmental changes:
Here’s an example workflow:
No agent explicitly called another agent. They communicated through changes to the shared environment, with the Event Broker mediating the triggers.
Why This Model Scales
Direct agent-to-agent messaging requires agents to know about each other. At scale (10,000+ agents), this creates a dependency graph nightmare. When 10,000 agents are operating simultaneously, each producing dozens of state changes per second, you’re looking at hundreds of thousands to millions of events per second. Most event systems collapse under this load.
Environment-driven communication decouples agents completely:
VAST’s Event Broker is architecturally different because it’s not a separate service—it’s integrated directly into the storage fabric. This integration provides scalability and performance characteristics that are critical for multi-agent environments. The broker doesn’t maintain events in separate in-memory buffers or on dedicated broker nodes. Events are written directly to D-Nodes as structured data on the same storage fabric that handles all other I/O. This means:
Because events are part of the storage fabric:
3. State Management: Transactional Tables, Vectors, and Consistency
Agent state must be persistent, queryable, and consistent. This is where most traditional agent architectures fail—they rely on external databases (e.g., Postgres for state, Pinecone for vectors) which introduces latency, complexity, and data movement.
VAST DataBase unifies three capabilities in a single storage layer:
The DataBase is built on VAST’s disaggregated storage architecture. Data is stored in columnar format on D-Nodes, with multiple layers of indexing:
Vector Operations at Scale
The vector storage subsystem is designed for trillion-scale vector search. Key characteristics include:
4. The Memory Hierarchy: KV Cache Offloading for GPU Efficiency and Endless Context
In multi-agent systems, the GPU inference layer is the most expensive and constrained resource. When thousands of agents are simultaneously calling LLM reasoning services, GPU utilization becomes the critical bottleneck that determines system throughput and cost efficiency.
The KV cache is the key to unlocking high GPU utilization, reducing latency of response, and the ability to support endless context conversations that multi-agent systems require.
The GPU Utilization Problem
A GPU’s compute capacity is wasted if its memory (HBM) fills up before its processors are fully utilized.
Long-context inference on models like Llama 3.1 70B faces different bottlenecks across its two phases. The prefill phase to compute a 128K-token KV cache (~40 GB in FP16, 50% of an H100 80GB’s memory) saturates tensor cores at 60-90% utilization—expensive computation you’d want to reuse whenever possible. The subsequent autoregressive decode phase becomes memory-bound, dropping to only 20-30% utilization as each step fetches the entire KV cache but performs minimal computation. With model weights at ~140 GB in FP16, a 4×H100 80GB configuration (320 GB total HBM) leaves only ~20-25 GB per GPU for KV cache after accounting for operational overhead. This memory constraint severely limits concurrent long-context requests, forcing deployments to provision 3-5× more GPUs than compute needs would suggest—purely for memory capacity to hold KV caches. Even with optimizations like FP8 quantization, continuous batching, and multi-query attention, memory capacity remains the primary scaling constraint.
The result: you’re paying for a $30,000 GPU but only using a fraction of its processing power because memory is the bottleneck, not compute. At scale, this means you need 3-5x more GPUs than you should, purely due to memory constraints.
The Solution: KV Cache Offloading
By offloading the KV cache from GPU HBM to VAST’s high-speed storage fabric, you break the memory bottleneck:
This dramatically increases GPU utilization because memory is no longer the constraint. You can now serve 50-100+ concurrent agent requests per GPU instead of 10-20, fully utilizing the GPU’s compute capacity.
Enabling Endless Context for Multi-Agent Systems
Multi-agent systems require agents to maintain long-running context across multiple interactions. A ResearchAgent might accumulate more than 100,000 tokens of context over hours of operation. A ConversationAgent might maintain weeks of chat history. Without KV cache offloading, this is impossible—the context simply won’t fit in GPU memory.
With KV cache offloading:
The Technical Mechanism
VAST achieves KV cache offloading by integrating with inference frameworks (e.g., NVIDIA Dynamo, LMCache) at the KV cache layer:
Technically, there is a trade-off when offloading KV cache to external storage: generation latency increases by 1-2ms per cache miss. However, this small latency increase enables:
For multi-agent systems, where agents are performing reasoning tasks rather than real-time chat, this trade-off is overwhelmingly favorable. As costs add up fast and GPU efficiency becomes a bottleneck, KV cache offloading is a crucial component in multi-agent deployments.
5. Security and Isolation: Identity, Permissions, and Audit
When thousands of agents are executing autonomously, security cannot be an afterthought. The traditional approach—giving each service a long-lived API key or shared credentials—breaks down catastrophically at scale. A compromised key becomes a skeleton key to your entire system. Credential rotation becomes an operational nightmare. And when something goes wrong, tracing which agent did what becomes nearly impossible.
The alternative is to treat agents as first-class principals in your security model, with the same rigor you’d apply to human users or external services. Each agent must have:
Identity Fabric: Short-Lived Credentials
Agents don’t have static API keys. Instead, when an agent kernel spawns, it requests a short-lived identity token from the platform’s identity service. This token:
When the agent makes a request (read from DataBase, publish to Event Broker), the token is validated. If expired or unauthorized, the request fails.
Fine-Grained Permissions: Row and Column Security
VAST DataBase enforces row- and column-level security. For example:
This is enforced at the storage layer, not in application code. An agent cannot bypass permissions by crafting a clever SQL query.
Audit Logging
Every operation is logged:
These logs are stored in the DataBase itself, queryable for compliance and debugging.
6. Observability: Tracing and Performance Monitoring
In a multi-agent system, there can be an incredible amount of internal traffic and inter-agent communication. For example, a single user task might spawn 50 agents across 20 compute nodes. When something fails, you need to trace the entire causal chain. Standard observability tools like Grafana and Prometheus are inadequate for modern multi-agent, high-velocity workflows due to four key limitations:
VAST’s built-in observability tracks:
These metrics feed into a dashboard to identify bottlenecks. A real-time, comprehensive dashboard offers system administrators a holistic view of operational health and resource utilization. This centralized tool visualizes key performance indicators (KPIs) and resource counters, facilitating quick pattern recognition and trend analysis. By correlating metrics (CPU, memory, I/O latency, network throughput, queue depths), the dashboard rapidly identifies performance bottlenecks and accelerates root cause analysis and resolution.
An Agentic OS Built for the Future
Of course, the above example represents just one single prompt executed by one user. Multiply this by thousands of employees, all their automatable workflows, and the potential for agent-instigated actions — and there will be a meaningful impact on AI infrastructure.
We built the VAST AI OS to handle AI inference and agentic workflows at any scale. The platform makes this a reality with an event broker that can handle huge volumes of events, which invoke the various functionalities that agents require. A unique enterprise feature is the ability to offload the KV cache from expensive GPU memory to cost-effective VAST Data storage, providing “infinite context” for LLMs and generative agents. This innovation enables stateful, long-running conversations, extensive document analysis, and complex reasoning.
In essence, the VAST AI OS eliminates bottlenecks to deploy and scale enterprise AI, offering a high-performance, unified environment with maximum context and minimal latency.
Can’t wait for you to show case this in Sydney in 2026 Ofir Zan