Beyond the Loop: Architecting a VAST, Multi-Agent AI System
Authored by Anat Heilper, Director of AI Infrastructure, and Ofir Zan, VP, AI Solutions & Enterprise Lead

Beyond the Loop: Architecting a VAST, Multi-Agent AI System

Why the future of AI is a swarm of millions of agents, and how the VAST Data Platform’s Shared-Everything architecture provides the foundation to scale it.


We are at a critical inflection point, one that fundamentally reshapes the future for both cloud service providers (CSPs) and enterprise buyers. The industry is rapidly moving beyond “AI as a chat box” and hurtling toward “AI as an autonomous workforce.” Today’s single-agent demos, running in a simple while loop, are the “hello, world” prototypes of a new computing era. 

The future isn’t one hyper-intelligent agent; it’s a swarm of millions. Why? Specialization.

A single company isn’t one hyper-productive CEO. It’s a team of specialists in finance, engineering, marketing, and logistics, all collaborating. Complex AI problems are no different. You wouldn’t ask one agent to run a global supply chain, design a silicon chip, or manage a live marketing campaign. These tasks require a team of specialized agents —a PlannerAgent, a ResearchAgent, a DataAnalystAgent, a SecurityMonitorAgent, and a NotificationAgent — all working in concert.

For CSPs, this shift represents the next generation of massive compute demand and a new, high-value AI platform to host. For enterprise buyers, this marks the opportunity to achieve unprecedented levels of automation and operational efficiency.

However, this also creates an unprecedented infrastructure problem. A swarm of millions of autonomous agents, all reasoning, communicating, and accessing data simultaneously, cannot run on today’s siloed architecture. It’s an infrastructure nightmare, creating problems of cost, security, and complexity.

The Bottleneck: The Environment vs. The Swarm

The primary challenge in scaling from a single agent to a swarm is the lack of infrastructure designed for multi-agent orchestration. Today’s agent deployments typically rely on general-purpose infrastructure—virtual machines, containers, or serverless functions—that wasn’t built with swarm coordination in mind. While modern platforms can scale individual components, they lack the integrated primitives agents need to operate collectively at scale.

This creates critical bottlenecks that make swarm deployment challenging:

  • The coupled scaling problem: General-purpose infrastructure doesn’t natively understand agent-specific resource patterns. An agent’s compute needs (LLM inference) and memory needs (vector stores, conversation history) scale differently, but standard deployment models don’t provide first-class primitives for independent scaling of agent-specific resources. You end up either over-provisioning or constantly reconfiguring infrastructure to match workload patterns.
  • The state and amnesia problem: Agents need durable, accessible state across invocations—conversation history, tool outputs, intermediate reasoning. Standard platforms offer generic databases or object storage, but lack agent-native state management that understands semantic context, supports efficient retrieval, handles state synchronization across multiple agents, and survives infrastructure changes. Developers must cobble together solutions from generic components, leading to fragility.
  • The communication bottleneck: Coordinating thousands of agents requires high-throughput, asynchronous, observable communication patterns. While message queues and event brokers exist, they’re not agent-aware—they don’t understand task delegation, result aggregation, or semantic routing. Implementing swarm coordination on generic messaging infrastructure means building the orchestration layer yourself, which is complex and error-prone.
  • The security and identity silo: In a monolithic setup, each agent is an island. It has to manage its own secrets and API keys. The platform provides no central, secure identity or tool execution service, creating thousands of individual security risks instead of one manageable one. This lack of centralized management extends to vital infrastructure like Model Context Protocol tools, which also become disparate and difficult to govern across the agent ecosystem.

These aren’t agent failures; they’re platform gaps—the absence of infrastructure primitives purpose-built for multi-agent systems.

A Shared, Decoupled Foundation for Running AI Agents

VAST’s architectural approach to this problem is built on a core design principle: compute and state must be fully separated, but all compute resources must have shared, high-speed access to a global, consistent state layer.

It turns out that agentic workloads are a perfect fit for this model. Rather than running agents in an “all-in-one-box” environment, we can run the separate functions of an agent—its compute logic, its reasoning, its state, and its communication—as their own layers on our platform. The agent’s logic (CPU) runs on one set of resources, while its state and memory (storage) live on another, accessible by all. CPU compute, GPU compute, storage, and networking can all be scaled independently, which opens the door for efficiently and economically serving large numbers of autonomous agents.

The primary advantage of this disaggregated architecture is the ability to scale only the component that is the bottleneck.

  • Scaling CPU (kernels): If 10,000 tasks arrive, the platform spins up 10,000 cheap, stateless Agent Kernels (on VAST CNodes).
  • Scaling GPU (reasoning): If those agents create 100,000 reasoning requests, only the GPU servers are scaled, backed by the VAST KVCache to maximize GPU utility.
  • Scaling storage (memory): If agents need to access shared knowledge, the VAST Data Platform (on VAST D-Nodes) scales its read replicas and internal bandwidth independently of compute.
  • Scaling I/O (comms): If agents pass massive data payloads, the Event Broker and the VAST fabric’s internal network handle the I/O without the compute nodes becoming a bottleneck.

Article content

Architectural and Engineering Deep Dive: Pillars of a Scalable Agentic Platform

This section examines the core components required to run disaggregated agents at scale on the VAST AI OS, addressing the fundamental challenges of coordination, state, scaling, and operability.

1. The Execution Model: Stateless Compute with Shared State Access

The fundamental architectural pattern is simple: agent logic runs as stateless compute, while all persistent state lives in a globally-accessible storage layer. Applying this pattern to agents requires solving specific problems around latency, consistency, and scale.

Agent Kernels as Ephemeral Functions

Agent “kernels” (the executable logic of an agent) run as serverless functions on VAST CNodes (compute nodes). These functions are stateless and short-lived. They:

  • Execute in response to events (e.g., file writes, object creation, table mutations)
  • Have direct NVMe access to the underlying storage fabric
  • Scale horizontally without coordination, so spawning 10,000 agent kernels doesn’t require inter-kernel communication
  • Die after execution; state persistence is the storage layer’s responsibility, not the kernel’s

This is a different architectural approach from traditional container-based agents, where the agent’s logic, memory, and state are bundled. Here, the kernel is throwaway; only the storage persists.

The Data Locality Advantage

Because kernels run on CNodes that are part of the storage fabric, data access is local NVMe, not remote network I/O. When an agent needs to read a 10GB dataset, that data doesn’t move across a network—the compute moves to the data. This is critical for agents doing data-intensive work (e.g., a DataAnalysisAgent processing logs or a VectorSearchAgent scanning embeddings).

Execution Triggers and Lifecycle

Agent kernel triggers can be:

  1. Event-driven: File write, object creation, table insert → kernel spawns automatically
  2. API-driven: External system calls an endpoint → kernel executes on-demand
  3. Scheduled: Cron-like triggers for periodic agent tasks

When is it important to write agents output storage:

In-memory storage is used for short-lived, task-specific context such as conversational state, temporary results, inter-agent messages, or intermediate computations. This minimizes latency and provides fast, stateful interactions crucial for responsiveness in ongoing agent flows.

Persistent storage (databases, object stores, file systems) holds data that must survive agent restarts, session losses, or system failures—such as project histories, decision logs, audit trails, user profiles, and learned strategies or policies. This ensures reliability, traceability, and long-term knowledge accumulation

Write everything to persistent storage when:

  • Regulatory or compliance requirements mandate traceability (finance, healthcare)
  • System reliability is paramount
  • Agents work on long-horizon tasks
  • Agent interactions should be inspectable, replayable, or learning-enabled for future deployments.

2. Event-Driven Coordination: Communication Through the Environment

Agents don’t communicate through direct messages alone. They communicate through the environment — by modifying shared state, writing files, updating tables, or inserting vectors. Other agents observe these changes and react.

This is a fundamentally different communication model from traditional RPC or message-passing. Instead of AgentA explicitly calling AgentB, AgentA modifies the environment (writes a file, updates a table), and the platform automatically triggers AgentB based on registered event patterns.

Environment-Driven Agent Triggers

Agents register themselves to be triggered by specific environmental changes:

  • File system events: An agent triggers when a file matching /vast/datasets/sales_*.parquet is created
  • Database mutations: An agent triggers when a new row is inserted into the pending_tasks table
  • Vector updates: An agent triggers when new embeddings are added to the knowledge base
  • Time-based events: An agent triggers on a schedule to perform periodic analysis

Here’s an example workflow:

  1. DataCollectorAgent writes a new file: /vast/datasets/sales_2025_q3.parquet
  2. The file write generates a file.created event on the Event Broker
  3. DataValidatorAgent, registered to trigger on file.created events in /vast/datasets/, automatically spawns
  4. DataValidatorAgent reads the file, validates it, and updates a database table with status = “validated”
  5. The table update generates a table.updated event
  6. AnalysisPipelineAgent, registered to trigger on validation status changes, automatically spawns

No agent explicitly called another agent. They communicated through changes to the shared environment, with the Event Broker mediating the triggers.

Why This Model Scales

Direct agent-to-agent messaging requires agents to know about each other. At scale (10,000+ agents), this creates a dependency graph nightmare. When 10,000 agents are operating simultaneously, each producing dozens of state changes per second, you’re looking at hundreds of thousands to millions of events per second. Most event systems collapse under this load.

Environment-driven communication decouples agents completely:

  • Agents don’t need to know which other agents exist
  • New agents can be added without modifying existing agents
  • Failed agents don’t cause cascading failures—their work simply remains in the environment until they recover

VAST’s Event Broker is architecturally different because it’s not a separate service—it’s integrated directly into the storage fabric. This integration provides scalability and performance characteristics that are critical for multi-agent environments. The broker doesn’t maintain events in separate in-memory buffers or on dedicated broker nodes. Events are written directly to D-Nodes as structured data on the same storage fabric that handles all other I/O. This means:

  • No broker bottleneck: Event capacity scales with storage capacity, not broker node count
  • Unlimited retention: Events can persist indefinitely because storage is cheap and abundant
  • Parallel consumption: Thousands of agents can consume events concurrently without overwhelming broker nodes—they’re reading directly from distributed storage

Because events are part of the storage fabric:

  • Sub-millisecond publish latency: Writing an event is a storage write, leveraging NVMe and RDMA
  • Massive parallel throughput: The storage fabric can handle millions of events per second across the cluster
  • No data movement for triggers: When an event triggers a DataEngine function, the function executes on the same CNode where the event is stored—zero data movement

3. State Management: Transactional Tables, Vectors, and Consistency

Agent state must be persistent, queryable, and consistent. This is where most traditional agent architectures fail—they rely on external databases (e.g., Postgres for state, Pinecone for vectors) which introduces latency, complexity, and data movement.

VAST DataBase unifies three capabilities in a single storage layer:

  1. OLTP: Transactional state (agent metadata, task status, workflow state)
  2. OLAP: Analytical queries (aggregate agent performance, system metrics)
  3. Vector storage: Embeddings for RAG, semantic search, and memory retrieval

The DataBase is built on VAST’s disaggregated storage architecture. Data is stored in columnar format on D-Nodes, with multiple layers of indexing:

  • Row-based index: For transactional lookups (e.g., “fetch state for agent_id=123”)
  • Column-based index: For analytical scans (e.g., “count all agents with status=failed”)
  • Vector index: approximate nearest neighbor search for embeddings

Vector Operations at Scale

The vector storage subsystem is designed for trillion-scale vector search. Key characteristics include:

  • Parallel updates: Unlike traditional vector DBs that require full re-indexing, VAST’s parallel architecture allows concurrent inserts without locking the entire index
  • Freshness of data: All of the data is searchable, including data that was not already converted to the vectorDB
  • Trillion-vector scale: No sharding and no performance degradation at scale

4. The Memory Hierarchy: KV Cache Offloading for GPU Efficiency and Endless Context

In multi-agent systems, the GPU inference layer is the most expensive and constrained resource. When thousands of agents are simultaneously calling LLM reasoning services, GPU utilization becomes the critical bottleneck that determines system throughput and cost efficiency.

The KV cache is the key to unlocking high GPU utilization, reducing latency of response, and the ability to support endless context conversations that multi-agent systems require.

The GPU Utilization Problem

A GPU’s compute capacity is wasted if its memory (HBM) fills up before its processors are fully utilized. 

Long-context inference on models like Llama 3.1 70B faces different bottlenecks across its two phases. The prefill phase to compute a 128K-token KV cache (~40 GB in FP16, 50% of an H100 80GB’s memory) saturates tensor cores at 60-90% utilization—expensive computation you’d want to reuse whenever possible. The subsequent autoregressive decode phase becomes memory-bound, dropping to only 20-30% utilization as each step fetches the entire KV cache but performs minimal computation. With model weights at ~140 GB in FP16, a 4×H100 80GB configuration (320 GB total HBM) leaves only ~20-25 GB per GPU for KV cache after accounting for operational overhead. This memory constraint severely limits concurrent long-context requests, forcing deployments to provision 3-5× more GPUs than compute needs would suggest—purely for memory capacity to hold KV caches. Even with optimizations like FP8 quantization, continuous batching, and multi-query attention, memory capacity remains the primary scaling constraint.

The result: you’re paying for a $30,000 GPU but only using a fraction of its processing power because memory is the bottleneck, not compute. At scale, this means you need 3-5x more GPUs than you should, purely due to memory constraints.

The Solution: KV Cache Offloading

By offloading the KV cache from GPU HBM to VAST’s high-speed storage fabric, you break the memory bottleneck:

  • Hot cache blocks (actively being used for generation) stay in GPU HBM
  • Cold cache blocks (common prefixes, older context, inactive requests) are moved to VAST storage
  • When needed, cold blocks are fetched back to HBM in under 1ms

This dramatically increases GPU utilization because memory is no longer the constraint. You can now serve 50-100+ concurrent agent requests per GPU instead of 10-20, fully utilizing the GPU’s compute capacity.

Enabling Endless Context for Multi-Agent Systems

Multi-agent systems require agents to maintain long-running context across multiple interactions. A ResearchAgent might accumulate more than 100,000 tokens of context over hours of operation. A ConversationAgent might maintain weeks of chat history. Without KV cache offloading, this is impossible—the context simply won’t fit in GPU memory.

With KV cache offloading:

  • Agents can maintain effectively unlimited context length—bounded only by storage, not GPU memory
  • Long-running agents don’t need to truncate or summarize their context to fit memory constraints
  • Multi-turn reasoning tasks can operate over massive context windows without performance degradation

The Technical Mechanism

VAST achieves KV cache offloading by integrating with inference frameworks (e.g., NVIDIA Dynamo, LMCache) at the KV cache layer:

  • The inference engine generates KV cache blocks during forward passes
  • The cache manager (modified to use VAST as backing storage) keeps frequently-accessed blocks in GPU HBM
  • When HBM fills up, cold blocks are evicted to VAST storage via high-speed NVMe and RDMA
  • On a cache miss, the block is fetched from VAST storage with sub-millisecond latency
  • Prefetch hints and access patterns allow proactive loading of likely-needed blocks

Technically, there is a trade-off when offloading KV cache to external storage: generation latency increases by 1-2ms per cache miss. However, this small latency increase enables:

  • 3-5x higher throughput: Serve more agents per GPU by breaking the memory bottleneck
  • 3-4x lower GPU costs: Need fewer GPUs to handle the same agent workload
  • Unlimited context: Support long-running agents with massive context windows

For multi-agent systems, where agents are performing reasoning tasks rather than real-time chat, this trade-off is overwhelmingly favorable. As costs add up fast and GPU efficiency becomes a bottleneck, KV cache offloading is a crucial component in multi-agent deployments.

5. Security and Isolation: Identity, Permissions, and Audit

When thousands of agents are executing autonomously, security cannot be an afterthought. The traditional approach—giving each service a long-lived API key or shared credentials—breaks down catastrophically at scale. A compromised key becomes a skeleton key to your entire system. Credential rotation becomes an operational nightmare. And when something goes wrong, tracing which agent did what becomes nearly impossible.

The alternative is to treat agents as first-class principals in your security model, with the same rigor you’d apply to human users or external services. Each agent must have:

  1. Identity: Who is this agent?
  2. Authorization: What can this agent access?
  3. Auditability: What did this agent do?

Identity Fabric: Short-Lived Credentials

Agents don’t have static API keys. Instead, when an agent kernel spawns, it requests a short-lived identity token from the platform’s identity service. This token:

  • Has a TTL (e.g., 15 minutes)
  • Carries scoped permissions (e.g., read access to /vast/datasets/sales/*)
  • Is tied to a specific agent instance (not reusable across spawns

When the agent makes a request (read from DataBase, publish to Event Broker), the token is validated. If expired or unauthorized, the request fails.

Fine-Grained Permissions: Row and Column Security

VAST DataBase enforces row- and column-level security. For example:

  • AgentA (a DataAnalysisAgent) has read access to all rows in sales_data
  • AgentB (a ReportGeneratorAgent) has read access only to aggregated views
  • AgentC (a SecurityAuditorAgent) has read access to audit logs but not raw sales data

This is enforced at the storage layer, not in application code. An agent cannot bypass permissions by crafting a clever SQL query.

Audit Logging

Every operation is logged:

  • Agent identity
  • Resource accessed
  • Operation type (read, write, query)
  • Timestamp
  • Result (success, denied, error)

These logs are stored in the DataBase itself, queryable for compliance and debugging.

6. Observability: Tracing and Performance Monitoring

In a multi-agent system, there can be an incredible amount of internal traffic and inter-agent communication. For example, a single user task might spawn 50 agents across 20 compute nodes. When something fails, you need to trace the entire causal chain. Standard observability tools like Grafana and Prometheus are inadequate for modern multi-agent, high-velocity workflows due to four key limitations:

  • Lack of causal tracing: They show what failed (e.g., latency), but not the multi-step why across an event-driven agent swarm.
  • Data locality ignorance: They treat compute and storage separately, failing to diagnose performance issues caused by remote data fetching instead of local access. VAST’s integrated observability handles this.
  • Massive event volume: The millions of state-change events generated by agent swarms overwhelm standard logging pipelines. VAST handles this scale by treating events as structured data within the storage fabric.
  • Agent-centric metrics: They lack native tracking for crucial agent metrics like Cost per Agentic Task, Agent Latency per Token, and KV cache Utilization.

VAST’s built-in observability tracks:

  • I/O patterns: Which agents are reading/writing what data, and at what rate
  • Latency breakdown: Time spent in compute vs. storage vs. network
  • Resource utilization: Compute, storage, and network for the different executing jobs

These metrics feed into a dashboard to identify bottlenecks. A real-time, comprehensive dashboard offers system administrators a holistic view of operational health and resource utilization. This centralized tool visualizes key performance indicators (KPIs) and resource counters, facilitating quick pattern recognition and trend analysis. By correlating metrics (CPU, memory, I/O latency, network throughput, queue depths), the dashboard rapidly identifies performance bottlenecks and accelerates root cause analysis and resolution.

An Agentic OS Built for the Future

Of course, the above example represents just one single prompt executed by one user. Multiply this by thousands of employees, all their automatable workflows, and the potential for agent-instigated actions — and there will be a meaningful impact on AI infrastructure. 

We built the VAST AI OS to handle AI inference and agentic workflows at any scale. The platform makes this a reality with an event broker that can handle huge volumes of events, which invoke the various functionalities that agents require. A unique enterprise feature is the ability to offload the KV cache from expensive GPU memory to cost-effective VAST Data storage, providing “infinite context” for LLMs and generative agents. This innovation enables stateful, long-running conversations, extensive document analysis, and complex reasoning. 

In essence, the VAST AI OS eliminates bottlenecks to deploy and scale enterprise AI, offering a high-performance, unified environment with maximum context and minimal latency.

Can’t wait for you to show case this in Sydney in 2026 Ofir Zan

To view or add a comment, sign in

More articles by VAST Data

Explore content categories