🐢 LLM as Judge for LLM as Judge: Turtles All the Way Down

Itiel Shwartz

Co-Founder & CTO at Komodor - Kubernetes for Humans

Published Aug 19, 2025

Komodor Klaudia Refactor: When Your AI System Needs to Judge Its Own Judge

At Komodor, we're deep into the biggest architectural refactor in Klaudia's history. What started as a quest to modernize our multi-agent Kubernetes troubleshooting system has led us down a fascinating rabbit hole: How do you validate the system that validates your AI?

The Plot Twist: Klaudia Was Never Just One Agent

Here's something that might surprise you: despite appearing as a single, coherent AI assistant, Klaudia has always been dozens of specialized agents working in concert. We have agents for GPU diagnostics, network analysis, storage investigation, security policy validation, and many more - each one an expert in its domain.

When we built Klaudia originally, the multi-agent landscape was barren. No solid frameworks, no Model Context Protocol (MCP), no established patterns. So we did what any startup does: we built everything custom. And it worked brilliantly.

But technology evolves. Fast.

Why We're Rebuilding: The Framework Revolution

The agent orchestration space has matured dramatically. Modern frameworks like Agno offer robust patterns for multi-agent coordination, state management, and tool integration that make our custom architecture look like reinventing the wheel.

The benefits are clear:

Maintainability: Standard patterns instead of custom glue code
Extensibility: Easier to add new agents and capabilities
Reliability: Battle-tested frameworks instead of homegrown solutions
Performance: Optimized orchestration and resource management

So we made the call: migrate everything to a modern, established framework.

The Validation Paradox

But here's where it gets interesting (and recursive).

Our "LLM as Judge" system has been the cornerstone of Klaudia's reliability. It's the AI system that scores, validates, and ensures the quality of every root cause analysis Klaudia produces. It's what gives us confidence in our 95%+ accuracy claims.

Now we're migrating this judge to the new framework too. But early results show significant differences between the old judge and the new judge when evaluating the same Klaudia outputs.

The question that keeps me up at night: Which judge is right?

Enter the Meta-Judge: LLM as Judge for LLM as Judge

We've reached peak recursion: we're now building an LLM as Judge system to evaluate our LLM as Judge system.

Think about the philosophical depth here:

Judge A (old framework): Validates Klaudia's Kubernetes analysis
Judge B (new framework): Also validates Klaudia's analysis, but scores differently
Meta-Judge: Evaluates which judge is more accurate

It's validation systems all the way down - a perfect example of the "turtles all the way down" problem in epistemology.

The Real-World Engineering Challenge

This isn't just academic navel-gazing. This validation uncertainty has real implications:

Quality Assurance: How do we maintain Klaudia's reliability during the transition? Customer Trust: Our accuracy claims depend on validated results - which validator do we trust? Development Velocity: We can't ship improvements without confidence in our quality metrics.

Our Approach: Empirical Validation at Scale

We're solving this methodically:

Shadow Testing: Running both judges simultaneously on historical incidents where we know the ground truth
Customer Outcome Correlation: Tracking which judge's scores better predict successful issue resolution
Expert Review: Having our Kubernetes experts evaluate cases where the judges disagree most strongly
Feedback Loop Integration: Using customer feedback to tune the meta-validation process

The Deeper Lesson: Trust in Autonomous Systems

This challenge illuminates something fundamental about building AI systems that users actually trust. It's not enough to have sophisticated algorithms - you need sophisticated ways to validate those algorithms.

As we move toward more autonomous AI-SRE systems, this meta-validation becomes critical. When Klaudia eventually starts taking autonomous remediation actions (coming soon), the stakes for getting validation right become exponentially higher.

What We're Learning

Three key insights from this journey:

Validation is a Product: Your "LLM as Judge" system isn't internal tooling - it's a core product component that needs the same engineering rigor as customer-facing features.

Recursive Complexity is Inevitable: As AI systems become more sophisticated, you'll inevitably need AI to help you understand and validate your AI. Embrace it rather than fighting it.

Trust Requires Transparency: The more layers of AI validation you add, the more important it becomes to make the entire process observable and explainable to humans.

The Road Ahead

We're not just refactoring Klaudia's architecture - we're building the foundation for the next generation of autonomous Kubernetes operations. A system that doesn't just diagnose problems with expert precision, but validates its own reasoning with the same rigor.

The goal remains unchanged: democratizing Kubernetes expertise and eliminating the 3 AM debugging sessions that plague every engineering team.

But now we're doing it with a validation system that validates itself. Because in the world of autonomous AI-SRE, trust isn't just nice to have - it's everything.

What's your experience with validating AI systems? Have you encountered similar recursive validation challenges? I'd love to hear how other teams are solving the "who watches the watchers" problem in AI operations.

#Kubernetes #AI #SRE #CloudNative #MultiAgent #TechLeadership #Innovation #Klaudia #AgentFramework #Validation