Today, we’re releasing our latest findings on our work for CyberGym, a large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks. This benchmark was recently included in the latest Anthropic model card. Our system had a success rate of 53%. This is a 90% improvement compared to existing systems. The takeaway? Craft matters. The right architecture and tooling turn good models into great agents. This is critical for efficacy of implementation at enterprise scale. Read the full report in the comments.
It’s great to see another example of how deriving intuition from first principles and taking the time to dive deep and iterate on a small eval set yields quality insights that generalize beyond! Thanks a lot for sharing these write-ups. This perspective "from the trenches" is invaluable :-)
Amazing job John Heyer! Great to be more public about all the great things we are doing internally to secure everyone's codebase
Wow this is quite significant! Talk about ruining the class curve ;) Terrific work and you guys are only getting started!
Woooo!
amazing job! excited for whats more to come!
Great progress!
Great work!
Super impressive! This is great proof that context matters so much and frontier models do actually need the explicit supporting framework to do well. Even more so beyond benchmarking in the messy real world where anything is fair game as long as it uncovers vulnerabilities, actually providing the agent with clear tactics from a domain space expert should be extremely useful. Looking to learn more about this in future iterations!