Realistic cyber environments

Inside 0Labs' approach to high-fidelity cyber range design

Simulating cyber environments
Cyber ranges are the foundation to our work. Whether we're evaluating the autonomous capabilities of AI offensive agents, stress-testing detection logic, or training models to perform defensive tasks, the quality of the environment determines the quality of the result. We will discuss how 0Labs think about ranges - why realism matters, what distinguishes a range from a task benchmark, and how they serve both offensive and defensive research.
Realism is non-negotiable
A meaningful evaluation requires a meaningful environment. Many existing benchmarks fall short here since they rely on simplified environments that don’t reflect the complexity of the real world. An agent that succeeds in a stripped-down lab environment may fail entirely against a network with realistic noise, layered defenses, diverse operating systems, and real software stacks.
Our ranges are built from reality. For capability evaluations, we design environments around documented historical incidents - the 2015 Sandworm attack on the Ukrainian power grid, for instance, drawing from threat intelligence, published reports, and independent expert review. For enterprise environments, we deploy traditional security tooling across defended endpoints, common enterprise applications, and cloud infrastructure. If a service typically runs on a server, it runs on a VM. If it's containerised in production, we containerise it. We don't simplify deployments as doing so introduces blind spots that can skew evaluation results and reduce attack fidelity.
Our environments are also designed to be modular - attack paths, security controls, and noise levels can all be set independently. This lets us dial difficulty deliberately.
The goal is an environment hard enough to be honest.
Ranges vs. task benchmarks
We recognise a clear distinction between evaluating AI capabilities across full cyber ranges and evaluating discrete skills through targeted tasks. Both serve an important purpose.
A range tests the thing that actually matters for safety: can an agent autonomously chain an end-to-end attack against realistic defended infrastructure? This is where dangerous capability thresholds reveal themselves. To inform that threshold, we engage with industry experts and organisations to understand what a concerning attack actually looks like in their context - those conversations shape our specifications.
Task benchmarks serve a different, complementary function. An isolated cryptography challenge, for example, doesn't require standing up an enterprise environment, and it gives us precise signal about what a model knows and how it reasons. That granular understanding feeds back into range evaluations: it helps explain how an agent chains an attack in the wild, not just whether it can.
Together, they paint a fuller picture.
Attack simulation for better defense
The same infrastructure we use to simulate attacks also powers defensive research. In one experiment, we gave an agent access to a SIEM connected to a live range under attack. It investigated the activity, identified attack sequences, and surfaced detection gaps much faster than we could. That result points toward something significant: simulation as a mechanism for scaling detection engineering.
This is core to what we're building. By running adaptive, AI-driven attacks against our ranges and feeding that signal into detection pipelines, we can validate existing coverage, identify gaps, and refine logic continuously. The range becomes a closed loop. Attacks improve, and so do defenses.
Training on high-quality trajectories
Ranges also serve as training environments. We've successfully trained models on expert-quality trajectories generated within our infrastructure - attack traces that represent genuinely skilled behavior. These trajectories are automatically labelled using our harness, and are used to train open source models that perform as well as frontier models.
We have also started exploring using these environments for reinforcement learning, with the goal of training agents that improve on defensive tasks like vulnerability research and SOC operations.
We are excited about this direction of research, and have been inviting cybersecurity and AI Labs to partner with us.
Closing thoughts
As AI capability in offensive security accelerates, the environments we use to evaluate and train against it will define how clearly we see the threat, and how quickly we defend against them.
If you are interested in working with us, or gaining access to any of our simulated environments, please contact us below.