


Cradle’s Innovation Engine: How Benchmarking Powers Breakthrough Protein Design
Cradle’s lab-in-the-loop benchmarking drives our best-in-class protein design—helping customers create better proteins, faster and with less effort

Thomas & Noé
May 13, 2025
May 13, 2025
This year, Cradle’s automated machine learning (ML) platform for protein design won the international Adaptyv Bio protein design challenge, outperforming 130 competitors to design the molecule that bound most strongly to a key therapeutic target in cancer treatment. After the win, Adaptyv Bio offered to test 11 additional, diverse protein sequences submitted by Cradle. All 11 also beat the original competition field—meaning Cradle protein designs would have ranked 1 through 12. This kind of consistent performance isn’t luck. It’s the result of a rigorous, data-driven approach to machine learning—and particularly to benchmarking.
Benchmarking rarely makes headlines, but at Cradle, it’s a key driver of success. Crucially, our benchmarking is grounded in real-world experimental data: we generate ground truth in our own lab, where we synthesize and test our novel proteins.
At Cradle, every new release is validated in our internal lab. We ensure our active learning works across diverse protein modalities — including antibodies, peptides, enzymes, and vaccines. We ensure our system can learn from and improve multiple properties simultaneously. And we ensure all this can be done without requiring additional code, customized training procedures, or other types of manual intervention
In this post, we unpack how benchmarking underpins everything we do at Cradle—and why it’s essential to building the most consistent, high-performance ML platform in protein engineering.
The challenge: Building a system that works out of the box
In simple terms, Cradle is a software platform that enables biologists to generate optimized protein sequences using models customized to their problem. As more data is collected, and updated, the models are automatically fine-tuned, resulting in improved sequences round over round. They remain securely stored on Cradle’s cloud, private to the customer, and accessible at any time through an intuitive user interface or well-documented easy-to-use APIs.
Cradle operates through a powerful combination of two types of ML models:
Generators: Large protein language models that propose new protein sequences—mutants or variants—to efficiently explore the vast space of possible, promising proteins.
Predictors: Models that forecast how well those proposed sequences are likely to perform in the lab (in vitro).

Benchmarking allows us to understand and measure the performance of both. It’s the process of comparing the quality of model outputs against verified, ground truth data. For predictors, this is relatively straightforward—we draw on our rich and extensive collection of public and proprietary datasets covering key protein characteristics such as thermostability, expression, and binding affinity.
But benchmarking generator models is an altogether harder problem. You can’t benchmark a novel protein sequence unless you know how it performs in the lab—and, by definition, no one does, because it’s never existed before.
This challenge isn’t unique to protein engineering. It also shows up in fields such as large language models (LLMs), where models generate new text and researchers have to ask: Does this response make sense? Is it addressing the user’s prompt? Often, training these systems relies on humans to read the output and score its quality—a process that’s slow, subjective, and hard to scale. In generating protein designs, we face a similar dilemma. But we can’t rely on opinion; we need objective, accurate data.
Closing the loop: How our lab sets us apart
That’s why we built our own lab, which allows us to synthesize and test the proteins our models generate with short turnaround time, often just a week or two. This gives us an objective, repeatable source of ground truth to evaluate both generators and predictors, and it closes the loop between our cutting-edge algorithms and biological reality. Cradle’s automated lab plays the role of a high-throughput, unbiased evaluator, measuring how well different models and algorithms truly navigate the protein space.
This tight feedback loop is a key part of what sets Cradle apart. It transforms benchmarking from an academic exercise into a core part of how we develop, iterate and improve the platform so our customers can depend on its outcomes. And it ensures Cradle stays grounded in real-world, lab validated performance—not just theoretical promise.
Cradle benchmarking: Rigorous and real-world
Customers start off with Cradle’s robust, pre-trained models that are continuously improved through our internal benchmarking. Each customer operates within their own private, secure workspace, where they may create and run their various protein engineering projects. For each project, they can fine-tune Cradle’s pre-trained models using their own proprietary assay data. This allows for targeted optimization based on each project’s specific goals. Importantly, customer data is never used to train or improve Cradle’s base models, nor is it shared with other customers’ workspaces.
We continuously refine our algorithms to push the boundaries of protein design, with every algorithmic enhancement undergoing rigorous benchmarking to ensure it improves the quality of the platform across the board, not just for a single problem or customer.
Our benchmarking strategy is twofold:
In silico validation: For models that predict lab performance, such as how stable or active a novel protein might be, we benchmark improvements using existing ground truth data available internally. This lets us rapidly iterate and refine our predictor models. Beyond the predictor models themselves, we also benchmark the algorithms that are used to train or fine-tune predictor models when new lab data is uploaded.
In vitro validation: Major algorithmic changes, especially those impacting the generator models, must first be tested and proven effective in the lab. We dedicate lab capacity to this experimental validation. This is critical, because the real test of a newly generated protein is how it performs in the real world.

This rigorous benchmarking process is the biological equivalent of A/B testing on retail websites—except instead of tweaking a “Buy” button hoping to boost sales clicks, we’re testing new algorithms, aiming to generate better-performing proteins. It’s how we rapidly translate cutting-edge ML research into real-world improvements for every customer.
Take Direct Preference Optimization (DPO) for example, a method for fine-tuning LLMs. We didn’t develop DPO, but we benchmarked it extensively to evaluate its potential for improving Cradle’s generator models. This meant testing DPO-trained models across multiple tasks, including those in the Align To Innovate protein engineering tournament (which ran in 2023), in addition to extensive wet lab validation. That competition featured 28 teams from industry and academia, tasked with predicting experimental outcomes across 19 datasets and four distinct protein engineering tasks.
Cradle’s models demonstrated strong, consistent predictive performance. Using Cradle on auto mode —no human tuning, no custom engineering—we tied or beat the first-place results across all four tasks. This “A/B test” gave us the evidence needed to integrate DPO into the platform.
The best of both worlds for our customers
All Cradle customers benefit from our rigorous benchmarking—whether they realize it or not. Every validated improvement feeds directly into the platform, so Cradle’s models and algorithms get more capable over time, delivering better and more consistent outcomes across a broad range of protein engineering objectives.
And for customers with specific goals, Cradle supports deeper, highly targeted optimization. Within their secure workspaces, customers can fine-tune Cradle’s base models using their own data, with complete control and privacy. It’s a best-of-both-worlds approach: global improvements for everyone, private customization where it counts.
Accelerating progress through rigorous benchmarking
Rigorous, lab-based benchmarking underpins Cradle’s ability to deliver consistent, high-quality results across a wide range of protein engineering challenges.
By continuously validating our models on diverse datasets and real-world use cases, we ensure that improvements translate into tangible benefits for every customer. Working across sectors gives us insights into varied design problems—insights that feed directly into the benchmarking process and strengthen our models over time.
The result is a platform that improves with every iteration—grounded in data, tested in the lab, and built to help customers design better proteins faster, with greater confidence.
This year, Cradle’s automated machine learning (ML) platform for protein design won the international Adaptyv Bio protein design challenge, outperforming 130 competitors to design the molecule that bound most strongly to a key therapeutic target in cancer treatment. After the win, Adaptyv Bio offered to test 11 additional, diverse protein sequences submitted by Cradle. All 11 also beat the original competition field—meaning Cradle protein designs would have ranked 1 through 12. This kind of consistent performance isn’t luck. It’s the result of a rigorous, data-driven approach to machine learning—and particularly to benchmarking.
Benchmarking rarely makes headlines, but at Cradle, it’s a key driver of success. Crucially, our benchmarking is grounded in real-world experimental data: we generate ground truth in our own lab, where we synthesize and test our novel proteins.
At Cradle, every new release is validated in our internal lab. We ensure our active learning works across diverse protein modalities — including antibodies, peptides, enzymes, and vaccines. We ensure our system can learn from and improve multiple properties simultaneously. And we ensure all this can be done without requiring additional code, customized training procedures, or other types of manual intervention
In this post, we unpack how benchmarking underpins everything we do at Cradle—and why it’s essential to building the most consistent, high-performance ML platform in protein engineering.
The challenge: Building a system that works out of the box
In simple terms, Cradle is a software platform that enables biologists to generate optimized protein sequences using models customized to their problem. As more data is collected, and updated, the models are automatically fine-tuned, resulting in improved sequences round over round. They remain securely stored on Cradle’s cloud, private to the customer, and accessible at any time through an intuitive user interface or well-documented easy-to-use APIs.
Cradle operates through a powerful combination of two types of ML models:
Generators: Large protein language models that propose new protein sequences—mutants or variants—to efficiently explore the vast space of possible, promising proteins.
Predictors: Models that forecast how well those proposed sequences are likely to perform in the lab (in vitro).

Benchmarking allows us to understand and measure the performance of both. It’s the process of comparing the quality of model outputs against verified, ground truth data. For predictors, this is relatively straightforward—we draw on our rich and extensive collection of public and proprietary datasets covering key protein characteristics such as thermostability, expression, and binding affinity.
But benchmarking generator models is an altogether harder problem. You can’t benchmark a novel protein sequence unless you know how it performs in the lab—and, by definition, no one does, because it’s never existed before.
This challenge isn’t unique to protein engineering. It also shows up in fields such as large language models (LLMs), where models generate new text and researchers have to ask: Does this response make sense? Is it addressing the user’s prompt? Often, training these systems relies on humans to read the output and score its quality—a process that’s slow, subjective, and hard to scale. In generating protein designs, we face a similar dilemma. But we can’t rely on opinion; we need objective, accurate data.
Closing the loop: How our lab sets us apart
That’s why we built our own lab, which allows us to synthesize and test the proteins our models generate with short turnaround time, often just a week or two. This gives us an objective, repeatable source of ground truth to evaluate both generators and predictors, and it closes the loop between our cutting-edge algorithms and biological reality. Cradle’s automated lab plays the role of a high-throughput, unbiased evaluator, measuring how well different models and algorithms truly navigate the protein space.
This tight feedback loop is a key part of what sets Cradle apart. It transforms benchmarking from an academic exercise into a core part of how we develop, iterate and improve the platform so our customers can depend on its outcomes. And it ensures Cradle stays grounded in real-world, lab validated performance—not just theoretical promise.
Cradle benchmarking: Rigorous and real-world
Customers start off with Cradle’s robust, pre-trained models that are continuously improved through our internal benchmarking. Each customer operates within their own private, secure workspace, where they may create and run their various protein engineering projects. For each project, they can fine-tune Cradle’s pre-trained models using their own proprietary assay data. This allows for targeted optimization based on each project’s specific goals. Importantly, customer data is never used to train or improve Cradle’s base models, nor is it shared with other customers’ workspaces.
We continuously refine our algorithms to push the boundaries of protein design, with every algorithmic enhancement undergoing rigorous benchmarking to ensure it improves the quality of the platform across the board, not just for a single problem or customer.
Our benchmarking strategy is twofold:
In silico validation: For models that predict lab performance, such as how stable or active a novel protein might be, we benchmark improvements using existing ground truth data available internally. This lets us rapidly iterate and refine our predictor models. Beyond the predictor models themselves, we also benchmark the algorithms that are used to train or fine-tune predictor models when new lab data is uploaded.
In vitro validation: Major algorithmic changes, especially those impacting the generator models, must first be tested and proven effective in the lab. We dedicate lab capacity to this experimental validation. This is critical, because the real test of a newly generated protein is how it performs in the real world.

This rigorous benchmarking process is the biological equivalent of A/B testing on retail websites—except instead of tweaking a “Buy” button hoping to boost sales clicks, we’re testing new algorithms, aiming to generate better-performing proteins. It’s how we rapidly translate cutting-edge ML research into real-world improvements for every customer.
Take Direct Preference Optimization (DPO) for example, a method for fine-tuning LLMs. We didn’t develop DPO, but we benchmarked it extensively to evaluate its potential for improving Cradle’s generator models. This meant testing DPO-trained models across multiple tasks, including those in the Align To Innovate protein engineering tournament (which ran in 2023), in addition to extensive wet lab validation. That competition featured 28 teams from industry and academia, tasked with predicting experimental outcomes across 19 datasets and four distinct protein engineering tasks.
Cradle’s models demonstrated strong, consistent predictive performance. Using Cradle on auto mode —no human tuning, no custom engineering—we tied or beat the first-place results across all four tasks. This “A/B test” gave us the evidence needed to integrate DPO into the platform.
The best of both worlds for our customers
All Cradle customers benefit from our rigorous benchmarking—whether they realize it or not. Every validated improvement feeds directly into the platform, so Cradle’s models and algorithms get more capable over time, delivering better and more consistent outcomes across a broad range of protein engineering objectives.
And for customers with specific goals, Cradle supports deeper, highly targeted optimization. Within their secure workspaces, customers can fine-tune Cradle’s base models using their own data, with complete control and privacy. It’s a best-of-both-worlds approach: global improvements for everyone, private customization where it counts.
Accelerating progress through rigorous benchmarking
Rigorous, lab-based benchmarking underpins Cradle’s ability to deliver consistent, high-quality results across a wide range of protein engineering challenges.
By continuously validating our models on diverse datasets and real-world use cases, we ensure that improvements translate into tangible benefits for every customer. Working across sectors gives us insights into varied design problems—insights that feed directly into the benchmarking process and strengthen our models over time.
The result is a platform that improves with every iteration—grounded in data, tested in the lab, and built to help customers design better proteins faster, with greater confidence.
Subscribe to updates
Get new posts and other Cradle updates directly to your inbox. No spam :)
Recent posts
Built with ❤️ in Amsterdam & Zurich
Built with ❤️ in Amsterdam & Zurich