Talk to sales

Backbone of high-throughput experimentation: Design principles for ML-ready sample tracking

Sample tracking seems simple until it breaks—we redesigned our system around five principles to make it easier to identify samples across experiments.

Oct 17

Harmen

Daan

Oct 17

Harmen

Daan

To embed a website or widget, add it to the properties panel.

Sample tracking is the hidden foundation of high-throughput experimentation. It’s a critical component that often goes unnoticed until something goes wrong.

In this article, we will break down our very own approach to sample tracking step-by-step - including the critical integration into your LIMS.

This exact solution is highly specific to our own lab workflow. Every lab is different–but we think that the design considerations and technical structure behind it can still be useful for other labs to compare and learn from.

When sample tracking fails, everything fails

At its core, sample tracking is the process of monitoring and recording the location, status, and history of each sample as it moves through various stages of an experimental workflow.

This seemingly simple concept becomes increasingly complex as the scale and speed of experiments grow, particularly in high-throughput environments and labs where many variants might be tested simultaneously.

The importance of robust sample tracking cannot be overstated. It ensures data integrity by linking each piece of data to its corresponding sample, prevents mix-ups that could invalidate results, and provides a clear audit trail for regulatory compliance. Effective sample tracking is the backbone for automating and scaling high-throughput experimentation.

Our Laboratory Information Management System (LIMS) is key: it integrates sample tracking with data management, providing a centralized platform for managing the complex web of information generated in modern biological research.

The Laboratory Information Management System (LIMS) is the solution that functions as our central platform for sample tracking and data management. At Cradle, we have selected Benchling as our LIMS platform.

The thinking behind our design principles

When we encountered issues with our original sample tracking schema, we took a step back to think about what design principles should guide our approach. This is a common scenario for labs - you start with something that sort of works, but then you walk into issues and need to redesign based on everything you know now.

One of the key issues we faced was sample identification. We had entities that linked everything (dna, proteins, and host strain), but if someone said "this is strain X," we didn't directly know whether they were talking about a particular sample in one plate or a particular sample from another experiment, because the IDs would be identical. This made it hard to uniquely identify samples.

From these challenges, we developed several core design principles:

Flexibility Above All: Our most important principle is that you should be able to add additional experimental steps to your workflow even after it's already running multiple times, without breaking the system. Workflows constantly evolve, and the schema needs to accommodate new assays or protocol changes seamlessly. Production methods are a good example of this: We wanted a schema that works well whether you're using E. coli to produce proteins, cell-free protein expression systems, or potentially other methods in the future. The system shouldn't be tied to one specific production approach.

Abstract What Makes Samples Unique: We realized we needed to separate out the core components that make each sample unique: the DNA that encodes the protein, the protein itself, and the production machinery (whether that's a specific E. coli strain like Shuffle or BL21, or a cell-free protein synthesis kit). By abstracting these elements, our system works across different production methods.

Clear Sample Context and Labeling: Each sample should get a clear label indicating what type it is - expression sample, culture sample, purified protein, etc. Anyone should be able to take a sample ID and relatively easily figure out what protein it contains, what round or experiment it belongs to, and what the goal was.

Parent-Child Traceability: We implemented a universal approach where basically when you move a sample from one container to another container, you create a child sample. This creates a tree structure where you can always go back up to figure out where any sample came from, eliminating ambiguity about which plate a sample originated from.

These principles guided our thinking about how to build a system that would be robust, flexible, and broadly applicable to similar lab processes beyond just our specific workflow.

How we built flexible sample tracking

We've implemented a sample tracking system built on the Benchling LIMS platform. Our approach uses a nested sample structure that follows the flow of our protein engineering process, encompassing six main stages: DNA Assembly, Sequencing, Expression, Protein, Normalized Protein, and Measurements. This system balances thorough tracking with operational adaptability, supporting our high-throughput protein engineering work.

Short context: Our lab workflow

Our wet lab’s role is to validate and test hypotheses from our Machine Learning team. To start the process off, we get designs from our ML team and order DNA from Twist. We then transform the DNA into E. coli, pick single colonies, and perform a sequencing QC step. We then grow E. coli and lyse the cells, followed by protein purification.

Finally, we test the proteins for different assays based on the ML hypothesis - this could be expression, binding affinity, thermostability, or enzyme activity depending on what we care about improving in the algorithm.

Starting with protein designs

The process begins with ht_prots (ht stands for high-throughput), entities that contain the protein sequences as designed by our ML workflow. Based on these protein sequences, we generate corresponding ht_part entities that hold the reverse-transcribed DNA sequence.

Assembling and sequencing constructs

After synthesis at Twist Bioscience, these ht_parts enter our DNA assembly workflow. When we use E. coli, we assemble the constructs, but when we use in vitro cell-free protein expression, we directly skip to the following steps.

Following assembly with our protein expression backbone (bb###), the constructs are sequenced using the Nanopore platform. An in silico (Flyte) workflow takes the raw sequencing information and generates a couple of Benchling entities. An ht_vect entity that stores the DNA sequence and some QC-related fields like whether it passed QC, QC method, and reason for failure. The ht_vect entity also links to the protein (ht_prot) that is produced upon expression.

Creating biological objects

This workflow also generates ht_blob entities, where "blob" stands for "biological little object" - our entity type for anything that generates a protein from DNA, such as a strain or an in vitro transcription/translation reaction. Besides storing details on the host or expression machinery, this ht_blob entity also has a calculated field, resolved_protein, that takes the ht_protein from the ht_vect that the ht_blob links to.

Sample creation and quality control

In the final step of the in silico workflow, Benchling plates are created where wells map to ht_sample entities. This entity contains a link to the ht_blob and has calculated fields for ht_vect (resolved_vector) and protein (resolved_protein), meaning it will take this information from the ht_blob it contains.

Hierarchical samples unlock traceability

Importantly, ht_samples also records quality control data including QC status, values, and messages, ensuring that only samples meeting quality standards move forward in the process. Importantly, ht_sample entities have a parent_sample field. If this field is populated, the resolved_blob, resolved_vector, and resolved_protein will link to the entities derived from the parent sample. As a rule of thumb, we generate new samples whenever we transfer samples to a new plate and/or execute some form of (QC) analysis on the samples.

Building complete sample histories

Moving along, ht_samples also records quality control data including QC status, values, and messages, ensuring that only samples meeting quality standards move forward in the process. Importantly, ht_sample entities have a parent_sample field. If this field is populated, the resolved_blob, resolved_vector, and resolved_protein will link to the entities derived from the parent sample. As a rule of thumb, we generate new samples whenever we transfer samples to a new plate and/or execute some form of (QC) analysis on the samples.

One of the key enabling features we discovered in Benchling is the ability to have calculated fields as part of an entity. This allows the whole resolved protein and resolved vector system to work - basically, the parent sample or the very first parent sample has the actual ht_blob, and for a grandchild sample they can just look at the parent or grandparent sample and find the information corresponding to the ht_blob. So we don't have to link the ht_blob to all the child samples manually. It automatically looks at the parent or parent's parent and takes that information from there.

This follows the "don't repeat yourself" principle - information only needs to be maintained in one place. This prevents ambiguity that could arise if someone updated information for one sample but forgot to update it for a child sample - we wouldn't know what's true anymore at that point. We had to reach out to Benchling support to implement this feature, but it was really enabling for our workflow.

Building complete sample histories

These samples link to their parent samples from the previous stage, forming a nested structure that allows for complete sample history tracing. As samples progress, they receive evolving "sample type" tags corresponding to their current stage - sequencing, culture, protein, normalized protein, or assay. This tagging system, combined with the nested structure, maintains traceability while allowing for workflow modifications such as branching, removal or addition of steps.

Moving from sequencing to expression

After the sequencing in silico workflow has finished, we continue with the samples that have passed QC by preparing seed culture plates. These plates have new ht_samples, where each sample links to the ht_sample from the sequencing plate via the parent_sample link. This allows for multiple ht_samples to share the same parent sample, facilitating experimental replication or parallel processing. This feature adds flexibility to our workflow and enables us to adapt to various experimental designs.

Normalizing and expressing our proteins

As we move to protein production, our protein QC workflow guides samples through expression and normalization. The system records protein concentration at the Protein stage and normalized protein concentration at the Normalized Protein stage, automatically integrating this data with the sample information. This automatic tracking of key parameters aids in maintaining data integrity throughout the process.

Adapting assays to different needs

The final Assay stage accommodates different assay workflows. Here, the system demonstrates its adaptability by allowing for adjustments in plate type for each assay, accommodating a wide range of experimental requirements. Assay values are added using Benchling’s results tables. These result tables associate assay values directly with key entities such as protein IDs, sample IDs, and plates. By attaching results to these important entities, traceability is greatly enhanced. For instance, if you have a specific protein ID such as your control protein (e.g., wild-type) you can immediately access all relevant assay results linked to that protein over time, simplifying data retrieval and monitoring.

Following the journey of a sample

Moreover, this meticulous tracking of sample IDs is crucial for our machine learning models, especially in multi-property improvement projects. By maintaining precise records of sample IDs, we ensure that our ML models can accurately correlate experimental outcomes with specific samples and their processing history. For example, assays are often done at different stages of the workflow - expression might be measured on the culture sample, while thermostability is measured on the purified protein sample. The first communal parent that these samples share is the sample ID on which you can correlate the experimental data. This level of detail is essential for training robust models that can effectively navigate the complex landscape of protein engineering.

Each stage in our process utilizes a plate layout, typically in 96-well format, facilitating high-throughput processing. While we use 96-well plates because all lab equipment is standardized for this format, the beauty of our sample tracking approach is that it's completely agnostic to plate format - it could work equally well with a 1-well plate or a 101-well plate. This standardized approach allows us to efficiently manage large numbers of samples while maintaining detailed tracking of each individual sample's journey through our workflow.

Researchers trace any sample's history

For example, consider the life of a specific sample: a DNA part (ht_part123) is first combined with a backbone (bb42) to create a construct. Once sequenced, it becomes a vector (ht_vect122) with an associated blob (ht_blob456) and protein (ht_prot456). The encompassing sample (ht_sample123) and children samples (e.g. ht_sample223, ht_sample223, etc) stores all relevant metadata - sequencing quality, growth conditions, protein concentration, and finally, assay performance metrics. A researcher can trace this complete history through our nested sample structure, allowing for comprehensive analysis of how each experimental condition influenced the final results.

This balanced approach between structure and flexibility to track our samples and manage our results forms the backbone of Cradle's high-throughput protein engineering efforts. It allows us to incorporate new assays or modify protocols while preserving a consistent structure for data tracking, supporting our ongoing research and development in protein engineering.

Building tomorrow's laboratory

As we continue to refine our approach to sample tracking and data handling, we have several areas for future improvement. We’ll start using UniteLabs Data Warehouse to improve variant lookup and inventory management, considering better integration between laboratory equipment and our LIMS for automated data transfer, and working on support for multi-chain or more complex proteins.

We're sharing this approach because we believe these foundational systems are crucial for AI-accelerated biotechnology. The design principles and nested sample structure we've developed could be adapted by other labs facing similar challenges in high-throughput experimentation and data tracking.

Share your experiences, questions, and ideas - whether you're just beginning to explore laboratory automation or you're optimizing existing high-throughput workflows. The future of AI-accelerated biotechnology depends on these foundational systems working seamlessly together.

Sample tracking is the hidden foundation of high-throughput experimentation. It’s a critical component that often goes unnoticed until something goes wrong.

In this article, we will break down our very own approach to sample tracking step-by-step - including the critical integration into your LIMS.

When sample tracking fails, everything fails

At its core, sample tracking is the process of monitoring and recording the location, status, and history of each sample as it moves through various stages of an experimental workflow.

The thinking behind our design principles

From these challenges, we developed several core design principles:

These principles guided our thinking about how to build a system that would be robust, flexible, and broadly applicable to similar lab processes beyond just our specific workflow.

Backbone of high-throughput experimentation: Design principles for ML-ready sample tracking

When sample tracking fails, everything fails

The thinking behind our design principles