The Emerging AI Bio Playbook
Why the availability of data is both a limit and an opportunity
Somewhere between the views of AI as savior or scam, the truth is that it is already helping solve some interesting problems today. Those problems tend to share a few properties:
Measurable predictions: Quantifiable goals provide feedback for improvement.
Complex patterns: The problem is too complex for human intuition or heuristics.
Large search spaces: Inputs and outputs are too large to exhaustively search.
High data availability: Enough data to find hidden patterns and relationships.
There are a lot of biology problems that look like this. Biology is the universe’s most complex system, following a set of rules built up atom by atom, yet so inscrutably elaborate that our current understanding barely scratches the surface. It is the perfect setup for AI.
Take protein structure prediction, the problem of translating a protein sequence into 3D coordinates for each atom. Proteins share plenty of common patterns, motifs like alpha helices and beta sheets, yet they interact in complex and unpredictable ways that sometimes make little intuitive sense. There’s enough similarity to find patterns, but not enough to make rules. The 2024 Nobel Prize in Chemistry was awarded for successfully using AI to predict these protein structures.
Or liquid biopsies, the problem of detecting the presence of cancer from signals in blood. The diversity of data in our bloodstreams is almost infinitely complex and unique to each individual. Yet today you can walk into a doctor’s office and order tests that use patterns in those biomolecules to detect cancer with surprising precision and accuracy.
Or designing protein binders. Or identifying patient drug responses. Or scaling up biomanufacturing. The examples in biology are endless, and each bears the hallmarks of a good AI problem. So it’s no surprise that AI has now cured all diseases and Oh Wait Actually most of these problems are missing something: good data. So much of biology looks like a problem waiting for AI to solve, but data availability means we’ve only just scratched the surface of it.
Good Data, Bad Data
Protein structure prediction is an excellent case study. AI Models used the Protein Data Bank, a publicly available dataset of more than 200K protein structures, as well as more than two billion genome sequences that provided evolutionary context. It was perfectly structured, providing clear mappings from sequence to position, and covered a wide diversity of patterns seen in most proteins.
Contrast this with binding affinity, the problem of predicting how well a molecule binds to a protein. There are tens of millions of publicly available datapoints associating molecules with proteins, yet no major AI breakthroughs. The available data is homogenous and poorly distributed across the global space of possibilities, and it is often reduced to a single number, stripping relevant context around the assay, protein, or molecule. As a result, AI continues to underperform rules-based physics approaches in real world contexts.
These examples highlight what it means to have good data in biology. The ideal dataset is:
Diverse: AI’s strength is interpolation, and the more diverse the data is, the wider the range of properties a model can confidently predict. Biology changes dramatically across species, genotypes, tissues, and environments, making it critical to cover edge cases and vary parameters as much as possible.
Large: The more complex the data is, the larger the dataset must be to sufficiently cover that diversity. Some highly constrained problems can be solved surprisingly well with minimal data, but almost any problem benefits from more of it.
Contextual: The long arc of AI has been to remove preconceived biases and provide richer data to models. High-dimensional modalities like images, 3D coordinates, and time series capture that context, and the metadata they come with help avoid overfitting to noise. Biology itself is inherently multimodal, spanning from genomics to proteins to diseases, and data should reflect that.
Relevant: The most relevant data in biology is from humans, which can be expensive or impossible to collect. In those cases, we rely on proxies, biological models that hopefully correlate with human outcomes like disease progression. The further away the data being used, the more risk it fails to translate clinically.
Getting Good Data
In a few cases, we are fortunate to have great publicly available datasets. The Protein Data Bank made protein structure prediction models possible. Similarly, the UK Biobank has helped identify the genetic origins of various diseases, and the Cancer Genome Atlas has been a source of models for cancer detection and stratification.
In most cases, we aren’t so lucky. Sometimes, the data is fragmented and locked away by healthcare and pharma companies, who view the data as their competitive advantage and are increasingly using it for internal AI projects. Efforts to share that data through federated learning and partnerships are promising, but still early.
Otherwise, it’s likely that the data doesn’t exist at all. Enter a new crop of AI biotech companies, tackling this problem by generating the data themselves. They are making a bet that good data is the last barrier to accurate models, and that they can generate it. They are building automated wet labs, developing high throughput syntheses and assays, and annotating data at scale. This approach also gives them a flywheel, a continuous feedback loop where models improve data, which improves the models. This “lab in the loop” creates amazing competitive moats with improving data generation capabilities and proprietary datasets. The AI Biology revolution is upon us.
Ok, not quite. But the opportunity is here, for those who can create data themselves.
The Risks Of Data Generation
Generating data sounds easier than it is, so let’s take a look at risks to avoid.
Biological Risk: Picking The Wrong Problem
Biotech startups are often spared market risk - the question is usually whether a solution works, not whether it is useful. But they do need to be intentional about what biological risk they are taking.
One form is translation risk: that the solution will work in a petri dish or monkey, but fail in a human. Developing a treatment for sickle cell anemia, where the mutational cause is well understood and validated, is less risky than a treatment for Alzheimer’s Disease, where the biological models are far less reliable. When taking translation risk, it’s possible for AI models to perfectly solve a given problem, only to fail because the inputs or outputs they trained on were not representative of human physiology.
Another form is unknown biology risk: that solving the given problem is not sufficient for making an impact. For example, when building an AI model to optimize a new drug modality, it may be necessary to design the assay for it, work out new synthesis techniques, and mitigate some new form of toxicity. Making things even more difficult, these questions often arise in sequence only after solving a previous problem. The more disparate or poorly understood the requirements are, the harder it is for AI to tackle them alone.
For a startup already taking risk elsewhere, it can be helpful when possible to minimize risk by working in well understood biology.
Quality Risk: Homogenous Data
It’s an old story. A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, “this is where the light is”.
Seems like an obvious mistake, but this is one reason Pharma hasn’t already solved these problems despite what seems like a goldmine of data. Internal biotech data generally comes from drug programs, which means the data is clustered around key molecules that are being optimized. Trained models work well on those clusters, but trying to generalize - to a new molecular scaffold, a new protein, or a new disease - pitches them back into the dark.
To take best advantage of AI, it’s important to intentionally generate diverse data in unexplored regions. These regions may be unintuitive using traditional approaches, but that is precisely why they must be intentionally explored - by helping fill in blanks on the map it becomes easier to see where the opportunities are.
Execution Risk: Data That Doesn’t Scale
Scaling laws are one of the most compelling properties of neural networks, where increases in data and parameter size lead to consistently better performance. Yet that growth requires access to large amounts of data, and not all biological data can be generated at the scale needed. It’s no accident that many AI Bio companies came from insights related to high throughput data collection first and then found a good use case for that data.
Some categories of data lend themselves well to scale, like barcoded screens and combinatorial chemistry. Others may be more difficult and require innovation. Any biotech looking to generate data must plan for how to scale: through lab automation, multiplex testing, or sometimes just by spending more than anyone else is willing to spend on it.
Finding Opportunities
Talking with companies already applying these tactics make it clear: lack of data is not an impossible hurdle for applying AI to biology. But data and AI alone also won’t solve every biology problem. The key is finding the right problems together with the right data. This suggests an approach to identify new opportunities:
A relevant problem that is measurable, complex, and has a large search space.
Well understood mechanisms and data that minimize biological risk.
The ability to generate or acquire diverse, high context data at scale.
Relevant Problem + Biological Validation + Data Generation = AI Bio Opportunity.
Startups are already collecting patient tumor samples and single cell transcriptomics to predict patient responses. They are building fully automated biomanufacturing plants to optimize protein synthesis. They are leveraging combinatorial chemistry to predict drug potency.
And there are hundreds more problems waiting for someone to discover and scale up a new data generation approach. I am confident that over the coming decade, we will see more biotechs shift away from public and partnership data, decide to generate their own data, and build their AI Bio flywheel. I’m excited to see what they’ll accomplish.
As always, if you’re exploring these ideas, I’d love to hear what you’re building or learning. Reach out anytime.


