Models and Molecules

The Emerging AI Bio Playbook

Zach Kagin — Tue, 18 Nov 2025 17:40:37 GMT

Somewhere between the views of AI as savior or scam, the truth is that it is already helping solve some interesting problems today. Those problems tend to share a few properties:

Measurable predictions: Quantifiable goals provide feedback for improvement.
Complex patterns: The problem is too complex for human intuition or heuristics.
Large search spaces: Inputs and outputs are too large to exhaustively search.
High data availability: Enough data to find hidden patterns and relationships.

There are a lot of biology problems that look like this. Biology is the universe’s most complex system, following a set of rules built up atom by atom, yet so inscrutably elaborate that our current understanding barely scratches the surface. It is the perfect setup for AI.

Take protein structure prediction, the problem of translating a protein sequence into 3D coordinates for each atom. Proteins share plenty of common patterns, motifs like alpha helices and beta sheets, yet they interact in complex and unpredictable ways that sometimes make little intuitive sense. There’s enough similarity to find patterns, but not enough to make rules. The 2024 Nobel Prize in Chemistry was awarded for successfully using AI to predict these protein structures.

Or liquid biopsies, the problem of detecting the presence of cancer from signals in blood. The diversity of data in our bloodstreams is almost infinitely complex and unique to each individual. Yet today you can walk into a doctor’s office and order tests that use patterns in those biomolecules to detect cancer with surprising precision and accuracy.

Or designing protein binders. Or identifying patient drug responses. Or scaling up biomanufacturing. The examples in biology are endless, and each bears the hallmarks of a good AI problem. So it’s no surprise that AI has now cured all diseases and Oh Wait Actually most of these problems are missing something: good data. So much of biology looks like a problem waiting for AI to solve, but data availability means we’ve only just scratched the surface of it.

Good Data, Bad Data

Protein structure prediction is an excellent case study. AI Models used the Protein Data Bank, a publicly available dataset of more than 200K protein structures, as well as more than two billion genome sequences that provided evolutionary context. It was perfectly structured, providing clear mappings from sequence to position, and covered a wide diversity of patterns seen in most proteins.

Contrast this with binding affinity, the problem of predicting how well a molecule binds to a protein. There are tens of millions of publicly available datapoints associating molecules with proteins, yet no major AI breakthroughs. The available data is homogenous and poorly distributed across the global space of possibilities, and it is often reduced to a single number, stripping relevant context around the assay, protein, or molecule. As a result, AI continues to underperform rules-based physics approaches in real world contexts.

These examples highlight what it means to have good data in biology. The ideal dataset is:

Diverse: AI’s strength is interpolation, and the more diverse the data is, the wider the range of properties a model can confidently predict. Biology changes dramatically across species, genotypes, tissues, and environments, making it critical to cover edge cases and vary parameters as much as possible.
Large: The more complex the data is, the larger the dataset must be to sufficiently cover that diversity. Some highly constrained problems can be solved surprisingly well with minimal data, but almost any problem benefits from more of it.
Contextual: The long arc of AI has been to remove preconceived biases and provide richer data to models. High-dimensional modalities like images, 3D coordinates, and time series capture that context, and the metadata they come with help avoid overfitting to noise. Biology itself is inherently multimodal, spanning from genomics to proteins to diseases, and data should reflect that.
Relevant: The most relevant data in biology is from humans, which can be expensive or impossible to collect. In those cases, we rely on proxies, biological models that hopefully correlate with human outcomes like disease progression. The further away the data being used, the more risk it fails to translate clinically.

Getting Good Data

In a few cases, we are fortunate to have great publicly available datasets. The Protein Data Bank made protein structure prediction models possible. Similarly, the UK Biobank has helped identify the genetic origins of various diseases, and the Cancer Genome Atlas has been a source of models for cancer detection and stratification.

In most cases, we aren’t so lucky. Sometimes, the data is fragmented and locked away by healthcare and pharma companies, who view the data as their competitive advantage and are increasingly using it for internal AI projects. Efforts to share that data through federated learning and partnerships are promising, but still early.

Otherwise, it’s likely that the data doesn’t exist at all. Enter a new crop of AI biotech companies, tackling this problem by generating the data themselves. They are making a bet that good data is the last barrier to accurate models, and that they can generate it. They are building automated wet labs, developing high throughput syntheses and assays, and annotating data at scale. This approach also gives them a flywheel, a continuous feedback loop where models improve data, which improves the models. This “lab in the loop” creates amazing competitive moats with improving data generation capabilities and proprietary datasets. The AI Biology revolution is upon us.

Ok, not quite. But the opportunity is here, for those who can create data themselves.

The Risks Of Data Generation

Generating data sounds easier than it is, so let’s take a look at risks to avoid.

Biological Risk: Picking The Wrong Problem

Biotech startups are often spared market risk - the question is usually whether a solution works, not whether it is useful. But they do need to be intentional about what biological risk they are taking.

One form is translation risk: that the solution will work in a petri dish or monkey, but fail in a human. Developing a treatment for sickle cell anemia, where the mutational cause is well understood and validated, is less risky than a treatment for Alzheimer’s Disease, where the biological models are far less reliable. When taking translation risk, it’s possible for AI models to perfectly solve a given problem, only to fail because the inputs or outputs they trained on were not representative of human physiology.

Another form is unknown biology risk: that solving the given problem is not sufficient for making an impact. For example, when building an AI model to optimize a new drug modality, it may be necessary to design the assay for it, work out new synthesis techniques, and mitigate some new form of toxicity. Making things even more difficult, these questions often arise in sequence only after solving a previous problem. The more disparate or poorly understood the requirements are, the harder it is for AI to tackle them alone.

For a startup already taking risk elsewhere, it can be helpful when possible to minimize risk by working in well understood biology.

Quality Risk: Homogenous Data

It’s an old story. A policeman sees a drunk man searching for something under a streetlight and asks what the drunk has lost. He says he lost his keys and they both look under the streetlight together. After a few minutes the policeman asks if he is sure he lost them here, and the drunk replies, no, and that he lost them in the park. The policeman asks why he is searching here, and the drunk replies, “this is where the light is”.

Seems like an obvious mistake, but this is one reason Pharma hasn’t already solved these problems despite what seems like a goldmine of data. Internal biotech data generally comes from drug programs, which means the data is clustered around key molecules that are being optimized. Trained models work well on those clusters, but trying to generalize - to a new molecular scaffold, a new protein, or a new disease - pitches them back into the dark.

To take best advantage of AI, it’s important to intentionally generate diverse data in unexplored regions. These regions may be unintuitive using traditional approaches, but that is precisely why they must be intentionally explored - by helping fill in blanks on the map it becomes easier to see where the opportunities are.

Execution Risk: Data That Doesn’t Scale

Scaling laws are one of the most compelling properties of neural networks, where increases in data and parameter size lead to consistently better performance. Yet that growth requires access to large amounts of data, and not all biological data can be generated at the scale needed. It’s no accident that many AI Bio companies came from insights related to high throughput data collection first and then found a good use case for that data.

Some categories of data lend themselves well to scale, like barcoded screens and combinatorial chemistry. Others may be more difficult and require innovation. Any biotech looking to generate data must plan for how to scale: through lab automation, multiplex testing, or sometimes just by spending more than anyone else is willing to spend on it.

Finding Opportunities

Talking with companies already applying these tactics make it clear: lack of data is not an impossible hurdle for applying AI to biology. But data and AI alone also won’t solve every biology problem. The key is finding the right problems together with the right data. This suggests an approach to identify new opportunities:

A relevant problem that is measurable, complex, and has a large search space.
Well understood mechanisms and data that minimize biological risk.
The ability to generate or acquire diverse, high context data at scale.

Relevant Problem + Biological Validation + Data Generation = AI Bio Opportunity.

Startups are already collecting patient tumor samples and single cell transcriptomics to predict patient responses. They are building fully automated biomanufacturing plants to optimize protein synthesis. They are leveraging combinatorial chemistry to predict drug potency.

And there are hundreds more problems waiting for someone to discover and scale up a new data generation approach. I am confident that over the coming decade, we will see more biotechs shift away from public and partnership data, decide to generate their own data, and build their AI Bio flywheel. I’m excited to see what they’ll accomplish.

As always, if you’re exploring these ideas, I’d love to hear what you’re building or learning. Reach out anytime.

Why Clinical Trials Fail

Zach Kagin — Tue, 21 Oct 2025 16:45:43 GMT

This post reviews a number of clinical trial anecdotes. If you’re less familiar with drug discovery and would like a refresher, check out this deep dive on drug discovery to get some additional context.

Everyone knew “bad” LDL cholesterol caused heart disease, and “good” HDL cholesterol prevented it. So when Eli Lilly developed Evacetrapib and saw HDL-C increase by 130% while decreasing LDL-C by 35%, the company thought they had a blockbuster. Much to their surprise, the 2016 trial showed no change in cardiovascular deaths at all. The conventional wisdom, backed up by decades of genetic and observational studies, was misleading. While these biomarkers were correlated with cardiovascular health, it seemed they were not causal. The drug was scrapped, Eli Lilly’s stock fell 8% in a day, and researchers were sent back to the drawing board looking for more nuanced biomarkers.

“I can see any failure as a chance. That result will teach you something else, something new.”

- Shinya Yamanaka, Nobel Laureate in Medicine

Numbers don’t lie (usually)

Clinical trial stats are grim. Trials can take more than a decade, cost upward of a billion dollars, and have a likelihood of success of around 10%. Every drug hunter has faced that moment of despair, knowing the odds are stacked against them and years of work may amount to nothing. They push through anyways, because on the other side, for the few drugs that succeed, lives are forever changed for the better. And the more we study those failures, the more we learn from them as we tackle the next drug.

Many studies1 have categorized trial failures, and largely agree around three reasons:

60% Efficacy: The drug doesn’t work well enough.
20% Safety: The side effects are too toxic.
20% Commercial: Strategic portfolio shifts or financial difficulties prevent follow-up trials.

While these categories cover how a drug fails, they don’t answer why - and more importantly, what could be done differently. Clinical trials are preceded by a long discovery process focused on efficacy and safety, so it’s worth asking: Why did we believe the drug would be safe and effective, and what did we miss?

Every clinical trial is a hypothesis to be proven or disproven, different for every drug. Look through enough individual examples, and a few stories continually show up to explain the disconnect between hypothesis and reality.

Poor patient selection: Humans are heterogenous - we have different genes, environments, diets, gut biomes, and more. So it’s no surprise that we also respond differently to drugs. In most clinical trials, only some of the patients show positive responses, and the challenge is figuring out beforehand which ones will respond well. Selecting the wrong patient population or not collecting the right data to stratify patients can prevent an effective drug from showing positive results, as the responding patient numbers are overwhelmed by non-responding ones.

Inaccurate models: Scientists use preclinical models as a proxy for human physiology. Computational tools, cell assays, and animal models all feed the iteration cycle to find and improve drug candidates. But those models can fail to capture the full complexity of a system: cells behave differently in test tubes than bodies, induced diseases look different from natural ones, and animals have different genes and proteins than humans do. Those disconnects can lead to false confidence, where a drug works well in the model but not in humans.

Non-causal biomarkers: We’ve never had more human data, from enormous health databases like the UK Biobank to clinical trial measurements kept at every pharma company. These datasets can correlate inputs like genetics, diets, and blood markers to outcomes like lifespan or disease risk, ultimately providing ideas for new drug targets and measurements to evaluate progress. But as with all data analysis, correlation does not mean causation. Even after taking precautions, it’s easy to think a biomarker drives a disease right up until a clinical trial disproves the hypothesis.

“All successful drugs are alike; each failed drug has failed in its own way.”

- Leo Tolstoy, almost

Of the near infinite stories to tell, let’s look at a few more examples.

Finding the right patients

Cancer is a notoriously challenging target because every tumor is different. So in the early 2000s, scientists were excited to find a pattern: the protein EGFR was overly abundant in many cancers and played an important role in cell growth. Yet hopes were dashed when the 2005 trial for gefitinib showed no advantage over standard chemotherapies. Undeterred, researchers segmented the patient population and found it worked better in non-smokers and Asians, but they couldn’t explain why. By the time a follow-up study was commissioned in 2009, a new technology emerged that could help. DNA sequencing costs had fallen from $10,000,000 to $100,000 in those four years, just cheap enough that sequencing patient tumors was now feasible. A sequence analysis revealed the true cause: patients with specific EGFR mutations responded well to gefitinib, while those with only overexpression did not. As it turned out, both non-smokers and Asians were far more likely to have those mutations than others. Approval and mutation-based screening quickly followed.

Working with imperfect models

Inspired by the success of EGFR drugs, scientists tried again with another overexpressed protein in cancers, IGF-1R. Repeated clinical trials over the course of the 2010s failed to outperform chemotherapy, and unlike EGFR, retrospective analysis could not identify reliable biomarkers to segment the population. Yet animal models continued to work, leading researchers to wonder where the disconnect was. More mechanistic experiments revealed the problem: in humans, when IGF-1R is blocked, other proteins like IR-A step in to compensate, quickly ramping up expression to take over where IGF-1R left off. Animal models largely miss or suppress this compensatory mechanism, and the difference in animal biology prevented researchers from noticing the issue.

Toxic side effects can be even more challenging than efficacy to predict, as a fateful trial in 1999 showed. Jesse Gelsinger suffered from a rare genetic disease affecting his liver, and doctors suggested he try an experimental new gene therapy: a modified adenovirus intended to insert corrected versions of the mutated gene into his cells. After receiving the therapy, his health degraded rapidly, and four days later, he died from a massive immune reaction. Unbeknownst to the clinicians, Jesse had been previously exposed to the adenovirus, likely from a common cold, priming his immune system to respond more strongly the second time. Jesse’s death sent shockwaves through the field2, leading to more scrutiny and a decade of careful safety studies before experimentation came back. Yet the central problem facing gene therapy remains mostly the same even today: unpredictable immune responses.

A bad biomarker, maybe?

The “amyloid hypothesis” has been the prevailing theory of Alzheimer’s Disease for more than 30 years: that the accumulation of beta amyloid plaques in the brain causes neurodegeneration. In that time, hundreds of clinical trials have tried and failed to treat Alzheimer’s by reducing these plaques. And if this story had been written five years ago, the summary would be clear - amyloids are a symptom and not a cause. But three recent and contentious drug approvals for Alzheimer’s add some ambiguity to the story, raising the question of whether amyloids are a bad biomarker, or if we just haven’t had the right drug for it. Others have written extensively on the subject3, and this is a story still being written. When we finally have good treatments for Alzheimer’s, there will be much to learn from the amyloid approach as either a success or failure of biomarkers.

New questions

Finding accurate models, good biomarkers, and a relevant patient population is a daunting gauntlet of challenges to tackle. Yet they also provide us a new set of questions to answer, ones that will hopefully lead to more successful clinical trials. Some of the questions I have are:

How do we segment patient populations?

How can we identify which drugs can be repurposed for new patient populations?
How can we design clinical trials to personalize medicines for each patient?

How do we build better preclinical models?

How can we generate relevant training data at unprecedented scale, and will that be enough for AI models to scale our way into biological understanding?
How can we represent increasingly complex systems with 3D organoids and organs-on-a-chip?

How do we find causal biomarkers?

What data should we be collecting in clinical trials, and how can we make that easy to obtain?
How can we break down data silos and make more data available to researchers?
How can we more effectively separate causality from correlation, and correlation from noise?

Each of these questions is deserving of its own essay, and intrepid researchers and entrepreneurs are already tackling them head on. I’m excited to see what they’ll accomplish.

As always, if you’re exploring these ideas, I’d love to hear what you’re building or learning. Reach out anytime.

Arrowsmith, J. Phase III and submission failures: 2007–2010. Nat Rev Drug Discov 10, 87 (2011). https://doi.org/10.1038/nrd3375
Arrowsmith, J. Phase II failures: 2008–2010. Nat Rev Drug Discov 10, 328–329 (2011). https://doi.org/10.1038/nrd3439
Arrowsmith, J., Miller, P. Phase II and Phase III attrition rates 2011–2012. Nat Rev Drug Discov 12, 569 (2013). https://doi.org/10.1038/nrd4090
Harrison, R. Phase II and phase III failures: 2013–2015. Nat Rev Drug Discov 15, 817–818 (2016). https://doi.org/10.1038/nrd.2016.184
Hwang TJ, Carpenter D, Lauffenburger JC, Wang B, Franklin JM, Kesselheim AS. Failure of Investigational Drugs in Late-Stage Clinical Development and Publication of Trial Results. JAMA Intern Med. 2016;176(12):1826–1833. https://doi.org/10.1001/jamainternmed.2016.6008

For more on how Jesse’s death impacted the field, I highly recommend The Death of Jesse Gelsinger, 20 years later.

A few of the most highly referenced articles include:

Starting In Biotech: Part II

Zach Kagin — Mon, 25 Aug 2025 23:42:08 GMT

If you’re just learning about biotech and want a broad industry view, consider starting with this overview of the biotech industry and then diving into drug discovery.

Look up the top 10 companies in biotech, and you’ll get a list of pharmaceutical companies: Eli Lilly, Novo Nordisk, Merck, Pfizer. Biotech is dominated by the development of drugs, in mindshare and in dollars. And while we’ve been trying new medicines for thousands of years, the last few decades have seen an explosion in new approaches and methods. CRISPR, mRNA vaccines, and CAR-T therapy are just a few of the latest approaches to mitigating diseases we’ve suffered from since the dawn of humanity.

To break it down, let’s first step through the different stages of drug discovery. Afterwards, we’ll dive into the different types of drugs and their unique advantages.

Stage

The drug discovery process is a long journey, spanning from years to decades, that starts with a disease state and ends with a therapeutic. The first step is usually to understand the disease better, allowing scientists to come up with therapeutic ideas and build models to test their hypotheses. These hypotheses are developed and refined over time, ultimately becoming drug candidates that get validated first in animals and then humans, and finally manufactured at scale and distributed to the market.

Target Identification

For most of human history, we had no real understanding of how diseases happen. Hippocrates, father of medicine, felt that an imbalance of the four bodily fluids (black bile, yellow bile, phlegm, and blood) led to disease. Others believed it came from witchcraft, or the alignment of the stars. With no mechanistic knowledge of disease, the only way to find medicine was to try different ingredients and see what worked.

Today our understanding is much improved, and we can often point to a “target” for therapeutics to focus on. Some of the broad categories of disease origin include:

Infectious: An external pathogen like bacteria or virus (influenza)
Genetic: A DNA mutation, at birth or developed over time (sickle cell anemia)
Immune-mediated: The immune system attacks healthy cells or overreacts to external stimuli (rheumatoid arthritis)
Nutritional: Deficiencies in nutrition lead to lack of resources (scurvy)
Environmental: An external toxin causes damage (lead poisoning)
Regulatory: Hormone or metabolic feedback systems break down (type 2 diabetes)
Vascular: Cumulative structural damage leading to loss of function (stroke)

There is no one journey to understand a disease, and two very different approaches highlight that.

A natural drug: In 1964, a medical expedition to Easter Island was collecting samples for potential new drugs (“trying different ingredients to see what worked”). One of the samples found was initially tested as an antifungal, but after the research team saw evidence of immunosuppressive and anti-tumor activity, they started investigating how it worked. It wasn’t until the mid-90s that biologists were able to isolate what protein this compound was affecting. That target, mTOR, turns out to be a core component of the cellular reproduction cycle, and the drug, rapamycin, is now used for a variety of purposes in cancer treatment, organ transplants, and longevity.

Insight in data: In 2003, French researchers were researching families that had extraordinarily high LDL cholesterol. After sequencing their genome, they found a consistent genetic mutation among them, a gene called PCSK9. Two years later, an American team found a cohort with ultra-low LDL, identifying a different mutation in the same gene. This natural experiment surfaced PCSK9 as a prime target for a new drug, and by 2015 two new antibody drugs were approved that dramatically and safely lower LDL cholesterol.

For some diseases, we are now confident about their cause. Sickle cell anemia is almost certainly caused by a mutation in the hemoglobin S gene, and tuberculosis by an infection of the bacteria Mycobacterium tuberculosis. The cause of other diseases, like Alzheimer’s, are still mostly a mystery despite decades of investment. The lack of understanding has made it that much harder to find drugs that work.

The volume and diversity of data being collected is helping to illuminate new possible causes, and a new wave of biotechs are generating and building models based on that data. The hope is that with increasing data and model scale, we’ll start to see emergent behavior that can help us understand how different biological components interact. If these models can more accurately simulate cells or organoids, we can rapidly test changes in input, like DNA, and see how that changes outputs like protein expression, eventually giving us new hypotheses and new therapeutic targets.

Phenotypic Discovery

Some traditional drug discovery approaches skip target identification, focusing on the downstream effects of a drug without requiring an understanding of the mechanism by which it operates (for example, we still don’t really know how Tylenol works). This approach, working solely based on observable changes called “phenotypes,” can remove assumptions about how biology works, potentially unlocking therapeutics for diseases with no known or intractably complex causes. But it provides only minimal information to inform what direction to develop a drug, often leaving us to stumble upon one that works.

Examples: Noetik (Biology Foundation Models), Recursion (Phenotypic AI Models)

Drug Development

With a target in mind, the next stage is identifying a compound or mechanism for affecting that target. Scientists develop proxies to evaluate a compound’s potential effectiveness in humans. Those range from simple chemical tests called assays, to animals genetically modified to exhibit more human-like attributes. At the beginning of the drug development process, faster methods are required to quickly evaluate many different compounds, while late-stage development requires more accurate, but more expensive tests.

Most drug development starts by “screening,” or evaluating a library of compounds. Traditionally, this meant physically running potentially millions of compounds through the assay to see if anything works. Many AI/ML drug discovery companies will now use “virtual screens” instead, where they generate and evaluate ideas computationally before chemically testing only a smaller subset.

If an initial compound is found, scientists continue iterating, trying small changes to see how it improves effectiveness. Other properties start to become more relevant over time, like how effectively it is delivered into a cell, how easy it is to make, or how toxic it might be to other healthy cells. Because of the variety of properties, optimizing one will often lead to degradation in another, making this problem ripe for multi-parameter optimization techniques.

As potential candidate drugs advance through the iterative process, they are tested more thoroughly, both in cell assays (“in vitro”) and in living organisms (“in vivo”). For now, animal studies are still the best and most accurate way to test for the wide variety of effects a drug may have, but a new crop of biotech companies is focused on supplementing or replacing them with AI-predicted properties.

When all is done, a single candidate drug is declared, and an IND (Investigational New Drug) is filed with the FDA to begin human studies.

Examples: Insilico (Generative AI for drug discovery), Axiom Bio (AI drug toxicity models)

Delivery

While therapeutics may work well once inside of a cell, delivering it into the right cells is an entirely different challenge. Our body has numerous defense systems to prevent this from happening, and any delivery mechanism needs to figure out how to evade the immune system, not get metabolized by enzymes in the blood, and pass through the cell barrier intact.

Some drugs can be formulated into pills or liquids and taken orally, while larger and more complex treatments sometimes require more thought. Nanocarriers and other encapsulation systems can safely shepherd drugs through the blood, while a modified virus can deliver DNA directly into a cell and insert it into the genome. Developing the vehicle of delivery is sometimes even more important than developing the drug itself.

Examples: Dyno Therapeutics (capsid delivery), Capsida Biotherapeutics (viral vector development)

Clinical Development

The FDA generally breaks clinical trials into three phases.

Phase 1 is primarily testing for safety and what dosage can be reasonably tolerated.
Phase 2 focuses on effectiveness, measuring actual disease progress in patients, like symptom frequency or tumor size.
Phase 3 is a large-scale study, validating Phases 1 and 2 in a much larger population and monitoring for less common side-effects.

At the end of phase 3, the drug is submitted to the FDA for full approval. The regulatory pathway can also be more complex, with post-approval monitoring (“phase 4”), accelerated approvals for major breakthroughs, and expanded access for patients in need during the trial.

Clinical trials are risky, with over 90% of drugs never making it to FDA approval. They can cost hundreds of millions of dollars to run, making it hard for all but the largest pharma companies to bring a drug to market. Because of this, many smaller biotech companies out-license their drugs early in the process to mitigate the risk of failure.

Executing clinical trials presents a new set of challenges for a drug discovery company.

Protocol design determines in advance which hypotheses will be tested, what the success criteria will be, which patients will be eligible for the trial, and what controls will be analyzed to compare. Patient selection in particular is a critical decision, as many drugs work only on a subset of a population. If a trial is designed poorly, a drug that works very well on a subset of the population might fail the trial anyways.

Immuno-oncology, the use of the immune system to target our own cancers, is one of the most exciting developments in cancer treatment, and can lead to complete remission even in some late-stage patients. It has been transformative for some, but one of the biggest open questions is why it only works on 20-50% of patients. If we could identify which patients these drugs worked on, it would enable better precision therapy and give us a clue on how to design drugs for the remaining patients.

Site Selection and Trial recruitment: The logistics of managing live patients is a large part of what drives clinical trial costs into the hundreds of millions. Companies need to coordinate with hospitals on recruitment and treatment plans, monitor patient progress, and handle patient data all under heavy regulatory scrutiny.

Matching patients with clinical trials is a classic marketplace problem, but combined with extreme time pressure, geographic constraints, and lack of information. New approaches for pairing patients and trials are critical to help drive down the extreme and growing costs of these trials.

Examples: Foundation Medicine (Genomic profiling for clinical trials), Unlearn (AI digital twins)

Go-to-market

Once a drug is approved by the FDA, it can then be sold to patients. This is usually done through clinicians and hospitals, but drugs like Ozempic are a good example of a drug that appears so effective patients are asking for it themselves.

The good news is that by this point, there’s rarely a “product-market fit” issue. If the drug works, people will usually want it and doctors will usually want to prescribe it. The bad news is there are still a lot of challenges remaining. Go-to-market still involves handling pricing, insurance coverage, and reimbursement protocols for drugs. And scaling up manufacturing can require optimizing chemical synthesis, building bioreactors, and creating a supply chain to deliver the drug where it needs to go.

Even at this stage, some drugs can fail to live up to expectations. Exubera, an inhaled insulin drug, failed because of an unwieldy and embarrassing inhaler and a high price. Provenge, a treatment for prostate cancer, took too long to negotiate coverage from Medicare/Medicaid, and had sufficiently severe manufacturing bottlenecks that it was eventually outcompeted by other drugs a few years later. Especially in fast-moving industries where new drugs are on the horizon, biotechs sometimes only have a few years to capitalize on their lead.

Examples: Culture Biosciences (Cloud-connected bioreactors), Komodo (patient journey tracking)

Drug Modalities

While most drug discovery goes through a similar process from target identification to marketing, the approach can differ based on the type of drug. To introduce these types, it’s helpful to briefly review some cell biology.

DNA is the central information repository of the cell. It’s made up of long chains of molecules called nucleotides, and these encode information the same way bits do in computers (but in base 4). Small snippets of the DNA are “read” to make copies called RNA. That RNA is used as a template, where each section of three nucleotides is mapped to produce an amino acid (a codon table shows which patterns map to which amino acids). Chains of amino acids fold into complex shapes as they are synthesized, producing proteins. This is the Central Dogma of biology, that “DNA makes RNA, and RNA makes protein.”

Proteins are the primary actors of the cell, responsible for everything from transporting molecules to defending against pathogens to regulating bodily functions. Those proteins communicate directly with each other, but often use other smaller biomolecules, like hormones and neurotransmitters, to signal information to other proteins. I often think of proteins like functions and small molecules like the variable inputs to those functions, though proteins will often interact directly with each other as well.

With that out of the way, let’s jump into the different ways we can intervene when something breaks.

Small Molecules

You probably consume small molecule “drugs” every day without realizing it. While Advil and Allegra sit in your medicine cabinet, caffeine and ethanol work the same way. These molecules bind into a pocket in proteins, inducing the protein to change shape or enabling the next step in a reaction. The small molecule functions like an ignition key to a protein “lock,” dialing protein function up and down, and new small molecules can be designed to inhibit dangerous proteins or activate dormant ones.

Over the last few decades, structure-based design has become a very popular method for small molecule discovery. It involves looking at the 3D shape of the protein, and using that information to design molecules that might fit into the same pocket. These 3D structures usually come from x-ray crystallography or cryo-EM (cryogenic electron microscopy), but the development of protein folding models like Alphafold have allowed scientists to view hypothesized structures without needing to crystallize a protein.

While this sounds as simple as designing the key for a lock, it becomes much more difficult knowing that:

The key and lock are both flexible, constantly moving and changing shape in the cell.
The key and lock will adapt to each other in surprising ways, changing shape as they get close.
The key can only fit this specific lock, as fitting other locks can cause unwanted side effects.
The key can sometimes fit into the lock, but not “turn,” such that there’s no effect on the protein.

Physics-based simulations like Free Energy Perturbation (FEP) are used to help calculate the likelihood that molecules bind to proteins. More recently, AI-based methods have built on that information to directly predict how well those molecules bind.

Examples: Genesis Therapeutics (AI for small molecule discovery), Rowan Sci (computational chemistry tools)

Protein Biologics

If a protein isn’t working correctly, one option is to add more of that protein directly. Diabetes happens when the body is unable to produce insulin in the right quantities and at the right time in response to glucose levels, and the first mainstream solution to that was directly injecting insulin. As technology has improved, companies can now design more complex proteins to either replicate functionality in the body, or add entirely new mechanisms. Ozempic is one of the most popular examples of this, mimicking the structure of the protein GLP-1 to regulate blood sugar levels, but with improved stability and half-life.

Historically, protein biologics have focused on making small changes to known proteins, helping them last longer or be absorbed better, but protein folding models can again help unlock fully new proteins. If a specific shape is desired, reverse folding models can generate sequences that should fold to produce the desired structure. Understanding the function of a protein from its structure remains a large challenge, and many designs still lean on known structures for inspiration.

Examples: Generate:Biomedicines, Arzeda (AI for protein design)

Antibodies

Antibodies are a special class of protein biologics. The immune system naturally produces antibodies as a defense against malfunctioning cells and foreign invaders, but we can’t always produce the right antibodies to protect ourselves.

Some of the most popular drugs currently on the market are designed antibodies. Keytruda is an antibody cancer drug, part of an exciting wave of therapies in immuno-oncology where the immune system is harnessed to attack cancer. Keytruda targets and inhibits a protein responsible for applying brakes to our own immune system, allowing the immune system to go after those cells. Humira is an antibody used for rheumatoid arthritis, targeting a protein that stimulates inflammation. Both are examples of monoclonal antibodies, which target a single cell marker. Current research aims to develop antibodies targeting multiple markers simultaneously (“bispecific”), and combining antibodies with small molecules to increase effectiveness ("antibody-drug conjugates”).

Antibody generation has historically worked differently than small molecules and other biologics, because we already have a natural system for rapidly designing and testing new antibodies - the immune system itself. To leverage that, scientists can inject the antigen into an animal and then extract the antibodies generated. Those antibodies then need to be “humanized” so they will work in a different species. This approach is called hybridoma technology.

Instead of using live animals, researchers also use “phage display” to simulate an immune system’s generation. It involves producing a large library of protein sequences and fusing them into a special organism called a bacteriophage, which produces the protein on its surface. Phages are tested for how well they bind to the antigen, and the most effective phages have their DNA sequenced to identify the expressing protein. That DNA can then be attached to an existing antibody, which will then hopefully also bind to the same antigen.

While designing antibodies often already works well, optimizing for a wide variety of other properties remains a challenge. New antibodies can accidentally bind to other proteins, trigger unwanted immune system responses, or are difficult to manufacture and keep stable at scale. New computational methods aim both to design antibodies and simultaneously optimize for these new properties to accelerate the development process.

Examples: Absci, Chai Discovery (AI for antibodies)

Vaccines

Vaccines are also a special case of protein biologics. Instead of producing a protein intended to serve some function inside the cell, it is there only to trigger the immune system to produce antibodies for it. The protein itself is an inactive replica or portion of the pathogen the vaccine is meant to prevent. This approach works well when the pathogen is relatively stable, but is less effective for diseases that rapidly evolve or come in a variety of types. For example, the common cold is the result of hundreds of different viruses, while the flu mutates quickly enough that new vaccines are designed every year for them.

Examples: Evaxion (AI for Immunology)

RNA Therapy

In late 2020, the world quickly became experts on RNA therapies, due to the exciting and timely development of the mRNA vaccine for COVID. When mRNA is directly added, cells take it up and start producing proteins from the coded information. In a vaccine’s case, that protein is a component (like the famous “spike protein”) of the virus or undesirable cell, causing the body to ramp up its immune system to clear the protein out and leaving residual protection when the virus later shows up.

Another form of RNA therapies are called siRNAs, or small interfering RNA. They are small strands of double-stranded RNA that, instead of turning into proteins, degrade existing mRNA. They can be used to suppress production of specific harmful proteins. Antisense oligonucleotides (ASOs) work similarly, but in the form of a single-stranded RNA.

The design of an RNA sequence is often relatively straightforward, but challenges remain trying to deliver it effectively. If injected directly, RNA is rapidly degraded by the body’s own enzymes. Wrapping RNA in a delivery vehicle can help it survive longer, but risks more unwanted immune responses or accumulation of RNA into the wrong organs or cell types.

RNA is hypothesized as the very first biomolecule, the true origin of life. The list of RNA types is long and growing, with a new one being discovered as recently as 2021 (GlycoRNA). While early, it seems likely these discoveries will eventually lead to vast new types of therapeutics.

Examples: Atomic AI (AI for RNA therapies), Deep Genomics (AI for RNA therapies)

Gene Therapy

Since the discovery of DNA as the information source of the cell, scientists have dreamed of editing it directly, and modern methods like CRISPR now allow us to add, remove, and replace nucleotides in targeted fashion. For diseases that originate with a genetic mutation, there is no better solution than to fix it at the source.

Gene therapy is most effective when specific genes are broken or missing. In sickle cell anemia, adding new copies of a hemoglobin gene seems sufficient to restore patients to health. Even if the mutation continues to exist in some cells, having other cells with the repaired DNA can often mitigate the disease. In cases like cancer, where even a single cell can cause problems when it replicates, gene therapy has been less effective.

Examples: Profluent (AI designing new gene editors), Mammoth Biosciences (CRISPR gene editing)

Cell Therapy

Instead of adding proteins or DNA, it can sometimes be advantageous to add entire cells. The first bone marrow transplant was done in 1956, and while stem cell treatments didn’t turn out to be the miracle drug we expected in the 2000s, new developments continue to push the envelope of what’s possible. CAR T-cell therapy for cancer is a new approach, where immune cells are extracted from the patient, modified to track down specific tumor cells, and then injected back into the body.

Examples: Sana Biotechnology (engineered cells), Elevate Bio (cell therapy development)

New Approaches

Scientists are always coming up with new and creative ways of developing therapies. Some of those experimental and less validated ideas include:

Microbiome therapeutics: Leveraging healthy bacteria to improve the microbiome.
Oncolytic Viruses: Engineering viruses specifically to seek out cancer cells.
Radiopharmaceuticals: Drugs that target particular cells and emit radiation to kill them, reveal them during imaging, or further target them with other methods.
Molecular glues: Drugs designed to bring multiple proteins together instead of disrupting a single protein.

What Next?

While there is so much we still don’t understand about biology, there are many opportunities for software and AI to work at the cutting edge of research. You hopefully now have a framework to help inform what you’re excited about, and which questions to ask next.

If you’re interested in following along as we investigate some of these biological and technical questions in more depth, subscribe and stay up-to-date with new blog posts. And as always, if you're exploring this space too, I'd love to hear what you're building or learning. Reach out anytime.

Thank you to Alex Goldberg, Anton Krutiansky, Ken Leidal, Lina Garcia, Matteus Pan, Ryan Kagin, and Wojtek Swiderski for taking the time to review this post.

Starting In Biotech

Zach Kagin — Mon, 25 Aug 2025 23:39:37 GMT

The biotech developments of the last decade sometimes sound like science fiction. We can directly edit genes to correct inherited diseases, predict protein structures from DNA sequences alone, and develop a working vaccine for a global pandemic in just a few weeks. But for someone who wants to learn more, it's hard to know where to start asking questions.

This is the overview I wish I had when deciding how to join biotech. My aim is to give you a practical framework to understand the breadth of research, a starting point to follow your own curiosity, and some insight into where software and AI can help bring us into the future that biotech promises.

Why work in Biotech?

Biotech is science applied to improve health and save lives. Its impact is deeply human, giving us a few more years with a loved one and the freedom to age gracefully. Its nature is deeply intellectual, interrogating the most complex system the universe has ever created. And its pace is accelerating, with new discoveries coming every year.

There has never been a better time to join the field. While breakthroughs will continue to come from laboratory experiments and clinical insights, software plays an increasingly important role enabling those discoveries and bringing them to market. And with the explosion of biomedical data, AI has the opportunity to uncover connections that were never possible before.

It no longer requires a biology or chemistry PhD to contribute. But it does help to understand a little more of the science, so let’s jump in.

How does biotech improve health and save lives? We’ll start by looking at how biotech directly helps patients. It does so by answering two questions about our health: what is the problem (diagnostics), and how do we solve it (therapeutics). With that context, we will then view the development journey of a product, and how biotech enables researchers, scientists, and clinicians to create and deliver those products to patients.

The Patient’s View

Identifying the Problem: Diagnostics

Every health issue needs a diagnosis before a treatment can be applied. Diagnostics help a clinician choose the right treatment, stop an infectious disease from spreading, and prevent a catastrophic failure like a heart attack or stroke. One way to categorize the different approaches is to look at the source and frequency of information.

Clinical screening uses what can be seen, heard, or felt to identify problems. Every time a doctor asks you to cough, every time your gym teacher asks you to run a mile, and every time you have to read EDFCZP at the bottom of an eye chart, you are getting clinical diagnostics. These tests are the first indication of whether something is wrong and whether to follow up with other tests. More recently, these tests are being brought into the comfort of our own homes using nothing but an iPhone or handheld device.

Example biotech companies: TytoCare (home care), BrainCheck (neurocognitive testing)

Molecular diagnostics uses the presence of specific molecules in bodily fluids to diagnose potential issues. Blood panels can tell if your Vitamin D is low or your cholesterol is high. Stool samples can analyze your microbiome for gut health, while an all-too-familiar nasal swab test can quickly detect COVID-19. New tests can even use DNA markers in the blood to identify early signs of cancer.

Examples: GRAIL (cancer detection), Viome (microbiome tests), Sherlock Biosciences (abnormal DNA/RNA detection)

Genetic testing is a special case of molecular diagnostics, extracting DNA to analyze further. While 23andme could tell us what percent of our DNA is shared with Neanderthals, our genome sequence can also determine the presence of a genetic-linked disease or propensity for it. For example, more than 50% of women with a BRCA1 mutation will get breast cancer by age 80, leading many to get preventative mastectomies to avoid it.

The more genomic information we can collect, the more patterns we can find that affiliate genetic mutations with observable changes. Researchers are analyzing enormous datasets containing genome data, lifestyle information, physical measurements, medical imaging, and death registries. This complex multimodal data is ripe for AI models to extract information, and the hope is they can connect genetic changes to health issues, pointing the way towards future solutions.

Examples: Nebula Genomics (whole genome sequencing), Genomenon (gene analysis)

Imaging diagnostics visually examine what’s going on inside the body. The alphabet soup of MRIs, PET scans, CT scans, X-rays, and other imaging tools each have their own advantages. X-rays see dense structures like bones the best, and are great for diagnosing bone fractures. MRIs can provide a 3D view of soft tissues in the entire body, but are much more expensive. And expectant mothers are familiar with ultrasounds, allowing them to hear the most precious sound in the world in real-time, their baby’s heartbeat. New methods are also continually emerging, like photoacoustic imaging, that improve on accuracy or safety.

The most accurate images can only be taken by removing some tissue or fluid first. With samples removed, doctors can see what’s happening at the cellular level under a microscope, like a throat swab to check for strep throat, or a tumor biopsy to identify the type of cancer. The growing field of digital pathology takes images of these scans and analyzes them for more complex and generalizable patterns.

Examples: PathAI (digital pathology), Prenuvo (MRIs for cancer detection), Rad AI (AI for radiology)

Monitoring systems take measurements and give live feedback to the wearer about what’s going on. Digital watches and rings may be fashion statements, but can also track heart rates, blood oxygen levels, and sleep quality. Continuous Glucose Monitors (CGMs) were built to help diabetics manage their insulin, but have also become popular among consumers to help understand how they respond to different foods. And headbands using EEGs are recording brain waves to analyze sleep patterns and help diagnose neurological disorders.

Examples: Dexcom (CGMs), Oura (wearable rings), Beacon Biosignals (brain monitoring)

Where Diagnostics Can Go From Here

Diagnostics is evolving in three different directions, each presenting new opportunities for researchers.

Adding new types of data: How else can we measure what’s happening? Joy Milne demonstrated the astonishing ability to smell Parkinson’s, and researchers are now trying to translate that into a standardized test. Others are working to identify new sources of data inside the cell, extracting ever more granular information on every molecule it contains.

Collecting more data: How do we improve the accuracy, throughput, and frequency of data? CGMs transformed diabetes management by checking glucose levels continually rather than in manual tests, and others are working on monitoring for other key biomarkers. DNA sequencing costs have fallen 100,000x over 25 years, enabling whole genome dataset collection at an unprecedented scale. How else can we scale up existing methods?

Interpreting that data: Across every diagnostic type, data requires analysis to be useful. Data is siloed in individual companies and in disparate formats. Datasets can easily reach into the petabytes, requiring bespoke data pipelines to manage. And all of that data needs interpretation to look for patterns and insights. AI models are already used to review medical images, identify potentially causal genetic mutations, and detect disease markers. As the volume and diversity of data increases, analysis will become ever more challenging and important.

All of these methods can help identify a health issue. What happens next?

Solving the Problem: Therapeutics

A diagnosis alone isn’t usually helpful without a treatment to accompany it. One way to categorize therapeutics is by the type of intervention: behavioral, physical, and medicinal.

Lifestyle and behavior are changes we can enact on our own, though we are often resistant to doing so. The perennial advice from doctors and parents is to eat well, sleep well, exercise, and manage mental health, and those pillars of health still hold true. But as anyone who has tried to squeeze in a morning run before work can attest, it is easier said than done. Companies are building platforms to help make these behavioral changes easier, but it’s likely it will always require some hard work on our part.

Many behavioral changes are preventative in nature. This can make them extremely effective, as it is easier to prevent rather than cure a disease. But it can also make them much harder to apply when there isn’t a burning need to do so immediately (e.g. vitamins vs. painkillers).

Despite its importance, nutrition and exercise science are still not as well understood as many hope, and clean data collection remains a key barrier to understanding. One challenge is the financial incentive, as it is hard to make money recommending something that is freely accessible to all. As a result, clinical trials for diet and exercise have largely been funded by government and non-profits.

Examples: MyFitnessPal (smart food tracking), Tonal (exercise machines), Elemind (inducing sleep), Headspace (meditation)

Physical Interventions directly interact with the body. Surgeries can remove objects, repair damage, or add something new. Whether the surgery gives us a little more confidence or a little more mobility, the technology behind it is quickly advancing. Robots now assist in more than 20% of all surgeries in the U.S., and testing has started for fully autonomous robotic surgeries. Brain-computer interfaces are adding new ways for our minds to communicate with the outside world. And we are finding new ways to augment the human body directly, like vagus nerve stimulation devices and advanced prosthetics.

Non-invasive energy-based interventions are also advancing. Radiation therapy is a common treatment to kill targeted cancer cells, focused ultrasound can stimulate or destroy tissue to treat diseases like Parkinson’s, and phototherapy uses ultraviolet light to treat skin conditions and reduce inflammation.

Examples: Neuralink (brain-computer interfaces), eGenesis (human-compatible organs for transplant), RefleXion (biology-guided radiotherapy)

Medicinal Drugs are usually the first idea that comes to mind when someone thinks of biotech. Humans have been using plants as medicine for as long as there are written records, both medically and recreationally (Cannabis was one of the first medicines in recorded history). Today, drugs take a variety of forms, from small molecules to larger proteins to gene therapies. Some of the most exciting developments in biotech involve using AI to discover new drugs, which I’ll cover in more depth in the next blog post.

Examples: Genesis Therapeutics (AI for drug discovery), Schrödinger (physics-based discovery tools)

Development Of A New Biotech Product

If diagnostics and therapeutics are the products being delivered, how are they made? The development lifecycle follows a path from experimentation, to validation, to delivery.

Experimentation

The seeds of an idea are planted in an errant image, a surprising spike on a chart, or a miraculous mouse. It forms into a hypothesis, and that hypothesis forms into a set of experiments. Every researcher’s workflow looks different, but each has the same goal of illuminating some new truth of biology. This unique path is part of what makes experimentation hard to automate, as there is not a standard set of tasks or steps. That initial “Eureka” moment so often relies on the observant scientist thinking outside of the box.

But despite the bespoke nature of research, there are some common tools to make it faster and more effective. Manual pipetting and stirring has made way for automated robots running experiments programmatically. Beside them, researchers leverage LIMS (laboratory information management systems) and ELNs (electronic lab notebooks) to keep track of results and share them with colleagues. And new foundational biology models aim to enable in-silico experimentation that will inform what experiments or parameters to try next.

A core part of experimentation is the data layer. Collected through every blood panel or clinical image, the amount of data generated in biotech is enormous and quickly growing. By some estimates, a single person’s whole genome scan can reach into the gigabytes even when storing only changes from a reference - multiply that by millions of people, dozens of different types of information, and add in images, and you can see the problem. “Omics”, the shorthand for the collective analysis of any biomolecule (genomics for genes, proteomics for proteins, etc.) has exploded, with new types and scales of data arising all the time, and it is often no longer feasible to analyze on a local computer.

Beyond sheer size, biotech data also suffers from other challenges. The data is often siloed between organizations, batch effects show up with different devices, and it can be hard to infer causality from correlational studies.

All of this data feeds into experiments in an iterative feedback loop. More data raises more questions, requiring more lab work or new data sources to answer. The goal of some biology ML models is to build “closed-loop lab automation,” where the AI is identifying experiments to run, collecting the data, and using that data to inform new experiments without any intervention.

Whether automated or manual, the result of all this experimentation is the prototype of a product - a drug, a test, or a new approach.

Examples: Benchling (cloud R&D platform), Emerald Cloud Lab (Lab automation)

Validation

To ensure every new product is safe and effective, it must go through validation in humans, a process tightly controlled by the FDA.

For diagnostics, validation can mean experimentation across manufacturing sites, tissue types, or patients. Methods are also evaluated for their sensitivity and specificity, making sure they do more help than harm. In therapeutics, it most often means clinical trials. Running these trials is complex, having to manage the trial design, recruitment of potential patients, testing of the new product, and coordination between partners.

Throughout the process, there are strict regulations and compliance requirements designed to make sure quality controls are put in place at each step of the pipeline, patient information is safely handled, and clinical decisions are made ethically.

Examples: Science 37 (clinical trial operations), Flatiron Health (patient and site selection)

Delivery

As with all new products, developing a new diagnostic or therapy is necessary but not sufficient, and the challenge of bringing it to market is complicated by a host of biotech-specific constraints.

FDA approval is often thought of as a binary decision, but it is just the start of an ongoing journey of approvals. Companies are required to monitor for rare adverse events and continual real-world effectiveness, considered the “phase 4” of a clinical trial. They also must maintain their quality control systems, which the FDA will periodically audit.

Selling a biotech product is very different from a consumer good. Though the patient is the one receiving the product, the clinician is often the one deciding whether to use it, and the insurance is the one covering it. This leads to a complex set of negotiations around pricing, coverage, and reimbursements, as well as a marketing challenge to disseminate information about new treatments. While patients ask Dr. Google (or increasingly, ChatGPT), clinicians are also leveraging tools to inform their clinical decisions. The U.S. and New Zealand are unique as the only two countries in the world to allow direct-to-consumer advertising of prescription drugs, as others want to avoid the potential for patient misinformation.

Even manufacturing a product can become quite complicated. One of the reasons Novo Nordisk may have fallen behind Eli Lilly in the diabetics/weight loss market despite being first is that it had substantial difficulties scaling up their manufacturing (though it didn’t help that Eli Lilly’s drug was also more effective). It often involves custom buildouts of bioreactors, spinning up cold supply chains, and negotiations around key input ingredients. And regulation and compliance are even more strict here, ensuring anything that reaches patients maintains quality throughout the process.

Examples: Cellares (cell therapy manufacturing), Ketryx (AI for FDA compliance)

What Next?

This overview still only scratches the surface of biotech. Each paragraph above contains entire industries and enough depth to sustain teams of PhDs and engineers for decades. But if you’ve made it this far and you're still excited, go talk to someone! Communities like Bits in Bio are full of researchers, engineers, and founders eager to collaborate and share ideas. Many biotech companies are hiring, and open to software-first thinkers.

If you want to dive deeper, the next post in this series explores drug discovery in detail. In future posts, I’ll be diving deeper into specific biotech challenges, from clinical trial patterns to biology foundation models. Leave a comment with topics you want to learn more about! And if you're exploring this space too, I'd love to hear what you're building or learning. Reach out anytime.

Thank you to Alex Goldberg, Anton Krutiansky, Ken Leidal, Lina Garcia, Matteus Pan, Ryan Kagin, and Wojtek Swiderski for taking the time to review this post.