Introduction

How our genes work together to build our cells, organs and bodies, and how mutations in many genes contribute to disease remain fundamental questions in genetics. Despite more than a century of exploration, our inventory of the genes required by each of our cell types for their development and their specific functions within the body remains sparse, as is our view of the regulatory links between genes. Notwithstanding the requirement to use laboratory animals as proxies for human biology, defining the genes essential for each cell type is a daunting prospect. Experiments must disrupt one or more genes within the animal to then assess the outcome on multiple cell types and potentially thousands of changes in gene expression at many different developmental time points.

Emerging measurement technologies based on sequencing1,2, imaging or both3 offer a route to opening this longstanding bottleneck in genetics. Single-cell transcriptome sequencing can be performed on millions of cells4 from thousands of independent samples2. Other aspects of the molecular states of cells, such as chromatin accessibility5 or DNA methylation6, can be measured in single cells at scale, often in conjunction with RNA7. Spatial molecular profiling is rapidly maturing, enabling surveys of the entire transcriptome across large fields of view8. Paralleling improvements in single-cell measurement technologies, improvements in the efficiency, precision and throughput of genome-engineering techniques now allow us to manipulate genes and cells in vivo9. Single-cell sequencing and CRISPR genome editing go hand in hand as a means of revealing how mutating a gene alters the expression of every other gene in each cell type. Systematically searching for genes that ‘phenocopy’ one another in terms of the molecular, anatomic or behavioural effects they elicit often reveals genes that function together through regulatory or biochemical interactions. Particularly in the context of large-scale perturbation studies10, ‘guilt-by-association’ analyses for finding sets of genes with similar phenotypes have proved invaluable to understand gene function, with whole biochemical complexes revealed11. Together, these technologies promise to generate an avalanche of data that captures the consequences of genome-scale experiments in which we measure the effects of our interventions on every gene in every cell type in a whole animal.

However, exploiting the full power of single-cell technologies to understand gene function demands thoughtful attention to fundamental statistical issues at the core of several recurring computational problems. A first problem is to catalogue and characterize the cell types based on differences in how they respond to perturbation. A second lies in discerning how each cell type regulates, and is regulated by, genes. A third is to understand how cell types descend from one another in the lineage or depend on each other via signalling. A fourth is how the genes depend on and regulate one another, and how that control varies within different cell types. And finally, a fifth problem is to integrate such knowledge to make accurate, quantitative forecasts. For example, what will happen in a mouse model of disease in yet-untested genetic, drug or environmental perturbations? What will happen in a specific patient when that person is prescribed a particular drug? Although not all these problems are purely or even primarily statistical in nature, attention to the statistical issues that arise when bringing single-cell technology to bear on them is key. Analogous statistical issues will arise when studying gene function with other measurement tools from the molecular to the anatomic scale.

A well formulated statistical model can do the following: weigh up the contribution of each of many input factors; transform qualitative phenotypes into quantitative ones; resolve technical limitations of the instrument that produces the data; or connect measurements made at various scales to capture or even forecast the behaviour of molecules, cells, tissues and whole organisms, including under perturbations and in new environments or contexts. Good models establish confidence that observed effects and phenotypes are ‘real’, can help to separate direct versus indirect effects, and can describe how variation in one variable propagates to or compounds variation in others, thus guiding subsequent experimental design decisions and saving time and laboratory resources.

In this Review, I highlight statistical concepts, models, tools and algorithms that can be repurposed to solve problems now arising or becoming increasingly urgent in genetic and molecular biology studies of development and disease. I outline the workflow for phenotyping with single-cell molecular profiling, including differential cell composition and gene expression analysis. I then introduce strategies for quantitatively modelling a gene’s regulation, for understanding lineage relationships and cell–cell signalling interactions, and for inferring gene regulatory networks. Finally, I turn to the problem of forecasting cell fate, briefly touching on the promise of advanced artificial intelligence systems for such problems. I do not discuss the relevant technologies in detail to avoid overlap with many recent reviews on single-cell sequencing, spatial molecular profiling and genome editing. I also avoid statistical subjects already very familiar to geneticists, such as genome-wide association studies and expression quantitative trait loci analysis.

Phenotyping at single-cell resolution

For many diseases, we know neither the principal cell types nor the genes involved, much less the mechanisms that might be targeted as part of therapy. Single-cell genomic experiments that compare healthy to diseased tissue or track the development of embryos over time typically aim to identify subpopulations of cells that differ between conditions or timepoints at the molecular level. For example, cells that exist only in diseased samples are of central interest, as are healthy cell types that are absent in diseased samples. In principle, comparing the transcriptomes of individual cells in one condition (for example, mutant or disease) against control cells (for example, wild-type or healthy) is straightforward: first, cluster the cells, then classify the clusters according to type based on the expression of discriminative marker genes. Next, count the number of cells of each type in each sample and compare these counts across sample groups to assess which cell types are more abundant in one group than another12. Finally, scrutinize and compare the genes specifically active in each of these states to extract insights about possible mechanisms of pathogenesis and progression. However, the nuances at each of these steps of analysis demands attention to the statistical issues and adoption of best practices (reviewed recently13).

Cell-type annotation

Recognizing disease-specific changes in cell-type proportions or transcriptomes requires annotating each cell according to type (Fig. 1a). Classifying cells according to type is done by comparing their transcriptomes to annotated cells in reference cell atlases4,14,15,16,17,18,19. Several bioinformatic tools have been developed to open up this bottleneck20,21,22,23,24,25. Garnett uses elastic net regression26 to train a classifier that annotates cells in a new experiment based on prior single-cell data according to an ontology of cell types defined by a literature-derived set of markers for each cell type20. An alternative strategy matches each cell from a new experiment onto cells from an existing atlas with similar overall transcriptomes and then transfers any annotations from the reference (for example, cell type) onto each query cell27. This so-called label transfer is fast and straightforward but requires a reliably annotated reference atlas. The Human Cell Atlas and various model organism atlases are improving rapidly as more groups contribute data, but annotations are frequently updated, and cell types are still being catalogued, particularly in the context of disease.

Fig. 1: Regression analysis at single-cell resolution isolates cell types and genes central to pathobiology.
figure 1

a, A hypothetical experimental design with several cases (people with disease) and controls (people without disease). Peripheral blood mononuclear cells are collected from each individual and subjected to single-cell RNA sequencing (RNA-seq), identifying five different cell types. b, Clustering and cell type annotation reveal major cell types in proportions that vary across individuals of different ages. c, One cell type (activated T cells) seems to be more abundant in cases than in controls, but is the difference statistically significant even if we control for differences in cell type proportion as a function of age? d, Linear regression can model the proportion in a cell type relative to the others across samples. A first model describes the activated T cell proportions as a function of age and disease status. In a second model, disease status is excluded. These models can be compared with a likelihood ratio test to assess whether the improvement in the first model’s ability to explain variation in proportions across samples is enough to justify its additional complexity in terms of parameters. The plotted lines correspond to the fits from the alternative model. e, Individual genes can also be tested for variation across cells from different samples. A gene may vary within cells from patients treated with a drug as a function of treatment time since therapy began, but the trend may differ between individuals, who may respond differently. These donor-specific trends should be accounted for during differential expression analysis. One approach is to use a mixed model that captures donor-specific ‘random effects’ in addition to the fixed effect of age and other experimental variables. The ‘(1 | patient)’ notation denotes a patient-specific random effect.

Differential cell composition

Once cells have been identified, one can ask, for example, whether their proportions vary by donor age (Fig. 1b), disease (Fig. 1c) or by other study variables. There are two strategies for comparing abundances of each cell type across specimens. A first strategy is to simply count the cells according to type and then compare the counts across samples, using a simple two-group test. However, two-group tests can have difficulty detecting differences as a function of one variable (for example, disease) while controlling for others (for example, age). A more flexible approach is to model the cell-type counts across groups via regression (Fig. 1d). Propeller is a recently introduced tool that compares the proportions of different cell types by regressing experimental variables (for example, case versus control or age) against each cell type’s proportions in the sample12. The advantages of regression-based analyses of cell proportions are that they are easy to implement and interpret, but a disadvantage is that they require accurate cell-type annotations. Moreover, cell ontologies necessarily impose arbitrary classifications on cells, specifying, for example, at what point a developing cell becomes ‘terminally’ differentiated.

Several methods have emerged to navigate the trade-offs between resolution and interpretability, which become especially fraught with sparse single-cell data, when testing for enrichment or changes in cell proportions. SEACells groups cells into ‘metacells’, between which biological variation is preserved but technical variation is minimized28. SCAVENGE relies on network propagation to associate genetic variants with distinct molecular states in which they are enriched29. Uncertainty surrounding cell annotation can be at least partly accounted for in such models via bootstrap resampling30 at additional computational expense. MELD31 and Milo32 are alternative approaches that model the distribution of cells throughout the high-dimensional space of possible transcriptomes and then test whether sample groups are distributed differently within it. The advantage of such an approach is that it does not require prior clustering or annotation and is therefore robust to choices of parameters of clustering algorithms and can work in the absence of an accurately annotated reference. A disadvantage is that more complex contrasts (for example, mutant versus wild type, controlling for differences in age) can be more complicated to specify and compute.

Differential gene expression

A third major statistical challenge arises when comparing cells of different types or states. What genes distinguish each cell type? What, if anything, is different about nominally similar cells from different donors or treatment groups, even after accounting for systematic technical effects shared by ‘batches’ of specimens? Such comparisons are central, for example, to identifying disease biomarkers or characterizing the genes that are dysregulated in disease-specific cell states relative to the healthy ones from which they emerged. For example, differential gene expression of single nuclei from mice exposed to silica dust revealed that lung macrophages activate genes normally expressed by osteoclasts, including proteases and other molecules that damage the lung33. It is essential in such comparisons to compare the levels of genes in cells with a statistical model and test that is well suited to the data34. For simple two-group comparisons, a Wilcoxon rank-sum test (also known as a Mann–Whitney U test) or similar non-parametric approach can accurately capture genuine differences while controlling for false discoveries that may arise from sampling error, with the same limitations as in differential cell-composition analysis. For example, one might want to compare the kinetics of a gene over time in two different strains, or test whether there is an age-dependent difference in how a particular cell type responds to a drug treatment (Fig. 1e). Generalized linear regression models (GLMs) can test for genes that vary over one or more experimental variables (having controlled for the others). GLMs treat each observation (that is, each cell) as independent, but this assumption is not met in most single-cell RNA sequencing (RNA-seq) experiments because cells from the same specimen are not fully independent from one another. If this ‘grouping structure’ among the observations is not accounted for, the analysis will return many false positives. There are two ways of accounting for grouping structure in single-cell datasets: one approach is to use generalized linear mixed models (GLMMs) that allow parameters (for example, age or dose effects) to differ between groups of cells. A second approach is to aggregate data from individual cells of the same type and from the same sample, to create ‘pseudobulk’ profiles34,35 that can be analysed with conventional RNA-seq tools36,37. Pseudobulking is straightforward but requires accurate cell-type annotations. GLMMs can be fitted without annotations but are computationally much more expensive to fit.

Because cell types are often defined on the basis of how they cluster, and cells are clustered according to how similar their transcriptomes are, care must be taken not to ‘double dip’. Using exactly the same data both to define clusters and to characterize their differences risks reporting spurious biomarkers of novel cell types38. Two solutions have been proposed for studies that seek both to discover new cell types and to characterize their molecular markers38. In the first, termed ‘sample splitting’, cells from a first group of samples are clustered according to transcriptomic similarity. Then, samples from a second group of samples that were not used in the definition of clusters are assigned to the cluster most similar to them. Finally, the cells from this second group are used for differential expression analysis to define markers. This approach is simple and does not require specialized statistical procedures, but the experiment must collect enough samples that they can be split between the two stages. A second approach, called data thinning, splits the sequencing data from each cell into two statistically independent groups, the first of which is used for clustering and the second for differential expression39,40, avoiding the ‘double dip’ without adding to the overall cost of the experiment.

Modelling gene regulation

A useful model of gene expression can help to discern the contribution of several molecular inputs across diverse genetic, environmental, developmental and disease contexts (Fig. 2a). How much mRNA is transcribed from a gene, given a cell’s concentrations of its upstream kinases and transcription factors? How active is protein signalling in this cell? What non-coding sequences determine whether a gene is expressed in each cell type, and are they accessible for binding in the cell? How do mutations within those sequences alter the gene’s output?

Fig. 2: Statistical models of gene expression aim to quantify relative contributions of regulatory DNA sequence, proteins and signals to a gene’s mRNA and protein output.
figure 2

a, The transcription of a hypothetical gene with diverse molecular inputs that are integrated to determine the rate of RNA production. Binding of a ligand (dark purple) of varying concentration to a receptor (light purple) of varying surface expression triggers signalling through kinases A and B in a pathway that ultimately leads to the phosphorylation (P) and translocation of a transcription factor (TF) to the nucleus. The nuclear concentration of this factor determines the probability that it occupies the regulatory DNA of the gene, which is also a function of how accessible chromatin is around that DNA, and whether it contains any sequence variants that might alter the factor’s binding affinity. b, A linear regression model for RNA production for the locus in panel a, which assumes that each input has an additive effect on expression. c, A Bayesian network model of RNA production from the locus, which describes the conditional dependencies between the various inputs.

Multi-modal single-cell assays are increasingly powering quantitative descriptions of how individual genes are regulated across the molecular layers of the central dogma41. For example, CITE-seq co-assays cellular transcriptomes and their immunophenotypes via oligo-conjugated antibodies that can be detected in single-cell RNA-seq (scRNA-seq) libraries42. ASAP-seq co-assays the immunophenotypes of cells with their chromatin accessibility profile via scATAC-seq43. Several RNA and chromatin accessibility co-assays have been reported7,44, enabling the correlation of gene expression with the accessibility of nearby regulatory DNA elements (for example, putative enhancers).

A quantitative model of a given gene’s transcription aims to find a mathematical function that accurately estimates the number of its mRNA or protein products present in the cell under different conditions. This function must somehow capture the molecular biology underlying the central dogma. In their 1961 classic paper on lactose metabolism in Escherichia coli, Jacob and Monod showed that the kinetics of a gene could be used as a quantitative phenotype and for causal inference regarding its control45. In the decades since, many others have returned to this system as a testbed for new ideas in quantitative modelling of individual genes through mathematical functions of chemical reaction rates, the affinities of transcription factors for their DNA binding sites and other biochemical properties46. Despite the rigour, sophistication and utility of such models for discriminating between possible mechanisms of gene regulation, an inescapable conclusion of these efforts is that mathematical equations that describe the kinetics of mRNA synthesis even for simple genes become intractable when many inputs must be considered.

An alternative approach to modelling a gene’s expression is to use numerical optimization to learn the more abstract functions that nevertheless accurately predict its output from its many inputs. These functions may not be explicitly specified in biochemical terms but nevertheless quantify the contributions of genetic mutations, epigenetic states, cell-type-specific signalling, among other variables (Fig. 2b). With the inputs, output and general form of a given gene’s model specified, the computer can search through the space of all such functions to find the one that best explains the multi-modal data using an appropriate numerical optimization algorithm.

Statistical models can vary tremendously in their complexity, with increasing fidelity to the underlying molecular biology. For example, a simple linear model will describe how the amount of mRNA changes with addition or subtraction of each input. Such models are straightforward to interpret but have difficulty describing how inputs interact, and so will not detect, for example, how transcription factors cooperate to activate a gene or block each other from doing so. The function can be augmented with terms that explicitly model the interaction between explanatory variables, albeit at a potentially substantial cost in terms of interpretability and computation burden. Graphical modelling approaches such as Bayesian networks can capture complex interdependencies between the inputs (Fig. 2c). For example, suppose a gene’s mRNA levels can be predicted accurately from measurements of a particular signalling pathway’s activity or from the nuclear abundance of a key upstream transcription factor whose translocation to the nucleus depends on that pathway (Fig. 2a). A Bayesian network could learn that the signalling measurements are conditionally independent of the mRNA levels, given the transcription factor levels, providing no additional predictive value.

Especially when the inputs depend on one another, more complex models are not necessarily better: although adding terms will always explain more variance in training data, those terms may actually hurt performance on ��held-out’ test data (that is, data not used to train model parameters). Model selection, the formal assessment of whether adding explanatory variables and dependencies can be justified in terms of how well a model explains data, is critical for quantifying the contributions different inputs make to gene regulation. Fortunately, the statistical literature is rich in tools with which to assess whether a complex model is significantly better than a simpler one ‘nested’ within it (for example, likelihood ratio tests), at selecting a parsimonious set of inputs from a large collection of potentially correlated ones (for example, LASSO47), and at measuring overall model complexity (for example, the Akaike information criterion48).

Statistical models of gene regulation can be especially powerful in conjunction with genetic experiments. For example, the bone morphogenic protein (BMP) pathway transduces signals through multiple ligands bound by receptors composed of multiple subunits. A recent quantitative analysis determined how ligand and receptor compositions and stoichiometry define how cells transduce BMP into diverse patterns of downstream gene expression49. The statistical methods developed as part of the study discriminated between additive, synergistic and antagonistic effects between the ligands and the receptor subunits. Modulating BMP receptor composition by knocking out subunits altered how cells respond to BMP, both quantitatively in levels of downstream expression and also qualitatively, with one cell type made to resemble another. This study demonstrates that combining genetic perturbations with statistical modelling to discriminate between possible functions for genes using the quantitative kinetics of the cellular behaviours they control remains as powerful today as it was when Jacob and Monod used it more than 60 years ago.

Disentangling interactions and dependencies

How our many diverse cells arise from a single cell, how they communicate with each other, and how they depend on one another to form organized, structured tissue are fundamental questions in developmental biology. What genes does each cell use to sense and influence its neighbours? What genes does it use to ensure its daughters are properly positioned and specified? Genes that function non-autonomously through cell–cell interactions or by conditioning a cell (or its descendants) to behave a certain way in the future are notoriously difficult to study. Although understanding the lineal and signalling dependencies between cell types may seem like disparate goals, new tools for studying cells across space and time promise to decode these parts of the genetic program.

Spatial relationships

Identifying the specific proteins, peptides and small molecules that drive cell-fate decisions is challenging. Highly multiplexed spatial transcriptomics and proteomic techniques (reviewed recently50,51) have emerged as a way of measuring heterogeneity across cells within a tissue at the molecular level, promising to characterize the molecular basis of signalling dependencies between cell types, and one day may enable quantitative spatiotemporal models of tissue morphogenesis or tumour growth and invasion. Tools from spatial statistics are increasingly being deployed to overcome common tasks in spatial transcriptomics or proteomic analyses. For example, a recent analysis of the mouse liver used a combination of spatial transcriptomics and scRNA-seq to statistically deconvolve the spatial transcriptomics data using scRNA-seq, estimating each gene’s expression within each cell in the tissue; to automatically annotate veins in the image and classify them as central or portal; and to identify modules of spatially autocorrelated genes within hepatocytes that vary as a function of position along the lobular axis and proximity to portal veins52. A separate study used molecular cartography to compare livers from wild-type and Wnt2/Wnt9b double-knockout mice, finding these genes to be required for properly zoned gene expression and liver regeneration following acetominophen injury53. These two examples highlight how new technology (and rigorous spatial statistics), coupled with genetic and drug perturbations, facilitate causal inferences about the role of individual genes in cell–cell interactions and molecular phenotypes.

The central statistical challenge associated with spatial molecular profiling is to quantify the contribution that the locations of cells within a tissue make to explaining variation in their forms and functions. This challenge is fundamental to answering diverse questions in cell, molecular and developmental biology. Are T cells that have infiltrated a tumour different from those at the margin? How does distance from an organizer and the morphogens it emits correlate with a cell’s fate in the embryo? Spatial statistics is rich with parametric and nonparametric methods of exploring such questions, which frequently arise in demography, geology, meteorology, oceanography, ecology and other disciplines that analyse measurements over maps and volumes. Consider a hypothetical spatial molecular profiling experiment aiming to identify ligand–receptor interactions that determine the expression of a downstream gene through signalling (Fig. 3). Statistical models could be used to determine whether variation in target gene T across the tissue can be explained as a function of the local concentration of ligands A and B, conditional on whether the receptor is also expressed. The variance explained by these models can be taken as the strength of evidence for interactions between each ligand and the receptor. However, as more candidate ligands are considered, model selection can be burdensome as the number of candidate models explodes.

Fig. 3: Identifying molecular mediators of cell–cell interaction with spatial statistics.
figure 3

a, Multiplexed imaging or spatial transcriptomics collect measurements of each gene at many positions in a field of view or tissue section. Using Gaussian process regression or ‘kriging’ produces an estimate of each gene’s expression even at coordinates that were not directly measured. b, Measurements from one gene can be correlated with another across space to identify genes that vary as a function of concentration of ligands and other key regulatory molecules. c, A set of spatial regression models that aim to explain variation in a target gene’s expression across cells as a function of one or more candidate extrinsic signalling factors. *** indicates statistical significance. Pseudo-R2 is a measure of the variability in a dependent variable explained by a model of several independent variables. d, Ligands for receptors can in principle be identified for receptors by evaluating the amount of variation they explain in targets known or suspected to be downstream of those receptors. n.s., not significant.

Statistical methods are also critical for overcoming limitations in the underlying measurement technologies. Techniques such as Slide-seq54 that release mRNA from tissues and then blot it onto a substrate prior to library construction are not, strictly speaking, single-cell techniques, in that libraries correspond to spots on a spatial grid contributed by multiple (and sometimes many) cells of different types and in different functional states. There are now numerous algorithms for deconvolving the spots under various mixture models for how individual cells contribute their diffused mRNA to each nearby spot55,56. Techniques that extract single cells or nuclei prior to sequencing such as sci-Space57 avoid this issue, but they face others: because cells are physically extracted from tissue prior to single-cell sequencing, the original tissue coordinates of those cells must be triangulated (again, with uncertainty) from the sequencing. Triangulating cells can also be formulated as a mixture modelling problem in which each cell is labelled according to their relative proximity to a reference point within the assay’s spatial coordinate system. A second problem common to many spatial molecular profiling assays is that measurements for some probes or cells may be missing at some coordinates. Fortunately, there is a rich statistical literature on how to estimate missing measurements through interpolation. One classic approach to the spatial interpolation problem is ‘kriging’, whereby a ‘variogram’ is constructed that describes the correlation in measurements collected at two points based on the distance between them, enabling one to predict values at unsampled coordinates based on sampled ones. A third problem common to any analysis of tissue sections is how to register multiple sections from the same specimen into a common, three-dimensional coordinate framework. Here too, spatial statistical methods have proven invaluable: especially in conjunction with deep neural networks, for aligning sections against one another in three dimensions58.

Lineage relationships

In some organisms, the lineage relationships between cells can be ascertained simply by observing them under the microscope. Sulston and colleagues characterized the entire cell lineage of the nematode Caenorhabditis elegans by meticulously documenting individual cell-division events in worm embryos59. The worm is ideal for such an effort because it is small and has an invariant lineage that over developmental time follows a deterministic genetic programme, so that observations of different embryos can be registered into a unified ‘coordinate system’ of the lineage. Unlike the worm, most animals do not have invariant lineages; even genetically identical animals do not contain exactly the same number of cells, of exactly the same cell types, positioned in exactly the same ways in their tissues and organs. Experiments that interrogate the cells arising from a population marked by a particular gene (assuming a suitable reporter could even be designed) leave the remainder of the animal unmapped, and so the lineal origins of many cell types remain unclear.

Emerging genomic and imaging technologies for lineage tracing offer a path to comprehensively characterizing the lineage relationships between cells at the whole-animal scale (reviewed recently60,61). These technologies draw from two basic strategies for connecting cells to their descendants through developmental time. A first class of prospective lineage tracing strategies labels cells by manipulating their genomes (for example, with unique barcodes), allowing them to expand, sequencing the barcodes of their descendants, and then associating each descendant with its ancestor through these barcodes. A second, retrospective set of strategies scrutinizes a population of cells and reconstructs their ancestry through phylogenetic analysis of their shared genetic variants. Recently, several techniques that draw from both of these strategies have been developed, such as GESTALT, which uses CRISPR–Cas9 to introduce cumulative edits to a synthetic construct installed in the genome of developing zebrafish62 (Table 1). As the zebrafish embryo develops, cells accumulate unique patterns of repair to this construct, prospectively and continuously labelling ever-smaller clades of cells (Fig. 4a). On the assumption that descendants will share the genomic scars formed when their shared ancestors received edits, their lineage relationships can be inferred by reconstructing a lineage tree that minimizes the amount of editing that would be needed to generate their ‘alleles’ (for example, maximum parsimony). Intense efforts are underway to improve the depth and breadth of lineage recording capacity, to co-capture other measurements (such as the transcriptome), to incorporate spatial information and to deploy recorders in animal models. For example, the DARLIN mouse, which records cellular lineage via an inducible Cas9-barcoding system, revealed that clonal memory in haematopoietic stem cells is associated more with DNA methylation than with chromatin accessibility or the transcriptome63. Further experiments, possibly involving mutants on the DARLIN background, will be required to identify the genes that mediate memory through DNA methylation.

Table 1 Emerging genomics technologies that enable statistical inference of gene function
Fig. 4: Statistical inference of cell lineage relationships during development using molecular recorders.
figure 4

a, Molecular lineage recorders work by first introducing synthetic sequences into the genome and then targeting those sequences with genome-editing reagents (such as CRISPR–Cas9) to introduce mutations in them as the embryo develops. Each cell accumulates a distinct (ideally unique) pattern of mutations, producing an embryo that is massively mosaic at the target locus. The ‘alleles’ can be used to infer a phylogenetic tree corresponding to the cell lineage. b, For organisms without a deterministic lineage programme, the cell lineage tree for each individual will differ, and how to produce a ‘consensus lineage’ for a population of individuals from the same species is an unsolved problem. However, coalescent theory offers a conceptual roadmap for building such consensus lineages.

Understanding how animal genomes sweep out a lineage that is variable without compromising reproducibility and robustness during development is a central aim of molecular recording technology. Realizing this goal will require developing statistical models that describe and quantify variability in the lineages of many individual embryos. Statistical methods from phylogenetics and population genetics offer guidance on how to approach the problem of inferring lineage relationships from molecular recorders. In population genetics, coalescent theory formulates models of how alleles observed in a population arose from a common ancestral allele over successive generations64 (Fig. 4b). Although coalescent theory rests on assumptions that may not hold for the synthetic alleles introduced by molecular recorders, such as the assumption that edits are introduced in cells at a constant rate, much of the mathematical and statistical infrastructure may be reusable. Elaborations to coalescent theory capture phenomena in evolutionary histories that have analogues in cell and developmental biology such as bottlenecks (programmed death) and dispersal (cell migration). A coalescent theory of molecular recorder experiments would enable the use of inferred cell histories as quantitative phenotypes. Phenotyping with molecular recorders would enable direct tackling of the question of how genomes sweep out reproducible yet stochastic cell lineages in development.

Dissecting gene circuits

The fate decisions, lineal histories, physical positions and molecular messaging of cells in the embryo are ultimately determined by the regulation of and by genes. Systematically or even automatically inferring how genes regulate one another has long been one of the most difficult challenges in computational biology. With each new generation of genomic measurement technologies, new algorithms emerge to exploit them, hoping to realize the promise of data-driven dissection of genetic programmes in development, disease and other contexts.

The problem of gene network reconstruction has been formulated in many ways, most ambitiously as a problem in genome-scale causal inference: given the expression levels of all genes in many cells or samples, infer the direct regulatory relationships between the genes (Fig. 5). A statistical approach to this problem models each gene as a dependent variable explained by all others, as well as any extrinsic factors such as drug treatments or environmental stresses (Fig. 5a). However, a pattern of correlation between two genes is insufficient to declare one a regulator of the other. As more pairs of correlated genes are considered, one must discriminate between an exploding number of regulatory architectures consistent with the correlations (Fig. 5b); for a genome with 20,000 genes, there are 400 million possible regulator–target pairs to consider. Even to conclude that two genes are correlated and that changes in one follow the other requires some care because sparse count data can bias simple measures of relatedness65 (for example, Pearson’s r). But the greater challenge lies in distinguishing causal links from correlations. Casting John Stuart Mill’s classic criteria for causal inference66 in genetic terms, defending a regulatory link between one gene A and another gene B requires establishing that changes in A precede B, that measurements of A are correlated with B, and that no other factor, neither a third gene C nor some external variable, explains changes in both A and B. That is, are two genes still correlated, even if one accounts for co-variation with every other gene in the genome? Computing these conditional dependencies massively increases both the computational burden and the amount of data needed for accurate estimates. There are so many parameters to estimate that a naive approach will produce either a uselessly inaccurate result or, after accounting for multiple testing, no result at all.

Fig. 5: Analysing conditional dependence relationships between genes and other experimental factors in single-cell data can reveal regulatory interactions between them.
figure 5

a, In a hypothetical analysis of co-expression between three genes A, B and C across cells, positive correlations between them suggest that they may regulate one another. However, the drivers for variation in the expression of these genes could be some external factor, such as stress. b, Three possible regulatory architectures that are consistent with the correlations shown in panel a. c, Fitting regression models that correlate the residuals of two genes, having controlled for the third, along with any other measured experimental variables, eliminates confounders and discriminates between the three possible architectures. TF, transcription factor.

Three main statistical strategies make this causal inference problem more tractable, each served by a suite of new technologies. A first strategy is to reduce the number of genes or gene–gene pairs in the inference by ruling out genes that cannot regulate others or by declaring some gene–gene interactions to be mechanistically implausible. For example, ‘structural genes’, such as actins, do not regulate most other genes. Focusing only on transcriptional networks shrinks the set of putative regulatory genes to transcription factors and imposes the requirement that such regulators recognize their target genes through cognate binding sequences in flanking non-coding DNA. This strategy (reviewed recently67) is served by multi-modal single-cell assays that readout not only expression levels (a measure of gene output) with chromatin accessibility, histone modifications or intracellular signalling (all measures of gene input). A second strategy is to perturb the genes (for example, through CRISPR68,69,70,71), which establishes (or excludes) their role as a causal regulator of other genes. Expansions of this concept can be used to explore the vast space of combinations of knockouts efficiently72, how genetic mutations interact with drug treatments73 and how genes determine how cells interact with each other74. An important limitation in such screens is that not all cells receive functional edits. Statistical tools75, improved genome editing76 and direct genotyping protocols77 could provide clearer genotype-to-phenotype associations. A third strategy is to collect vast amounts of data, from diverse conditions, to isolate the most informative correlations. Major improvements to single-cell protocols now enable the analysis of cells from many samples2,78,79,80, each perturbed in different ways, in the same experiment, opening the door to screens and systematic reverse genetic studies.

Forecasting cell fate

Understanding what fates and functions cells will adopt in the future is central in both development and disease. Single-cell sequencing experiments carried out over time can capture a nearly continuous view of how cells regulate genes as they differentiate or enter pathological states. Among the first applications of single-cell RNA-seq towards understanding development was trajectory inference, the computational reconstruction of transcriptional programmes that cells execute as they differentiate81,82. Trajectory inference algorithms83 organize differentiating cells in pseudotime according to their maturity and fate decisions (Fig. 6a), enabling one to construct statistical models that anticipate the kinetics of each gene in each cell type. These algorithms were the starting point for more ambitious efforts to forecast the transcriptomes and proportions of cell types in developing tissues, including those subjected to genetic, environmental and other perturbations (Fig. 6b,c).

Fig. 6: Forecasting cell fates, states and phenotypes.
figure 6

a, Visualizing cells undergoing fate decisions by embedding transcriptionally similar cells near one another in two or three dimensions can reveal molecular trajectories that capture the events leading up to and following their fate decisions, but may require time series experiments to orient the trajectory. b, Comparing recently synthesized, incompletely processed transcripts to their mature counterparts (RNA velocity) anticipates each cell’s near-term future transcriptional state. Direct RNA labelling experimentally distinguishes more and less recently synthesized transcripts to achieve similar ends. Either approach can provide a forecast of cells’ future states. c, Inferred gene regulatory networks (GRNs) can be used to forecast the effects of knocking out genes in the network on the distribution of cells over the trajectory, losses of specific fates (indicated by red arrows) and changes in individual genes that would otherwise be expected in the wild type. d, Foundation models can be trained on a large single-cell data corpus (for example, the Human Cell Atlas) and then adapted to perform numerous statistical inference and forecasting tasks. ΔA and ΔB, mutant genotypes lacking genes A and B, respectively; NA, not applicable; TF, transciption factor.

Statistical methods for time series analysis aim to predict the future state of a system using its past and current states. For example, autoregressive models, which learn correlation structure between present data and the recent past to predict the future84, are widely applied in econometrics and to forecast financial markets. Gaussian process models elaborate this concept and have been used to pinpoint the moment individual genes begin to undergo divergent expression as bipotent progenitors differentiate85. However, because current single-cell sequencing technologies are destructive, they do not repeatedly sample the same individual cells, limiting the amount of auto-correlation in the data86.

Both experimental techniques and computational insights have been proposed to learn how past and future states are linked without requiring repeated measures. Single-cell RNA-seq destroys each cell, but labelling the RNA while it is living provides a means of discriminating between older RNAs and more recent ones87. RNAs from newly activated genes will appear in the unlabelled but not the labelled fraction, and therefore indicate where a cell is ‘headed’, transcriptionally speaking. RNA velocity is a purely computational technique that achieves similar ends by comparing the relative abundances of incompletely processed transcripts to mature, spliced isoforms88 (Fig. 6b). Transcriptional velocities have proved powerful in computational ‘fate mapping’. For example, CellRank estimates RNA velocity fields, taking care to propagate uncertainty in its estimates into downstream computations, to accurately predict cellular outcomes even outside normal development, such as cellular reprogramming and tissue regeneration89. Dynamo is a recent algorithm that computes transcriptional velocity vectors from labelled RNA (if available) or in a manner similar to RNA velocity (if not) and then uses the vector field to infer gene regulatory networks and use them to predict future cell fates90.

Extrapolating cell fates is even more challenging in the context of genetic, environmental or other perturbations. Accurate models for predicting what would happen to a population of cells in an as-yet-unseen mutant or under particular environmental conditions is of paramount interest, and techniques for ‘out-of-sample’ prediction are beginning to emerge. CellOracle is a tool that builds on ideas from trajectory analysis and RNA velocity, but also constructs approximate gene regulatory networks to predict what would happen if one of those genes were absent91 (Fig. 6c). To do so, the tool deploys several of the strategies discussed above to make the gene regulatory network problem tractable. For example, CellOracle focuses on transcriptional regulatory networks and leverages prior knowledge from single-cell ATAC-seq data and transcription factor binding motifs to constrain the problem91. It also limits the number of regulators through a statistical technique called penalization, prioritizing overall accuracy in forecasting the future transcriptome of the cell rather than capturing every regulator of each gene. Linear models of gene regulation as used by CellOracle are straightforward to interpret, which means that in silico predictions can be readily tested in the laboratory. A disadvantage of linear models is that they may be too simple to capture the complex, nonlinear interactions between genes, cell types and environmental factors needed for accurate out-of-distribution forecasting.

Deep learning methods are emerging to facilitate in silico predictions of cell states and fates. These techniques trade interpretability for accuracy in forecasting and require dramatically more training data and computational power to build. A first class of methods are built for specific single-cell analysis tasks: scGen92 and a related subsequent method, CPA93, use variational autoencoders to learn low-dimensional representations of cell transcriptomes that enable one to predict gene-expression changes following gene knockouts, chemical perturbations and other insults. For example, by first embedding cells treated with individual small molecules into a low-dimensional space, and then learning how each drug moves cells around in that space, CPA can predict the effects of combinations of drugs that were not included in the training dataset. A second class of techniques — exemplified by Geneformer94 and scGPT95 — building on the same principles that led to the success of large language models (LLMs) such as ChatGPT, first trains a foundation model based on attention networks that learns how genes co-vary with one another across a huge, diverse collection of cell types and states. This foundation model can be used as is for some tasks (so called zero-shot prediction) such as in silico gene knockout analysis, or it can be fine-tuned to perform a more specialized application, such as cell-type annotation or identifying transcription factor targets (Fig. 6d).

There is a robust discussion ongoing within the machine learning community regarding how to combine the strengths of deep learning methods and more traditional modelling techniques, especially for very challenging problems such as forecasting a gene’s regulation in a developing embryo. On the one hand, discriminating between equally plausible mechanistic models may demand the accuracy that thus far only deep learning tools have attained. On the other hand, given many free parameters, a deep learning method may discover a highly accurate but totally uninterpretable or mechanistically completely wrong model for a given system. If we are to use the internal structure of our models to guide mechanistic experiments in the laboratory, we must demand more than accuracy of our statistical methods and artificial intelligence (AI) tools. They must not only learn biology but also be able to convey it to us in terms we understand. Accordingly, intense effort in the AI community is being directed towards making the output of AI models more readily interpretable96.

Conclusions and future directions

As dramatic as the advances in genomic technologies over the past decade have been, there is plenty of room for further progress. Protein abundances and other critical measurements remain challenging to capture in single cells. In terms of scale, experiments with hundreds of specimens remain cost-prohibitive for many laboratories, limiting access to cohort studies and screens. Costs of single-cell sequencing have plummeted in recent years, with new protocols97,98 and new commercial offerings increasingly democratizing the toolset, but they remain expensive in absolute terms and further cost reductions are needed. One can profile a handful of molecular modalities in single cells, but quantifying the contribution of the many inputs to gene regulation may require simultaneous measurement of myriad aspects of each cell’s molecular state. Gene knockouts with CRISPR are routine, but highly efficient, precise in vivo base editing would supercharge efforts to stratify the many thousands of outstanding genetic variants of unknown importance. The value of measuring cells in their tissue contexts was made abundantly clear by early spatial transcriptomic applications, but the current technology is largely two-dimensional, rather than three- or four-dimensional, distorting or excluding whole dimensions of context. Multiplexed single-cell measurement is destructive, limiting efforts to forecast cellular behaviour. Molecular recorders may mitigate this challenge but remain in their infancy, with limited ‘memory’ and programmability, and have yet to reach the level of democratization needed to see them deployed except in a few select applications. Fortunately, the pace of single-cell technology development remains as rapid as ever.

Although we have statistical tools to meet the challenges and opportunities presented by single-cell genomic and multiplexed imaging technologies, many of these tools remain inaccessible. There are at least three areas of investment that would help biologists to unlock their data using ideas from statistics. The first is bespoke statistical software that is built-for-purpose to work with single-cell or spatial sequencing data. For example, many regression analysis packages are general and can be configured to work with all kinds of different data, but running them on a single-cell experiment requires detailed, intimate knowledge of the package, possibly intimidating a user without formal statistical training. Furthermore, the size and scale of single-cell data often exceed the capacity of data structures and algorithms of general-purpose software. Experiments that perturb hundreds or even thousands of genes will demand not only scalable data structures but intuitive, responsive visualization tools to help users to interpret the many phenotypes that emerge at such scale11. The second is statistical education and outreach. The data science community has made tremendous progress educating scientists without computational backgrounds in using regression analysis and other techniques discussed here, but barriers to access could be lower. The third is closer communication between statisticians, geneticists and technology developers to shape future protocols and assays. By connecting developers of statistical methods with developers of genomic technologies, we will produce powerful, integrated solutions to answer questions that have frustrated geneticists for decades. New technologies invariably face trade-offs, and a statistical perspective can be tremendously useful for navigating these. For example, a new technique might favour more samples over profiling each sample more deeply. Will this limit statistical power when applying it to key questions in genetics? Such conversations might induce technology developers to invent assays they never would have considered had they not been made aware of the pain points that arise during statistical analysis or genetic interpretation.

The rise of machine learning has inescapable implications for biology. Although the accuracy of such models is still being assessed and a truly general AI model of gene regulation at genome scale may require much more data than is currently available, the pace of single-cell technology development, data generation and transfer learning suggests a future in which biologists can make extremely precise predictions about genes, cells, tissues and individual people. Such predictions will have both obvious and unanticipated applications across diverse areas of biomedicine. However, the convergence of statistical and deep learning and massive single-cell datasets will not alleviate the need for real-world experiments using conventional molecular and genetic approaches. Even the best models will be complex in terms of their internal structure, many will be largely or wholly uninterpretable in terms of underlying biological mechanisms, and there will probably be many equally plausible models for the same observations. Nevertheless, models will continue to help to ascribe functions to genes, either directly or by excluding reasonable hypotheses about how these genes might work. To paraphrase George Box’s famous aphorism, these models will all be wrong, but some will be useful in guiding mechanistic experiments. The models will come and go, but the mechanistic knowledge to which they direct us will be forever.