Abstract
Single-cell and spatial molecular profiling assays have shown large gains in sensitivity, resolution and throughput. Applying these technologies to specimens from human and model organisms promises to comprehensively catalogue cell types, reveal their lineage origins in development and discern their contributions to disease pathogenesis. Moreover, rapidly dropping costs have made well-controlled perturbation experiments and cohort studies widely accessible, illuminating mechanisms that give rise to phenotypes at the scale of the cell, the tissue and the whole organism. Interpreting the coming flood of single-cell data, much of which will be spatially resolved, will place a tremendous burden on existing computational pipelines. However, statistical concepts, models, tools and algorithms can be repurposed to solve problems now arising in genetic and molecular biology studies of development and disease. Here, I review how the questions that recent technological innovations promise to answer can be addressed by the major classes of statistical tools.
Similar content being viewed by others
Introduction
How our genes work together to build our cells, organs and bodies, and how mutations in many genes contribute to disease remain fundamental questions in genetics. Despite more than a century of exploration, our inventory of the genes required by each of our cell types for their development and their specific functions within the body remains sparse, as is our view of the regulatory links between genes. Notwithstanding the requirement to use laboratory animals as proxies for human biology, defining the genes essential for each cell type is a daunting prospect. Experiments must disrupt one or more genes within the animal to then assess the outcome on multiple cell types and potentially thousands of changes in gene expression at many different developmental time points.
Emerging measurement technologies based on sequencing1,2, imaging or both3 offer a route to opening this longstanding bottleneck in genetics. Single-cell transcriptome sequencing can be performed on millions of cells4 from thousands of independent samples2. Other aspects of the molecular states of cells, such as chromatin accessibility5 or DNA methylation6, can be measured in single cells at scale, often in conjunction with RNA7. Spatial molecular profiling is rapidly maturing, enabling surveys of the entire transcriptome across large fields of view8. Paralleling improvements in single-cell measurement technologies, improvements in the efficiency, precision and throughput of genome-engineering techniques now allow us to manipulate genes and cells in vivo9. Single-cell sequencing and CRISPR genome editing go hand in hand as a means of revealing how mutating a gene alters the expression of every other gene in each cell type. Systematically searching for genes that ‘phenocopy’ one another in terms of the molecular, anatomic or behavioural effects they elicit often reveals genes that function together through regulatory or biochemical interactions. Particularly in the context of large-scale perturbation studies10, ‘guilt-by-association’ analyses for finding sets of genes with similar phenotypes have proved invaluable to understand gene function, with whole biochemical complexes revealed11. Together, these technologies promise to generate an avalanche of data that captures the consequences of genome-scale experiments in which we measure the effects of our interventions on every gene in every cell type in a whole animal.
However, exploiting the full power of single-cell technologies to understand gene function demands thoughtful attention to fundamental statistical issues at the core of several recurring computational problems. A first problem is to catalogue and characterize the cell types based on differences in how they respond to perturbation. A second lies in discerning how each cell type regulates, and is regulated by, genes. A third is to understand how cell types descend from one another in the lineage or depend on each other via signalling. A fourth is how the genes depend on and regulate one another, and how that control varies within different cell types. And finally, a fifth problem is to integrate such knowledge to make accurate, quantitative forecasts. For example, what will happen in a mouse model of disease in yet-untested genetic, drug or environmental perturbations? What will happen in a specific patient when that person is prescribed a particular drug? Although not all these problems are purely or even primarily statistical in nature, attention to the statistical issues that arise when bringing single-cell technology to bear on them is key. Analogous statistical issues will arise when studying gene function with other measurement tools from the molecular to the anatomic scale.
A well formulated statistical model can do the following: weigh up the contribution of each of many input factors; transform qualitative phenotypes into quantitative ones; resolve technical limitations of the instrument that produces the data; or connect measurements made at various scales to capture or even forecast the behaviour of molecules, cells, tissues and whole organisms, including under perturbations and in new environments or contexts. Good models establish confidence that observed effects and phenotypes are ‘real’, can help to separate direct versus indirect effects, and can describe how variation in one variable propagates to or compounds variation in others, thus guiding subsequent experimental design decisions and saving time and laboratory resources.
In this Review, I highlight statistical concepts, models, tools and algorithms that can be repurposed to solve problems now arising or becoming increasingly urgent in genetic and molecular biology studies of development and disease. I outline the workflow for phenotyping with single-cell molecular profiling, including differential cell composition and gene expression analysis. I then introduce strategies for quantitatively modelling a gene’s regulation, for understanding lineage relationships and cell–cell signalling interactions, and for inferring gene regulatory networks. Finally, I turn to the problem of forecasting cell fate, briefly touching on the promise of advanced artificial intelligence systems for such problems. I do not discuss the relevant technologies in detail to avoid overlap with many recent reviews on single-cell sequencing, spatial molecular profiling and genome editing. I also avoid statistical subjects already very familiar to geneticists, such as genome-wide association studies and expression quantitative trait loci analysis.
Phenotyping at single-cell resolution
For many diseases, we know neither the principal cell types nor the genes involved, much less the mechanisms that might be targeted as part of therapy. Single-cell genomic experiments that compare healthy to diseased tissue or track the development of embryos over time typically aim to identify subpopulations of cells that differ between conditions or timepoints at the molecular level. For example, cells that exist only in diseased samples are of central interest, as are healthy cell types that are absent in diseased samples. In principle, comparing the transcriptomes of individual cells in one condition (for example, mutant or disease) against control cells (for example, wild-type or healthy) is straightforward: first, cluster the cells, then classify the clusters according to type based on the expression of discriminative marker genes. Next, count the number of cells of each type in each sample and compare these counts across sample groups to assess which cell types are more abundant in one group than another12. Finally, scrutinize and compare the genes specifically active in each of these states to extract insights about possible mechanisms of pathogenesis and progression. However, the nuances at each of these steps of analysis demands attention to the statistical issues and adoption of best practices (reviewed recently13).
Cell-type annotation
Recognizing disease-specific changes in cell-type proportions or transcriptomes requires annotating each cell according to type (Fig. 1a). Classifying cells according to type is done by comparing their transcriptomes to annotated cells in reference cell atlases4,14,15,16,17,18,19. Several bioinformatic tools have been developed to open up this bottleneck20,21,22,23,24,25. Garnett uses elastic net regression26 to train a classifier that annotates cells in a new experiment based on prior single-cell data according to an ontology of cell types defined by a literature-derived set of markers for each cell type20. An alternative strategy matches each cell from a new experiment onto cells from an existing atlas with similar overall transcriptomes and then transfers any annotations from the reference (for example, cell type) onto each query cell27. This so-called label transfer is fast and straightforward but requires a reliably annotated reference atlas. The Human Cell Atlas and various model organism atlases are improving rapidly as more groups contribute data, but annotations are frequently updated, and cell types are still being catalogued, particularly in the context of disease.
a, A hypothetical experimental design with several cases (people with disease) and controls (people without disease). Peripheral blood mononuclear cells are collected from each individual and subjected to single-cell RNA sequencing (RNA-seq), identifying five different cell types. b, Clustering and cell type annotation reveal major cell types in proportions that vary across individuals of different ages. c, One cell type (activated T cells) seems to be more abundant in cases than in controls, but is the difference statistically significant even if we control for differences in cell type proportion as a function of age? d, Linear regression can model the proportion in a cell type relative to the others across samples. A first model describes the activated T cell proportions as a function of age and disease status. In a second model, disease status is excluded. These models can be compared with a likelihood ratio test to assess whether the improvement in the first model’s ability to explain variation in proportions across samples is enough to justify its additional complexity in terms of parameters. The plotted lines correspond to the fits from the alternative model. e, Individual genes can also be tested for variation across cells from different samples. A gene may vary within cells from patients treated with a drug as a function of treatment time since therapy began, but the trend may differ between individuals, who may respond differently. These donor-specific trends should be accounted for during differential expression analysis. One approach is to use a mixed model that captures donor-specific ‘random effects’ in addition to the fixed effect of age and other experimental variables. The ‘(1 | patient)’ notation denotes a patient-specific random effect.
Differential cell composition
Once cells have been identified, one can ask, for example, whether their proportions vary by donor age (Fig. 1b), disease (Fig. 1c) or by other study variables. There are two strategies for comparing abundances of each cell type across specimens. A first strategy is to simply count the cells according to type and then compare the counts across samples, using a simple two-group test. However, two-group tests can have difficulty detecting differences as a function of one variable (for example, disease) while controlling for others (for example, age). A more flexible approach is to model the cell-type counts across groups via regression (Fig. 1d). Propeller is a recently introduced tool that compares the proportions of different cell types by regressing experimental variables (for example, case versus control or age) against each cell type’s proportions in the sample12. The advantages of regression-based analyses of cell proportions are that they are easy to implement and interpret, but a disadvantage is that they require accurate cell-type annotations. Moreover, cell ontologies necessarily impose arbitrary classifications on cells, specifying, for example, at what point a developing cell becomes ‘terminally’ differentiated.
Several methods have emerged to navigate the trade-offs between resolution and interpretability, which become especially fraught with sparse single-cell data, when testing for enrichment or changes in cell proportions. SEACells groups cells into ‘metacells’, between which biological variation is preserved but technical variation is minimized28. SCAVENGE relies on network propagation to associate genetic variants with distinct molecular states in which they are enriched29. Uncertainty surrounding cell annotation can be at least partly accounted for in such models via bootstrap resampling30 at additional computational expense. MELD31 and Milo32 are alternative approaches that model the distribution of cells throughout the high-dimensional space of possible transcriptomes and then test whether sample groups are distributed differently within it. The advantage of such an approach is that it does not require prior clustering or annotation and is therefore robust to choices of parameters of clustering algorithms and can work in the absence of an accurately annotated reference. A disadvantage is that more complex contrasts (for example, mutant versus wild type, controlling for differences in age) can be more complicated to specify and compute.
Differential gene expression
A third major statistical challenge arises when comparing cells of different types or states. What genes distinguish each cell type? What, if anything, is different about nominally similar cells from different donors or treatment groups, even after accounting for systematic technical effects shared by ‘batches’ of specimens? Such comparisons are central, for example, to identifying disease biomarkers or characterizing the genes that are dysregulated in disease-specific cell states relative to the healthy ones from which they emerged. For example, differential gene expression of single nuclei from mice exposed to silica dust revealed that lung macrophages activate genes normally expressed by osteoclasts, including proteases and other molecules that damage the lung33. It is essential in such comparisons to compare the levels of genes in cells with a statistical model and test that is well suited to the data34. For simple two-group comparisons, a Wilcoxon rank-sum test (also known as a Mann–Whitney U test) or similar non-parametric approach can accurately capture genuine differences while controlling for false discoveries that may arise from sampling error, with the same limitations as in differential cell-composition analysis. For example, one might want to compare the kinetics of a gene over time in two different strains, or test whether there is an age-dependent difference in how a particular cell type responds to a drug treatment (Fig. 1e). Generalized linear regression models (GLMs) can test for genes that vary over one or more experimental variables (having controlled for the others). GLMs treat each observation (that is, each cell) as independent, but this assumption is not met in most single-cell RNA sequencing (RNA-seq) experiments because cells from the same specimen are not fully independent from one another. If this ‘grouping structure’ among the observations is not accounted for, the analysis will return many false positives. There are two ways of accounting for grouping structure in single-cell datasets: one approach is to use generalized linear mixed models (GLMMs) that allow parameters (for example, age or dose effects) to differ between groups of cells. A second approach is to aggregate data from individual cells of the same type and from the same sample, to create ‘pseudobulk’ profiles34,35 that can be analysed with conventional RNA-seq tools36,37. Pseudobulking is straightforward but requires accurate cell-type annotations. GLMMs can be fitted without annotations but are computationally much more expensive to fit.
Because cell types are often defined on the basis of how they cluster, and cells are clustered according to how similar their transcriptomes are, care must be taken not to ‘double dip’. Using exactly the same data both to define clusters and to characterize their differences risks reporting spurious biomarkers of novel cell types38. Two solutions have been proposed for studies that seek both to discover new cell types and to characterize their molecular markers38. In the first, termed ‘sample splitting’, cells from a first group of samples are clustered according to transcriptomic similarity. Then, samples from a second group of samples that were not used in the definition of clusters are assigned to the cluster most similar to them. Finally, the cells from this second group are used for differential expression analysis to define markers. This approach is simple and does not require specialized statistical procedures, but the experiment must collect enough samples that they can be split between the two stages. A second approach, called data thinning, splits the sequencing data from each cell into two statistically independent groups, the first of which is used for clustering and the second for differential expression39,40, avoiding the ‘double dip’ without adding to the overall cost of the experiment.
Modelling gene regulation
A useful model of gene expression can help to discern the contribution of several molecular inputs across diverse genetic, environmental, developmental and disease contexts (Fig. 2a). How much mRNA is transcribed from a gene, given a cell’s concentrations of its upstream kinases and transcription factors? How active is protein signalling in this cell? What non-coding sequences determine whether a gene is expressed in each cell type, and are they accessible for binding in the cell? How do mutations within those sequences alter the gene’s output?
a, The transcription of a hypothetical gene with diverse molecular inputs that are integrated to determine the rate of RNA production. Binding of a ligand (dark purple) of varying concentration to a receptor (light purple) of varying surface expression triggers signalling through kinases A and B in a pathway that ultimately leads to the phosphorylation (P) and translocation of a transcription factor (TF) to the nucleus. The nuclear concentration of this factor determines the probability that it occupies the regulatory DNA of the gene, which is also a function of how accessible chromatin is around that DNA, and whether it contains any sequence variants that might alter the factor’s binding affinity. b, A linear regression model for RNA production for the locus in panel a, which assumes that each input has an additive effect on expression. c, A Bayesian network model of RNA production from the locus, which describes the conditional dependencies between the various inputs.
Multi-modal single-cell assays are increasingly powering quantitative descriptions of how individual genes are regulated across the molecular layers of the central dogma41. For example, CITE-seq co-assays cellular transcriptomes and their immunophenotypes via oligo-conjugated antibodies that can be detected in single-cell RNA-seq (scRNA-seq) libraries42. ASAP-seq co-assays the immunophenotypes of cells with their chromatin accessibility profile via scATAC-seq43. Several RNA and chromatin accessibility co-assays have been reported7,44, enabling the correlation of gene expression with the accessibility of nearby regulatory DNA elements (for example, putative enhancers).
A quantitative model of a given gene’s transcription aims to find a mathematical function that accurately estimates the number of its mRNA or protein products present in the cell under different conditions. This function must somehow capture the molecular biology underlying the central dogma. In their 1961 classic paper on lactose metabolism in Escherichia coli, Jacob and Monod showed that the kinetics of a gene could be used as a quantitative phenotype and for causal inference regarding its control45. In the decades since, many others have returned to this system as a testbed for new ideas in quantitative modelling of individual genes through mathematical functions of chemical reaction rates, the affinities of transcription factors for their DNA binding sites and other biochemical properties46. Despite the rigour, sophistication and utility of such models for discriminating between possible mechanisms of gene regulation, an inescapable conclusion of these efforts is that mathematical equations that describe the kinetics of mRNA synthesis even for simple genes become intractable when many inputs must be considered.
An alternative approach to modelling a gene’s expression is to use numerical optimization to learn the more abstract functions that nevertheless accurately predict its output from its many inputs. These functions may not be explicitly specified in biochemical terms but nevertheless quantify the contributions of genetic mutations, epigenetic states, cell-type-specific signalling, among other variables (Fig. 2b). With the inputs, output and general form of a given gene’s model specified, the computer can search through the space of all such functions to find the one that best explains the multi-modal data using an appropriate numerical optimization algorithm.
Statistical models can vary tremendously in their complexity, with increasing fidelity to the underlying molecular biology. For example, a simple linear model will describe how the amount of mRNA changes with addition or subtraction of each input. Such models are straightforward to interpret but have difficulty describing how inputs interact, and so will not detect, for example, how transcription factors cooperate to activate a gene or block each other from doing so. The function can be augmented with terms that explicitly model the interaction between explanatory variables, albeit at a potentially substantial cost in terms of interpretability and computation burden. Graphical modelling approaches such as Bayesian networks can capture complex interdependencies between the inputs (Fig. 2c). For example, suppose a gene’s mRNA levels can be predicted accurately from measurements of a particular signalling pathway’s activity or from the nuclear abundance of a key upstream transcription factor whose translocation to the nucleus depends on that pathway (Fig. 2a). A Bayesian network could learn that the signalling measurements are conditionally independent of the mRNA levels, given the transcription factor levels, providing no additional predictive value.
Especially when the inputs depend on one another, more complex models are not necessarily better: although adding terms will always explain more variance in training data, those terms may actually hurt performance on ��held-out’ test data (that is, data not used to train model parameters). Model selection, the formal assessment of whether adding explanatory variables and dependencies can be justified in terms of how well a model explains data, is critical for quantifying the contributions different inputs make to gene regulation. Fortunately, the statistical literature is rich in tools with which to assess whether a complex model is significantly better than a simpler one ‘nested’ within it (for example, likelihood ratio tests), at selecting a parsimonious set of inputs from a large collection of potentially correlated ones (for example, LASSO47), and at measuring overall model complexity (for example, the Akaike information criterion48).
Statistical models of gene regulation can be especially powerful in conjunction with genetic experiments. For example, the bone morphogenic protein (BMP) pathway transduces signals through multiple ligands bound by receptors composed of multiple subunits. A recent quantitative analysis determined how ligand and receptor compositions and stoichiometry define how cells transduce BMP into diverse patterns of downstream gene expression49. The statistical methods developed as part of the study discriminated between additive, synergistic and antagonistic effects between the ligands and the receptor subunits. Modulating BMP receptor composition by knocking out subunits altered how cells respond to BMP, both quantitatively in levels of downstream expression and also qualitatively, with one cell type made to resemble another. This study demonstrates that combining genetic perturbations with statistical modelling to discriminate between possible functions for genes using the quantitative kinetics of the cellular behaviours they control remains as powerful today as it was when Jacob and Monod used it more than 60 years ago.
Disentangling interactions and dependencies
How our many diverse cells arise from a single cell, how they communicate with each other, and how they depend on one another to form organized, structured tissue are fundamental questions in developmental biology. What genes does each cell use to sense and influence its neighbours? What genes does it use to ensure its daughters are properly positioned and specified? Genes that function non-autonomously through cell–cell interactions or by conditioning a cell (or its descendants) to behave a certain way in the future are notoriously difficult to study. Although understanding the lineal and signalling dependencies between cell types may seem like disparate goals, new tools for studying cells across space and time promise to decode these parts of the genetic program.
Spatial relationships
Identifying the specific proteins, peptides and small molecules that drive cell-fate decisions is challenging. Highly multiplexed spatial transcriptomics and proteomic techniques (reviewed recently50,51) have emerged as a way of measuring heterogeneity across cells within a tissue at the molecular level, promising to characterize the molecular basis of signalling dependencies between cell types, and one day may enable quantitative spatiotemporal models of tissue morphogenesis or tumour growth and invasion. Tools from spatial statistics are increasingly being deployed to overcome common tasks in spatial transcriptomics or proteomic analyses. For example, a recent analysis of the mouse liver used a combination of spatial transcriptomics and scRNA-seq to statistically deconvolve the spatial transcriptomics data using scRNA-seq, estimating each gene’s expression within each cell in the tissue; to automatically annotate veins in the image and classify them as central or portal; and to identify modules of spatially autocorrelated genes within hepatocytes that vary as a function of position along the lobular axis and proximity to portal veins52. A separate study used molecular cartography to compare livers from wild-type and Wnt2/Wnt9b double-knockout mice, finding these genes to be required for properly zoned gene expression and liver regeneration following acetominophen injury53. These two examples highlight how new technology (and rigorous spatial statistics), coupled with genetic and drug perturbations, facilitate causal inferences about the role of individual genes in cell–cell interactions and molecular phenotypes.
The central statistical challenge associated with spatial molecular profiling is to quantify the contribution that the locations of cells within a tissue make to explaining variation in their forms and functions. This challenge is fundamental to answering diverse questions in cell, molecular and developmental biology. Are T cells that have infiltrated a tumour different from those at the margin? How does distance from an organizer and the morphogens it emits correlate with a cell’s fate in the embryo? Spatial statistics is rich with parametric and nonparametric methods of exploring such questions, which frequently arise in demography, geology, meteorology, oceanography, ecology and other disciplines that analyse measurements over maps and volumes. Consider a hypothetical spatial molecular profiling experiment aiming to identify ligand–receptor interactions that determine the expression of a downstream gene through signalling (Fig. 3). Statistical models could be used to determine whether variation in target gene T across the tissue can be explained as a function of the local concentration of ligands A and B, conditional on whether the receptor is also expressed. The variance explained by these models can be taken as the strength of evidence for interactions between each ligand and the receptor. However, as more candidate ligands are considered, model selection can be burdensome as the number of candidate models explodes.
a, Multiplexed imaging or spatial transcriptomics collect measurements of each gene at many positions in a field of view or tissue section. Using Gaussian process regression or ‘kriging’ produces an estimate of each gene’s expression even at coordinates that were not directly measured. b, Measurements from one gene can be correlated with another across space to identify genes that vary as a function of concentration of ligands and other key regulatory molecules. c, A set of spatial regression models that aim to explain variation in a target gene’s expression across cells as a function of one or more candidate extrinsic signalling factors. *** indicates statistical significance. Pseudo-R2 is a measure of the variability in a dependent variable explained by a model of several independent variables. d, Ligands for receptors can in principle be identified for receptors by evaluating the amount of variation they explain in targets known or suspected to be downstream of those receptors. n.s., not significant.
Statistical methods are also critical for overcoming limitations in the underlying measurement technologies. Techniques such as Slide-seq54 that release mRNA from tissues and then blot it onto a substrate prior to library construction are not, strictly speaking, single-cell techniques, in that libraries correspond to spots on a spatial grid contributed by multiple (and sometimes many) cells of different types and in different functional states. There are now numerous algorithms for deconvolving the spots under various mixture models for how individual cells contribute their diffused mRNA to each nearby spot55,56. Techniques that extract single cells or nuclei prior to sequencing such as sci-Space57 avoid this issue, but they face others: because cells are physically extracted from tissue prior to single-cell sequencing, the original tissue coordinates of those cells must be triangulated (again, with uncertainty) from the sequencing. Triangulating cells can also be formulated as a mixture modelling problem in which each cell is labelled according to their relative proximity to a reference point within the assay’s spatial coordinate system. A second problem common to many spatial molecular profiling assays is that measurements for some probes or cells may be missing at some coordinates. Fortunately, there is a rich statistical literature on how to estimate missing measurements through interpolation. One classic approach to the spatial interpolation problem is ‘kriging’, whereby a ‘variogram’ is constructed that describes the correlation in measurements collected at two points based on the distance between them, enabling one to predict values at unsampled coordinates based on sampled ones. A third problem common to any analysis of tissue sections is how to register multiple sections from the same specimen into a common, three-dimensional coordinate framework. Here too, spatial statistical methods have proven invaluable: especially in conjunction with deep neural networks, for aligning sections against one another in three dimensions58.
Lineage relationships
In some organisms, the lineage relationships between cells can be ascertained simply by observing them under the microscope. Sulston and colleagues characterized the entire cell lineage of the nematode Caenorhabditis elegans by meticulously documenting individual cell-division events in worm embryos59. The worm is ideal for such an effort because it is small and has an invariant lineage that over developmental time follows a deterministic genetic programme, so that observations of different embryos can be registered into a unified ‘coordinate system’ of the lineage. Unlike the worm, most animals do not have invariant lineages; even genetically identical animals do not contain exactly the same number of cells, of exactly the same cell types, positioned in exactly the same ways in their tissues and organs. Experiments that interrogate the cells arising from a population marked by a particular gene (assuming a suitable reporter could even be designed) leave the remainder of the animal unmapped, and so the lineal origins of many cell types remain unclear.
Emerging genomic and imaging technologies for lineage tracing offer a path to comprehensively characterizing the lineage relationships between cells at the whole-animal scale (reviewed recently60,61). These technologies draw from two basic strategies for connecting cells to their descendants through developmental time. A first class of prospective lineage tracing strategies labels cells by manipulating their genomes (for example, with unique barcodes), allowing them to expand, sequencing the barcodes of their descendants, and then associating each descendant with its ancestor through these barcodes. A second, retrospective set of strategies scrutinizes a population of cells and reconstructs their ancestry through phylogenetic analysis of their shared genetic variants. Recently, several techniques that draw from both of these strategies have been developed, such as GESTALT, which uses CRISPR–Cas9 to introduce cumulative edits to a synthetic construct installed in the genome of developing zebrafish62 (Table 1). As the zebrafish embryo develops, cells accumulate unique patterns of repair to this construct, prospectively and continuously labelling ever-smaller clades of cells (Fig. 4a). On the assumption that descendants will share the genomic scars formed when their shared ancestors received edits, their lineage relationships can be inferred by reconstructing a lineage tree that minimizes the amount of editing that would be needed to generate their ‘alleles’ (for example, maximum parsimony). Intense efforts are underway to improve the depth and breadth of lineage recording capacity, to co-capture other measurements (such as the transcriptome), to incorporate spatial information and to deploy recorders in animal models. For example, the DARLIN mouse, which records cellular lineage via an inducible Cas9-barcoding system, revealed that clonal memory in haematopoietic stem cells is associated more with DNA methylation than with chromatin accessibility or the transcriptome63. Further experiments, possibly involving mutants on the DARLIN background, will be required to identify the genes that mediate memory through DNA methylation.
a, Molecular lineage recorders work by first introducing synthetic sequences into the genome and then targeting those sequences with genome-editing reagents (such as CRISPR–Cas9) to introduce mutations in them as the embryo develops. Each cell accumulates a distinct (ideally unique) pattern of mutations, producing an embryo that is massively mosaic at the target locus. The ‘alleles’ can be used to infer a phylogenetic tree corresponding to the cell lineage. b, For organisms without a deterministic lineage programme, the cell lineage tree for each individual will differ, and how to produce a ‘consensus lineage’ for a population of individuals from the same species is an unsolved problem. However, coalescent theory offers a conceptual roadmap for building such consensus lineages.
Understanding how animal genomes sweep out a lineage that is variable without compromising reproducibility and robustness during development is a central aim of molecular recording technology. Realizing this goal will require developing statistical models that describe and quantify variability in the lineages of many individual embryos. Statistical methods from phylogenetics and population genetics offer guidance on how to approach the problem of inferring lineage relationships from molecular recorders. In population genetics, coalescent theory formulates models of how alleles observed in a population arose from a common ancestral allele over successive generations64 (Fig. 4b). Although coalescent theory rests on assumptions that may not hold for the synthetic alleles introduced by molecular recorders, such as the assumption that edits are introduced in cells at a constant rate, much of the mathematical and statistical infrastructure may be reusable. Elaborations to coalescent theory capture phenomena in evolutionary histories that have analogues in cell and developmental biology such as bottlenecks (programmed death) and dispersal (cell migration). A coalescent theory of molecular recorder experiments would enable the use of inferred cell histories as quantitative phenotypes. Phenotyping with molecular recorders would enable direct tackling of the question of how genomes sweep out reproducible yet stochastic cell lineages in development.
Dissecting gene circuits
The fate decisions, lineal histories, physical positions and molecular messaging of cells in the embryo are ultimately determined by the regulation of and by genes. Systematically or even automatically inferring how genes regulate one another has long been one of the most difficult challenges in computational biology. With each new generation of genomic measurement technologies, new algorithms emerge to exploit them, hoping to realize the promise of data-driven dissection of genetic programmes in development, disease and other contexts.
The problem of gene network reconstruction has been formulated in many ways, most ambitiously as a problem in genome-scale causal inference: given the expression levels of all genes in many cells or samples, infer the direct regulatory relationships between the genes (Fig. 5). A statistical approach to this problem models each gene as a dependent variable explained by all others, as well as any extrinsic factors such as drug treatments or environmental stresses (Fig. 5a). However, a pattern of correlation between two genes is insufficient to declare one a regulator of the other. As more pairs of correlated genes are considered, one must discriminate between an exploding number of regulatory architectures consistent with the correlations (Fig. 5b); for a genome with 20,000 genes, there are 400 million possible regulator–target pairs to consider. Even to conclude that two genes are correlated and that changes in one follow the other requires some care because sparse count data can bias simple measures of relatedness65 (for example, Pearson’s r). But the greater challenge lies in distinguishing causal links from correlations. Casting John Stuart Mill’s classic criteria for causal inference66 in genetic terms, defending a regulatory link between one gene A and another gene B requires establishing that changes in A precede B, that measurements of A are correlated with B, and that no other factor, neither a third gene C nor some external variable, explains changes in both A and B. That is, are two genes still correlated, even if one accounts for co-variation with every other gene in the genome? Computing these conditional dependencies massively increases both the computational burden and the amount of data needed for accurate estimates. There are so many parameters to estimate that a naive approach will produce either a uselessly inaccurate result or, after accounting for multiple testing, no result at all.
a, In a hypothetical analysis of co-expression between three genes A, B and C across cells, positive correlations between them suggest that they may regulate one another. However, the drivers for variation in the expression of these genes could be some external factor, such as stress. b, Three possible regulatory architectures that are consistent with the correlations shown in panel a. c, Fitting regression models that correlate the residuals of two genes, having controlled for the third, along with any other measured experimental variables, eliminates confounders and discriminates between the three possible architectures. TF, transcription factor.
Three main statistical strategies make this causal inference problem more tractable, each served by a suite of new technologies. A first strategy is to reduce the number of genes or gene–gene pairs in the inference by ruling out genes that cannot regulate others or by declaring some gene–gene interactions to be mechanistically implausible. For example, ‘structural genes’, such as actins, do not regulate most other genes. Focusing only on transcriptional networks shrinks the set of putative regulatory genes to transcription factors and imposes the requirement that such regulators recognize their target genes through cognate binding sequences in flanking non-coding DNA. This strategy (reviewed recently67) is served by multi-modal single-cell assays that readout not only expression levels (a measure of gene output) with chromatin accessibility, histone modifications or intracellular signalling (all measures of gene input). A second strategy is to perturb the genes (for example, through CRISPR68,69,70,71), which establishes (or excludes) their role as a causal regulator of other genes. Expansions of this concept can be used to explore the vast space of combinations of knockouts efficiently72, how genetic mutations interact with drug treatments73 and how genes determine how cells interact with each other74. An important limitation in such screens is that not all cells receive functional edits. Statistical tools75, improved genome editing76 and direct genotyping protocols77 could provide clearer genotype-to-phenotype associations. A third strategy is to collect vast amounts of data, from diverse conditions, to isolate the most informative correlations. Major improvements to single-cell protocols now enable the analysis of cells from many samples2,78,79,80, each perturbed in different ways, in the same experiment, opening the door to screens and systematic reverse genetic studies.
Forecasting cell fate
Understanding what fates and functions cells will adopt in the future is central in both development and disease. Single-cell sequencing experiments carried out over time can capture a nearly continuous view of how cells regulate genes as they differentiate or enter pathological states. Among the first applications of single-cell RNA-seq towards understanding development was trajectory inference, the computational reconstruction of transcriptional programmes that cells execute as they differentiate81,82. Trajectory inference algorithms83 organize differentiating cells in pseudotime according to their maturity and fate decisions (Fig. 6a), enabling one to construct statistical models that anticipate the kinetics of each gene in each cell type. These algorithms were the starting point for more ambitious efforts to forecast the transcriptomes and proportions of cell types in developing tissues, including those subjected to genetic, environmental and other perturbations (Fig. 6b,c).
a, Visualizing cells undergoing fate decisions by embedding transcriptionally similar cells near one another in two or three dimensions can reveal molecular trajectories that capture the events leading up to and following their fate decisions, but may require time series experiments to orient the trajectory. b, Comparing recently synthesized, incompletely processed transcripts to their mature counterparts (RNA velocity) anticipates each cell’s near-term future transcriptional state. Direct RNA labelling experimentally distinguishes more and less recently synthesized transcripts to achieve similar ends. Either approach can provide a forecast of cells’ future states. c, Inferred gene regulatory networks (GRNs) can be used to forecast the effects of knocking out genes in the network on the distribution of cells over the trajectory, losses of specific fates (indicated by red arrows) and changes in individual genes that would otherwise be expected in the wild type. d, Foundation models can be trained on a large single-cell data corpus (for example, the Human Cell Atlas) and then adapted to perform numerous statistical inference and forecasting tasks. ΔA and ΔB, mutant genotypes lacking genes A and B, respectively; NA, not applicable; TF, transciption factor.
Statistical methods for time series analysis aim to predict the future state of a system using its past and current states. For example, autoregressive models, which learn correlation structure between present data and the recent past to predict the future84, are widely applied in econometrics and to forecast financial markets. Gaussian process models elaborate this concept and have been used to pinpoint the moment individual genes begin to undergo divergent expression as bipotent progenitors differentiate85. However, because current single-cell sequencing technologies are destructive, they do not repeatedly sample the same individual cells, limiting the amount of auto-correlation in the data86.
Both experimental techniques and computational insights have been proposed to learn how past and future states are linked without requiring repeated measures. Single-cell RNA-seq destroys each cell, but labelling the RNA while it is living provides a means of discriminating between older RNAs and more recent ones87. RNAs from newly activated genes will appear in the unlabelled but not the labelled fraction, and therefore indicate where a cell is ‘headed’, transcriptionally speaking. RNA velocity is a purely computational technique that achieves similar ends by comparing the relative abundances of incompletely processed transcripts to mature, spliced isoforms88 (Fig. 6b). Transcriptional velocities have proved powerful in computational ‘fate mapping’. For example, CellRank estimates RNA velocity fields, taking care to propagate uncertainty in its estimates into downstream computations, to accurately predict cellular outcomes even outside normal development, such as cellular reprogramming and tissue regeneration89. Dynamo is a recent algorithm that computes transcriptional velocity vectors from labelled RNA (if available) or in a manner similar to RNA velocity (if not) and then uses the vector field to infer gene regulatory networks and use them to predict future cell fates90.
Extrapolating cell fates is even more challenging in the context of genetic, environmental or other perturbations. Accurate models for predicting what would happen to a population of cells in an as-yet-unseen mutant or under particular environmental conditions is of paramount interest, and techniques for ‘out-of-sample’ prediction are beginning to emerge. CellOracle is a tool that builds on ideas from trajectory analysis and RNA velocity, but also constructs approximate gene regulatory networks to predict what would happen if one of those genes were absent91 (Fig. 6c). To do so, the tool deploys several of the strategies discussed above to make the gene regulatory network problem tractable. For example, CellOracle focuses on transcriptional regulatory networks and leverages prior knowledge from single-cell ATAC-seq data and transcription factor binding motifs to constrain the problem91. It also limits the number of regulators through a statistical technique called penalization, prioritizing overall accuracy in forecasting the future transcriptome of the cell rather than capturing every regulator of each gene. Linear models of gene regulation as used by CellOracle are straightforward to interpret, which means that in silico predictions can be readily tested in the laboratory. A disadvantage of linear models is that they may be too simple to capture the complex, nonlinear interactions between genes, cell types and environmental factors needed for accurate out-of-distribution forecasting.
Deep learning methods are emerging to facilitate in silico predictions of cell states and fates. These techniques trade interpretability for accuracy in forecasting and require dramatically more training data and computational power to build. A first class of methods are built for specific single-cell analysis tasks: scGen92 and a related subsequent method, CPA93, use variational autoencoders to learn low-dimensional representations of cell transcriptomes that enable one to predict gene-expression changes following gene knockouts, chemical perturbations and other insults. For example, by first embedding cells treated with individual small molecules into a low-dimensional space, and then learning how each drug moves cells around in that space, CPA can predict the effects of combinations of drugs that were not included in the training dataset. A second class of techniques — exemplified by Geneformer94 and scGPT95 — building on the same principles that led to the success of large language models (LLMs) such as ChatGPT, first trains a foundation model based on attention networks that learns how genes co-vary with one another across a huge, diverse collection of cell types and states. This foundation model can be used as is for some tasks (so called zero-shot prediction) such as in silico gene knockout analysis, or it can be fine-tuned to perform a more specialized application, such as cell-type annotation or identifying transcription factor targets (Fig. 6d).
There is a robust discussion ongoing within the machine learning community regarding how to combine the strengths of deep learning methods and more traditional modelling techniques, especially for very challenging problems such as forecasting a gene’s regulation in a developing embryo. On the one hand, discriminating between equally plausible mechanistic models may demand the accuracy that thus far only deep learning tools have attained. On the other hand, given many free parameters, a deep learning method may discover a highly accurate but totally uninterpretable or mechanistically completely wrong model for a given system. If we are to use the internal structure of our models to guide mechanistic experiments in the laboratory, we must demand more than accuracy of our statistical methods and artificial intelligence (AI) tools. They must not only learn biology but also be able to convey it to us in terms we understand. Accordingly, intense effort in the AI community is being directed towards making the output of AI models more readily interpretable96.
Conclusions and future directions
As dramatic as the advances in genomic technologies over the past decade have been, there is plenty of room for further progress. Protein abundances and other critical measurements remain challenging to capture in single cells. In terms of scale, experiments with hundreds of specimens remain cost-prohibitive for many laboratories, limiting access to cohort studies and screens. Costs of single-cell sequencing have plummeted in recent years, with new protocols97,98 and new commercial offerings increasingly democratizing the toolset, but they remain expensive in absolute terms and further cost reductions are needed. One can profile a handful of molecular modalities in single cells, but quantifying the contribution of the many inputs to gene regulation may require simultaneous measurement of myriad aspects of each cell’s molecular state. Gene knockouts with CRISPR are routine, but highly efficient, precise in vivo base editing would supercharge efforts to stratify the many thousands of outstanding genetic variants of unknown importance. The value of measuring cells in their tissue contexts was made abundantly clear by early spatial transcriptomic applications, but the current technology is largely two-dimensional, rather than three- or four-dimensional, distorting or excluding whole dimensions of context. Multiplexed single-cell measurement is destructive, limiting efforts to forecast cellular behaviour. Molecular recorders may mitigate this challenge but remain in their infancy, with limited ‘memory’ and programmability, and have yet to reach the level of democratization needed to see them deployed except in a few select applications. Fortunately, the pace of single-cell technology development remains as rapid as ever.
Although we have statistical tools to meet the challenges and opportunities presented by single-cell genomic and multiplexed imaging technologies, many of these tools remain inaccessible. There are at least three areas of investment that would help biologists to unlock their data using ideas from statistics. The first is bespoke statistical software that is built-for-purpose to work with single-cell or spatial sequencing data. For example, many regression analysis packages are general and can be configured to work with all kinds of different data, but running them on a single-cell experiment requires detailed, intimate knowledge of the package, possibly intimidating a user without formal statistical training. Furthermore, the size and scale of single-cell data often exceed the capacity of data structures and algorithms of general-purpose software. Experiments that perturb hundreds or even thousands of genes will demand not only scalable data structures but intuitive, responsive visualization tools to help users to interpret the many phenotypes that emerge at such scale11. The second is statistical education and outreach. The data science community has made tremendous progress educating scientists without computational backgrounds in using regression analysis and other techniques discussed here, but barriers to access could be lower. The third is closer communication between statisticians, geneticists and technology developers to shape future protocols and assays. By connecting developers of statistical methods with developers of genomic technologies, we will produce powerful, integrated solutions to answer questions that have frustrated geneticists for decades. New technologies invariably face trade-offs, and a statistical perspective can be tremendously useful for navigating these. For example, a new technique might favour more samples over profiling each sample more deeply. Will this limit statistical power when applying it to key questions in genetics? Such conversations might induce technology developers to invent assays they never would have considered had they not been made aware of the pain points that arise during statistical analysis or genetic interpretation.
The rise of machine learning has inescapable implications for biology. Although the accuracy of such models is still being assessed and a truly general AI model of gene regulation at genome scale may require much more data than is currently available, the pace of single-cell technology development, data generation and transfer learning suggests a future in which biologists can make extremely precise predictions about genes, cells, tissues and individual people. Such predictions will have both obvious and unanticipated applications across diverse areas of biomedicine. However, the convergence of statistical and deep learning and massive single-cell datasets will not alleviate the need for real-world experiments using conventional molecular and genetic approaches. Even the best models will be complex in terms of their internal structure, many will be largely or wholly uninterpretable in terms of underlying biological mechanisms, and there will probably be many equally plausible models for the same observations. Nevertheless, models will continue to help to ascribe functions to genes, either directly or by excluding reasonable hypotheses about how these genes might work. To paraphrase George Box’s famous aphorism, these models will all be wrong, but some will be useful in guiding mechanistic experiments. The models will come and go, but the mechanistic knowledge to which they direct us will be forever.
References
Replogle, J. M. et al. Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq. Cell 185, 2559–2575.e28 (2022).
Srivatsan, S. R. et al. Massively multiplex chemical transcriptomics at single-cell resolution. Science 367, 45–51 (2020).
Funk, L. et al. The phenotypic landscape of essential human genes. Cell 185, 4634–4653.e22 (2022).
Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).
Cusanovich, D. A. et al. Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015).
Mulqueen, R. M. et al. Highly scalable generation of DNA methylation profiles in single cells. Nat. Biotechnol. 36, 428–431 (2018).
Cao, J. et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science 361, 1380–1385 (2018).
Chen, A. et al. Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell 185, 1777–1792.e21 (2022).
Chavez, M., Chen, X., Finn, P. B. & Qi, L. S. Advances in CRISPR therapeutics. Nat. Rev. Nephrol. 19, 9–22 (2023).
Kemmeren, P. et al. Large-scale genetic perturbations reveal regulatory networks and an abundance of gene-specific repressors. Cell 157, 740–752 (2014).
Dorrity, M. W., Saunders, L. M., Queitsch, C., Fields, S. & Trapnell, C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat. Commun. 11, 1537 (2020).
Phipson, B. et al. propeller: testing for differences in cell type proportions in single cell data. Bioinformatics 38, 4720–4726 (2022).
Heumos, L. et al. Best practices for single-cell analysis across modalities. Nat. Rev. Genet. 24, 550–572 (2023).
Rozenblatt-Rosen, O., Stubbington MJT, Regev, A. & Teichmann, S. A. The human cell atlas: from vision to reality. Nature 550, 451–453 (2017).
Saunders, L. M. et al. Embryo-scale reverse genetics at single-cell resolution. Nature 623, 782–791 (2023). This paper deploys massively scalable single-cell RNA-seq on many developing wild-type and mutant zebrafish to measure the consequences of gene disruption on the whole transcriptome in each cell in the animal.
Cao, J. et al. A human cell atlas of fetal gene expression. Science 370, eaba7721 (2020).
Domcke, S. et al. A human cell atlas of fetal chromatin accessibility. Science 370, eaba7612 (2020).
BRAIN Initiative Cell Census Network (BICCN). A multimodal cell census and atlas of the mammalian primary motor cortex. Nature 598, 86–102 (2021).
Sikkema, L. et al. An integrated cell atlas of the lung in health and disease. Nat. Med. 29, 1563–1577 (2023).
Pliner, H. A., Shendure, J. & Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat. Meth. 16, 983–986 (2019).
Fu, R. et al. clustifyr: an R package for automated single-cell RNA sequencing cluster classification. F1000Res. 9, 223 (2020).
Domínguez Conde, C. et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 376, eabl5197 (2022).
Lotfollahi, M. et al. Mapping single-cell data to reference atlases by transfer learning. Nat. Biotechnol. 40, 121–130 (2022).
Kang, J. B. et al. Efficient and precise single-cell reference atlas mapping with Symphony. Nat. Commun. 12, 5890 (2021).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).
Friedman, J., Hastie, T. & Tibshirani, R. Lasso and elastic-net regularized generalized linear models. glmnet https://glmnet.stanford.edu (2009).
Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 42, 293–304 (2024).
Persad, S. et al. SEACells infers transcriptional and epigenomic cellular states from single-cell genomics data. Nat. Biotechnol. 41, 1746–1757 (2023).
Yu, F. et al. Variant to function mapping at single-cell resolution through network propagation. Nat. Biotechnol. 40, 1644–1653 (2022).
Cao, Y. et al. scDC: single cell differential composition analysis. BMC Bioinform. 20, 721 (2019).
Burkhardt, D. B. et al. Quantifying the effect of experimental perturbations at single-cell resolution. Nat. Biotechnol. 39, 619–629 (2021).
Dann, E., Henderson, N. C., Teichmann, S. A., Morgan, M. D. & Marioni, J. C. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat. Biotechnol. 40, 245–253 (2022).
Hasegawa, Y. et al. Pulmonary osteoclast-like cells in silica induced pulmonary fibrosis. Preprint at bioRxiv https://doi.org/10.1101/2023.02.17.528996 (2023).
Squair, J. W. et al. Confronting false discoveries in single-cell differential expression. Nat. Commun. 12, 5692 (2021).
Zimmerman, K. D., Espeland, M. A. & Langefeld, C. D. A practical solution to pseudoreplication bias in single-cell studies. Nat. Commun. 12, 738 (2021).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Neufeld, A., Popp, J., Gao, L. L., Battle, A. & Witten, D. Negative binomial count splitting for single-cell RNA sequencing data. Preprint at arxiv https://doi.org/10.48550/arXiv.2307.12985 (2023).
Neufeld, A., Dharamshi, A., Gao, L. L. & Witten, D. Data thinning for convolution-closed distributions. Preprint at arxiv https://doi.org/10.48550/arXiv.2301.07276 (2023).
Dharamshi, A. et al. Generalized data thinning using sufficient statistics. Preprint at arxiv https://doi.org/10.48550/arXiv.2303.12931 (2023).
Vandereyken, K., Sifrim, A., Thienpont, B. & Voet, T. Methods and applications for single-cell and spatial multi-omics. Nat. Rev. Genet. 24, 494–515 (2023).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Meth. 14, 865–868 (2017).
Mimitou, E. P. et al. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nat. Biotechnol. 39, 1246–1258 (2021).
Ma, S. et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell 183, 1103–1116.e20 (2020).
Jacob, F. & Monod, J. Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3, 318–356 (1961).
Phillips, R. Napoleon is in equilibrium. Annu. Rev. Condens. Matter Phys. 6, 85–111 (2015). This paper discusses the power and limitations of statistical mechanics in constructing quantitative models of gene regulation.
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1996).
Akaike, H. Information theory and an extension of the maximum likelihood principle. In Selected Papers of Hirotugu Akaike (eds Parzen, E., Tanabe, K. & Kitagawa, G.)199–213 (Springer, 1998).
Klumpe, H. E. et al. The context-dependent, combinatorial logic of BMP signaling. Cell Syst. 18, 388–407.e10 (2022). This study combines genetic perturbations of the BMP pathway with statistical modelling to discriminate between possible functions for genes using the quantitative kinetics of the cellular behaviours they control.
Moffitt, J. R., Lundberg, E. & Heyn, H. The emerging landscape of spatial profiling technologies. Nat. Rev. Genet. 23, 741–759 (2022).
Elhanani, O., Ben-Uri, R. & Keren, L. Spatial profiling technologies illuminate the tumor microenvironment. Cancer Cell 41, 404–420 (2023).
Hildebrandt, F. et al. Spatial transcriptomics to define transcriptional patterns of zonation and structural components in the mouse liver. Nat. Commun. 12, 7046 (2021).
Hu, S. et al. Single-cell spatial transcriptomics reveals a dynamic control of metabolic zonation and liver regeneration by endothelial cell Wnt2 and Wnt9b. Cell Rep. Med. 3, 100754 (2022).
Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019).
Moncada, R. et al. Integrating microarray-based spatial transcriptomics and single-cell RNA-seq reveals tissue architecture in pancreatic ductal adenocarcinomas. Nat. Biotechnol. 38, 333–342 (2020).
Cable, D. M. et al. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat. Biotechnol. 40, 517–526 (2022). This paper exemplifies the utility of statistical deconvolution techniques to overcome limitations of spatial transcriptomics, improving the resolution and power of the technology.
Srivatsan, S. R. et al. Embryo-scale, single-cell spatial transcriptomics. Science 373, 111–117 (2021).
Jones, A., Townes, F. W., Li, D. & Engelhardt, B. E. Alignment of spatial genomics data using deep Gaussian processes. Nat. Meth. 20, 1379–1387 (2023). This paper explores how spatial statistics can be augmented with techniques from deep learning to solve difficult problems in spatial data integration.
Sulston, J. E., Schierenberg, E., White, J. G. & Thomson, J. N. The embryonic cell lineage of the nematode Caenorhabditis elegans. Dev. Biol. 100, 64–119 (1983).
Sankaran, V. G., Weissman, J. S. & Zon, L. I. Cellular barcoding to decipher clonal dynamics in disease. Science 378, eabm5874 (2022).
Li, Z. et al. Reconstructing cell lineage trees with genomic barcoding: approaches and applications. J. Genet. Genom. 51, 35–47 (2023).
McKenna, A. et al. Whole-organism lineage tracing by combinatorial and cumulative genome editing. Science 353, aaf7907 (2016). This paper introduces the concept of cumulative, CRISPR-based genome editing to write lineage histories into the genomes of developing embryos.
Li, L. et al. A mouse model with high clonal barcode diversity for joint lineage, transcriptomic, and epigenomic profiling in single cells. Cell 186, 5183–5199.e22 (2023).
Rosenberg, N. A. & Nordborg, M. Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat. Rev. Genet. 3, 380–390 (2002).
Serra, A., Coretto, P., Fratello, M., Tagliaferri, R. & Stegle, O. Robust and sparse correlation matrix estimation for the analysis of high-dimensional genomics data. Bioinformatics 34, 625–634 (2018).
Shadish, W. R., Cook, T. D. & Campbell, D. T. Experimental and Quasi-experimental Designs for Generalized Causal Inference (Houghton Mifflin, 2002).
Badia-I-Mompel, P. et al. Gene regulatory network inference in the era of single-cell multi-omics. Nat. Rev. Genet. 24, 739–754 (2023).
Adamson, B. et al. A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell 167, 1867–1882.e21 (2016).
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell rna profiling of pooled genetic screens. Cell 167, 1853–1866.e17 (2016).
Jaitin, D. A. et al. Dissecting immune circuits by linking crispr-pooled screens with single-cell RNA-seq. Cell 167, 1883–1896.e15 (2016).
Datlinger, P. et al. Pooled CRISPR screening with single-cell transcriptome readout. Nat. Meth. 14, 297–301 (2017).
Norman, T. M. et al. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science 365, 786–793 (2019).
McFaline-Figueroa, J. L. et al. Multiplex single-cell chemical genomics reveals the kinase dependence of the response to targeted therapy. Cell Genom. 4, 100487 (2024).
Liu, S. J. et al. In vivo perturb-seq of cancer and immune cells dissects oncologic drivers and therapy response. Preprint at bioRxiv https://doi.org/10.1101/2023.09.01.555831v1 (2023).
Papalexi, E. et al. Characterizing the molecular regulation of inhibitory immune checkpoints with multimodal single-cell screens. Nat. Genet. 53, 322–331 (2021).
Martin-Rufino, J. D. et al. Massively parallel base editing to map variant effects in human hematopoiesis. Cell 186, 2456–2474.e24 (2023).
Olsen, T. R. et al. Scalable co-sequencing of RNA and DNA from individual nuclei. Preprint at bioRxiv https://doi.org/10.1101/2023.02.09.527940 (2023).
Stoeckius, M. et al. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 19, 224 (2018).
Gehring, J., Hwee Park, J., Chen, S., Thomson, M. & Pachter, L. Highly multiplexed single-cell RNA-seq by DNA oligonucleotide tagging of cellular proteins. Nat. Biotechnol. 38, 35–38 (2020).
McGinnis, C. S. et al. MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat. Meth. 16, 619–626 (2019).
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Bendall, S. C. et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).
Deconinck, L., Cannoodt, R., Saelens, W., Deplancke, B. & Saeys, Y. Recent advances in trajectory inference from single-cell omics data. Curr. Opin. Syst. Biol. 27, 100344 (2021).
Diggle, P. Time Series: A Biostatistical Introduction 257 (Oxford Univ. Press, 1990).
Boukouvalas, A., Hensman, J. & Rattray, M. BGP: identifying gene-specific branching dynamics from single-cell data with a branching Gaussian process. Genome Biol. 19, 65 (2018).
Qiu, X. et al. Inferring causal gene regulatory networks from coupled single-cell expression dynamics using scribe. Cell Syst. 10, 265–274.e11 (2020).
Cao, J., Zhou, W., Steemers, F., Trapnell, C. & Shendure, J. Sci-fate characterizes the dynamics of gene expression in single cells. Nat. Biotechnol. 38, 980–988 (2020).
La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).
Lange, M. et al. CellRank for directed single-cell fate mapping. Nat. Meth. 19, 159–170 (2022).
Qiu, X. et al. Mapping transcriptomic vector fields of single cells. Cell 185, 690–711.e45 (2022).
Kamimoto, K. et al. Dissecting cell identity via network inference and in silico gene perturbation. Nature 614, 742–751 (2023). This paper introduces CellOracle, an algorithm for forecasting the effects of genetic perturbations on cell fates in developmental or reprogramming contexts.
Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019). This paper explores the potential of variational autoencoders for making predictions about the future behaviour of individual cells after genetic perturbations.
Lotfollahi, M. et al. Predicting cellular responses to complex perturbations in high-throughput screens. Mol. Syst. Biol. 19, e11517 (2023).
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023). This paper introduces Geneformer, a ‘foundational model’ of gene regulation constructed from the Human Cell Atlas, and demonstrates its versatility for addressing diverse problems in human genetics.
Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods https://doi.org/10.1038/s41592-024-02201-0 (2024).
Novakovsky, G., Dexter, N., Libbrecht, M. W., Wasserman, W. W. & Mostafavi, S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. 24, 125–137 (2023).
Martin, B. K. et al. Optimized single-nucleus transcriptional profiling by combinatorial indexing. Nat. Protoc. 18, 188–207 (2022).
Sziraki, A. et al. A global view of aging and Alzheimer’s pathogenesis-associated cell population dynamics and molecular signatures in human and mouse brains. Nat. Genet. 55, 2104–2116 (2023).
Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
Clark, I. C. et al. Microfluidics-free single-cell genomics with templated emulsification. Nat. Biotechnol. 41, 1557–1566 (2023).
Lareau, C. A. et al. Droplet-based combinatorial indexing for massive-scale single-cell chromatin accessibility. Nat. Biotechnol. 37, 916–924 (2019).
Cao, J. et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357, 661–667 (2017).
Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).
O’Huallachain, M. et al. Ultra-high throughput single-cell analysis of proteins and RNAs by split-pool synthesis. Commun. Biol. 3, 213 (2020).
Peterson, V. M. et al. Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol. 35, 936–939 (2017).
Swanson, E. et al. Simultaneous trimodal single-cell measurement of transcripts, epitopes, and chromatin accessibility using TEA-seq. eLife 10, e63632 (2021).
Blair, J. D. et al. Phospho-seq: integrated, multi-modal profiling of intracellular protein dynamics in single cells. Preprint at bioRxiv https://doi.org/10.1101/2023.03.27.534442 (2023).
Liscovitch-Brauer, N. et al. Profiling the genetic determinants of chromatin accessibility with scalable single-cell CRISPR screens. Nat. Biotechnol. 39, 1270–1277 (2021).
Shah, S., Lubeck, E., Zhou, W. & Cai, L. In situ transcription profiling of single cells reveals spatial organization of cells in the mouse hippocampus. Neuron 92, 342–357 (2016).
Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348, aaa6090 (2015).
Ståhl, P. L. et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353, 78–82 (2016).
Russell, A. J. C. et al. Slide-tags enables single-nucleus barcoding for multimodal spatial genomics. Nature 625, 101–109 (2024).
Goltsev, Y. et al. Deep profiling of mouse splenic architecture with CODEX multiplexed imaging. Cell 174, 968–981.e15 (2018).
Spencer Chapman, M. et al. Lineage tracing of human development through somatic mutations. Nature 595, 85–90 (2021).
Ludwig, L. S. et al. Lineage tracing in humans enabled by mitochondrial mutations and single-cell genomics. Cell 176, 1325–1339.e22 (2019).
Zafar, H., Lin, C. & Bar-Joseph, Z. Single-cell lineage tracing by integrating CRISPR-Cas9 mutations with transcriptomic data. Nat. Commun. 11, 3055 (2020).
Forrow, A. & Schiebinger, G. LineageOT is a unified framework for lineage tracing and trajectory inference. Nat. Commun. 12, 4940 (2021).
Biddy, B. A. et al. Single-cell mapping of lineage and identity in direct reprogramming. Nature 564, 219–224 (2018).
Weinreb, C., Rodriguez-Fraticelli, A., Camargo, F. D. & Klein, A. M. Lineage tracing on transcriptional landscapes links state to fate during differentiation. Science 367, eaaw3381 (2020).
Frieda, K. L. et al. Synthetic recording and in situ readout of lineage information in single cells. Nature 541, 107–111 (2016).
Choi, J. et al. A time-resolved, multi-symbol molecular recorder via sequential genome editing. Nature 608, 98–107 (2022).
Chen, W. et al. Multiplex genomic recording of enhancer and signal transduction activity in mammalian cells. Preprint at bioRxiv https://doi.org/10.1101/2021.11.05.467434 (2021).
Acknowledgements
The author is grateful to D. Kimelman, and members of their own laboratory for critical feedback on the manuscript. The author’s work is supported by grants from the Paul G. Allen Frontiers Group, the Chan Zuckerberg Initiative, and the National Institutes of Health (UM1HG011586, 1R01HG010632, RC2DK114777 and R01HG012761). The author’s work is also supported by the Seattle Hub for Synthetic Biology, a collaboration between the Allen Institute, the Chan Zuckerberg Initiative (award number CZIF2023-008738) and the University of Washington.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
C.T. is a scientific advisory board member, consultant and/or co-founder of Algen Biotechnologies, Altius Therapeutics and Scale Biosciences.
Peer review
Peer review information
Nature Reviews Genetics thanks Vijay G. Sankaran, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Related links
Human Cell Atlas: https://www.humancellatlas.org/
Model organism atlases: https://descartes.brotmanbaty.org/
Glossary
- Akaike information criterion
-
Weighs the explanatory value of a model against its complexity (in terms of parameters that must be estimated from data).
- Attention network
-
A neural network that focuses training on parts of a larger problem, then composes those parts into an overall solution: can be trained in a semi-supervised or self-supervised manner, scaling to massive datasets; useful for modelling context in complex biological tasks such as predicting protein structure from sequence.
- Autoregressive model
-
A model that treats the past and current output of a process as input for predicting its future output, broadly useful in time series analysis and forecasting (for example, in financial markets).
- Bayesian network
-
A type of graphical model that captures the direction of dependencies between input variables, facilitating causal inference.
- Coalescent theory
-
A model of how alleles sampled from a population may have arisen from a common ancestor, with numerous applications in population genetics; potentially repurposable to model cell lineages.
- Data thinning
-
Avoids circular reasoning when testing for differences between cell clusters defined from single-cell data.
- Deep learning
-
Machine learning methods that use layers of interconnected artificial neural networks to automatically discover how to represent input data for various tasks such as classification.
- Deep neural network
-
A class of artificial neural network that uses many layers of interconnected ‘artificial neurons’, each of which is capable of performing a very simple calculation, to solve a complex task in machine learning (for example, classification or natural language translation).
- Foundation model
-
An artificial intelligence model trained on a vast amount of unlabelled data to perform a core task that can be subsequently adapted to perform many other tasks through transfer learning, useful for building large language models such as ChatGPT and text-to-image models such as DALL-E.
- Generalized linear mixed model
-
(GLMM). Accounts for grouping structure in single-cell datasets.
- Generalized linear regression models
-
(GLM). Used to quantify effects of genotype, drugs, environments or other factors on gene expression or cell proportions.
- Graphical modelling
-
Captures the hierarchical (conditional) dependences between a set of input variables that contribute to an output.
- Kriging
-
A Gaussian process regression method for interpolating at unmeasured points based on measurements at nearby points, useful for both spatial and temporal data.
- LASSO
-
A tool used to identify a parsimonious set of variables to explain variation across cells or samples.
- Maximum parsimony
-
A method of reconstructing the phylogenetic tree that explains a set of alleles requiring the fewest changes needed to transform them into a single allele, useful for reconstructing cell lineages from molecular recorders.
- Mixture models
-
A modelling approach that represents variables as mixtures of two or more distributions, and simultaneously estimates the variables and their relative proportions from data.
- Network propagation
-
A technique for identifying the most relevant reference data points or documents for a given query by measuring their distance to the query in a network of relatedness between all the documents in the database.
- Pseudotime
-
Single-cell trajectory inference: a data-driven, unsupervised approach for ordering cells according to developmental maturity based on their transcriptomes or other aspects of molecular state, useful for defining the sequence of gene regulatory events in a cell’s development.
- RNA velocity
-
A technique for forecasting a cell’s future transcriptome based on the differences between its fully processed mRNAs and its pre-mRNAs.
- Transfer learning
-
A machine learning strategy that adapts a model trained to perform one task to perform a different, related task; it is useful when training data for one task are limited, but training data for the other task are more abundant.
- Variational autoencoder
-
A type of artificial neural network that learns both how to encode a set of unlabelled data into a low-dimensional representation and how to decode it in an unsupervised manner, useful for single-cell data visualization and forecasting unseen perturbations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Trapnell, C. Revealing gene function with statistical inference at single-cell resolution. Nat Rev Genet (2024). https://doi.org/10.1038/s41576-024-00750-w
Accepted:
Published:
DOI: https://doi.org/10.1038/s41576-024-00750-w