Review Article
Published: 01 July 2024

Revealing gene function with statistical inference at single-cell resolution

Cole Trapnell ORCID: orcid.org/0000-0002-8105-4347^1,2,3,4

Nature Reviews Genetics (2024)Cite this article

6959 Accesses
73 Altmetric
Metrics details

Subjects

Abstract

Single-cell and spatial molecular profiling assays have shown large gains in sensitivity, resolution and throughput. Applying these technologies to specimens from human and model organisms promises to comprehensively catalogue cell types, reveal their lineage origins in development and discern their contributions to disease pathogenesis. Moreover, rapidly dropping costs have made well-controlled perturbation experiments and cohort studies widely accessible, illuminating mechanisms that give rise to phenotypes at the scale of the cell, the tissue and the whole organism. Interpreting the coming flood of single-cell data, much of which will be spatially resolved, will place a tremendous burden on existing computational pipelines. However, statistical concepts, models, tools and algorithms can be repurposed to solve problems now arising in genetic and molecular biology studies of development and disease. Here, I review how the questions that recent technological innovations promise to answer can be addressed by the major classes of statistical tools.

Single-cell, whole-embryo phenotyping of mammalian developmental disorders

Article Open access 15 November 2023

Dictys: dynamic gene regulatory network dissects developmental continuum with single-cell multiomics

Article 03 August 2023

Investigating higher-order interactions in single-cell data with scHOT

Article 13 July 2020

Introduction

How our genes work together to build our cells, organs and bodies, and how mutations in many genes contribute to disease remain fundamental questions in genetics. Despite more than a century of exploration, our inventory of the genes required by each of our cell types for their development and their specific functions within the body remains sparse, as is our view of the regulatory links between genes. Notwithstanding the requirement to use laboratory animals as proxies for human biology, defining the genes essential for each cell type is a daunting prospect. Experiments must disrupt one or more genes within the animal to then assess the outcome on multiple cell types and potentially thousands of changes in gene expression at many different developmental time points.

Emerging measurement technologies based on sequencing^1,2, imaging or both³ offer a route to opening this longstanding bottleneck in genetics. Single-cell transcriptome sequencing can be performed on millions of cells⁴ from thousands of independent samples². Other aspects of the molecular states of cells, such as chromatin accessibility⁵ or DNA methylation⁶, can be measured in single cells at scale, often in conjunction with RNA⁷. Spatial molecular profiling is rapidly maturing, enabling surveys of the entire transcriptome across large fields of view⁸. Paralleling improvements in single-cell measurement technologies, improvements in the efficiency, precision and throughput of genome-engineering techniques now allow us to manipulate genes and cells in vivo⁹. Single-cell sequencing and CRISPR genome editing go hand in hand as a means of revealing how mutating a gene alters the expression of every other gene in each cell type. Systematically searching for genes that ‘phenocopy’ one another in terms of the molecular, anatomic or behavioural effects they elicit often reveals genes that function together through regulatory or biochemical interactions. Particularly in the context of large-scale perturbation studies¹⁰, ‘guilt-by-association’ analyses for finding sets of genes with similar phenotypes have proved invaluable to understand gene function, with whole biochemical complexes revealed¹¹. Together, these technologies promise to generate an avalanche of data that captures the consequences of genome-scale experiments in which we measure the effects of our interventions on every gene in every cell type in a whole animal.

However, exploiting the full power of single-cell technologies to understand gene function demands thoughtful attention to fundamental statistical issues at the core of several recurring computational problems. A first problem is to catalogue and characterize the cell types based on differences in how they respond to perturbation. A second lies in discerning how each cell type regulates, and is regulated by, genes. A third is to understand how cell types descend from one another in the lineage or depend on each other via signalling. A fourth is how the genes depend on and regulate one another, and how that control varies within different cell types. And finally, a fifth problem is to integrate such knowledge to make accurate, quantitative forecasts. For example, what will happen in a mouse model of disease in yet-untested genetic, drug or environmental perturbations? What will happen in a specific patient when that person is prescribed a particular drug? Although not all these problems are purely or even primarily statistical in nature, attention to the statistical issues that arise when bringing single-cell technology to bear on them is key. Analogous statistical issues will arise when studying gene function with other measurement tools from the molecular to the anatomic scale.

A well formulated statistical model can do the following: weigh up the contribution of each of many input factors; transform qualitative phenotypes into quantitative ones; resolve technical limitations of the instrument that produces the data; or connect measurements made at various scales to capture or even forecast the behaviour of molecules, cells, tissues and whole organisms, including under perturbations and in new environments or contexts. Good models establish confidence that observed effects and phenotypes are ‘real’, can help to separate direct versus indirect effects, and can describe how variation in one variable propagates to or compounds variation in others, thus guiding subsequent experimental design decisions and saving time and laboratory resources.

In this Review, I highlight statistical concepts, models, tools and algorithms that can be repurposed to solve problems now arising or becoming increasingly urgent in genetic and molecular biology studies of development and disease. I outline the workflow for phenotyping with single-cell molecular profiling, including differential cell composition and gene expression analysis. I then introduce strategies for quantitatively modelling a gene’s regulation, for understanding lineage relationships and cell–cell signalling interactions, and for inferring gene regulatory networks. Finally, I turn to the problem of forecasting cell fate, briefly touching on the promise of advanced artificial intelligence systems for such problems. I do not discuss the relevant technologies in detail to avoid overlap with many recent reviews on single-cell sequencing, spatial molecular profiling and genome editing. I also avoid statistical subjects already very familiar to geneticists, such as genome-wide association studies and expression quantitative trait loci analysis.

Phenotyping at single-cell resolution

For many diseases, we know neither the principal cell types nor the genes involved, much less the mechanisms that might be targeted as part of therapy. Single-cell genomic experiments that compare healthy to diseased tissue or track the development of embryos over time typically aim to identify subpopulations of cells that differ between conditions or timepoints at the molecular level. For example, cells that exist only in diseased samples are of central interest, as are healthy cell types that are absent in diseased samples. In principle, comparing the transcriptomes of individual cells in one condition (for example, mutant or disease) against control cells (for example, wild-type or healthy) is straightforward: first, cluster the cells, then classify the clusters according to type based on the expression of discriminative marker genes. Next, count the number of cells of each type in each sample and compare these counts across sample groups to assess which cell types are more abundant in one group than another¹². Finally, scrutinize and compare the genes specifically active in each of these states to extract insights about possible mechanisms of pathogenesis and progression. However, the nuances at each of these steps of analysis demands attention to the statistical issues and adoption of best practices (reviewed recently¹³).

Cell-type annotation

Recognizing disease-specific changes in cell-type proportions or transcriptomes requires annotating each cell according to type (Fig. 1a). Classifying cells according to type is done by comparing their transcriptomes to annotated cells in reference cell atlases^{4,14,15,16,17,18,19}. Several bioinformatic tools have been developed to open up this bottleneck^{20,21,22,23,24,25}. Garnett uses elastic net regression²⁶ to train a classifier that annotates cells in a new experiment based on prior single-cell data according to an ontology of cell types defined by a literature-derived set of markers for each cell type²⁰. An alternative strategy matches each cell from a new experiment onto cells from an existing atlas with similar overall transcriptomes and then transfers any annotations from the reference (for example, cell type) onto each query cell²⁷. This so-called label transfer is fast and straightforward but requires a reliably annotated reference atlas. The Human Cell Atlas and various model organism atlases are improving rapidly as more groups contribute data, but annotations are frequently updated, and cell types are still being catalogued, particularly in the context of disease.

**Fig. 1: Regression analysis at single-cell resolution isolates cell types and genes central to pathobiology.**

Differential cell composition

Once cells have been identified, one can ask, for example, whether their proportions vary by donor age (Fig. 1b), disease (Fig. 1c) or by other study variables. There are two strategies for comparing abundances of each cell type across specimens. A first strategy is to simply count the cells according to type and then compare the counts across samples, using a simple two-group test. However, two-group tests can have difficulty detecting differences as a function of one variable (for example, disease) while controlling for others (for example, age). A more flexible approach is to model the cell-type counts across groups via regression (Fig. 1d). Propeller is a recently introduced tool that compares the proportions of different cell types by regressing experimental variables (for example, case versus control or age) against each cell type’s proportions in the sample¹². The advantages of regression-based analyses of cell proportions are that they are easy to implement and interpret, but a disadvantage is that they require accurate cell-type annotations. Moreover, cell ontologies necessarily impose arbitrary classifications on cells, specifying, for example, at what point a developing cell becomes ‘terminally’ differentiated.

Several methods have emerged to navigate the trade-offs between resolution and interpretability, which become especially fraught with sparse single-cell data, when testing for enrichment or changes in cell proportions. SEACells groups cells into ‘metacells’, between which biological variation is preserved but technical variation is minimized²⁸. SCAVENGE relies on network propagation to associate genetic variants with distinct molecular states in which they are enriched²⁹. Uncertainty surrounding cell annotation can be at least partly accounted for in such models via bootstrap resampling³⁰ at additional computational expense. MELD³¹ and Milo³² are alternative approaches that model the distribution of cells throughout the high-dimensional space of possible transcriptomes and then test whether sample groups are distributed differently within it. The advantage of such an approach is that it does not require prior clustering or annotation and is therefore robust to choices of parameters of clustering algorithms and can work in the absence of an accurately annotated reference. A disadvantage is that more complex contrasts (for example, mutant versus wild type, controlling for differences in age) can be more complicated to specify and compute.

Differential gene expression

A third major statistical challenge arises when comparing cells of different types or states. What genes distinguish each cell type? What, if anything, is different about nominally similar cells from different donors or treatment groups, even after accounting for systematic technical effects shared by ‘batches’ of specimens? Such comparisons are central, for example, to identifying disease biomarkers or characterizing the genes that are dysregulated in disease-specific cell states relative to the healthy ones from which they emerged. For example, differential gene expression of single nuclei from mice exposed to silica dust revealed that lung macrophages activate genes normally expressed by osteoclasts, including proteases and other molecules that damage the lung³³. It is essential in such comparisons to compare the levels of genes in cells with a statistical model and test that is well suited to the data³⁴. For simple two-group comparisons, a Wilcoxon rank-sum test (also known as a Mann–Whitney U test) or similar non-parametric approach can accurately capture genuine differences while controlling for false discoveries that may arise from sampling error, with the same limitations as in differential cell-composition analysis. For example, one might want to compare the kinetics of a gene over time in two different strains, or test whether there is an age-dependent difference in how a particular cell type responds to a drug treatment (Fig. 1e). Generalized linear regression models (GLMs) can test for genes that vary over one or more experimental variables (having controlled for the others). GLMs treat each observation (that is, each cell) as independent, but this assumption is not met in most single-cell RNA sequencing (RNA-seq) experiments because cells from the same specimen are not fully independent from one another. If this ‘grouping structure’ among the observations is not accounted for, the analysis will return many false positives. There are two ways of accounting for grouping structure in single-cell datasets: one approach is to use generalized linear mixed models (GLMMs) that allow parameters (for example, age or dose effects) to differ between groups of cells. A second approach is to aggregate data from individual cells of the same type and from the same sample, to create ‘pseudobulk’ profiles^34,35 that can be analysed with conventional RNA-seq tools^36,37. Pseudobulking is straightforward but requires accurate cell-type annotations. GLMMs can be fitted without annotations but are computationally much more expensive to fit.

Because cell types are often defined on the basis of how they cluster, and cells are clustered according to how similar their transcriptomes are, care must be taken not to ‘double dip’. Using exactly the same data both to define clusters and to characterize their differences risks reporting spurious biomarkers of novel cell types³⁸. Two solutions have been proposed for studies that seek both to discover new cell types and to characterize their molecular markers³⁸. In the first, termed ‘sample splitting’, cells from a first group of samples are clustered according to transcriptomic similarity. Then, samples from a second group of samples that were not used in the definition of clusters are assigned to the cluster most similar to them. Finally, the cells from this second group are used for differential expression analysis to define markers. This approach is simple and does not require specialized statistical procedures, but the experiment must collect enough samples that they can be split between the two stages. A second approach, called data thinning, splits the sequencing data from each cell into two statistically independent groups, the first of which is used for clustering and the second for differential expression^39,40, avoiding the ‘double dip’ without adding to the overall cost of the experiment.

Modelling gene regulation

A useful model of gene expression can help to discern the contribution of several molecular inputs across diverse genetic, environmental, developmental and disease contexts (Fig. 2a). How much mRNA is transcribed from a gene, given a cell’s concentrations of its upstream kinases and transcription factors? How active is protein signalling in this cell? What non-coding sequences determine whether a gene is expressed in each cell type, and are they accessible for binding in the cell? How do mutations within those sequences alter the gene’s output?

**Fig. 2: Statistical models of gene expression aim to quantify relative contributions of regulatory DNA sequence, proteins and signals to a gene’s mRNA and protein output.**

Multi-modal single-cell assays are increasingly powering quantitative descriptions of how individual genes are regulated across the molecular layers of the central dogma⁴¹. For example, CITE-seq co-assays cellular transcriptomes and their immunophenotypes via oligo-conjugated antibodies that can be detected in single-cell RNA-seq (scRNA-seq) libraries⁴². ASAP-seq co-assays the immunophenotypes of cells with their chromatin accessibility profile via scATAC-seq⁴³. Several RNA and chromatin accessibility co-assays have been reported^7,44, enabling the correlation of gene expression with the accessibility of nearby regulatory DNA elements (for example, putative enhancers).

A quantitative model of a given gene’s transcription aims to find a mathematical function that accurately estimates the number of its mRNA or protein products present in the cell under different conditions. This function must somehow capture the molecular biology underlying the central dogma. In their 1961 classic paper on lactose metabolism in Escherichia coli, Jacob and Monod showed that the kinetics of a gene could be used as a quantitative phenotype and for causal inference regarding its control⁴⁵. In the decades since, many others have returned to this system as a testbed for new ideas in quantitative modelling of individual genes through mathematical functions of chemical reaction rates, the affinities of transcription factors for their DNA binding sites and other biochemical properties⁴⁶. Despite the rigour, sophistication and utility of such models for discriminating between possible mechanisms of gene regulation, an inescapable conclusion of these efforts is that mathematical equations that describe the kinetics of mRNA synthesis even for simple genes become intractable when many inputs must be considered.

An alternative approach to modelling a gene’s expression is to use numerical optimization to learn the more abstract functions that nevertheless accurately predict its output from its many inputs. These functions may not be explicitly specified in biochemical terms but nevertheless quantify the contributions of genetic mutations, epigenetic states, cell-type-specific signalling, among other variables (Fig. 2b). With the inputs, output and general form of a given gene’s model specified, the computer can search through the space of all such functions to find the one that best explains the multi-modal data using an appropriate numerical optimization algorithm.

Statistical models can vary tremendously in their complexity, with increasing fidelity to the underlying molecular biology. For example, a simple linear model will describe how the amount of mRNA changes with addition or subtraction of each input. Such models are straightforward to interpret but have difficulty describing how inputs interact, and so will not detect, for example, how transcription factors cooperate to activate a gene or block each other from doing so. The function can be augmented with terms that explicitly model the interaction between explanatory variables, albeit at a potentially substantial cost in terms of interpretability and computation burden. Graphical modelling approaches such as Bayesian networks can capture complex interdependencies between the inputs (Fig. 2c). For example, suppose a gene’s mRNA levels can be predicted accurately from measurements of a particular signalling pathway’s activity or from the nuclear abundance of a key upstream transcription factor whose translocation to the nucleus depends on that pathway (Fig. 2a). A Bayesian network could learn that the signalling measurements are conditionally independent of the mRNA levels, given the transcription factor levels, providing no additional predictive value.

Especially when the inputs depend on one another, more complex models are not necessarily better: although adding terms will always explain more variance in training data, those terms may actually hurt performance on ��held-out’ test data (that is, data not used to train model parameters). Model selection, the formal assessment of whether adding explanatory variables and dependencies can be justified in terms of how well a model explains data, is critical for quantifying the contributions different inputs make to gene regulation. Fortunately, the statistical literature is rich in tools with which to assess whether a complex model is significantly better than a simpler one ‘nested’ within it (for example, likelihood ratio tests), at selecting a parsimonious set of inputs from a large collection of potentially correlated ones (for example, LASSO⁴⁷), and at measuring overall model complexity (for example, the Akaike information criterion⁴⁸).

Statistical models of gene regulation can be especially powerful in conjunction with genetic experiments. For example, the bone morphogenic protein (BMP) pathway transduces signals through multiple ligands bound by receptors composed of multiple subunits. A recent quantitative analysis determined how ligand and receptor compositions and stoichiometry define how cells transduce BMP into diverse patterns of downstream gene expression⁴⁹. The statistical methods developed as part of the study discriminated between additive, synergistic and antagonistic effects between the ligands and the receptor subunits. Modulating BMP receptor composition by knocking out subunits altered how cells respond to BMP, both quantitatively in levels of downstream expression and also qualitatively, with one cell type made to resemble another. This study demonstrates that combining genetic perturbations with statistical modelling to discriminate between possible functions for genes using the quantitative kinetics of the cellular behaviours they control remains as powerful today as it was when Jacob and Monod used it more than 60 years ago.

Disentangling interactions and dependencies

How our many diverse cells arise from a single cell, how they communicate with each other, and how they depend on one another to form organized, structured tissue are fundamental questions in developmental biology. What genes does each cell use to sense and influence its neighbours? What genes does it use to ensure its daughters are properly positioned and specified? Genes that function non-autonomously through cell–cell interactions or by conditioning a cell (or its descendants) to behave a certain way in the future are notoriously difficult to study. Although understanding the lineal and signalling dependencies between cell types may seem like disparate goals, new tools for studying cells across space and time promise to decode these parts of the genetic program.

Spatial relationships

Identifying the specific proteins, peptides and small molecules that drive cell-fate decisions is challenging. Highly multiplexed spatial transcriptomics and proteomic techniques (reviewed recently^50,51) have emerged as a way of measuring heterogeneity across cells within a tissue at the molecular level, promising to characterize the molecular basis of signalling dependencies between cell types, and one day may enable quantitative spatiotemporal models of tissue morphogenesis or tumour growth and invasion. Tools from spatial statistics are increasingly being deployed to overcome common tasks in spatial transcriptomics or proteomic analyses. For example, a recent analysis of the mouse liver used a combination of spatial transcriptomics and scRNA-seq to statistically deconvolve the spatial transcriptomics data using scRNA-seq, estimating each gene’s expression within each cell in the tissue; to automatically annotate veins in the image and classify them as central or portal; and to identify modules of spatially autocorrelated genes within hepatocytes that vary as a function of position along the lobular axis and proximity to portal veins⁵². A separate study used molecular cartography to compare livers from wild-type and Wnt2/Wnt9b double-knockout mice, finding these genes to be required for properly zoned gene expression and liver regeneration following acetominophen injury⁵³. These two examples highlight how new technology (and rigorous spatial statistics), coupled with genetic and drug perturbations, facilitate causal inferences about the role of individual genes in cell–cell interactions and molecular phenotypes.

The central statistical challenge associated with spatial molecular profiling is to quantify the contribution that the locations of cells within a tissue make to explaining variation in their forms and functions. This challenge is fundamental to answering diverse questions in cell, molecular and developmental biology. Are T cells that have infiltrated a tumour different from those at the margin? How does distance from an organizer and the morphogens it emits correlate with a cell’s fate in the embryo? Spatial statistics is rich with parametric and nonparametric methods of exploring such questions, which frequently arise in demography, geology, meteorology, oceanography, ecology and other disciplines that analyse measurements over maps and volumes. Consider a hypothetical spatial molecular profiling experiment aiming to identify ligand–receptor interactions that determine the expression of a downstream gene through signalling (Fig. 3). Statistical models could be used to determine whether variation in target gene T across the tissue can be explained as a function of the local concentration of ligands A and B, conditional on whether the receptor is also expressed. The variance explained by these models can be taken as the strength of evidence for interactions between each ligand and the receptor. However, as more candidate ligands are considered, model selection can be burdensome as the number of candidate models explodes.

**Fig. 3: Identifying molecular mediators of cell–cell interaction with spatial statistics.**

Statistical methods are also critical for overcoming limitations in the underlying measurement technologies. Techniques such as Slide-seq⁵⁴ that release mRNA from tissues and then blot it onto a substrate prior to library construction are not, strictly speaking, single-cell techniques, in that libraries correspond to spots on a spatial grid contributed by multiple (and sometimes many) cells of different types and in different functional states. There are now numerous algorithms for deconvolving the spots under various mixture models for how individual cells contribute their diffused mRNA to each nearby spot^55,56. Techniques that extract single cells or nuclei prior to sequencing such as sci-Space⁵⁷ avoid this issue, but they face others: because cells are physically extracted from tissue prior to single-cell sequencing, the original tissue coordinates of those cells must be triangulated (again, with uncertainty) from the sequencing. Triangulating cells can also be formulated as a mixture modelling problem in which each cell is labelled according to their relative proximity to a reference point within the assay’s spatial coordinate system. A second problem common to many spatial molecular profiling assays is that measurements for some probes or cells may be missing at some coordinates. Fortunately, there is a rich statistical literature on how to estimate missing measurements through interpolation. One classic approach to the spatial interpolation problem is ‘kriging’, whereby a ‘variogram’ is constructed that describes the correlation in measurements collected at two points based on the distance between them, enabling one to predict values at unsampled coordinates based on sampled ones. A third problem common to any analysis of tissue sections is how to register multiple sections from the same specimen into a common, three-dimensional coordinate framework. Here too, spatial statistical methods have proven invaluable: especially in conjunction with deep neural networks, for aligning sections against one another in three dimensions⁵⁸.

Lineage relationships

In some organisms, the lineage relationships between cells can be ascertained simply by observing them under the microscope. Sulston and colleagues characterized the entire cell lineage of the nematode Caenorhabditis elegans by meticulously documenting individual cell-division events in worm embryos⁵⁹. The worm is ideal for such an effort because it is small and has an invariant lineage that over developmental time follows a deterministic genetic programme, so that observations of different embryos can be registered into a unified ‘coordinate system’ of the lineage. Unlike the worm, most animals do not have invariant lineages; even genetically identical animals do not contain exactly the same number of cells, of exactly the same cell types, positioned in exactly the same ways in their tissues and organs. Experiments that interrogate the cells arising from a population marked by a particular gene (assuming a suitable reporter could even be designed) leave the remainder of the animal unmapped, and so the lineal origins of many cell types remain unclear.

Emerging genomic and imaging technologies for lineage tracing offer a path to comprehensively characterizing the lineage relationships between cells at the whole-animal scale (reviewed recently⁶⁰^,⁶¹). These technologies draw from two basic strategies for connecting cells to their descendants through developmental time. A first class of prospective lineage tracing strategies labels cells by manipulating their genomes (for example, with unique barcodes), allowing them to expand, sequencing the barcodes of their descendants, and then associating each descendant with its ancestor through these barcodes. A second, retrospective set of strategies scrutinizes a population of cells and reconstructs their ancestry through phylogenetic analysis of their shared genetic variants. Recently, several techniques that draw from both of these strategies have been developed, such as GESTALT, which uses CRISPR–Cas9 to introduce cumulative edits to a synthetic construct installed in the genome of developing zebrafish⁶² (Table 1). As the zebrafish embryo develops, cells accumulate unique patterns of repair to this construct, prospectively and continuously labelling ever-smaller clades of cells (Fig. 4a). On the assumption that descendants will share the genomic scars formed when their shared ancestors received edits, their lineage relationships can be inferred by reconstructing a lineage tree that minimizes the amount of editing that would be needed to generate their ‘alleles’ (for example, maximum parsimony). Intense efforts are underway to improve the depth and breadth of lineage recording capacity, to co-capture other measurements (such as the transcriptome), to incorporate spatial information and to deploy recorders in animal models. For example, the DARLIN mouse, which records cellular lineage via an inducible Cas9-barcoding system, revealed that clonal memory in haematopoietic stem cells is associated more with DNA methylation than with chromatin accessibility or the transcriptome⁶³. Further experiments, possibly involving mutants on the DARLIN background, will be required to identify the genes that mediate memory through DNA methylation.

Table 1 Emerging genomics technologies that enable statistical inference of gene function

Full size table

**Fig. 4: Statistical inference of cell lineage relationships during development using molecular recorders.**

Understanding how animal genomes sweep out a lineage that is variable without compromising reproducibility and robustness during development is a central aim of molecular recording technology. Realizing this goal will require developing statistical models that describe and quantify variability in the lineages of many individual embryos. Statistical methods from phylogenetics and population genetics offer guidance on how to approach the problem of inferring lineage relationships from molecular recorders. In population genetics, coalescent theory formulates models of how alleles observed in a population arose from a common ancestral allele over successive generations⁶⁴ (Fig. 4b). Although coalescent theory rests on assumptions that may not hold for the synthetic alleles introduced by molecular recorders, such as the assumption that edits are introduced in cells at a constant rate, much of the mathematical and statistical infrastructure may be reusable. Elaborations to coalescent theory capture phenomena in evolutionary histories that have analogues in cell and developmental biology such as bottlenecks (programmed death) and dispersal (cell migration). A coalescent theory of molecular recorder experiments would enable the use of inferred cell histories as quantitative phenotypes. Phenotyping with molecular recorders would enable direct tackling of the question of how genomes sweep out reproducible yet stochastic cell lineages in development.

Dissecting gene circuits

The fate decisions, lineal histories, physical positions and molecular messaging of cells in the embryo are ultimately determined by the regulation of and by genes. Systematically or even automatically inferring how genes regulate one another has long been one of the most difficult challenges in computational biology. With each new generation of genomic measurement technologies, new algorithms emerge to exploit them, hoping to realize the promise of data-driven dissection of genetic programmes in development, disease and other contexts.

The problem of gene network reconstruction has been formulated in many ways, most ambitiously as a problem in genome-scale causal inference: given the expression levels of all genes in many cells or samples, infer the direct regulatory relationships between the genes (Fig. 5). A statistical approach to this problem models each gene as a dependent variable explained by all others, as well as any extrinsic factors such as drug treatments or environmental stresses (Fig. 5a). However, a pattern of correlation between two genes is insufficient to declare one a regulator of the other. As more pairs of correlated genes are considered, one must discriminate between an exploding number of regulatory architectures consistent with the correlations (Fig. 5b); for a genome with 20,000 genes, there are 400 million possible regulator–target pairs to consider. Even to conclude that two genes are correlated and that changes in one follow the other requires some care because sparse count data can bias simple measures of relatedness⁶⁵ (for example, Pearson’s r). But the greater challenge lies in distinguishing causal links from correlations. Casting John Stuart Mill’s classic criteria for causal inference⁶⁶ in genetic terms, defending a regulatory link between one gene A and another gene B requires establishing that changes in A precede B, that measurements of A are correlated with B, and that no other factor, neither a third gene C nor some external variable, explains changes in both A and B. That is, are two genes still correlated, even if one accounts for co-variation with every other gene in the genome? Computing these conditional dependencies massively increases both the computational burden and the amount of data needed for accurate estimates. There are so many parameters to estimate that a naive approach will produce either a uselessly inaccurate result or, after accounting for multiple testing, no result at all.

**Fig. 5: Analysing conditional dependence relationships between genes and other experimental factors in single-cell data can reveal regulatory interactions between them.**

Three main statistical strategies make this causal inference problem more tractable, each served by a suite of new technologies. A first strategy is to reduce the number of genes or gene–gene pairs in the inference by ruling out genes that cannot regulate others or by declaring some gene–gene interactions to be mechanistically implausible. For example, ‘structural genes’, such as actins, do not regulate most other genes. Focusing only on transcriptional networks shrinks the set of putative regulatory genes to transcription factors and imposes the requirement that such regulators recognize their target genes through cognate binding sequences in flanking non-coding DNA. This strategy (reviewed recently⁶⁷) is served by multi-modal single-cell assays that readout not only expression levels (a measure of gene output) with chromatin accessibility, histone modifications or intracellular signalling (all measures of gene input). A second strategy is to perturb the genes (for example, through CRISPR^68,69,70,71), which establishes (or excludes) their role as a causal regulator of other genes. Expansions of this concept can be used to explore the vast space of combinations of knockouts efficiently⁷², how genetic mutations interact with drug treatments⁷³ and how genes determine how cells interact with each other⁷⁴. An important limitation in such screens is that not all cells receive functional edits. Statistical tools⁷⁵, improved genome editing⁷⁶ and direct genotyping protocols⁷⁷ could provide clearer genotype-to-phenotype associations. A third strategy is to collect vast amounts of data, from diverse conditions, to isolate the most informative correlations. Major improvements to single-cell protocols now enable the analysis of cells from many samples^2,78,79,80, each perturbed in different ways, in the same experiment, opening the door to screens and systematic reverse genetic studies.

Forecasting cell fate

Understanding what fates and functions cells will adopt in the future is central in both development and disease. Single-cell sequencing experiments carried out over time can capture a nearly continuous view of how cells regulate genes as they differentiate or enter pathological states. Among the first applications of single-cell RNA-seq towards understanding development was trajectory inference, the computational reconstruction of transcriptional programmes that cells execute as they differentiate^81,82. Trajectory inference algorithms⁸³ organize differentiating cells in pseudotime according to their maturity and fate decisions (Fig. 6a), enabling one to construct statistical models that anticipate the kinetics of each gene in each cell type. These algorithms were the starting point for more ambitious efforts to forecast the transcriptomes and proportions of cell types in developing tissues, including those subjected to genetic, environmental and other perturbations (Fig. 6b,c).

Statistical methods for time series analysis aim to predict the future state of a system using its past and current states. For example, autoregressive models, which learn correlation structure between present data and the recent past to predict the future⁸⁴, are widely applied in econometrics and to forecast financial markets. Gaussian process models elaborate this concept and have been used to pinpoint the moment individual genes begin to undergo divergent expression as bipotent progenitors differentiate⁸⁵. However, because current single-cell sequencing technologies are destructive, they do not repeatedly sample the same individual cells, limiting the amount of auto-correlation in the data⁸⁶.

Both experimental techniques and computational insights have been proposed to learn how past and future states are linked without requiring repeated measures. Single-cell RNA-seq destroys each cell, but labelling the RNA while it is living provides a means of discriminating between older RNAs and more recent ones⁸⁷. RNAs from newly activated genes will appear in the unlabelled but not the labelled fraction, and therefore indicate where a cell is ‘headed’, transcriptionally speaking. RNA velocity is a purely computational technique that achieves similar ends by comparing the relative abundances of incompletely processed transcripts to mature, spliced isoforms⁸⁸ (Fig. 6b). Transcriptional velocities have proved powerful in computational ‘fate mapping’. For example, CellRank estimates RNA velocity fields, taking care to propagate uncertainty in its estimates into downstream computations, to accurately predict cellular outcomes even outside normal development, such as cellular reprogramming and tissue regeneration⁸⁹. Dynamo is a recent algorithm that computes transcriptional velocity vectors from labelled RNA (if available) or in a manner similar to RNA velocity (if not) and then uses the vector field to infer gene regulatory networks and use them to predict future cell fates⁹⁰.

Extrapolating cell fates is even more challenging in the context of genetic, environmental or other perturbations. Accurate models for predicting what would happen to a population of cells in an as-yet-unseen mutant or under particular environmental conditions is of paramount interest, and techniques for ‘out-of-sample’ prediction are beginning to emerge. CellOracle is a tool that builds on ideas from trajectory analysis and RNA velocity, but also constructs approximate gene regulatory networks to predict what would happen if one of those genes were absent⁹¹ (Fig. 6c). To do so, the tool deploys several of the strategies discussed above to make the gene regulatory network problem tractable. For example, CellOracle focuses on transcriptional regulatory networks and leverages prior knowledge from single-cell ATAC-seq data and transcription factor binding motifs to constrain the problem⁹¹. It also limits the number of regulators through a statistical technique called penalization, prioritizing overall accuracy in forecasting the future transcriptome of the cell rather than capturing every regulator of each gene. Linear models of gene regulation as used by CellOracle are straightforward to interpret, which means that in silico predictions can be readily tested in the laboratory. A disadvantage of linear models is that they may be too simple to capture the complex, nonlinear interactions between genes, cell types and environmental factors needed for accurate out-of-distribution forecasting.

Deep learning methods are emerging to facilitate in silico predictions of cell states and fates. These techniques trade interpretability for accuracy in forecasting and require dramatically more training data and computational power to build. A first class of methods are built for specific single-cell analysis tasks: scGen⁹² and a related subsequent method, CPA⁹³, use variational autoencoders to learn low-dimensional representations of cell transcriptomes that enable one to predict gene-expression changes following gene knockouts, chemical perturbations and other insults. For example, by first embedding cells treated with individual small molecules into a low-dimensional space, and then learning how each drug moves cells around in that space, CPA can predict the effects of combinations of drugs that were not included in the training dataset. A second class of techniques — exemplified by Geneformer⁹⁴ and scGPT⁹⁵— building on the same principles that led to the success of large language models (LLMs) such as ChatGPT, first trains a foundation model based on attention networks that learns how genes co-vary with one another across a huge, diverse collection of cell types and states. This foundation model can be used as is for some tasks (so called zero-shot prediction) such as in silico gene knockout analysis, or it can be fine-tuned to perform a more specialized application, such as cell-type annotation or identifying transcription factor targets (Fig. 6d).

There is a robust discussion ongoing within the machine learning community regarding how to combine the strengths of deep learning methods and more traditional modelling techniques, especially for very challenging problems such as forecasting a gene’s regulation in a developing embryo. On the one hand, discriminating between equally plausible mechanistic models may demand the accuracy that thus far only deep learning tools have attained. On the other hand, given many free parameters, a deep learning method may discover a highly accurate but totally uninterpretable or mechanistically completely wrong model for a given system. If we are to use the internal structure of our models to guide mechanistic experiments in the laboratory, we must demand more than accuracy of our statistical methods and artificial intelligence (AI) tools. They must not only learn biology but also be able to convey it to us in terms we understand. Accordingly, intense effort in the AI community is being directed towards making the output of AI models more readily interpretable⁹⁶.

Conclusions and future directions

As dramatic as the advances in genomic technologies over the past decade have been, there is plenty of room for further progress. Protein abundances and other critical measurements remain challenging to capture in single cells. In terms of scale, experiments with hundreds of specimens remain cost-prohibitive for many laboratories, limiting access to cohort studies and screens. Costs of single-cell sequencing have plummeted in recent years, with new protocols^97,98 and new commercial offerings increasingly democratizing the toolset, but they remain expensive in absolute terms and further cost reductions are needed. One can profile a handful of molecular modalities in single cells, but quantifying the contribution of the many inputs to gene regulation may require simultaneous measurement of myriad aspects of each cell’s molecular state. Gene knockouts with CRISPR are routine, but highly efficient, precise in vivo base editing would supercharge efforts to stratify the many thousands of outstanding genetic variants of unknown importance. The value of measuring cells in their tissue contexts was made abundantly clear by early spatial transcriptomic applications, but the current technology is largely two-dimensional, rather than three- or four-dimensional, distorting or excluding whole dimensions of context. Multiplexed single-cell measurement is destructive, limiting efforts to forecast cellular behaviour. Molecular recorders may mitigate this challenge but remain in their infancy, with limited ‘memory’ and programmability, and have yet to reach the level of democratization needed to see them deployed except in a few select applications. Fortunately, the pace of single-cell technology development remains as rapid as ever.

Although we have statistical tools to meet the challenges and opportunities presented by single-cell genomic and multiplexed imaging technologies, many of these tools remain inaccessible. There are at least three areas of investment that would help biologists to unlock their data using ideas from statistics. The first is bespoke statistical software that is built-for-purpose to work with single-cell or spatial sequencing data. For example, many regression analysis packages are general and can be configured to work with all kinds of different data, but running them on a single-cell experiment requires detailed, intimate knowledge of the package, possibly intimidating a user without formal statistical training. Furthermore, the size and scale of single-cell data often exceed the capacity of data structures and algorithms of general-purpose software. Experiments that perturb hundreds or even thousands of genes will demand not only scalable data structures but intuitive, responsive visualization tools to help users to interpret the many phenotypes that emerge at such scale¹¹. The second is statistical education and outreach. The data science community has made tremendous progress educating scientists without computational backgrounds in using regression analysis and other techniques discussed here, but barriers to access could be lower. The third is closer communication between statisticians, geneticists and technology developers to shape future protocols and assays. By connecting developers of statistical methods with developers of genomic technologies, we will produce powerful, integrated solutions to answer questions that have frustrated geneticists for decades. New technologies invariably face trade-offs, and a statistical perspective can be tremendously useful for navigating these. For example, a new technique might favour more samples over profiling each sample more deeply. Will this limit statistical power when applying it to key questions in genetics? Such conversations might induce technology developers to invent assays they never would have considered had they not been made aware of the pain points that arise during statistical analysis or genetic interpretation.

The rise of machine learning has inescapable implications for biology. Although the accuracy of such models is still being assessed and a truly general AI model of gene regulation at genome scale may require much more data than is currently available, the pace of single-cell technology development, data generation and transfer learning suggests a future in which biologists can make extremely precise predictions about genes, cells, tissues and individual people. Such predictions will have both obvious and unanticipated applications across diverse areas of biomedicine. However, the convergence of statistical and deep learning and massive single-cell datasets will not alleviate the need for real-world experiments using conventional molecular and genetic approaches. Even the best models will be complex in terms of their internal structure, many will be largely or wholly uninterpretable in terms of underlying biological mechanisms, and there will probably be many equally plausible models for the same observations. Nevertheless, models will continue to help to ascribe functions to genes, either directly or by excluding reasonable hypotheses about how these genes might work. To paraphrase George Box’s famous aphorism, these models will all be wrong, but some will be useful in guiding mechanistic experiments. The models will come and go, but the mechanistic knowledge to which they direct us will be forever.

References

Replogle, J. M. et al. Mapping information-rich genotype-phenotype landscapes with genome-scale Perturb-seq. Cell 185, 2559–2575.e28 (2022).
Article CAS PubMed PubMed Central Google Scholar
Srivatsan, S. R. et al. Massively multiplex chemical transcriptomics at single-cell resolution. Science 367, 45–51 (2020).
Article CAS PubMed Google Scholar
Funk, L. et al. The phenotypic landscape of essential human genes. Cell 185, 4634–4653.e22 (2022).
Article CAS PubMed PubMed Central Google Scholar
Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).
Article CAS PubMed PubMed Central Google Scholar
Cusanovich, D. A. et al. Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015).
Article CAS PubMed PubMed Central Google Scholar
Mulqueen, R. M. et al. Highly scalable generation of DNA methylation profiles in single cells. Nat. Biotechnol. 36, 428–431 (2018).
Article CAS PubMed PubMed Central Google Scholar
Cao, J. et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science 361, 1380–1385 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chen, A. et al. Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell 185, 1777–1792.e21 (2022).
Article CAS PubMed Google Scholar
Chavez, M., Chen, X., Finn, P. B. & Qi, L. S. Advances in CRISPR therapeutics. Nat. Rev. Nephrol. 19, 9–22 (2023).
Article CAS PubMed Google Scholar
Kemmeren, P. et al. Large-scale genetic perturbations reveal regulatory networks and an abundance of gene-specific repressors. Cell 157, 740–752 (2014).
Article CAS PubMed Google Scholar
Dorrity, M. W., Saunders, L. M., Queitsch, C., Fields, S. & Trapnell, C. Dimensionality reduction by UMAP to visualize physical and genetic interactions. Nat. Commun. 11, 1537 (2020).
Article CAS PubMed PubMed Central Google Scholar
Phipson, B. et al. propeller: testing for differences in cell type proportions in single cell data. Bioinformatics 38, 4720–4726 (2022).
Article CAS PubMed PubMed Central Google Scholar
Heumos, L. et al. Best practices for single-cell analysis across modalities. Nat. Rev. Genet. 24, 550–572 (2023).
Article CAS PubMed Google Scholar
Rozenblatt-Rosen, O., Stubbington MJT, Regev, A. & Teichmann, S. A. The human cell atlas: from vision to reality. Nature 550, 451–453 (2017).
Article CAS PubMed Google Scholar
Saunders, L. M. et al. Embryo-scale reverse genetics at single-cell resolution. Nature 623, 782–791 (2023). This paper deploys massively scalable single-cell RNA-seq on many developing wild-type and mutant zebrafish to measure the consequences of gene disruption on the whole transcriptome in each cell in the animal.
Article CAS PubMed PubMed Central Google Scholar
Cao, J. et al. A human cell atlas of fetal gene expression. Science 370, eaba7721 (2020).
Article CAS PubMed PubMed Central Google Scholar
Domcke, S. et al. A human cell atlas of fetal chromatin accessibility. Science 370, eaba7612 (2020).
Article CAS PubMed PubMed Central Google Scholar
BRAIN Initiative Cell Census Network (BICCN). A multimodal cell census and atlas of the mammalian primary motor cortex. Nature 598, 86–102 (2021).
Article PubMed Central Google Scholar
Sikkema, L. et al. An integrated cell atlas of the lung in health and disease. Nat. Med. 29, 1563–1577 (2023).
Article CAS PubMed PubMed Central Google Scholar
Pliner, H. A., Shendure, J. & Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat. Meth. 16, 983–986 (2019).
Article CAS Google Scholar
Fu, R. et al. clustifyr: an R package for automated single-cell RNA sequencing cluster classification. F1000Res. 9, 223 (2020).
Article CAS PubMed PubMed Central Google Scholar
Domínguez Conde, C. et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 376, eabl5197 (2022).
Article PubMed PubMed Central Google Scholar
Lotfollahi, M. et al. Mapping single-cell data to reference atlases by transfer learning. Nat. Biotechnol. 40, 121–130 (2022).
Article CAS PubMed Google Scholar
Kang, J. B. et al. Efficient and precise single-cell reference atlas mapping with Symphony. Nat. Commun. 12, 5890 (2021).
Article CAS PubMed PubMed Central Google Scholar
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).
Article CAS PubMed PubMed Central Google Scholar
Friedman, J., Hastie, T. & Tibshirani, R. Lasso and elastic-net regularized generalized linear models. glmnet https://glmnet.stanford.edu (2009).
Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 42, 293–304 (2024).
Article CAS PubMed Google Scholar
Persad, S. et al. SEACells infers transcriptional and epigenomic cellular states from single-cell genomics data. Nat. Biotechnol. 41, 1746–1757 (2023).
Article CAS PubMed PubMed Central Google Scholar
Yu, F. et al. Variant to function mapping at single-cell resolution through network propagation. Nat. Biotechnol. 40, 1644–1653 (2022).
Article CAS PubMed PubMed Central Google Scholar
Cao, Y. et al. scDC: single cell differential composition analysis. BMC Bioinform. 20, 721 (2019).
Article CAS Google Scholar
Burkhardt, D. B. et al. Quantifying the effect of experimental perturbations at single-cell resolution. Nat. Biotechnol. 39, 619–629 (2021).
Article CAS PubMed PubMed Central Google Scholar
Dann, E., Henderson, N. C., Teichmann, S. A., Morgan, M. D. & Marioni, J. C. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat. Biotechnol. 40, 245–253 (2022).
Article CAS PubMed Google Scholar
Hasegawa, Y. et al. Pulmonary osteoclast-like cells in silica induced pulmonary fibrosis. Preprint at bioRxiv https://doi.org/10.1101/2023.02.17.528996 (2023).
Squair, J. W. et al. Confronting false discoveries in single-cell differential expression. Nat. Commun. 12, 5692 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zimmerman, K. D., Espeland, M. A. & Langefeld, C. D. A practical solution to pseudoreplication bias in single-cell studies. Nat. Commun. 12, 738 (2021).
Article CAS PubMed PubMed Central Google Scholar
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Article PubMed PubMed Central Google Scholar
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Article CAS PubMed Google Scholar
Neufeld, A., Popp, J., Gao, L. L., Battle, A. & Witten, D. Negative binomial count splitting for single-cell RNA sequencing data. Preprint at arxiv https://doi.org/10.48550/arXiv.2307.12985 (2023).
Neufeld, A., Dharamshi, A., Gao, L. L. & Witten, D. Data thinning for convolution-closed distributions. Preprint at arxiv https://doi.org/10.48550/arXiv.2301.07276 (2023).
Dharamshi, A. et al. Generalized data thinning using sufficient statistics. Preprint at arxiv https://doi.org/10.48550/arXiv.2303.12931 (2023).
Vandereyken, K., Sifrim, A., Thienpont, B. & Voet, T. Methods and applications for single-cell and spatial multi-omics. Nat. Rev. Genet. 24, 494–515 (2023).
Article CAS PubMed Google Scholar
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Meth. 14, 865–868 (2017).
Article CAS Google Scholar
Mimitou, E. P. et al. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nat. Biotechnol. 39, 1246–1258 (2021).
Article CAS PubMed PubMed Central Google Scholar
Ma, S. et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell 183, 1103–1116.e20 (2020).
Article CAS PubMed PubMed Central Google Scholar
Jacob, F. & Monod, J. Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3, 318–356 (1961).
Article CAS PubMed Google Scholar
Phillips, R. Napoleon is in equilibrium. Annu. Rev. Condens. Matter Phys. 6, 85–111 (2015). This paper discusses the power and limitations of statistical mechanics in constructing quantitative models of gene regulation.
Article CAS PubMed PubMed Central Google Scholar
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58, 267–288 (1996).
Article Google Scholar
Akaike, H. Information theory and an extension of the maximum likelihood principle. In Selected Papers of Hirotugu Akaike (eds Parzen, E., Tanabe, K. & Kitagawa, G.)199–213 (Springer, 1998).
Klumpe, H. E. et al. The context-dependent, combinatorial logic of BMP signaling. Cell Syst. 18, 388–407.e10 (2022). This study combines genetic perturbations of the BMP pathway with statistical modelling to discriminate between possible functions for genes using the quantitative kinetics of the cellular behaviours they control.
Article Google Scholar
Moffitt, J. R., Lundberg, E. & Heyn, H. The emerging landscape of spatial profiling technologies. Nat. Rev. Genet. 23, 741–759 (2022).
Article CAS PubMed Google Scholar
Elhanani, O., Ben-Uri, R. & Keren, L. Spatial profiling technologies illuminate the tumor microenvironment. Cancer Cell 41, 404–420 (2023).
Article CAS PubMed Google Scholar
Hildebrandt, F. et al. Spatial transcriptomics to define transcriptional patterns of zonation and structural components in the mouse liver. Nat. Commun. 12, 7046 (2021).
Article CAS PubMed PubMed Central Google Scholar
Hu, S. et al. Single-cell spatial transcriptomics reveals a dynamic control of metabolic zonation and liver regeneration by endothelial cell Wnt2 and Wnt9b. Cell Rep. Med. 3, 100754 (2022).
Article CAS PubMed PubMed Central Google Scholar
Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019).
Article CAS PubMed PubMed Central Google Scholar
Moncada, R. et al. Integrating microarray-based spatial transcriptomics and single-cell RNA-seq reveals tissue architecture in pancreatic ductal adenocarcinomas. Nat. Biotechnol. 38, 333–342 (2020).
Article CAS PubMed Google Scholar
Cable, D. M. et al. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat. Biotechnol. 40, 517–526 (2022). This paper exemplifies the utility of statistical deconvolution techniques to overcome limitations of spatial transcriptomics, improving the resolution and power of the technology.
Article CAS PubMed Google Scholar
Srivatsan, S. R. et al. Embryo-scale, single-cell spatial transcriptomics. Science 373, 111–117 (2021).
Article CAS PubMed PubMed Central Google Scholar
Jones, A., Townes, F. W., Li, D. & Engelhardt, B. E. Alignment of spatial genomics data using deep Gaussian processes. Nat. Meth. 20, 1379–1387 (2023). This paper explores how spatial statistics can be augmented with techniques from deep learning to solve difficult problems in spatial data integration.
Article CAS Google Scholar
Sulston, J. E., Schierenberg, E., White, J. G. & Thomson, J. N. The embryonic cell lineage of the nematode Caenorhabditis elegans. Dev. Biol. 100, 64–119 (1983).
Article CAS PubMed Google Scholar
Sankaran, V. G., Weissman, J. S. & Zon, L. I. Cellular barcoding to decipher clonal dynamics in disease. Science 378, eabm5874 (2022).
Article CAS PubMed PubMed Central Google Scholar
Li, Z. et al. Reconstructing cell lineage trees with genomic barcoding: approaches and applications. J. Genet. Genom. 51, 35–47 (2023).
Article CAS Google Scholar
McKenna, A. et al. Whole-organism lineage tracing by combinatorial and cumulative genome editing. Science 353, aaf7907 (2016). This paper introduces the concept of cumulative, CRISPR-based genome editing to write lineage histories into the genomes of developing embryos.
Article PubMed PubMed Central Google Scholar
Li, L. et al. A mouse model with high clonal barcode diversity for joint lineage, transcriptomic, and epigenomic profiling in single cells. Cell 186, 5183–5199.e22 (2023).
Article CAS PubMed Google Scholar
Rosenberg, N. A. & Nordborg, M. Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat. Rev. Genet. 3, 380–390 (2002).
Article CAS PubMed Google Scholar
Serra, A., Coretto, P., Fratello, M., Tagliaferri, R. & Stegle, O. Robust and sparse correlation matrix estimation for the analysis of high-dimensional genomics data. Bioinformatics 34, 625–634 (2018).
Article CAS PubMed Google Scholar
Shadish, W. R., Cook, T. D. & Campbell, D. T. Experimental and Quasi-experimental Designs for Generalized Causal Inference (Houghton Mifflin, 2002).
Badia-I-Mompel, P. et al. Gene regulatory network inference in the era of single-cell multi-omics. Nat. Rev. Genet. 24, 739–754 (2023).
Article CAS PubMed Google Scholar
Adamson, B. et al. A multiplexed single-cell CRISPR screening platform enables systematic dissection of the unfolded protein response. Cell 167, 1867–1882.e21 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell rna profiling of pooled genetic screens. Cell 167, 1853–1866.e17 (2016).
Article CAS PubMed PubMed Central Google Scholar
Jaitin, D. A. et al. Dissecting immune circuits by linking crispr-pooled screens with single-cell RNA-seq. Cell 167, 1883–1896.e15 (2016).
Article CAS PubMed Google Scholar
Datlinger, P. et al. Pooled CRISPR screening with single-cell transcriptome readout. Nat. Meth. 14, 297–301 (2017).
Article CAS Google Scholar
Norman, T. M. et al. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science 365, 786–793 (2019).
Article CAS PubMed PubMed Central Google Scholar
McFaline-Figueroa, J. L. et al. Multiplex single-cell chemical genomics reveals the kinase dependence of the response to targeted therapy. Cell Genom. 4, 100487 (2024).
Article CAS PubMed PubMed Central Google Scholar
Liu, S. J. et al. In vivo perturb-seq of cancer and immune cells dissects oncologic drivers and therapy response. Preprint at bioRxiv https://doi.org/10.1101/2023.09.01.555831v1 (2023).
Papalexi, E. et al. Characterizing the molecular regulation of inhibitory immune checkpoints with multimodal single-cell screens. Nat. Genet. 53, 322–331 (2021).
Article CAS PubMed PubMed Central Google Scholar
Martin-Rufino, J. D. et al. Massively parallel base editing to map variant effects in human hematopoiesis. Cell 186, 2456–2474.e24 (2023).
Article CAS PubMed PubMed Central Google Scholar
Olsen, T. R. et al. Scalable co-sequencing of RNA and DNA from individual nuclei. Preprint at bioRxiv https://doi.org/10.1101/2023.02.09.527940 (2023).
Stoeckius, M. et al. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 19, 224 (2018).
Article CAS PubMed PubMed Central Google Scholar
Gehring, J., Hwee Park, J., Chen, S., Thomson, M. & Pachter, L. Highly multiplexed single-cell RNA-seq by DNA oligonucleotide tagging of cellular proteins. Nat. Biotechnol. 38, 35–38 (2020).
Article CAS PubMed Google Scholar
McGinnis, C. S. et al. MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat. Meth. 16, 619–626 (2019).
Article CAS Google Scholar
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol. 32, 381–386 (2014).
Article CAS PubMed PubMed Central Google Scholar
Bendall, S. C. et al. Single-cell trajectory detection uncovers progression and regulatory coordination in human B cell development. Cell 157, 714–725 (2014).
Article CAS PubMed PubMed Central Google Scholar
Deconinck, L., Cannoodt, R., Saelens, W., Deplancke, B. & Saeys, Y. Recent advances in trajectory inference from single-cell omics data. Curr. Opin. Syst. Biol. 27, 100344 (2021).
Article CAS Google Scholar
Diggle, P. Time Series: A Biostatistical Introduction 257 (Oxford Univ. Press, 1990).
Boukouvalas, A., Hensman, J. & Rattray, M. BGP: identifying gene-specific branching dynamics from single-cell data with a branching Gaussian process. Genome Biol. 19, 65 (2018).
Article PubMed PubMed Central Google Scholar
Qiu, X. et al. Inferring causal gene regulatory networks from coupled single-cell expression dynamics using scribe. Cell Syst. 10, 265–274.e11 (2020).
Article CAS PubMed PubMed Central Google Scholar
Cao, J., Zhou, W., Steemers, F., Trapnell, C. & Shendure, J. Sci-fate characterizes the dynamics of gene expression in single cells. Nat. Biotechnol. 38, 980–988 (2020).
Article CAS PubMed PubMed Central Google Scholar
La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).
Article PubMed PubMed Central Google Scholar
Lange, M. et al. CellRank for directed single-cell fate mapping. Nat. Meth. 19, 159–170 (2022).
Article CAS Google Scholar
Qiu, X. et al. Mapping transcriptomic vector fields of single cells. Cell 185, 690–711.e45 (2022).
Article CAS PubMed PubMed Central Google Scholar
Kamimoto, K. et al. Dissecting cell identity via network inference and in silico gene perturbation. Nature 614, 742–751 (2023). This paper introduces CellOracle, an algorithm for forecasting the effects of genetic perturbations on cell fates in developmental or reprogramming contexts.
Article CAS PubMed PubMed Central Google Scholar
Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019). This paper explores the potential of variational autoencoders for making predictions about the future behaviour of individual cells after genetic perturbations.
Article CAS PubMed Google Scholar
Lotfollahi, M. et al. Predicting cellular responses to complex perturbations in high-throughput screens. Mol. Syst. Biol. 19, e11517 (2023).
Article CAS PubMed PubMed Central Google Scholar
Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023). This paper introduces Geneformer, a ‘foundational model’ of gene regulation constructed from the Human Cell Atlas, and demonstrates its versatility for addressing diverse problems in human genetics.
Article CAS PubMed PubMed Central Google Scholar
Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods https://doi.org/10.1038/s41592-024-02201-0 (2024).
Novakovsky, G., Dexter, N., Libbrecht, M. W., Wasserman, W. W. & Mostafavi, S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. 24, 125–137 (2023).
Article CAS PubMed Google Scholar
Martin, B. K. et al. Optimized single-nucleus transcriptional profiling by combinatorial indexing. Nat. Protoc. 18, 188–207 (2022).
Article PubMed PubMed Central Google Scholar
Sziraki, A. et al. A global view of aging and Alzheimer’s pathogenesis-associated cell population dynamics and molecular signatures in human and mouse brains. Nat. Genet. 55, 2104–2116 (2023).
Article CAS PubMed PubMed Central Google Scholar
Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Article CAS PubMed PubMed Central Google Scholar
Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
Article CAS PubMed PubMed Central Google Scholar
Clark, I. C. et al. Microfluidics-free single-cell genomics with templated emulsification. Nat. Biotechnol. 41, 1557–1566 (2023).
Article CAS PubMed PubMed Central Google Scholar
Lareau, C. A. et al. Droplet-based combinatorial indexing for massive-scale single-cell chromatin accessibility. Nat. Biotechnol. 37, 916–924 (2019).
Article CAS PubMed PubMed Central Google Scholar
Cao, J. et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357, 661–667 (2017).
Article CAS PubMed PubMed Central Google Scholar
Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).
Article CAS PubMed PubMed Central Google Scholar
O’Huallachain, M. et al. Ultra-high throughput single-cell analysis of proteins and RNAs by split-pool synthesis. Commun. Biol. 3, 213 (2020).
Article PubMed PubMed Central Google Scholar
Peterson, V. M. et al. Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol. 35, 936–939 (2017).
Article CAS PubMed Google Scholar
Swanson, E. et al. Simultaneous trimodal single-cell measurement of transcripts, epitopes, and chromatin accessibility using TEA-seq. eLife 10, e63632 (2021).
Article CAS PubMed PubMed Central Google Scholar
Blair, J. D. et al. Phospho-seq: integrated, multi-modal profiling of intracellular protein dynamics in single cells. Preprint at bioRxiv https://doi.org/10.1101/2023.03.27.534442 (2023).
Liscovitch-Brauer, N. et al. Profiling the genetic determinants of chromatin accessibility with scalable single-cell CRISPR screens. Nat. Biotechnol. 39, 1270–1277 (2021).
Article CAS PubMed PubMed Central Google Scholar
Shah, S., Lubeck, E., Zhou, W. & Cai, L. In situ transcription profiling of single cells reveals spatial organization of cells in the mouse hippocampus. Neuron 92, 342–357 (2016).
Article CAS PubMed PubMed Central Google Scholar
Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348, aaa6090 (2015).
Article PubMed PubMed Central Google Scholar
Ståhl, P. L. et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353, 78–82 (2016).
Article PubMed Google Scholar
Russell, A. J. C. et al. Slide-tags enables single-nucleus barcoding for multimodal spatial genomics. Nature 625, 101–109 (2024).
Article CAS PubMed Google Scholar
Goltsev, Y. et al. Deep profiling of mouse splenic architecture with CODEX multiplexed imaging. Cell 174, 968–981.e15 (2018).
Article CAS PubMed PubMed Central Google Scholar
Spencer Chapman, M. et al. Lineage tracing of human development through somatic mutations. Nature 595, 85–90 (2021).
Article CAS PubMed Google Scholar
Ludwig, L. S. et al. Lineage tracing in humans enabled by mitochondrial mutations and single-cell genomics. Cell 176, 1325–1339.e22 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zafar, H., Lin, C. & Bar-Joseph, Z. Single-cell lineage tracing by integrating CRISPR-Cas9 mutations with transcriptomic data. Nat. Commun. 11, 3055 (2020).
Article CAS PubMed PubMed Central Google Scholar
Forrow, A. & Schiebinger, G. LineageOT is a unified framework for lineage tracing and trajectory inference. Nat. Commun. 12, 4940 (2021).
Article CAS PubMed PubMed Central Google Scholar
Biddy, B. A. et al. Single-cell mapping of lineage and identity in direct reprogramming. Nature 564, 219–224 (2018).
Article CAS PubMed PubMed Central Google Scholar
Weinreb, C., Rodriguez-Fraticelli, A., Camargo, F. D. & Klein, A. M. Lineage tracing on transcriptional landscapes links state to fate during differentiation. Science 367, eaaw3381 (2020).
Article CAS PubMed PubMed Central Google Scholar
Frieda, K. L. et al. Synthetic recording and in situ readout of lineage information in single cells. Nature 541, 107–111 (2016).
Article PubMed PubMed Central Google Scholar
Choi, J. et al. A time-resolved, multi-symbol molecular recorder via sequential genome editing. Nature 608, 98–107 (2022).
Article CAS PubMed PubMed Central Google Scholar
Chen, W. et al. Multiplex genomic recording of enhancer and signal transduction activity in mammalian cells. Preprint at bioRxiv https://doi.org/10.1101/2021.11.05.467434 (2021).

Download references

Acknowledgements

The author is grateful to D. Kimelman, and members of their own laboratory for critical feedback on the manuscript. The author’s work is supported by grants from the Paul G. Allen Frontiers Group, the Chan Zuckerberg Initiative, and the National Institutes of Health (UM1HG011586, 1R01HG010632, RC2DK114777 and R01HG012761). The author’s work is also supported by the Seattle Hub for Synthetic Biology, a collaboration between the Allen Institute, the Chan Zuckerberg Initiative (award number CZIF2023-008738) and the University of Washington.

Author information

Authors and Affiliations

Department of Genome Sciences, University of Washington, Seattle, WA, USA
Cole Trapnell
Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, USA
Cole Trapnell
Allen Discovery Center for Cell Lineage Tracing, Seattle, WA, USA
Cole Trapnell
Seattle Hub for Synthetic Biology, Seattle, WA, USA
Cole Trapnell

Authors

Cole Trapnell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cole Trapnell.

Ethics declarations

Competing interests

C.T. is a scientific advisory board member, consultant and/or co-founder of Algen Biotechnologies, Altius Therapeutics and Scale Biosciences.

Peer review

Peer review information

Nature Reviews Genetics thanks Vijay G. Sankaran, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Glossary

Akaike information criterion: Weighs the explanatory value of a model against its complexity (in terms of parameters that must be estimated from data).
Attention network: A neural network that focuses training on parts of a larger problem, then composes those parts into an overall solution: can be trained in a semi-supervised or self-supervised manner, scaling to massive datasets; useful for modelling context in complex biological tasks such as predicting protein structure from sequence.
Autoregressive model: A model that treats the past and current output of a process as input for predicting its future output, broadly useful in time series analysis and forecasting (for example, in financial markets).
Bayesian network: A type of graphical model that captures the direction of dependencies between input variables, facilitating causal inference.
Coalescent theory: A model of how alleles sampled from a population may have arisen from a common ancestor, with numerous applications in population genetics; potentially repurposable to model cell lineages.
Data thinning: Avoids circular reasoning when testing for differences between cell clusters defined from single-cell data.
Deep learning: Machine learning methods that use layers of interconnected artificial neural networks to automatically discover how to represent input data for various tasks such as classification.
Deep neural network: A class of artificial neural network that uses many layers of interconnected ‘artificial neurons’, each of which is capable of performing a very simple calculation, to solve a complex task in machine learning (for example, classification or natural language translation).
Foundation model: An artificial intelligence model trained on a vast amount of unlabelled data to perform a core task that can be subsequently adapted to perform many other tasks through transfer learning, useful for building large language models such as ChatGPT and text-to-image models such as DALL-E.
Generalized linear mixed model: (GLMM). Accounts for grouping structure in single-cell datasets.
Generalized linear regression models: (GLM). Used to quantify effects of genotype, drugs, environments or other factors on gene expression or cell proportions.
Graphical modelling: Captures the hierarchical (conditional) dependences between a set of input variables that contribute to an output.
Kriging: A Gaussian process regression method for interpolating at unmeasured points based on measurements at nearby points, useful for both spatial and temporal data.
LASSO: A tool used to identify a parsimonious set of variables to explain variation across cells or samples.
Maximum parsimony: A method of reconstructing the phylogenetic tree that explains a set of alleles requiring the fewest changes needed to transform them into a single allele, useful for reconstructing cell lineages from molecular recorders.
Mixture models: A modelling approach that represents variables as mixtures of two or more distributions, and simultaneously estimates the variables and their relative proportions from data.
Network propagation: A technique for identifying the most relevant reference data points or documents for a given query by measuring their distance to the query in a network of relatedness between all the documents in the database.
Pseudotime: Single-cell trajectory inference: a data-driven, unsupervised approach for ordering cells according to developmental maturity based on their transcriptomes or other aspects of molecular state, useful for defining the sequence of gene regulatory events in a cell’s development.
RNA velocity: A technique for forecasting a cell’s future transcriptome based on the differences between its fully processed mRNAs and its pre-mRNAs.
Transfer learning: A machine learning strategy that adapts a model trained to perform one task to perform a different, related task; it is useful when training data for one task are limited, but training data for the other task are more abundant.
Variational autoencoder: A type of artificial neural network that learns both how to encode a set of unlabelled data into a low-dimensional representation and how to decode it in an unsupervised manner, useful for single-cell data visualization and forecasting unseen perturbations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Trapnell, C. Revealing gene function with statistical inference at single-cell resolution. Nat Rev Genet (2024). https://doi.org/10.1038/s41576-024-00750-w

Download citation

Accepted: 21 May 2024
Published: 01 July 2024
DOI: https://doi.org/10.1038/s41576-024-00750-w

Revealing gene function with statistical inference at single-cell resolution

Subjects

Abstract

Similar content being viewed by others

Single-cell, whole-embryo phenotyping of mammalian developmental disorders

Dictys: dynamic gene regulatory network dissects developmental continuum with single-cell multiomics

Investigating higher-order interactions in single-cell data with scHOT

Introduction

Phenotyping at single-cell resolution

Cell-type annotation

Differential cell composition

Differential gene expression

Modelling gene regulation

Disentangling interactions and dependencies

Spatial relationships

Lineage relationships

Dissecting gene circuits

Forecasting cell fate

Conclusions and future directions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Related links

Glossary

Rights and permissions

About this article

Cite this article

Search

Quick links

Subjects

Abstract

Similar content being viewed by others

Single-cell, whole-embryo phenotyping of mammalian developmental disorders

Dictys: dynamic gene regulatory network dissects developmental continuum with single-cell multiomics

Investigating higher-order interactions in single-cell data with scHOT

Introduction

Phenotyping at single-cell resolution

Cell-type annotation

Differential cell composition

Differential gene expression

Modelling gene regulation

Disentangling interactions and dependencies

Spatial relationships

Lineage relationships

Dissecting gene circuits

Forecasting cell fate

Conclusions and future directions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Related links

Glossary

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links