Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

DNA mismatch and damage patterns revealed by single-molecule sequencing

Abstract

Mutations accumulate in the genome of every cell of the body throughout life, causing cancer and other diseases1,2. Most mutations begin as nucleotide mismatches or damage in one of the two strands of the DNA before becoming double-strand mutations if unrepaired or misrepaired3,4. However, current DNA-sequencing technologies cannot accurately resolve these initial single-strand events. Here we develop a single-molecule, long-read sequencing method (Hairpin Duplex Enhanced Fidelity sequencing (HiDEF-seq)) that achieves single-molecule fidelity for base substitutions when present in either one or both DNA strands. HiDEF-seq also detects cytosine deamination—a common type of DNA damage—with single-molecule fidelity. We profiled 134 samples from diverse tissues, including from individuals with cancer predisposition syndromes, and derive from them single-strand mismatch and damage signatures. We find correspondences between these single-strand signatures and known double-strand mutational signatures, which resolves the identity of the initiating lesions. Tumours deficient in both mismatch repair and replicative polymerase proofreading show distinct single-strand mismatch patterns compared to samples that are deficient in only polymerase proofreading. We also define a single-strand damage signature for APOBEC3A. In the mitochondrial genome, our findings support a mutagenic mechanism occurring primarily during replication. As double-strand DNA mutations are only the end point of the mutation process, our approach to detect the initiating single-strand events at single-molecule resolution will enable studies of how mutations arise in a variety of contexts, especially in cancer and ageing.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of HiDEF-seq.
Fig. 2: ssDNA call burdens and patterns in cancer predisposition syndromes.
Fig. 3: Hypermutating tumours deficient in both mismatch repair and polymerase proofreading.
Fig. 4: ssDNA damage signatures of sperm and heat-treated DNA.
Fig. 5: Mitochondrial genome dsDNA and ssDNA call burdens and patterns.

Similar content being viewed by others

Data availability

Sequencing data generated in this study (FASTQ files for Illumina sequencing; subreads BAM files for PacBio data) are available at the NCBI database of Genotypes and Phenotypes under accession code phs003604 (all of the samples except those from the International Replication Repair Deficiency Consortium and participants D1 and D2) and at the European Genome–Phenome Archive under accession number EGAS50000000318 (samples from the International Replication Repair Deficiency Consortium). Sequencing data of participants D1 and D2 were not deposited in these databases due to consent limitations. Accession IDs of specific samples are provided in Supplementary Table 1.

Code availability

The source code for the HiDEF-seq analysis pipeline is available at GitHub (https://github.com/evronylab/HiDEF-seq), and the version used for this manuscript (v.1.1) is archived in Zenodo (https://doi.org/10.5281/zenodo.10898439).

References

  1. Mustjoki, S. & Young, N. S. Somatic mutations in “benign” disease. N. Engl. J. Med. 384, 2039–2052 (2021).

    Article  CAS  PubMed  Google Scholar 

  2. Vijg, J. & Dong, X. Pathogenic mechanisms of somatic mutation and genome mosaicism in aging. Cell 182, 12–23 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Seplyarskiy, V. B. & Sunyaev, S. The origin of human mutation in light of genomic data. Nat. Rev. Genet. 22, 672–686 (2021).

    Article  CAS  PubMed  Google Scholar 

  4. Koh, G., Degasperi, A., Zou, X., Momen, S. & Nik-Zainal, S. Mutational signatures: emerging concepts, caveats and clinical applications. Nat. Rev. Cancer 21, 619–637 (2021).

    Article  CAS  PubMed  Google Scholar 

  5. Evrony, G. D. et al. Single-neuron sequencing analysis of L1 retrotransposition and somatic mutation in the human brain. Cell 151, 483–496 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Blokzijl, F. et al. Tissue-specific mutation accumulation in human adult stem cells during life. Nature 538, 260–264 (2016).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  7. Lee-Six, H. et al. The landscape of somatic mutation in normal colorectal epithelial cells. Nature 574, 532–537 (2019).

    Article  CAS  PubMed  ADS  Google Scholar 

  8. Abascal, F. et al. Somatic mutation landscapes at single-molecule resolution. Nature 593, 405–410 (2021).

    Article  CAS  PubMed  ADS  Google Scholar 

  9. Schmitt, M. W. et al. Detection of ultra-rare mutations by next-generation sequencing. Proc. Natl Acad. Sci. USA 109, 14508 (2012).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  10. Sloan, D. B., Broz, A. K., Sharbrough, J. & Wu, Z. Detecting rare mutations and DNA damage with sequencing-based methods. Trends Biotechnol. 36, 729–740 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Baid, G. et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat. Biotechnol. 41, 232–238 (2022).

    PubMed  Google Scholar 

  13. Moore, L. et al. The mutational landscape of human somatic and germline cells. Nature 597, 381–386 (2021).

    Article  CAS  PubMed  ADS  Google Scholar 

  14. Halldorsson, B. V. et al. Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science 363, eaau1043 (2019).

    Article  CAS  PubMed  Google Scholar 

  15. Hoang, M. L. et al. Genome-wide quantification of rare somatic mutations in normal human tissues using massively parallel sequencing. Proc. Natl Acad. Sci. USA 113, 9846–9851 (2016).

  16. Xing, D., Tan, L., Chang, C.-H., Li, H. & Xie, X. S. Accurate SNV detection in single cells by transposon-based whole-genome amplification of complementary strands. Proc. Natl Acad. Sci. USA 118, e2013106118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Robinson, P. S. et al. Increased somatic mutation burdens in normal human cells due to defective DNA polymerases. Nat. Genet. 53, 1434–1442 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Zou, X. et al. A systematic CRISPR screen defines mutational mechanisms underpinning signatures caused by replication errors and endogenous DNA damage. Nat. Cancer 2, 643–657 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Sanders, M. A. et al. Life without mismatch repair. Preprint at bioRxiv https://doi.org/10.1101/2021.04.14.437578 (2021).

  20. Yurchenko, A. A. et al. XPC deficiency increases risk of hematologic malignancies through mutator phenotype and characteristic mutational signature. Nat. Commun. 11, 5834 (2020).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  21. Robinson, P. S. et al. Inherited MUTYH mutations cause elevated somatic mutation rates and distinctive mutational signatures in normal human cells. Nat. Commun. 13, 3949 (2022).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  22. Lujan, S. A., Williams, J. S. & Kunkel, T. A. DNA polymerases divide the labor of genome replication. Trends Cell Biol. 26, 640–654 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  24. Lujan, S. A. et al. Heterogeneous polymerase fidelity and mismatch repair bias genome variation and composition. Genome Res. 24, 1751–1764 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Shinbrot, E. et al. Exonuclease mutations in DNA polymerase epsilon reveal replication strand specific mutation patterns and human origins of replication. Genome Res. 24, 1740–1750 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Tomkova, M., Tomek, J., Kriaucionis, S. & Schuster-Böckler, B. Mutational signature distribution varies with DNA replication timing and strand asymmetry. Genome Biol. 19, 129 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  27. Haradhvala, N. J. et al. Mutational strand asymmetries in cancer genomes reveal mechanisms of DNA damage and repair. Cell 164, 538–549 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Bulock, C. R., Xing, X. & Shcherbakova, P. V. Mismatch repair and DNA polymerase δ proofreading prevent catastrophic accumulation of leading strand errors in cells expressing a cancer-associated DNA polymerase ϵ variant. Nucleic Acids Res. 48, 9124–9134 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Shlien, A. et al. Combined hereditary and somatic mutations of replication error repair genes result in rapid onset of ultra-hypermutated cancers. Nat. Genet. 47, 257–262 (2015).

    Article  CAS  PubMed  Google Scholar 

  30. Hodel, K. P. et al. Explosive mutation accumulation triggered by heterozygous human Pol ε proofreading-deficiency is driven by suppression of mismatch repair. eLife 7, e32692 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  31. Haradhvala, N. J. et al. Distinct mutational signatures characterize concurrent loss of polymerase proofreading and mismatch repair. Nat. Commun. 9, 1746 (2018).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  32. Hodel, K. P. et al. POLE mutation spectra are shaped by the mutant allele identity, its abundance, and mismatch repair status. Mol. Cell 78, 1166–1177 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Kunkel, T. A. & Erie, D. A. Eukaryotic mismatch repair in relation to DNA replication. Ann. Rev. Genet. 49, 291–313 (2015).

    Article  CAS  PubMed  Google Scholar 

  34. Shinmura, K. et al. Defective repair capacity of variant proteins of the DNA glycosylase NTHL1 for 5-hydroxyuracil, an oxidation product of cytosine. Free Radic. Biol. Med. 131, 264–273 (2019).

    Article  CAS  PubMed  Google Scholar 

  35. Dizdaroglu, M. Oxidatively induced DNA damage and its repair in cancer. Mutat. Res. Rev. Mutat. Res. 763, 212–245 (2015).

    Article  CAS  PubMed  Google Scholar 

  36. Madugundu, G. S., Cadet, J. & Wagner, J. R. Hydroxyl-radical-induced oxidation of 5-methylcytosine in isolated and cellular DNA. Nucleic Acids Res. 42, 7450–7460 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Chen, G., Mosier, S., Gocke, C. D., Lin, M.-T. & Eshleman, J. R. Cytosine deamination is a major cause of baseline noise in next-generation sequencing. Mol. Diagn. Ther. 18, 587–593 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Tretyakova, N., Villalta, P. W. & Kotapati, S. Mass spectrometry of structurally modified DNA. Chem. Rev. 113, 2395–2436 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Grolleman, J. E. et al. Mutational signature analysis reveals NTHL1 deficiency to cause a multi-tumor phenotype. Cancer Cell 35, 256–266 (2019).

    Article  CAS  PubMed  Google Scholar 

  40. Krokan, H. E. & Bjørås, M. Base excision repair. Cold Spring Harb. Perspect. Biol. 5, a012583 (2013).

  41. Stringer, J. M., Winship, A., Liew, S. H. & Hutt, K. The capacity of oocytes for DNA repair. Cell. Mol. Life Sci. 75, 2777–2792 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Guo, Q. et al. The mutational signatures of formalin fixation on the human genome. Nat. Commun. 13, 4487 (2022).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  43. Clark, T. A., Spittle, K. E., Turner, S. W. & Korlach, J. Direct detection and sequencing of damaged DNA bases. Genome Integr. 2, 10 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Petljak, M. et al. Mechanisms of APOBEC3 mutagenesis in human cancer cells. Nature 607, 799–807 (2022).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  45. Sanchez-Contreras, M. et al. A replication-linked mutational gradient drives somatic mutation accumulation and influences germline polymorphisms and genome composition in mitochondrial DNA. Nucleic Acids Res. 49, 11103–11118 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Ju, Y. S. et al. Origins and functional consequences of somatic mitochondrial DNA mutations in human cancer. eLife 3, e02935 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  47. Kauppila, J. H. K. & Stewart, J. B. Mitochondrial DNA: radically free of free-radical driven mutations. Biochim. Biophys. Acta 1847, 1354–1361 (2015).

    Article  CAS  PubMed  Google Scholar 

  48. Kennedy, S. R., Salk, J. J., Schmitt, M. W. & Loeb, L. A. Ultra-sensitive sequencing reveals an age-related increase in somatic mitochondrial mutations that are inconsistent with oxidative damage. PLoS Genet. 9, e1003794 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Yuan, Y. et al. Comprehensive molecular characterization of mitochondrial genomes in human cancers. Nat. Genet. 52, 342–352 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Fontana, G. A. & Gahlon, H. L. Mechanisms of replication and repair in mitochondrial DNA deletion formation. Nucleic Acids Res. 48, 11244–11258 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Lodato, M. A. et al. Aging and neurodegeneration are associated with increased mutations in single human neurons. Science 359, 555–559 (2018).

    Article  CAS  PubMed  ADS  Google Scholar 

  52. Matsuda, T., Matsuda, S. & Yamada, M. Mutation assay using single-molecule real-time (SMRTTM) sequencing technology. Genes Environ. 37, 15 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  53. Hestand, M. S., Houdt, J. V., Cristofoli, F. & Vermeesch, J. R. Polymerase specific error rates and profiles identified by single molecule sequencing. Mutat. Res. 784–785, 39–45 (2016).

    Article  PubMed  Google Scholar 

  54. Agarwal, A., Gupta, S. & Sharma, R. in Andrological Evaluation of Male Infertility: A Laboratory Guide (eds Agarwal, A. et al.) 101–107 (Springer, 2016).

  55. Buisson, R. et al. Passenger hotspot mutations in cancer driven by APOBEC3A and mesoscale genomic features. Science 364, eaaw2872 (2019).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  56. Wu, H., de Gannes, M. K., Luchetti, G. & Pilsner, J. R. Rapid method for the isolation of mammalian sperm DNA. BioTechniques 58, 293–300 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Jenkins, T. G., Liu, L., Aston, K. I. & Carrell, D. T. Pre-screening method for somatic cell contamination in human sperm epigenetic studies. Syst. Biol. Reprod. Med. 64, 146–155 (2018).

    Article  CAS  PubMed  Google Scholar 

  58. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  59. Heng, L. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  60. Van der Auwera, G. A. & O’Connor, B. D. Genomics in the Cloud: Using Docker, GATK, and WDL in Terra. (O’Reilly Media, 2020).

  61. Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    Article  CAS  PubMed  Google Scholar 

  62. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  63. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. R Core Team. R: A Language and Environment for Statistical Computing (2021).

  65. Martin, M., Hervé, P., Valerie, O. & Nathaniel, H. Rsamtools: binary alignment (BAM), FASTA, variant call (BCF), and tabix (2020).

  66. Lawrence, M. et al. Software for computing and annotating genomic ranges. PLoS Comput. Biol. 9, e1003118 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Knaus, B. J. & Grünwald, N. J. vcfr: a package to manipulate and visualize variant call format data in R. Mol. Ecol. Resour. 17, 44–53 (2017).

    Article  CAS  PubMed  Google Scholar 

  68. Wickham, H. The split-apply-combine strategy for data analysis. J. Stat. Softw. 40, 1–29 (2011).

    Article  Google Scholar 

  69. Jianfeng, L. configr: an implementation of parsing and writing configuration file (2020).

  70. Ching, T. qs: quick serialization of R objects https://CRAN.R-project.org/package=qs (2021).

  71. Blokzijl, F., Janssen, R., van Boxtel, R. & Cuppen, E. MutationalPatterns: comprehensive genome-wide analysis of mutational processes. Genome Med. 10, 33 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  72. Milton, S. & Wickham, H. magrittr: a forward-pipe operator for R (2020).

  73. Wickham, H., Hester, J. & Bryan, J. readr: read rectangular text data (2022).

  74. Wickham, H., François, R., Henry, L. & Müller, K. dplyr: a grammar of data manipulation (2021).

  75. Lee, S., Cook, D. & Lawrence, M. plyranges: a grammar of genomic data transformation. Genome Biol. 20, 4 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  76. Wickham, H. stringr: simple, consistent wrappers for common string operations (2019).

  77. Eddelbuettel, D. digest: create compact hash digests of R objects (2021).

  78. Lawrence, M., Gentleman, R. & Carey, V. rtracklayer: an R package for interfacing with genome browsers. Bioinformatics 25, 1841–1842 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  79. Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  80. Kuhn, R. M., Haussler, D. & Kent, W. J. The UCSC genome browser and associated tools. Brief. Bioinform. 14, 144–161 (2013).

    Article  CAS  PubMed  Google Scholar 

  81. Zerbino, D. R., Johnson, N., Juettemann, T., Wilder, S. P. & Flicek, P. WiggleTools: parallel processing of large collections of genome-wide datasets for visualization and statistical analysis. Bioinformatics 30, 1008–1009 (2014).

    Article  CAS  PubMed  Google Scholar 

  82. Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE 11, e0163962 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  83. Kokot, M., Długosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).

    Article  CAS  PubMed  Google Scholar 

  84. Hunt, M. et al. REAPR: a universal tool for genome assembly evaluation. Genome Biol. 14, R47 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  85. Ponstingl, H. & Ning, Z. SMALT - a new mapper for DNA sequencing reads [poster]. F1000Posters 1, 313 (2010).

  86. Kent, W. J., Zweig, A. S., Barber, G., Hinrichs, A. S. & Karolchik, D. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics 26, 2204–2207 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  87. Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-4.0 (2015).

  89. Karimzadeh, M., Ernst, C., Kundaje, A. & Hoffman, M. M. Umap and Bismap: quantifying genome and methylome mappability. Nucleic Acids Res. 46, e120 (2018).

    PubMed  PubMed Central  Google Scholar 

  90. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  91. Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2009).

  92. Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  93. Jónsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature 549, 519 (2017).

    Article  PubMed  ADS  Google Scholar 

  94. Zhu, C.-H. et al. Investigation of the mechanisms leading to human sperm DNA damage based on transcriptome analysis by RNA-seq techniques. Reprod. BioMed. Online 46, 11–19 (2023).

    Article  CAS  PubMed  Google Scholar 

  95. Gori, K. & Baez-Ortega, A. sigfit: flexible Bayesian inference of mutational signatures. Preprint at bioRxiv https://doi.org/10.1101/372896 (2020).

  96. Cagan, A. et al. Somatic mutation rates scale with lifespan across mammals. Nature 604, 517–524 (2022).

    Article  CAS  PubMed  PubMed Central  ADS  Google Scholar 

  97. Hansen, R. S. et al. Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Proc. Natl Acad. Sci. USA 107, 139–144 (2010).

    Article  CAS  PubMed  ADS  Google Scholar 

  98. Seplyarskiy, V. B. et al. APOBEC-induced mutations in human cancers are strongly enriched on the lagging DNA strand during replication. Genome Res. 26, 174–182 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  99. Flusberg, B. A. et al. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods 7, 461–465 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  100. Gu, Z., Eils, R. & Schlesner, M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 32, 2847–2849 (2016).

    Article  CAS  PubMed  Google Scholar 

  101. Wagih, O. ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics 33, 3645–3647 (2017).

    Article  CAS  PubMed  Google Scholar 

  102. Freudenthal, B. D., Beard, W. A., Shock, D. D. & Wilson, S. H. Observing a DNA polymerase choose right from wrong. Cell 154, 157–168 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  103. Verderio, P. et al. External quality assurance programs for processing methods provide evidence on impact of preanalytical variables. New Biotechnol. 72, 29–37 (2022).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This work was supported by grants from the NIH Common Fund (UG3NS132024 – Somatic Mosaicism across Human Tissues Network, G.D.E.; DP5OD028158, G.D.E.), the Eunice Kennedy Shriver National Institute of Child Health and Human Development (R21HD105910; G.D.E. and J.E.S.), the Sontag Foundation (G.D.E.), the Pew Charitable Trusts (G.D.E.) and the Jacob Goldfield Foundation (G.D.E.). Sequencing performed at the New York University (NYU) Grossman School of Medicine Genome Technology Center was supported in part by the National Cancer Institute (P30CA016087) and a National Institutes of Health Shared Instrumentation Grant (1S10OD023423-01). The computational work was supported in part by the New York University Information Technology High Performance Computing resources, services and staff expertise, and by the New York University Grossman School of Medicine High Performance Computing Core. U.T. was supported by a Stand Up To Cancer–Bristol-Myers Squibb Catalyst Research Grant (SU2C-AACR-CT-07-17), SickKids Foundation donors Harry and Agnieszka Hall, Meagan’s Walk (MW-2014-10), BRAINchild Canada, the LivWise Foundation, the Canadian Institutes for Health Research (CIHR; grant 108188) and a Canadian Cancer Society/CIHR/Brain Canada Spark Grant (Spark-21, 707089). J.E.S. was supported by the Damon Runyon Cancer Research Foundation, the Vinney Family Scholars Award and the Bristol Myers Squibb Foundation. M.G.-P. was supported by NIH grants T32AG052909 and F32AG076287. We thank B. Neel, H. Klein and A. Chakravarti (NYU Grossman School of Medicine) for discussions; D. Dimartino and P. Zappile (Genome Technology Center at NYU Grossman School of Medicine) for assistance with sequencing; M. Fridrikh, N. Francoeur and R. Sebra (Genomics Core Facility at the Icahn School of Medicine at Mount Sinai) for assistance with sequencing; S. Wang (NYU Information Technology) for assistance with high-performance computing; and the NIH NeuroBioBank and its staff at the University of Maryland (R. Johnson) for providing human tissues.

Author information

Authors and Affiliations

Authors

Contributions

G.D.E. conceived the project. G.D.E., M.H.L., B.M.C., U.C. and J.E.S. designed the experiments. U.T., V.B., L.S., N.M.N., T.P. and R.E.B. collected some of the samples. E.L., D.R. and A.-B.S. recruited research participants for sperm samples. D.R. prepared ZyMot sperm samples. M.H.L., R.C.B., A.S., Z.R.M., C.A.L., T.K.T. and G.D.E. prepared tissues and cell samples. M.H.L., B.M.C. and U.C. performed technology development experiments. M.H.L. and B.M.C. prepared HiDEF-seq sequencing libraries. M.G.-P. prepared NanoSeq libraries. M.H.L., B.M.C. and E.C.B. performed other experiments. J.R.W. assisted with interpretation of cytosine deamination data. G.D.E. created the computational pipeline with input from U.C. M.H.L., B.M.C. and G.D.E. performed the analysis. M.H.L. and G.D.E. wrote the initial manuscript, with input from B.M.C. and J.E.S. All of the authors contributed to the final manuscript.

Corresponding author

Correspondence to Gilad D. Evrony.

Ethics declarations

Competing interests

A patent application for HiDEF-seq has been filed (G.D.E.). G.D.E. owns equity in DNA sequencing companies (Illumina, Oxford Nanopore Technologies, and Pacific Biosciences). The other authors declare no competing interests.

Peer review

Peer review information

Nature thanks Francesca Storici and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 HiDEF-seq library preparation and sequencing metrics.

a, Representative DNA sizing electropherogram after Hpy166II restriction enzyme digestion (top) and after completion of the HiDEF-seq library preparation, which removes fragments <1 kb (bottom). b, Two-dimensional histogram of all molecules from a representative HiDEF-seq sequencing run of each molecule’s longest strand read length (bp, base pairs) versus its total polymerase read length (PRL). Dashed line signifies the expected strand length distribution. The red diagonal line reflects 18% of molecules with <1 strand pass, which is typical in PacBio sequencing. c, Histogram (200 bp bins) for representative HiDEF-seq samples (n = 51) of molecule consensus sequence lengths (i.e., molecule sizes). Line and shaded region show average and standard deviation, respectively, across samples for each bin. The average of these samples’ median lengths is 1.7 kilobases (kb). d, Histogram as in panel (c), showing HiDEF-seq (n = 51 representative samples) yields smaller molecule lengths than standard PacBio (HiFi) samples (n = 10 samples). The average of samples’ median lengths are 1.7 kb and 18.3 kb for HiDEF-seq and HiFi, respectively. e, Two-dimensional histogram of the number of passes (bin width of 5 passes) vs. consensus sequence lengths (bin width of 200 bp) for molecules from the 51 representative HiDEF-seq samples plotted in panels (c,d). Bins are coloured if there is at least one molecule in the bin. f, Box plots of the fraction of a molecule’s consensus sequence bases (average of forward and reverse strands) that have the maximum predicted quality (quality=93, as predicted by ccs, Methods) versus the number of passes per strand, across all molecules of the same samples included in panels (c-e). Note: 93 is the quality required for HiDEF-seq analysis. This plot illustrates that the number of passes is a key determinant of consensus quality in both HiDEF-seq and HiFi. b, Plot generated by SMRT Link (Pacific Biosciences) software. c-e, The single-molecule consensus sequence length is the average of the forward and reverse strand lengths. Bin values are normalized to the bin with the highest molecule count. e,f, The number of passes per strand is the average of the forward and reverse strand ‘ec’ tags (Methods). c-f, Plots show data of HiDEF-seq molecules that are output by the primary data processing step of the HiDEF-seq analysis pipeline and standard PacBio HiFi molecules that are output by the ccs HiFi pipeline (Methods). f, Box plot: middle line, median; boxes, 1st and 3rd quartiles; whiskers, the maximum/minimum values within 1.5 x interquartile range. X-axis: square brackets and parentheses signify inclusion and exclusion of interval endpoints, respectively.

Extended Data Fig. 2 Schematic of analysis pipeline.

Primary data processing (blue) is followed by call filtering (green) along with germline sequencing analysis (orange), which is then followed by call burden and signature analysis (purple). See Methods for full details. On the left of primary data processing steps are the average percentage of molecules filtered by each step across 17 representative HiDEF-seq sequencing runs. Approximately half of molecules filtered by the ‘Generate consensus sequence’ step are molecules with less than 3 full-length passes (default setting of the ccs tool that creates consensus sequences), and the other half are due to molecules with read quality (‘rq’ tag) <0.99. At the end of the call filtering steps are listed the percentage of bases filtered by all the call filtering steps, calculated out of the total bases of molecules that pass primary data processing, for the same 17 representative HiDEF-seq sequencing runs. The filter for ‘low-quality genomic regions and gnomAD variants with allele frequency (AF) > 0.1% in the population’ covers approximately 15% and 7% of the genome when using Illumina and PacBio germline sequencing data, respectively (i.e., when PacBio germline sequencing data is used, the pipeline uses less restrictive filters due to fewer genome alignment errors and artifacts). WGS, whole-genome sequencing.

Extended Data Fig. 3 Analysis thresholds and comparison of analyses using short- versus long-read germline sequencing.

a, Histogram of predicted consensus sequence accuracy (‘rq’ tag, bin width=0.0001) for DNA molecules that pass primary data processing steps from 3 representative sperm samples profiled by HiDEF-seq (with nick ligation) (21yo: SPM-1002; 39yo: SPM-1004; 44yo: SPM-1020; yo, years old). Note, these are consensus sequence accuracies predicted by the ccs consensus calling software (Methods), which are used to filter low-quality molecules, but these accuracies do not reflect the true accuracy that is significantly higher. b, Box plot of passes per strand for different consensus sequence accuracy bins, for molecules from the 3 samples included in the prior panel, showing that higher minimum accuracies select for molecules with higher numbers of passes. c, Fraction of post-primary data processing molecules that are filtered (left plot) and fraction of post-primary data processing base pairs that remain for interrogation (right plot) using different minimum passes per strand and consensus sequence accuracy thresholds. Values show average of the 3 samples included in the prior panels, after completing all steps of the mutation filtering pipeline. d,e, dsDNA mutation burdens for the 3 samples included in the prior panels using different minimum passes per strand and consensus sequence accuracy thresholds. Panel (e) shows data from (d) at consensus accuracy of 0.99 with Poisson 95% confidence intervals. These data illustrate stability of dsDNA mutation burden estimates at broad thresholds using sperm as the most stringent test of fidelity. f, Fraction of high-quality, known heterozygous germline variants detected using different minimum required fraction of molecule passes (i.e., subreads) that detect the variant (filter applied separately to each strand). This value is used for sensitivity correction (Methods). Values show average of the 3 samples included in prior panels. g,h, dsDNA mutation burdens for the 3 samples included in the prior panels using different minimum required fraction of molecule passes that detect the variant (filter applied separately to each strand), after correcting for sensitivity (g), and using different minimum required distances from the end of the read (h). Panel (g) illustrates that correcting for sensitivity maintains stable burden estimates. The analysis pipeline requires a minimum of 10 bp from the ends of reads to remove rare alignment artifacts, although this does not significantly alter burden estimates. i, ssDNA call burdens for the 3 sperm samples included in the prior panels using different minimum passes per strand and consensus sequence accuracy thresholds. Plot shows a small decrease in ssDNA call burdens with a higher minimum required passes per strand at low consensus sequence accuracy thresholds, and convergence to similar burdens at high consensus sequence accuracy thresholds. Data shown with minimum fraction of 0.5 molecule passes that detect the variant. j, ssDNA call burdens for the 3 sperm samples included in the prior panels using different minimum required fraction of molecule passes that detect the variant, after correcting for sensitivity. Data shown with a minimum consensus sequence accuracy of 0.999 and a minimum of 20 passes per strand. k,l, Concordant dsDNA mutation and ssDNA call burdens obtained by HiDEF-seq using short-read (Illumina) or long-read (PacBio, Pacific Biosciences) germline sequencing during analysis, for two samples (1301 and 1901 blood). a-d,i, Consensus sequence accuracies are the average of forward and reverse strand accuracies. b, Box plot: middle line, median; boxes, 1st and 3rd quartiles; whiskers, the maximum/minimum values within 1.5 x interquartile range. X-axis: square brackets and parentheses signify inclusion and exclusion of interval endpoints, respectively. c-e,i, Threshold for minimum required passes per strand is applied to both strands. c-j, The symbols ‡ and § mark the final thresholds chosen for dsDNA and ssDNA analyses, respectively. c,f, Error bars: standard deviation; note, panel (f) error bars are small and therefore not well visualized. d,e,g-l, Mutation and call burdens are corrected for sensitivity and trinucleotide context opportunities of the full genome relative to interrogated bases (Methods). e,g,h,j-l, Dots and error bars: point estimates and their Poisson 95% confidence intervals.

Extended Data Fig. 4 dsDNA mutation burdens of HiDEF-seq without ssDNA nick ligation and removal of ssDNA artifacts by ssDNA nick ligation.

a, dsDNA mutation burdens in two sperm samples (left to right: SPM-1004, SPM-1020) profiled by HiDEF-seq without ssDNA nick ligation and by NanoSeq, compared for each age (yo, years old) to paternally phased de novo mutations in children from a prior study of 2,976 trios14. See Fig. 1c for sperm samples profiled by HiDEF-seq with nick ligation. b, dsDNA mutation burdens versus age, measured by HiDEF-seq without nick ligation (see Fig. 1d for samples profiled by HiDEF-seq with nick ligation). Dashed lines (liver, kidney): weighted least-squares linear regression. Dotted lines (blood, neurons): these only connect two data points to aid visualization of burden difference, since regression cannot be performed with two samples. c, Mutational signature contribution to dsDNA mutations detected in samples profiled by HiDEF-seq without nick ligation (see Extended Data Fig. 5i for samples profiled by HiDEF-seq with nick ligation). All samples, except blood from a 62-year-old individual with a history of kidney disease (1901, asterisk), were jointly analysed with fitting of SBS1 and de novo extraction of one additional signature SBSi (Methods). The blood sample of the 62-year-old was analysed separately together with 5 other HiDEF-seq (with nick ligation) blood samples from this individual, due to identification of an additional signature SBSii with strong and moderate similarity to SBS19 and SBS23, respectively. Analysis of samples grouped by tissue type, excluding the 62-year-old blood sample, produced similar results. For de novo extracted signatures (SBSi and SBSii), the cosine similarities to the most similar COSMIC signatures are shown in parentheses. Sperm samples and kidney and liver samples from an infant (1443) were not included here since the number of mutations is too low for reliable signature extraction. d, Burdens of dsDNA mutations (left plot) and ssDNA calls (right plot) of a blood sample (individual 1301) measured by HiDEF-seq without versus with nick ligation. Nick ligation eliminates T > A ssDNA artifacts that match the illustrated GTTBVH motif. The motif was derived using the ggseqlogo R package (ref. 101) using all ssDNA T > A calls from the sample profiled by HiDEF-seq without nick ligation. Grey bar is calls matching the motif with log-odds score > 2 calculated with the score_match function of the universalmotif R package. e,f, Proposed mechanism for the GTTBVH motif of ssDNA artifactual calls in HiDEF-seq without ssDNA nick ligation. The known GTNNAC motif of the Hpy166II restriction enzyme used in HiDEF-seq may arise if Hpy166II operates as a dimer (cut sites signified by triangles) with each monomer binding opposite strands, and the GTTBVH motif is due to intersection (∩) and union (U) combinatorial logic for the outer and inner 2 bases, respectively (e). Without nick ligation, ssDNA GT[T > A]BVH artifactual calls may arise from rare Hpy166II monomer nicking events, pyrophosphorolysis of the ‘T’ upstream of the nick, and addition of a mismatched ‘A’ during the Klenow dATP/ddBTP A-tailing reaction. Further extension with ddBTP does not occur due to the mismatch (ref. 102). This process is prevented in HiDEF-seq by nick ligation. g, Nick ligation increases HiDEF-seq library yield by 66% for post-mortem tissues, likely by repairing nicks in the original input DNA so that the molecules are not eliminated in the final nuclease treatment step. Bars show average yield for each group; number of samples per group (left to right): 8, 8, 5, 9 (**, p = 0.002; ns, not significant; two-sided unpaired t-test). a, Box plots: middle line, median; boxes, 1st and 3rd quartiles; whiskers, 5% and 95% quantiles. For each sample, HiDEF-seq and NanoSeq confidence intervals were normalized to reflect an equivalent number of interrogated base pairs (Methods). a,b,d, Error bars: Poisson 95% confidence intervals. g, Error bars: standard deviation.

Extended Data Fig. 5 HiDEF-seq without A-tailing removes ssDNA artifacts of post-mortem tissues with fragmented DNA.

a, Fraction of ssDNA calls that are T > A (corrected for trinucleotide context opportunities) versus the ssDNA T > A burden in all samples profiled by HiDEF-seq with A-tailing (i.e., Klenow reaction +dATP/+ddBTP) from healthy individuals and cell lines (i.e., excluding cancer predisposition syndromes). Post-mortem kidney and liver consistently have the highest fraction of ssDNA calls that are T > A. b, ssDNA call spectrum for a liver sample profiled by HiDEF-seq with A-tailing exhibiting a high ssDNA T > A burden (6.810−7 T > A burden; 7.610−7 total ssDNA call burden), corrected for trinucleotide context opportunities. Parentheses show total number of calls. c, Correlation between ssDNA T > A artifact burden and the input DNA’s DNA Integrity Number measured by TapeStation electrophoresis (ref. 103) across all samples profiled by HiDEF-seq with A-tailing from healthy individuals and cell lines (i.e., excluding cancer predisposition syndromes). Lower DNA Integrity Number corresponds to more fragmented DNA. d, Proposed mechanism for the ssDNA T > A artifact calls in fragmented DNA when performing HiDEF-seq with A-tailing and its prevention in HiDEF-seq without A-tailing. e, Modifications of the HiDEF-seq protocol to eliminate ssDNA T > A artifacts in fragmented DNA. All trials were from the same DNA extraction aliquot (liver from individual 5697). See Methods for details. PNK, polynucleotide kinase; Bst, Bst large fragment; min, minutes. f, ssDNA call spectra for three of the samples shown in panel (e): standard HiDEF-seq with A-tailing (top, same spectrum as panel (b)), HiDEF-seq with a Klenow reaction that does not contain dATP nor ddBTP (middle), and HiDEF-seq with a Klenow reaction containing only ddBTP (bottom). The total number of ssDNA calls and total ssDNA call burden (calls per base) are shown. g, Fraction of ssDNA calls that are T > A (corrected for trinucleotide context opportunities) versus the ssDNA T > A burden in post-mortem liver (n = 5) and kidney (n = 5) samples profiled by HiDEF-seq without A-tailing (i.e., Klenow reaction -dATP/+ddBTP). h, Concordant dsDNA mutation burdens in sperm sample SPM-1013 measured by HiDEF-seq with A-tailing (i.e., Klenow reaction +dATP/+ddBTP) and without A-tailing (i.e., Klenow reaction -dATP/+ddBTP). yo, years old. i, Mutational signature contribution to dsDNA mutations detected by HiDEF-seq in primary human tissues from individuals without cancer predisposition. Post-mortem liver and kidney samples were profiled by HiDEF-seq without A-tailing. All samples, except blood from a 62-year-old individual with a history of kidney disease (1901, asterisk), were jointly analysed with fitting of SBS1 and de novo extraction of one additional signature SBSiii. Blood samples of the 62-year-old profiled by HiDEF-seq were analysed separately (plot shows average signature contributions across 5 blood samples) due to identification of an additional signature SBSiv. Analysis of samples grouped by tissue type, excluding the 62-year-old blood sample, produced similar results. For de novo extracted signatures (SBSiii and SBSiv), the cosine similarities to the most similar COSMIC signatures are shown in parentheses. Sperm, kidney and liver samples from an infant (1443) and 18-year-old (1409), and blood from a 4-year-old (5203) are not included here since their number of mutations are too low for reliable signature extraction. e,h, Bars (e) and dots (h) show point estimates, and error bars are their Poisson 95% confidence intervals. e-g, Rxn, reaction.

Extended Data Fig. 6 Comparison of HiDEF-seq and NanoSeq.

a, Comparison of HiDEF-seq versus NanoSeq dsDNA mutation spectra for individual 63143. b, Comparison of HiDEF-seq versus NanoSeq ssDNA call burdens, separated by call type. For each call type (i.e., C > A, C > G, etc.), each bar represents a different sample. Samples for each call type, from left to right, are 1105 and 6501 for healthy blood; 63143 for POLE blood; and 1443 for kidney. Comparison for sperm samples is shown in Fig. 1g. c, Comparison of HiDEF-seq versus NanoSeq ssDNA call spectra for 6501 (Blood, 43 yo), 63143 (POLE blood), and SPM-1060 (sperm, 49 yo). a-c, yo, years old; mo, months old.

Extended Data Fig. 7 dsDNA mutation burdens and patterns in cancer predisposition syndromes.

a, Fraction of dsDNA mutations in each context. Non-cancer predisposition samples are (left to right): Blood (B) 5203, 1105, 1301, 6501, and 1901; lymphoblastoid cell line (LCL) GM12812; primary fibroblasts GM02036 and GM03348. Cancer predisposition samples are (left-to-right, in the same order and annotated sample types as top-to-bottom cancer predisposition samples in panel (c)): GM16381, GM01629, GM28257, 55838, 58801, 57627, 1400, 1324, 1325, 60603, 59637, 57615, 63143 (LCL), 63143 (B), CC-346-253, CC-388-290, CC-713-555. Affected genes annotated below. Note, GM02036 (asterisk) has a significant increase in C > T mutations with a spectrum matching COSMIC SBS7a (ultraviolet light exposure), likely due to the fibroblasts deriving from sun-exposed skin. b, Representative dsDNA mutation spectra of a sample for each affected gene, corrected for trinucleotide context opportunities. Sample IDs are in parentheses. Ages (yo, years old) are listed for blood samples. c, Fraction of dsDNA mutations attributable to de novo extracted dsDNA mutational signatures. Sample genotypes are on the right (hom., homozygous; compound heterozygous variants separated by ‘/’). In parentheses is the cosine similarity to the most similar COSMIC signature when the similarity is ≥ 0.8 (weak similarity: 0.8 – 0.85; moderate similarity: 0.85 – 0.9; strong similarity: ≥ 0.9; Methods). In ERCC6 and ERCC8 mutant cell lines, whose mutational patterns are unknown, we identified signature SBSB with weak similarity (cosine similarity 0.82) to the COSMIC SBS36 signature. For SBSF, the most similar COSMIC signature is SBS10c, but the cosine similarity of 0.79 is not considered significant. For SBSG, the most similar COSMIC signature is SBS40, but the cosine similarity of 0.76 is not considered significant. SBSG had non-significant similarities to SBS18 (0.69) and SBS36 (0.59), which have been previously associated with MUTYH21. These MUTYH signatures were not extracted due to the normal mutation burdens of our MUTYH blood samples (see panel (d)), which is expected at these sample ages and our interrogated base coverage21. Note that SBS40 resembles SBS18 and SBS36 in the C > A spectrum that is enriched in MUTYH syndrome21. Signature extraction was performed for samples of each DNA repair pathway (except XPC separately from ERCC6/ERCC8), while simultaneously fitting COSMIC SBS1 and SBS5 (Methods). Samples are in the same top-to-bottom order as left-to-right cancer predisposition samples in panel (a). d, dsDNA mutation burden per base pair divided by the age of the individual in years at the time of blood collection, corrected for trinucleotide context opportunities and sensitivity. Only blood samples are shown, since blood can be annotated with the age of the individual. Accordingly, since we did not profile blood samples nucleotide excision repair syndrome, this category is not shown. Non-cancer predisposition blood samples are the same (left-to-right) as in panel (a) (left-to-right). Cancer predisposition blood samples are the same (left-to-right) as blood samples in panel (c) (top-to-bottom). Affected genes annotated below. e, Replication strand asymmetry based on replication timing data (Methods) of AGA > ATA ssDNA mismatches and dsDNA mutations in POLE PPAP samples. Reference (+) refers to the human reference genome plus strand. Non-reference (-) strand lagging and leading strand synthesis corresponds to negative and positive fork polarity values, respectively (Methods). The ‘strand ratio’ (Y-axis) is calculated as the fraction of all AGA > ATA non-reference strand events that have the specified fork polarity divided by the fraction of all AGA > ATA reference strand mutations that have the specified fork polarity (Methods). *, p = 0.015; ***, p < 10−15 (chi-squared test; n = 73 ssDNA AGA > ATA mismatches; n = 3,871 dsDNA AGA > ATA mutations). For dsDNA mutations, bars show the average across PPAP samples (n = 4), and for ssDNA mismatches, due to their low number, bars show a single estimate for calls pooled across PPAP samples. See (f) for analysis of dsDNA mutations separated by fork polarity quantiles (rather than positive versus negative polarity), which cannot be plotted for ssDNA mismatches due to the low number of ssDNA mismatches per quantile. ssDNA strand ratios were calculated using calls of all POLE PPAP samples, since there are too few calls to reliably analyse individual samples. dsDNA strand ratios were calculated separately for each sample (plot shows average and standard deviation). Excluding calls overlapping genes to exclude transcription strand biases was still significant for dsDNA mutations (p < 10−15) but not ssDNA mismatches, but the latter had significantly reduced power due to a 55% reduction in the number of analysed ssDNA calls. f, Replication strand asymmetry of AGA > ATA dsDNA mutations in POLE PPAP samples calculated for each fork polarity quantile. Fork polarity quantiles divide fork polarity values into 9 quantile bins from 0 to 1, with higher values corresponding to a greater probability of the non-reference strand being replicated in the leading rather than lagging strand direction (Methods). Random loci are the average of 50 sets of 1,000 random genomic loci with either the sequence AGA or TCT for which there is replication timing data at the locus. The ‘strand ratio’ is calculated for POLE PPAP samples as in (e), and it is calculated for random genomic loci as the fraction of all AGA non-reference strand loci that are in the fork polarity quantile bin divided by the fraction of all AGA reference strand loci that are in the fork polarity quantile bin. PPAP samples are the same top-to-bottom order in the legend as top-to-bottom PPAP samples in (c). Asterisks signify statistical significance in comparison of the POLE PPAP 4-sample average (dashed line) to random loci (heteroscedastic two-tailed t.test); p-values left-to-right for asterisks: 3.710−17, 0.001, 0.009, 0.02, 0.003. Excluding mutations overlapping genes to exclude transcription strand biases produced similar results (p = 3.110−10, 0.003, and 0.04 for quantiles 0-0.1, 0.1-0.2, and 0.6-0.7, respectively), but this analysis has reduced power due to the 55% reduction in the number of mutations. a-f, See additional samples details in Supplementary Tables 14. e,f, Error bars: standard deviation.

Extended Data Fig. 8 Hypermutating tumours deficient in both mismatch repair and polymerase proofreading.

a, Burdens of dsDNA mutations (left) and ssDNA calls (right). Burdens are corrected for trinucleotide context opportunities and detection sensitivity (Methods). b,c, Fraction of dsDNA mutation burdens (b) and ssDNA call burdens (c) by context, corrected for trinucleotide context opportunities. d, ssDNA mismatch signature SBS14ss extracted from tumour samples, while simultaneously fitting SBS30ss*. e, Fraction of dsDNA mutations attributed to each dsDNA signature. Cosine similarity of the extracted signature SBSH to the most similar COSMIC SBS signature is shown in parentheses. Cosine similarities of original spectra of samples to spectra reconstructed from component signatures are (left to right): 0.94 and 0.998. f, Fraction of ssDNA calls attributed to each ssDNA signature. Cosine similarities of original spectra of samples to spectra reconstructed from component signatures are (left to right): 0.91 and 0.98. a, Dots and error bars: point estimates and their Poisson 95% confidence intervals. a-c,e,f, MB, medulloblastoma (ID: Tumour 8); GBM, glioblastoma (ID: Tumour 10). See Supplementary Table 1 for sample details.

Extended Data Fig. 9 Burdens of ssDNA C > T calls, kinetic interpulse duration profiles, and profiling of heat treatment in varied buffers.

a, Fraction of ssDNA calls that are C > T (corrected for trinucleotide context opportunities) across all HiDEF-seq samples from healthy individuals and cell lines (i.e., excluding cancer predisposition syndromes), versus the ssDNA C > T burden. Data shown for liver and kidney samples profiled by HiDEF-seq without A-tailing. Sperm consistently have the highest fraction of ssDNA calls that are C > T. LCL, lymphoblastoid cell line. b, Cosine similarity of ssDNA call spectra to SBS30 after projecting ssDNA spectra to central pyrimidine contexts. c, Average ratio of pulse widths (left) and interpulse durations (right) at C > T calls and 30 flanking bases relative to molecules aligning to the same locus without the call (sperm: n = 1799 calls; blood DNA 72 °C heat, 3 and 6 h: n = 626 calls; dsDNA C > T mutations in a larger set of non-heat treated blood DNA, 56 °C and 72 °C heat treated blood DNA, sperm, kidney, and liver samples: n = 1202 mutations; Methods). Positions +1 and +3 (stars) best discriminate ssDNA C > T damage from dsDNA C > T mutations. Yellow box is the span shown in Fig. 4f. d, Average ratio of pulse width (left column) and interpulse duration (right column) after randomizing labels of molecules with and without the calls, for the same samples and calls as in panel (c). e, dsDNA mutation and ssDNA call burdens of heat-treated blood DNA in an additional experiment testing the effect of different buffers and different DNA extraction methods (orange underline, Puregene alcohol precipitation; all other samples, MagAttract with magnetic beads). MgAc, magnesium acetate; MgCl2, magnesium chloride; KCl, potassium chloride; KAc, potassium acetate; Alb, albumin; Tris buffer is Tris-HCl except for the MgAc/KAc/Alb that is Tris-Acetate (see Supplementary Table 1 for concentrations). Non-heat treated DNA samples were placed on ice for 6 h. The percentage of ssDNA sequencing calls that are C > T are annotated above each sample. Cosine similarity to COSMIC dsDNA signature SBS30 is annotated below each sample, after collapsing ssDNA calls to central pyrimidine trinucleotide contexts and correcting for trinucleotide context opportunities, except for the no-heat treatment samples that do not have sufficient C > T calls (‘N/A’). f, SBS30ss* signature (reproduced from Fig. 4d) compared to spectra of ssDNA calls after 72 °C heat damage of blood DNA for 6 h (h) in only 10 mM Tris buffer (n = 10,852 calls) or only water (n = 2,751 calls). Spectra are plotted after correcting for trinucleotide context opportunities. Bottom, odds ratios of spectrum contributions at C > T contexts of the Tris-only and water-only samples compared to SBS30ss* (which was derived from sperm and salt-buffer heat-treated samples). Pyr, pyrimidine, Pur, purine. g, Heat map of average pulse width ratios for ssDNA and dsDNA C > T calls for positions −1 to +6, for blood DNA samples heated at 72 °C for 6 h in different buffers or water, and for additional samples for comparison. Unbiased clustering (dendrogram) separates kinetic profiles of ssDNA C > T calls from dsDNA C > T calls and from kinetic profiles after randomizing labels of molecules with and without the calls. dsDNA ‘Blood, heat’: blood DNA heat-treated at 56 °C and 72 °C (both 3 h and 6 h for each); dsDNA ‘Blood’: 4 samples, not heat treated. dsDNA ‘Kidney and liver’: 10 samples, not heat treated. b, HiDEF-seq spectra are corrected for trinucleotide context opportunities. c,d, Error bars: standard error of the mean. e, Bars and error bars: point estimates and their Poisson 95% confidence intervals.

Extended Data Fig. 10 APOBEC3A-induced dsDNA and ssDNA call burdens and patterns.

a,b, Burdens (corrected for trinucleotide context opportunities and sensitivity) of dsDNA mutations (a) and ssDNA calls (b) in fibroblasts transduced with lentivirus-expressing green fluorescent protein (GFP) as a control or APOBEC3A with or without a nuclear localization signal (NLS). Two biological replicates are shown for each condition. c, Spectra of dsDNA mutations corrected for trinucleotide context opportunities. d, Fraction of dsDNA mutations attributed to each dsDNA signature. Cosine similarity of the de novo extracted signature SBSI to the most similar COSMIC SBS signature is shown in parentheses. Cosine similarities of original spectra of samples to spectra reconstructed from component signatures are (left to right): 0.99, 0.98, 0.98, and 0.97. e, Spectra of ssDNA calls corrected for trinucleotide context opportunities. f, SBS2ss* obtained by de novo signature extraction from APOBEC3A samples. Cosine similarity to SBS2 is calculated after projecting to central pyrimidine trinucleotide context. a,b, Error bars: Poisson 95% confidence intervals.

Extended Data Fig. 11 Mitochondrial genome dsDNA mutation rates, similarity between SBS30ss* and mitochondrial genome heavy strand A > G dsDNA mutations, and mitochondrial ssDNA call spectra.

a, Mitochondrial dsDNA mutation burdens versus age in liver and kidney samples, including liver samples from which mitochondria were enriched. Dashed lines: weighted least-squares linear regression. Shaded ribbon: 95% confidence interval. b, SBS30ss* (cytosine deamination) spectrum is projected to central pyrimidine trinucleotide contexts and compared to mitochondria heavy strand A > G dsDNA mutation spectrum (corrected for trinucleotide context opportunities), for different sample sets: (i) HiDEF-seq liver and kidney samples, including liver samples from which mitochondria were enriched (i.e., same set of samples in Fig. 5a, c and Extended Data Fig. 11a); (ii) 5697 purified liver mitochondria samples only (plot includes 89% of the mutations in (i)); (iii) Sample set (i), excluding the 5697 purified liver mitochondria samples (plot includes 11% of the mutations in (i)). Note, the contexts of SBS30ss* are matched with the reverse complement flanking base contexts of mitochondria heavy strand A > G mutations. The number of dsDNA A > G mutations is indicated. c, Spectrum of mitochondrial ssDNA calls combined from the liver and kidney samples shown in Fig. 5a, c and Extended Data Fig. 11a. The spectrum is corrected for trinucleotide context opportunities, separately for each strand. See Fig. 5d for a spectrum that includes bulk (i.e., non-mitochondria enriched) samples profiled by HiDEF-seq with A-tailing. a, Dots and error bars: point estimates and their Poisson 95% confidence intervals.

Supplementary information

Supplementary Notes

Supplementary Notes 1–12, including Supplementary Figs. 1–7 and Supplementary References.

Reporting Summary

Supplementary Tables

Supplementary Table 1: details of samples profiled in the study and sequencing statistics. Supplementary Table 2: ssDNA call and dsDNA mutation burdens for all samples. Supplementary Tables 3 and 4: the raw counts and spectra of ssDNA calls and dsDNA mutations, respectively. Spectra (normalized to sum = 1) are corrected for the trinucleotide content of the genome relative to the trinucleotide content of the interrogated bases, except for ssDNA calls of NanoSeq. Supplementary Tables 5 and 6: details of the ssDNA calls and dsDNA mutations, respectively. Supplementary Table 7: profiles of single-strand signatures. Supplementary Table 8: details of lentiviruses.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, M.H., Costa, B.M., Bianchini, E.C. et al. DNA mismatch and damage patterns revealed by single-molecule sequencing. Nature 630, 752–761 (2024). https://doi.org/10.1038/s41586-024-07532-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41586-024-07532-8

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer