Data Descriptor
Open access
Published: 05 June 2024

Haplotype-resolved chromosome-level genome assembly of Ehretia macrophylla

Scientific Data volume 11, Article number: 589 (2024) Cite this article

693 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Ehretia macrophylla Wall, known as wild loquat, is an ecologically, economically, and medicinally significant tree species widely grown in China, Japan, Vietnam, and Nepal. In this study, we have successfully generated a haplotype-resolved chromosome-scale genome assembly of E. macrophylla by integrating PacBio HiFi long-reads, Illumina short-reads, and Hi-C data. The genome assembly consists of two haplotypes, with sizes of 1.82 Gb and 1.58 Gb respectively, and contig N50 lengths of 28.11 Mb and 21.57 Mb correspondingly. Additionally, 99.41% of the assembly was successfully anchored into 40 pseudo-chromosomes. We predicted 58,886 protein-coding genes, of which 99.60% were functionally annotated from databases. We furthermore detected 2.65 Gb repeat sequences, 659,290 rRNAs, 4,931 tRNAs and 4,688 other ncRNAs. The high-quality assembly of the genome offers a solid basis for furthering the fields of molecular breeding and functional genomics of E. macrophylla.

Haplotype-resolved chromosomal-level assembly of wasabi (Eutrema japonicum) genome

Article Open access 11 July 2023

Chromosome-level genome assembly and annotation of xerophyte secretohalophyte Reaumuria soongarica

Article Open access 22 July 2024

Improved chromosome-level genome assembly of Indian sandalwood (Santalum album)

Article Open access 21 December 2023

Background & Summary

Ehretia macrophylla Wall is a perennial shrub tree belonging to the genus Ehretia in the Boraginaceae family. It can arrive at 15 m and is widely distributed in the southwest, south, and east of China, as well as in certain regions of Japan, Vietnam, and Nepal^1,2,3. E. macrophylla, also known as wild loquat in China, is a rare tree with diverse applications, including ecological, gardening, ornamental, and medicinal value. To date, the complete sequencing of any species within the genus Ehretia remains unaccomplished. The genetic studies of E. macrophylla are impeded due to the absence of high-quality reference genome sequences, despite its multifarious applications.

E. macrophylla is an excellent tree species for urban greening and as a border tree, especially when dust retention is necessary. This is due to its high trunk, strong dust absorption ability, and resistance to pests and diseases². Furthermore, the foliage of E. macrophylla serves a dual purpose as both a potential food source and medicinal resource, highlighting its multifaceted utility in various fields⁴. It has the effect of activating the meridians and treating rheumatism, dispelling wind and dampness, and relieving joint pain. Furthermore, the bark of E. macrophylla has the effect of dissipating blood stasis and swelling, making it suitable for treating fall injuries³. Of additional interest, the fruit of E. macrophylla serves as a functional food supplement, consumed as a traditional fruit and utilized in herbal tea. It can help soothe the throat and alleviate coughs. The fruit is usually used to treat diseases such as bronchitis, acute and chronic pharyngitis, cough, and asthma^2,5. As a prominent species within the genus Ehretia, E. macrophylla is renowned for its diverse range of applications attributed to the copious presence of bioactive compounds in its fruit and other tissues. These bioactive substances remarkable antioxidant, antitumor, anti-inflammatory, antiviral, and antibacterial properties. Some of the key compounds found in E. macrophylla include quercetin, flavonoids, kaempferol, rosmarinate, caffeic acid, and pectin polysaccharide^2,4,5.

High-quality genomes are of profound significance for in-depth research, rational development, and adequate protection of plants. Here, we present a high-quality genome assembly of E. macrophylla using an integrated approach, which includes PacBio HiFi long-read sequencing, short-read Illumina sequencing, and Hi-C sequencing. The assembled genome (~3.40 Gb) comprises haplotype a (1.82 Gb) and haplotype b (1.58 Gb), with contig N50 lengths of 28.11 Mb and 21.57 Mb, respectively. Furthermore, the assembled scaffolds were meticulously anchored to 40 pseudochromosomes with an exceptional anchoring rate of 99.41%. We predicted a total of 58,886 protein-coding genes, with 29,805 for haplotype a and 29,081 for haplotype b. Among these genes, 99.60% were functionally annotated. In addition, we identified 2.65 Gb repeat sequences (1.44 Gb for haplotype a and 1.21 Gb for haplotype b), and annotated a total of 668,909 non-coding RNA genes, including 659,290 rRNA (415,016 for haplotype a and 244,274 for haplotype b), 4,931 tRNA genes (2,522 for haplotype a and 2,409 for haplotype b) and 4,688 other ncRNA genes (2,428 for haplotype a and 2,260 for haplotype b). Our data will serve as a valuable genetic resource, enabling us to reveal the genetic mechanisms behind special properties, conduct evolutionary studies of the genus Ehretia and family Boraginaceae, and elucidate the molecular breeding of E. macrophylla.

Methods

Plant materials, library construction, and genome size estimation

Fresh leaf tissue for genome and RNA sequencing was sampled in 2022 from a mature E. macrophylla individual growing in Luoyang, Henan Province, China (34.663041 N, 112.434468 E) (Fig. 1a). Superior-quality genomic DNA was isolated using the Plant Genomic DNA Kit (Tiangen, China). The concentration and purity of the genomic DNA were assessed using a NanoDrop 8000 spectrophotometer (Thermo Fisher Scientific, USA). Total RNA was extracted from E. macrophylla samples utilizing TRIzol reagent. Subsequently, RNase-free DNase I was employed to treat the isolated RNA, followed by elution with RNase-free water. RNA integrity was measured using an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA).

The DNA that met the required qualifications was utilized to construct a genome library using the Pacific Biosciences SMRTbell Express Template Prep Kit. A 20-kb insert library was processed using a BluePippin system. The sequencing was carried out using the Pacific Bioscience Sequel II platform (Pacific Biosciences, Menlo Park, CA, USA). We obtained ~152.18 Gb of PacBio HiFi raw data (~84 × ) with an average length of 16.05 kb (Table 1). For Illumina sequencing, the sequencing was performed on the HiSeq X Ten platform (Illumina) with model of 150 PE. Finally, we obtained approximately 54.03 Gb of Illumina raw data (~30 × ). The Hi-C libraries were constructed, enriched, and sheared according to methods described previously^6,7. The Hi-C sequencing was conducted using the Illumina HiSeq X Ten platform. A total of approximately 208.36 Gb (~114 × ) of raw Hi-C data were acquired. For RNA sequencing, a cDNA library was constructed using an RNA Library Prep Kit (NEB, UK). Approximately 9.66 Gb of raw data were obtained from the HiSeq X Ten platform (Illumina).

Table 1 The statistics of the genome sequencing data of E. macrophylla.

Full size table

Genome survey and assembly

Before assembly, the adaptor sequences, low-quality regions, and sequences that were overly short were removed using the fastp v0.19.3⁸ software. Jellyfish v2.3.0⁹ was employed for determining the frequency distribution of the depth of clean data with 17 K-mers, and GenomeScope v2.0¹⁰ was utilized to estimate the genome size. The estimated haplotype genome size for E. macrophylla is approximately 1.84 Gb (Fig. 2). A combination of HiFi reads and Hi-C short reads was employed as input for the genome assembler Hifiasm v0.16.1¹¹. The assembly process, conducted in Hi-C mode with default settings, resulted in the generation of two contigs representing haplotype a and haplotype b, respectively. For chromosome assembly, we first aligned the Hi-C reads to the assembly using Juicer v1.6 software¹². Next, the draft genome assembly was scaffolded using 3D-DNA¹³ with Hi-C reads. Then, we manually adjusted the chromosome construction using the Juicebox tool¹⁴, which involved removing incorrect insertions and adjusting the orientation to correct visible errors to the best extent possible. For further optimization of the genome assembly, three rounds of corrections were performed on the assembly using Illumina reads with NextPolish v1.4.0¹⁵, and the redundant sequences were removed using Redundans v0.14a27¹⁶. In total, approximately 99.41% the assembled data was anchored onto 40 pseudochromosomes in the two haplotypes (Supplementary Table 1). Finally, we obtained a high-quality haplotype-resolved chromosomal-level genome of E. macrophylla (Fig. 1b, Fig. 3). The assembly (~3.40 Gb) comprised two haplotypes, namely haplotype a and haplotype b, with respective genome sizes of 1.82 Gb and 1.58 Gb (Table 2). Since the genome assembly was haplotype-resolved and lacked parental information for subgenome phasing, we designated the long one chromosome from each homologous pair as haplotype a and the other as haplotype b. The contig N50 and the scaffold N50 lengths for haplotype a were 28.11 Mb and 92.55 Mb, respectively, whereas for haplotype b, they were 21.57 Mb and 83.31 Mb, respectively. A total of 307 gaps were identified in the current genome assembly (Table 2). Utilizing PacBio HiFi reads, the LR_Gapcloser¹⁷ software was employed for gap filling, with two iterations executed. Furthermore, we assembled a chloroplast genome with a length of 156,639 bp and a mitochondrial genome with a length of 702,890 bp using GetOrganelle v1.7.5.0¹⁸.

Table 2 Summary of the E. macrophylla genome assembly data.

Full size table

Genomic repeat annotation

To annotate the repeat sequences in the E. macrophylla genome, a transposable element (TE) library was first constructed by running the extensive de novo TE Annotator (EDTA) pipeline to identify TEs from scratch. The parameters used were–Sensitive 1–ANNO 1¹⁹. Then, we used RepeatMasker v4.1.3 (http://www.repeatmasker.org/RepeatMasker/) to mask the repeat library acquired from the Repbase database (https://www.girinst.org/repbase/). For E. macrophylla haplotype a, a total of 2,751,291 repetitive sequences, constituting approximately 79.18% of the genome, were identified with a cumulative length of 1.44 Gb. Among them, long terminal repeats (LTRs) were the main repeats, totaling 851,702, with a size of 790.85 Mb, accounting for 43.48% of the assembled genome. This was followed by DNA transposable elements (TIRs) at 29.36%. The sizes of the copia- and gypsy-like LTRs were 109.80 Mb and 351.66 Mb, respectively, which accounted for 6.04% and 19.33% of haplotype a (Table 3). In term of E. macrophylla haplotype b, a total of 2,258,809 repetitive sequences (76.29% of the genome) were identified with a length of 1.21 Gb. Of these, the primary repetitive elements were also LTRs, which amounted to 788,470 and occupied a total size of 713.16 Mb, representing 45.13% of the genome that was assembled. This was followed by TIRs, accounting for 24.32%. The copia- and gypsy-like LTRs had sizes of 109.44 Mb and 314.71 Mb, respectively, making up 6.93% and 19.92% of haplotype b (Table 3).

Table 3 The repetitive sequences identified in the genome of E. macrophylla.

Full size table

Gene identification and functional annotations

To annotate the high-quality protein-coding genes, a comprehensive approach encompassing homology-based, de novo, and transcriptome-based predictions was employed. A total of 31,9767 non-redundant protein sequences from closely related species (Echium plantagineum²⁰, Solanum lycopersicum²¹, Coffea canephora²², Eucommia ulmoides²³, Tectona grandis²⁴, Daucus carota²⁵, Nyssa sinensis²⁶, Rhododendron simsii²⁷, Lonicera japonica²⁸, Lactuca saligna²⁹, Vitis vinifera³⁰, and Arabidopsis thaliana³¹) were gathered as evidence for protein homology using Exonerate V2.4.0³². The RNA-seq data were aligned to the genome sequences using Hisat2 v2.2.0¹⁹ with default parameters, followed by assembly of the aligned reads using StringTie 2 v2.1.2³³. Subsequently, all splicing variations were identified and classified through alignment of full-length transcripts utilizing the PASA v2.3.3³⁴ pipeline. All complete gene structures predicted using PASA v2.3.3 pipeline were utilized to generate a training model with AUGSTUS v3.3.3³⁵, employing default parameters.

In addition, the putative protein-coding gene structure was predicted utilizing MAKER2³⁶. The ab initio predictions of gene structure were conducted using AUGSTUS v3.3. We aligned the transcript evidence with the genome using BLAST+³⁷ and finally optimized it with Exonerate v2.4.0³². In order to increase the accuracy of the annotation, we integrated and updated the gene prediction results using EVidenceModeler51 (EVM)³⁸ and PASA. In total, we annotated 29,805 protein-coding genes in E. macrophylla haplotype a with an average length of 4,956.40 bp. Among them, there are a total of 36,131 coding DNA sequence (CDS), 200,786 exons, and 164,655 introns. The average lengths were 1,243.30 bp for CDS, 281 bp for exons and 803 bp for introns (Table 4). Additionally, we identified 29,081 protein-coding genes in haplotype b a with an average length of 5,199.10 bp. A total of 34,686 CDS, 191,925 exons, and 157,239 introns were detected, with the average lengths of 1,248.6 bp, 279.1 bp and 854.6 bp respectively (Table 4).

Table 4 Statistical analysis of gene annotations.

Full size table

Functional annotation of protein-coding genes was carried out using three strategies. First, we mapped gene sequences against the eggNOG 5.0³⁹ database using eggNOG-mapper v2.16⁴⁰, and annotated 97.94% of the genes. Of these 48.80% and 47.94% were annotated with Gene Ontology (GO, http://geneontology.org/) and Kyoto Encyclopedia of Genes and Genomes (KEGG, https://www.genome.jp/kegg), respectively. Second, 98.40% of genes were annotated using DIAMOND v2.0.12⁴¹ against four protein databases: Swiss_Prot⁴² (78.96%), TrEMBL⁴² (98.39%), NR⁴³ (98.23%), and Arabidopsis thaliana genes (91.53%). Finally, InterProScan v5.5.2-86.0⁴⁴ was used to annotate 98.74% of the gene against 14 databases (Table 5).

Table 5 Statistics of protein-coding gene functional annotation for E.macrophylla.

Full size table

For the annotation of non-coding RNA genes, we detected a total of 415,016 rRNA genes, 2,522 tRNA genes, and 2,428 other ncRNA genes in haplotype a using tRNAScan-SE⁴⁵, Barrnap (https://github.com/tseemann/barrnap), and Rfam⁴⁶, respectively. In term of haplotype b, a total of 244,274 rRNA genes, 2,409 tRNA genes, and 2,260 other ncRNA genes were detected (Table 6).

Table 6 Statistics for non-coding RNA genes in the genome of E. macrophylla.

Full size table

Genome comparison between haplotype assemblies

The haplotype alignments were conducted utilizing minimap2⁴⁷, while the identification of syntenic regions and structural variations was performed using SyRI v1.6⁴⁸. The structural rearrangements identified between haplotype genomes were visualized using Plotsr v0.5.4⁴⁹ (Fig. 4). Chr 01, 02, and 04 to 10 exhibit more structural variation (Fig. 4a). A total of 13,045 syntenic regions (~953 Mbp) were detected, indicating extreme similarity between the two haplotypes (Fig. 4b). Numerous variations were also detected, including minor insertions/deletions and SNPs (Fig. 4c,d); two relatively large inversions were found on chr07 and chr10, respectively (Fig. 4a). We compared the dot plot of syntenic blocks using Minimap2 and found that the two haplotypes were very similar, with essentially the same chromosome order (Fig. 5).

Data Records

The sequencing data for this study have been uploaded to the NCBI database with the BioProject number PRJNA945189. The genomic PacBio sequencing data can be found in the NCBI Sequence Read Archive (SRA) database with accession numbers SRR23907027⁵⁰, SRR23907028⁵¹, SRR23907029⁵², and SRR23907030⁵³. For Hi-C sequencing data, specifically referring to accession numbers SRR23907031⁵⁴ and SRR23907036⁵⁵ in the SRA database. The genomic Illumina sequencing data are available under accession numbers SRR23907047⁵⁶ and SRR23907058⁵⁷. The final genome assembly was deposited in the GenBank with accession number: GCA_037974685.1⁵⁸ and GCA_037974665.1⁵⁹. In addition, the final chromosome assembly and annotation data were deposited in the Genome Warehouse (GWH) of the National Genomics Data Center (NGDC) with the accession number GWHEQHN00000000⁶⁰ and under the BioProject number PRJCA021125.

Technical Validation

To evaluate the completeness and accuracy of the genome, we employed BWA⁶¹, minimap2⁴⁷, and HISAT2¹⁹ to align Illumina reads, HiFi reads, and RNA-Seq reads to our reference genome respectively. In addition, BUSCO v5.2.2⁶² was used to evaluate the genome completeness using the embryophyta_odb10 and eukaryota_odb10 databases. The genomic completeness of these two haplotypes was found to be satisfactory, with proportions of complete BUSCOs (including both single-copy and multi-copy) at 98.1% and 97.1% for the expected genes from embryophyta, respectively (Table 7). The E. macrophylla genome size was evaluated using k-mer analysis (Fig. 2). After filtering out non-primary alignments, we proceed to calculate the mapping ratio and coverage percentage. We found that the genome coverage from sequencing data is relatively high (Table 8). We conducted additional quality control analysis on the genome assembly using Merqury⁶³ (at K = 16) based on PacBio HiFi reads (Fig. 6, Table 9). The consensus quality values (QVs) of the separate haplotypes a and b, as well as their shared genome, are recorded as 34.98, 34.74, and 34.87 correspondingly. The k-mer completeness scores of the distinct haplotypes a and b, along with their shared genome, amount to approximately 82.08%, 81.07%, and 94.46% accordingly. The further BUSCO analysis showed that the single-copy and multi-copy genes have approximately the same depth, indicating that the assembly had no redundancy (Fig. 7).

Table 7 Statistical analysis of BUSCO for both haplotypes and proteins.

Full size table

Table 8 Statistics of map rate and coverage of three types of sequencing reads.

Full size table

Table 9 Statistical analysis of Merqury for evaluating the quality of haplotypes.

Full size table

To evaluate the single-base error rate and heterozygosity, next-generation reads were mapped to the genome using BWA, and the variant loci were detected using bcftool v 1.11⁶⁴. Heterozygous sites were utilized for the computation of heterozygosity rates, whereas homozygous sites were employed for the determination of error rates. We found that the heterozygosity rate was approximately 0.19%, and the error rate was approximately 0.012%. By evaluating the coverage depth and GC content distribution analysis of the second and third generation data, we found that the second-generation data had a significant guanine-cytosine (GC) bias (Fig. 8). Juicer¹² was used to map the Hi-C data to the final genome assembly. It was found that the chromosome clustering was normal, with no obvious chromosome assembly errors, but there were abnormal signals in some regions (Fig. 3). The chromatin interaction data from the Hi-C map revealed low-level interactions occurred between pseudochromosomes, confirming the high quality and reliability of our chromosome-level anchoring (Supplementary Table 1).

The chromosomal locations of specific characteristic sequences, such as telomeres, rDNA, and tandem repeats, were determined through the mapping of repetitive sequences onto the genome. The majority of chromosome telomere sequences were completely assembled; however, a few exhibited partial or missing regions. We detected a high tandem repeat on chromosomes (Supplementary txt 1). This sequence contains 5 S rDNA, and its distribution is essentially consistent, suggesting that this sequence represents 5 S rDNA and its adjacent regions. In addition, the 18-5.8-28 S rDNA and 5 S rDNA arrays are very abundant and widely distributed (Supplementary Fig. 1).

BUSCO v5.2.2⁶² was employed to assess the annotated and integrated proteins utilizing the embryophyta_odb10 and eukaryota_odb10 databases. The proportion of complete core gene coverage was 96.4% (Table 7), which included 7.1% single-copy genes and 89.3% duplicated genes. Only 0.9% fragmented and 2.7% missing genes were detected, indicating that the genome annotation is of superior quality.

Code availability

All software and pipelines were executed in accordance with the manual and protocols of the published bioinformatics tools, adhering to the specified versions and meticulously documenting the code/parameters used, as elaborated in the Methods section.

References

Gottschling, M., Mai, D. H. & Hilger, H. H. The systematic position of Ehretia fossils (Ehretiaceae, Boraginales) from the European Tertiary and implications for character evolution. Review of Palaeobotany and Palynology 121, 149–156, https://doi.org/10.1016/S0034-6667(01)00147-6 (2002).
Article Google Scholar
Deng, N., Zheng, B., Li, T., Hu, X. & Liu, R. H. Phenolic profiles, antioxidant, antiproliferative, and hypoglycemic activities of Ehretia macrophyla Wall. (EMW) fruit. J Food Sci 85, 2177–2185, https://doi.org/10.1111/1750-3841.15185 (2020).
Article CAS PubMed Google Scholar
Xu, X., Cheng, Y., Tong, L., Tian, L. & Xia, C. The complete chloroplast genome sequence of Ehretia dicksonii Hance (Ehretiaceae). Mitochondrial DNA B Resour 7, 661–662, https://doi.org/10.1080/23802359.2022.2061873 (2022).
Article PubMed PubMed Central Google Scholar
Dong, M., Oda, Y. & Hirota, M. 10E,12Z,15Z)-9-hydroxy-10,12,15-octadecatrienoic acid methyl ester as an anti-inflammatory compound from Ehretia dicksonii. Biosci Biotechnol Biochem 64, 882–886, https://doi.org/10.1271/bbb.64.882 (2000).
Article CAS PubMed Google Scholar
Xu, D. et al. Potential prebiotic functions of a characterised Ehretia macrophylla Wall. fruit polysaccharide. Int J Food Sci Tech 57, 35–47, https://doi.org/10.1111/ijfs.15005 (2022).
Article CAS Google Scholar
Wang, C. et al. Genome-wide analysis of local chromatin packing in Arabidopsis thaliana. Genome Res 25, 246–256, https://doi.org/10.1101/gr.170332.113 (2015).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Niu, S. et al. The Chinese pine genome and methylome unveil key features of conifer evolution. Cell 185, 204–217 e214, https://doi.org/10.1016/j.cell.2021.12.006 (2022).
Article CAS PubMed Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890, https://doi.org/10.1093/bioinformatics/bty560 (2018).
Article CAS PubMed PubMed Central Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770, https://doi.org/10.1093/bioinformatics/btr011 (2011).
Article CAS PubMed PubMed Central Google Scholar
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and smudgeplot for reference-free profling of polyploid genomes. Nat Commun 11, 1432, https://doi.org/10.1038/s41467-020-14998-3 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18, 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst 3, 95–98, https://doi.org/10.1016/j.cels.2016.07.002 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95, https://doi.org/10.1126/science.aal3327 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst 3, 99–101, https://doi.org/10.1016/j.cels.2015.07.012 (2016).
Article CAS PubMed PubMed Central Google Scholar
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255, https://doi.org/10.1093/bioinformatics/btz891 (2020).
Article CAS PubMed Google Scholar
Pryszcz, L. P. & Gabaldon, T. Redundans: an assembly pipeline for highly heterozygous genomes. Nucleic Acids Res 44, e113, https://doi.org/10.1093/nar/gkw294 (2016).
Article CAS PubMed PubMed Central Google Scholar
Xu, G. C. et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. Gigascience 8, https://doi.org/10.1093/gigascience/giy157 (2019).
Jin, J. J. et al. GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biol 21, 241, https://doi.org/10.1186/s13059-020-02154-5 (2020).
Article PubMed PubMed Central Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37, 907–915, https://doi.org/10.1038/s41587-019-0201-4 (2019).
Article CAS PubMed PubMed Central Google Scholar
Tang, C. Y., Li, S., Wang, Y. T. & Wang, X. Comparative genome/transcriptome analysis probes Boraginales’ phylogenetic position, WGDs in Boraginales, and key enzyme genes in the alkannin/shikonin core pathway. Mol Ecol Resour 20, 228–241, https://doi.org/10.1111/1755-0998.13104 (2020).
Article CAS PubMed Google Scholar
Hosmani, P. S. et al. An improved de novo assembly and annotation of the tomato reference genome using single-molecule sequencing, Hi-C proximity ligation and optical maps. bioRxiv, 767764, https://doi.org/10.1101/767764 (2019).
Denoeud, F. et al. The coffee genome provides insight into the convergent evolution of caffeine biosynthesis. Science 345, 1181–1184, https://doi.org/10.1126/science.1255274 (2014).
Article ADS CAS PubMed Google Scholar
Li, Y. et al. High-quality de novo assembly of the Eucommia ulmoides haploid genome provides new insights into evolution and rubber biosynthesis. Hortic Res-England 7, https://doi.org/10.1038/s41438-020-00406-w (2020).
Zhao, D. et al. A chromosomal-scale genome assembly of reveals the importance of tandem gene duplication and enables discovery of genes in natural product biosynthetic pathways. Gigascience 8, https://doi.org/10.1093/gigascience/giz005 (2019).
Iorizzo, M. et al. A high-quality carrot genome assembly provides new insights into carotenoid accumulation and asterid genome evolution. Nature Genetics 48, 657–+, https://doi.org/10.1038/ng.3565 (2016).
Article CAS PubMed Google Scholar
Yang, X. et al. A chromosome-level genome assembly of the Chinese tupelo Nyssa sinensis. Sci Data 6, 282, https://doi.org/10.1038/s41597-019-0296-y (2019).
Article CAS PubMed PubMed Central Google Scholar
Yang, F. S. et al. Chromosome-level genome assembly of a parent species of widely cultivated azaleas. Nat Commun 11, 5269, https://doi.org/10.1038/s41467-020-18771-4 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Pu, X. D. et al. The honeysuckle genome provides insight into the molecular mechanism of carotenoid metabolism underlying dynamic flower coloration. New Phytologist 227, 930��943, https://doi.org/10.1111/nph.16552 (2020).
Article CAS PubMed Google Scholar
Reyes-Chin-Wo, S. et al. Genome assembly with in vitro proximity ligation data and whole-genome triplication in lettuce. Nat Commun 8, 14953, https://doi.org/10.1038/ncomms14953 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467, https://doi.org/10.1038/nature06148 (2007).
Article ADS CAS PubMed Google Scholar
Cheng, C. Y. et al. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J 89, 789–804, https://doi.org/10.1111/tpj.13415 (2017).
Article CAS PubMed Google Scholar
Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31, https://doi.org/10.1186/1471-2105-6-31 (2005).
Article CAS PubMed PubMed Central Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33, 290–295, https://doi.org/10.1038/nbt.3122 (2015).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31, 5654–5666, https://doi.org/10.1093/nar/gkg770 (2003).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644, https://doi.org/10.1093/bioinformatics/btn013 (2008).
Article CAS PubMed Google Scholar
Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18, 188–196, https://doi.org/10.1101/gr.6743907 (2008).
Article CAS PubMed PubMed Central Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421, https://doi.org/10.1186/1471-2105-10-421 (2009).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biology 9 (2008).
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res 47, D309–D314, https://doi.org/10.1093/nar/gky1085 (2019).
Article CAS PubMed Google Scholar
Huerta-Cepas, J. et al. Fast Genome-wide functional annotation through orthology assignment by eggNOG-Mapper. Molecular Biology and Evolution 34, 2115–2122, https://doi.org/10.1093/molbev/msx148 (2017).
Article CAS PubMed PubMed Central Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat Methods 12, 59–60, https://doi.org/10.1038/nmeth.3176 (2015).
Article CAS PubMed Google Scholar
UniProt, C. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 49, D480–D489, https://doi.org/10.1093/nar/gkaa1100 (2021).
Article CAS Google Scholar
Sayers, E. W. et al. Database resources of the national center for biotechnology information. Nucleic Acids Res 50, D20–D26, https://doi.org/10.1093/nar/gkab1112 (2022).
Article CAS PubMed Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240, https://doi.org/10.1093/bioinformatics/btu031 (2014).
Article CAS PubMed PubMed Central Google Scholar
Chan, P. P., Lin, B. Y., Mak, A. J. & Lowe, T. M. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic Acids Res 49, 9077–9096, https://doi.org/10.1093/nar/gkab688 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kalvari, I. et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res 49, D192–D200, https://doi.org/10.1093/nar/gkaa1047 (2021).
Article CAS PubMed Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100, https://doi.org/10.1093/bioinformatics/bty191 (2018).
Article CAS PubMed PubMed Central Google Scholar
Goel, M., Sun, H. Q., Jiao, W. B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biology 20, https://doi.org/10.1186/s13059-019-1911-0 (2019).
Goel, M. & Schneeberger, K. plotsr: visualizing structural similarities and rearrangements between multiple genomes. Bioinformatics 38, 2922–2926, https://doi.org/10.1093/bioinformatics/btac196 (2022).
Article CAS PubMed PubMed Central Google Scholar
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907027 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907028 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907029 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907030 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907031 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907036 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907047 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907058 (2023).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_037974685.1 (2024).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_037974665.1 (2024).
NGDC Genome Warehouse https://ngdc.cncb.ac.cn/gwh/Assembly/83111/show (2023).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv: Genomics (2013).
Manni, M., Berkeley, M. R., Seppey, M. & Zdobnov, E. M. BUSCO: Assessing genomic data quality and beyond. Curr Protoc 1, e323, https://doi.org/10.1002/cpz1.323 (2021).
Article PubMed Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245, https://doi.org/10.1186/s13059-020-02134-9 (2020).
Article CAS PubMed PubMed Central Google Scholar
Narasimhan, V. et al. BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data. Bioinformatics 32, 1749–1751, https://doi.org/10.1093/bioinformatics/btw044 (2016).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This study was supported by the Foundation for the invigorating forestry through science and technology (YLK202216), Central Plain Scholar’s workstation of Henan province (ZYGZZ2021048), the scientific and technological research project of Henan province (222102110480, 222102110444, 222102110448).

Author information

These authors contributed equally: Shiping Cheng, Qikun Zhang.

Authors and Affiliations

Henan Province Key Laboratory of Germplasm Innovation and Utilization of Eco-economic Woody Plant, Pingdingshan University, Pingdingshan, 467000, China
Shiping Cheng, Xining Geng, Lihua Xie, Minghui Chen, Siqian Jiao, Shuaizheng Qi & Pengqiang Yao
Kaitai-bio Company, Hangzhou, 310000, China
Qikun Zhang
Henan Forestry Vocational College, Luoyang, 471000, China
Mailin Lu & Mengren Zhang
Henan Senzhuang Cukang Agriculture and Forestry Technology Co., Ltd, Luoyang, 471000, China
Wenshan Zhai
Kaitai Mingjing Genetech Corporation, Beijing, 100070, China
Quanzheng Yun
College of Life and Environmental Science, Hangzhou Normal University, Hangzhou, 310036, China
Shangguo Feng
Zhejiang Provincial Key Laboratory for Genetic Improvement and Quality Control of Medicinal Plants, Hangzhou Normal University, Hangzhou, 310036, China
Shangguo Feng

Authors

Shiping Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Qikun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xining Geng
View author publications
You can also search for this author in PubMed Google Scholar
Lihua Xie
View author publications
You can also search for this author in PubMed Google Scholar
Minghui Chen
View author publications
You can also search for this author in PubMed Google Scholar
Siqian Jiao
View author publications
You can also search for this author in PubMed Google Scholar
Shuaizheng Qi
View author publications
You can also search for this author in PubMed Google Scholar
Pengqiang Yao
View author publications
You can also search for this author in PubMed Google Scholar
Mailin Lu
View author publications
You can also search for this author in PubMed Google Scholar
Mengren Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wenshan Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Quanzheng Yun
View author publications
You can also search for this author in PubMed Google Scholar
Shangguo Feng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Cheng S.P. and Feng S.G. conceived and designed the study; Cheng S.P. collected the samples; Zhang Q.K., Geng X.N., Xie L.H., Chen M.H., Jiao S.Q., Qi S.Z., Yao P.Q., Lu M.L., Zhang M.R., Zhai W.S., and Yun Q.Z. performed bioinformatics; Feng S.G., Cheng S.P. and Zhang Q.K. participated in the manuscript writing and revisions. Cheng S.P. and Zhang Q.K. contributed equally to this work. All the authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Shiping Cheng or Shangguo Feng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Cheng, S., Zhang, Q., Geng, X. et al. Haplotype-resolved chromosome-level genome assembly of Ehretia macrophylla. Sci Data 11, 589 (2024). https://doi.org/10.1038/s41597-024-03431-9

Download citation

Received: 21 November 2023
Accepted: 28 May 2024
Published: 05 June 2024
DOI: https://doi.org/10.1038/s41597-024-03431-9

Haplotype-resolved chromosome-level genome assembly of Ehretia macrophylla

Subjects

Abstract

Similar content being viewed by others

Haplotype-resolved chromosomal-level assembly of wasabi (Eutrema japonicum) genome

Chromosome-level genome assembly and annotation of xerophyte secretohalophyte Reaumuria soongarica

Improved chromosome-level genome assembly of Indian sandalwood (Santalum album)

Background & Summary

Methods

Plant materials, library construction, and genome size estimation

Genome survey and assembly

Genomic repeat annotation

Gene identification and functional annotations

Genome comparison between haplotype assemblies

Data Records

Technical Validation

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Search

Quick links

Subjects

Abstract

Similar content being viewed by others

Haplotype-resolved chromosomal-level assembly of wasabi (Eutrema japonicum) genome

Chromosome-level genome assembly and annotation of xerophyte secretohalophyte Reaumuria soongarica

Improved chromosome-level genome assembly of Indian sandalwood (Santalum album)

Background & Summary

Methods

Plant materials, library construction, and genome size estimation

Genome survey and assembly

Genomic repeat annotation

Gene identification and functional annotations

Genome comparison between haplotype assemblies

Data Records

Technical Validation

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links