Background & Summary

Orobanche coerulescens, a member of the Orobanchaceae family, is a holoparasitic plant with an obligate life cycle. Approximately 4,750 species of angiosperms are parasitic, relying wholly or partially on their host plants for water and nutrients to sustain their lives1. Parasitic plants can be classified based on their reliance on hosts2,3,4. Obligate parasites necessitate a host to complete their life cycle, while facultative ones do not. Hemiparasites are capable of photosynthesis but obtain water, minerals, and nutrients from their hosts. On the other hand, holoparasites lack chlorophyll and derive their fixed carbon from their hosts, presenting non-green colours. The Orobanchaceae family encompasses diverse autotrophic genera as well as hemiparasitic and holoparasitic lineages, providing an ideal model for evolutionary studies given its comprehensive range of parasitic states5,6,7.

While O. coerulescens is widely distributed across Eurasia, it is classified as scarce and endangered in Central Europe8. Likewise, there is increasing concern about its risk of extinction in Korea9. Notably, the species is found on the volcanic islands of Ulleungdo and Dokdo in Korea, exhibiting a distinct geographical niche compared to its presence on the Korean peninsula10 (Fig. 1). Interestingly, the plants on these islands are glabrous and tend to parasitize on Artemisia japonica rather than Artemisia capillaris, distinguishing them from those on the mainland counterparts9. Despite this ecological specialization, O. coerulescens faces a precarious existence due to its limited habitat range and susceptibility to external disturbances, leading to potential declines9. The unique geographic and ecological circumstances of O. coerulescens found on Ulleungdo and Dokdo Islands underscore the importance of our comprehensive genomic analysis, which aims to understand the evolution of parasitic plants and conserve this unique biodiversity.

Fig. 1
figure 1

Habitats and morphology of Orobanche coerulescens on Korean volcanic islands. (a) Habitats of O. coerulescens on Korean volcanic islands. (b) Glabrous type of O. coerulescens collected from Ulleungdo Island.

To understand the phylogeny of Orobanchaceae and genome evolution mechanisms in parasitic plants, various studies have been conducted. These focus on aspects such as reductive evolution under relaxed selection and structural rearrangement of the genome. Specifically, several studies have targeted on constructing the chloroplast genome of Orobanchaceae11,12,13. However, given the limited scope of these studies to chloroplast genomes, there is a compelling need for complementary research focused on the nuclear genome.

With the ongoing advancements in sequencing technologies, we utilized Oxford Nanopore Technologies (ONT), Illumina, and Hi-C methods to assemble a chromosomal-level genome of O. coerulescens (Fig. 2). The final assembly consists of 19 chromosome-level sequences and spans a total length of 3,648 Mb, with an N50 value of 195 Mb. Repeat sequences were found to constitute 86.3% of the entire genome. A total of 29,395 protein-coding genes were annotated with an average length of 4,849 bp, composed of 159,445 exons. This high-quality genome will serve as a valuable genetic resource for understanding the parasitic plant evolutionary mechanisms within Orobanchaceae and for conserving the unique biodiversity of endemics on Korean volcanic islands.

Fig. 2
figure 2

Contact map and overview of the O. coerulescens genome. (a) Contact map of 19 chromosome-level genome. (b) Circos plot features from the outermost to the innermost circles indicate gene density, GC content, and density of repeat elements. The figure depicts the host plant, Artemisia japonica, and the holoparasitic plant, O. coerulescens.

Methods

Sampling and sequencing

O. coerulescens was collected from Ulleungdo Island, Korea (37.544826 N, 130.908320E) (Fig. 1). All experimental procedures were conducted in compliance with national and international guidelines. The plant materials used was sourced from adult plants and adhered to with Korean guidelines.

The stem of O. coerulescens was used to obtain high molecular weight (HMW) DNA following a nuclei isolation method14 and genomic DNA was extracted using a modified cetyltrimethylammonium bromide (CTAB) protocol15 in the presence of 2% polyvinylpyrrolidone (1% of molecular weight 10,000 and 1% of molecular weight 40,000) (Sigma-Aldrich, Burlington, MA, USA). For Nanopore sequencing, short genomic fragments were removed using a Short Read Eliminator Kit (Circulomics, Baltimore, MD, USA) and the library was prepared using the ONT 1D ligation Sequencing kit (SQK-LSK109, Oxford Nanopore Technologies, Oxford, UK) with the native barcoding expansion kit (EXPNBD104) in accordance with the manufacturer’s protocol. The library was loaded onto a MinION flow cell FLO-MIN106 (Oxford Nanopore Technologies) and PromethION flow cell FLO-PRO002 (Oxford Nanopore Technologies) and sequencing was run on a MinION MK1b and PromethION. A total of long-read DNA data, 354,136,057,138 bp in length, was generated with an N50 value of 33,824 bp and a sequencing depth of 97x. To generate short high-quality sequences, we also used an Illumina platform with TruSeq DNA PCR-Free DNA library (Illumina, San Diego, CA, USA) and then paired-end DNA library was sequenced in the Illumina NovaSeq. 6000 with the length of 151 bp and the depth of 82x (Table S1).

We additionally employed Hi-C data for chromosome-level genome assembly. The Hi-C library was constructed as follows. Ground flower, stem, and root were mixed with 1% formaldehyde for fixing chromatin, and then the nuclei were isolated following a nuclei isolation method14. Fixed chromatin was digested with HindIII-HF (New England BioLabs), and we filled the 5′ overhangs with nucleotides and biotin-14-dCTP (Invitrogen) and ligated free blunt ends. After ligation, we purified DNA and removed biotin from unligated DNA ends. Fragmentation and size selection were performed to shear the Hi-C DNA. Hi-C library preparation was performed using the ThruPLEX® DNA-seq Kit (Takara Bio USA, Inc, Mountain View, CA, USA) and sequenced in Illumina NovaSeq. 6000 (Illumina) with the length of 151 bp paired-end reads. The Hi-C data were sequenced in 212,809,764,588 bp with 1,409,336,338 reads (Table S1).

Total RNA was isolated using TRIzol Reagent (Invitrogen) from three tissues of the same O. coerulescens-flower, stem, and root-following the manufacturer’s protocol. Messenger RNA (mRNA) was isolated using the Magnosphere™ UltraPure mRNA purification kit (Takara) according to the manufacturer’s instructions. The cDNA library was prepared using the cDNA-PCR Sequencing Kit (SQK-PCS109, Oxford Nanopore Technologies) with the PCR Barcoding Kit (SQK-PBK004, Oxford Nanopore Technologies) in accordance with the manufacturer’s protocol. The library was loaded onto the MinION flow cell FLO-MIN106 (Oxford Nanopore Technologies) and then sequenced on MinION Mk1b. We also used an Illumina platform data using TruSeq Stranded mRNA Prep kit for constructing cDNA library and Illumina NovaSeq. 6000 (Illumina) with the length of 101 bp paired-end reads. The results of RNA sequencing data were described in Table S1.

Chromosome level genome assembly

De novo assembly was performed using a combination of NextDenovo v2.4.016 and NextPolish17 with default parameters, utilizing ONT long reads and Illumina short reads. To obtain a high-quality chromosome-level genome, we applied the Juicer 1.618 and 3D-DNA v180419 pipeline19 in diploid mode with default parameters. The assembly was further refined and manually curated using a contactmap generated with Juicebox v1.13.0120. The final assembly comprised 19 chromosomes, spanning a total length of 3,648,003,138 bp and featuring an N50 value of 195,125,857 bp (Fig. 2 and Table 1).

Table 1 Genome assembly statistics.

Repeat and protein gene annotation

Repeat annotation in the O. coerulescens genome was conducted through two different algorithms: de novo and homology-based methods. For de novo repeat identification, a custom repeat element library was constructed using RepeatModeler 2.0.321 with the long terminal repeat (LTR) structural discovery pipeline option, as well as RECON 1.0822, RepeatScout 1.0.623 and TRF 4.0924, utilizing the RMBLAST 2.10.0 search engine. Additionally, MITETracker25 was used to detect miniature inverted-repeat transposable element (MITE) repeat sequences. This custom library was then augmented with data from Repbase 2018102626 and Dfam 3.627 databases to identify repeat sequences homologous to previously known repeat elements. Based on this comprehensive library, RepeatMasker 4.1.428 was applied to annotate the repeat elements. In the constructed genome, a total of 3,147,830,327 bp, accounting for 86.29%, was composed of repeat sequences. Of these, retroelements such as short interspersed nuclear elements (SINEs), long interspersed nuclear elements (LINEs), and LTRs accounted for 0.02%, 1.35%, and 55.85%, respectively, while DNA transposons constituted 5.52% (Table 2).

Table 2 Statistics of repetitive elements.

To annotate protein-coding genes, we first assembled transcriptomes by hybrid approach of long reads and short reads from three tissues. Reads from Illumina and Nanopore were trimmed using Trimmomatic v0.3929 and Porechop v0.2430, respectively. The cleaned Illumina reads were then mapped to the O. coerulescens genome using STAR and assembled alongside the cleaned Nanopore reads through StringTie v2.2.131. Extracted transcriptomes were processed with TransDecoder v5.5.028 using default settings. Gene structure prediction was carried out using the MAKER v3.01.0432 pipeline, which integrates three different methods: de novo-based, transcript evidence-based, and homology-based. De novo-based method employed Augustus v3.4.033,34, BRAKER v2.1.635,36,37,38,39, GeneMark-ET v3.6240, and SNAP v2.51.741 for gene model annotation. Transcript evidence-based method utilizing Exonerate applied to expressed sequence tags (EST) data, while homology-based method sourced data for Sesamum indicum (GCF_000512975.1) and Striga asiatica (GCA_008636005.1) from National Center for Biotechnology Information (NCBI). Using default settings, MAKER merged the outputs from these three methods to generate weighted consensus gene structures, resulting in the acquisition of 29,395 gene composed of 159,445 exons. The annotated genes included 4,203 mono-exon genes and 25,192 multi-exon genes, with an average gene length of 4,849 bp (Table 3).

Table 3 Statistics of protein coding gene annotation.

Data Records

The genome assembly was deposited with Genbank accession number GCA_033398695.142.

The whole genome sequencing DNA data were deposited with accession number SRR26425285, SRR26465287, SRR26492730, SRR26538103, SRR26538104, SRR26542269, SRR26588572, SRR26588573, SRR26588574, SRR26588624, SRR26588625, SRR26622679, SRR26661892 and SRR26661893 under Sequence Read Archive SRP46718043.

The transcriptome data of 3 tissues were deposited with accession numbers SRR26426871, SRR26426870, SRR26426869, SRR26503964, SRR26503965 and SRR26503966 under Sequence Read Archive SRP46718043.

The Hi-C data were deposited with accession number SRR26462611, SRR26522788, SRR26522789 and SRR26622680 under Sequence Read Archive SRP46718043.

All results from genome annotation are available in figshare44.

Technical Validation

Evaluating genome assembly

In order to assess the quality of chromosome-level genome of O. coerulescens, Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.2.145 analysis was performed using the embryophyta_odb10 database, which includes 1,614 single-copy genes. The BUSCO analysis revealed 1,267 complete single-copies and 56 duplicates, accounting for 82.0% completeness, along with 21 fragmented and 270 missing genes (Table 1). Additionally, we carried out BUSCO analysis on the protein-coding genes. Illumina short reads were subsequently realigned to the genome assembly using BWA v0.7.1746, yielding a mapping rate of 99.32%. We measured the quality vale (QV) and completeness using Merqury v1.347 with k-mer database, resulting in QV value of 29.03 and 92.0% respectively.

Gene prediction and annotation validation

To evaluate the gene annotation, protein sequences were searched against reference sequences from S. lycopersicum, A. thaliana, as well as the NCBI plant RefSeq and the Swiss-Prot databases. The BLASTx v2.8.148 was utilized with an e-value threshold below 1e-05 and a query coverage threshold above 40%. As a result, out of a total of 29,395 sequences, successfully matches were found for 24,706 sequences against S. lycopersicum, 23,590 sequences against A. thaliana, and 29,699 sequences against the combined NCBI and Swiss-Prot databases. Further annotation for motifs, domains and Gene Ontology (GO) was performed using InterProSCan v5.60–9249. This resulted in the identification of domains and motifs in 24,252 proteins according to the InterPro database, while 16,681 proteins were annotated with GO terms.