Article
Open access
Published: 09 July 2024

Partitioning and aggregating cross-tissue and tissue-specific genetic effects to identify gene-trait associations

Nature Communications volume 15, Article number: 5769 (2024) Cite this article

498 Accesses
Metrics details

Subjects

Abstract

TWAS have shown great promise in extending GWAS loci to a functional understanding of disease mechanisms. In an effort to fully unleash the TWAS and GWAS information, we propose MTWAS, a statistical framework that partitions and aggregates cross-tissue and tissue-specific genetic effects in identifying gene-trait associations. We introduce a non-parametric imputation strategy to augment the inaccessible tissues, accommodating complex interactions and non-linear expression data structures across various tissues. We further classify eQTLs into cross-tissue eQTLs and tissue-specific eQTLs via a stepwise procedure based on the extended Bayesian information criterion, which is consistent under high-dimensional settings. We show that MTWAS significantly improves the prediction accuracy across all 47 tissues of the GTEx dataset, compared with other single-tissue and multi-tissue methods, such as PrediXcan, TIGAR, and UTMOST. Applying MTWAS to the DICE and OneK1K datasets with bulk and single-cell RNA sequencing data on immune cell types showcases consistent improvements in prediction accuracy. MTWAS also identifies more predictable genes, and the improvement can be replicated with independent studies. We apply MTWAS to 84 UK Biobank GWAS studies, which provides insights into disease etiology.

Integrating single cell expression quantitative trait loci summary statistics to understand complex trait risk genes

Article Open access 20 May 2024

Quantifying genetic effects on disease mediated by assayed gene expression levels

Article 18 May 2020

Multi-context genetic modeling of transcriptional regulation resolves novel disease loci

Article Open access 28 September 2022

Introduction

Genome-wide association studies (GWAS) have identified tens of thousands of susceptibility loci for complex diseases, bringing insights into disease etiologies¹. However, it remains a challenge to understand and interpret the GWAS findings^2,3, especially those significant hits in noncoding regions. One hypothesis is that some noncoding variants influence traits through the regulation of gene expression⁴. However, to identify the path from genomic variants to gene expression is challenging. A simple strategy is to assign the associated variant a causal link to its nearest gene, but a shorter physical distance does not necessarily indicate a closer functional connection. A counterexample is that the obesity-associated single nucleotide polymorphisms (SNPs) within FTO form long-range functional connections with IRX3^2,5. In recent years, some large-scale consortiums have provided rich resources to identify expression quantitative trait loci (eQTLs) across various human tissues. For example, version 8 of the Genotype-Tissue Expression (GTEx v8) project collects a comprehensive set of 54 tissues from hundreds of donors⁶. Integrating eQTLs with GWAS studies reveals that a large proportion of phenotype variability in disease risk can be explained by variants that regulate the expression levels of genes⁷.

Recently, transcriptome-wide association studies (TWAS) have provided a successful path to bridge SNPs, gene expressions, and complex phenotypes. TWAS includes two stages: the first stage builds a linear model to predict genetically regulated gene expression from the genotype information, and the second stage associates complex diseases with the predicted gene expression. For simplicity, we refer to the two stages as the “prediction stage” and “association stage” throughout the paper. Widely used TWAS methods include PrediXcan², which trains an elastic net model in the prediction stage, and FUSION⁸, which includes more models such as top eQTL, LASSO, and Bayesian sparse linear models. A recently proposed method, TIGAR, utilizes a data-driven nonparametric prior for the eQTL effect sizes in the prediction stage, and estimates with a latent Dirichlet process regression model^9,10. The identified gene-trait associations shed light on the genetic bases of complex diseases. However, the limited sample sizes of eQTL studies have become a bottleneck in TWAS, resulting in low power in both prediction and association stages. In the GTEx v8 dataset, for example, the sample sizes for 21 out of the 54 tissues with genotypic information are smaller than 200.

Given the tissue-dependent nature of transcription regulation and the presence of shared eQTLs across various tissues, a joint modeling approach incorporating multiple tissues can potentially enhance the performance of TWAS. There have been some methods for improving statistical power in the association stage, including MultiXcan, which tests the joint effects of gene expression variation from different tissues¹¹; and sparse-canonical-correlation-analysis-TWAS, which combines the predicted expression of multiple tissues in TWAS¹². As for the prediction stage, the recently developed UTMOST method formulates cross-tissue expression prediction as a penalized multivariate regression problem, and introduces a group-lasso penalty on the cross-tissue effects and encourages the presence of eQTLs shared across tissues¹³. Despite the improved prediction accuracy compared with single-tissue prediction, we note that the penalty emphasizes that the effects of eQTLs are shared across all tissues, but does not distinguish between biologically related or irrelevant tissues. In other words, the method tends to only identify eQTLs shared by all tissues. However, the regulatory effect of a SNP may present in only subsets of tissues (e.g., brain-related tissues), sometimes even just one tissue (e.g., testis)¹⁴. Furthermore, tissue-specific eQTLs can provide a more focused mechanistic interpretation for GWAS associations than eQTLs shared across all tissues^15,16.

In this manuscript, we develop MTWAS, a flexible method to aggregate multiple tissue information for TWAS analysis. The method partitions and aggregates both cross-tissue and tissue-specific genetic effects. Considering the inherent characteristics of the genetic data, where the transcriptome of one tissue can exhibit strong correlations with others, we utilize a nonparametric missing value imputation method for inaccessible tissues. Thus, MTWAS allows for complex interactions and nonlinear data structures across tissues. In addition, we employ a stepwise procedure to select eQTLs and train prediction models by minimizing the extended Bayesian information criteria (EBIC)^17,18.

We show that MTWAS significantly improves the prediction accuracy of genetically regulated gene expression over existing methods, across all available tissues and cell types in the GTEx v8, DICE, and OneK1K datasets, and the prediction weights can then be used for identifying gene-trait associations with either individual-level genotype data or GWAS summary statistics.

Results

Model overview

The prediction stage of MTWAS has three steps (Fig. 1). The first step imputes values for missing entries in the sample-by-tissue expression matrix of each gene via a nonparametric imputation procedure¹⁹. Specifically, for each column of the matrix, we impute its missing entries by leveraging the information from other columns corresponding to relevant tissues. The imputed expression matrices are subsequently used for the identification of eQTLs in the following steps. We note that this imputation step does not involve the genotype data.

**Fig. 1: Schematic diagram of the prediction stage of MTWAS.**

The second step detects cross-tissue eQTLs (ct-eQTLs), i.e., genetic variations that are associated with gene expression in multiple tissues. For each gene, we extract the first few principal components (PCs) from its imputed sample-by-tissue expression matrix. Note that the first PC typically characterizes overall cross-tissue expression patterns (Supplementary Figs. 1–3), whereas others likely reflect differences across subsets of tissues (e.g., brain-related tissues versus others). The number of PCs selected is determined by magnitudes of eigenvalues of the correlation matrix, for which 2.0 is set as a default cutoff (on average 5 PCs have been chosen in the GTEx studies). Then, we regress each selected PC (can be thought of the expression vector for a “super-tissue”) against the individuals’ genotypes of the cis-SNPs and select ct-eQTLs based on EBIC (Methods). We show in Supplementary Fig. 4 that the performance of our method is robust to the number of PCs selected.

The third step identifies tissue-specific eQTLs (ts-eQTLs, i.e., genetic variations associated with gene expression within a particular tissue after accounting for the effects of ct-eQTLs) by the stepwise sparse linear regression method SODA¹⁸. Specifically, with the tissue-specific gene expression as the response and genotypes of the cis-SNP as the predictors in a linear model, SODA selects ts-eQTLs by minimizing EBIC (after accounting for ct-eQTLs). Effects of ct-eQTLs and ts-eQTLs are estimated using a weighted least squares method (“Methods”), and further utilized to infer gene-trait associations.

In the association stage, for a trait of interest, we perform an association test in a tissue-specific manner to retain the tissue specificity of gene-trait associations. We derive an explicit form of the MTWAS statistics with estimated effects of ct-eQTLs and ts-eQTLs, the GWAS summary statistics, and the reference LD matrix (“Methods”).

Improvements in prediction accuracy on GTEx tissues

We evaluate MTWAS in comparison with a few state-of-the-art methods including PrediXcan, TIGAR, and UTMOST with the GTEx v8 dataset. After quality control, 47 tissues were retained with sample sizes larger than 100 (“Methods”). The goal is to predict each individual’s gene expression in these tissues from his/her genotype information. The prediction accuracy is assessed by fivefold cross-validation (CV). Compared with PrediXcan, MTWAS achieved an average improvement in prediction R² of 0.02 (SD = 0.007) across 47 tissues, which amounts to about 47.4% (SD = 17.3%) improvements (Fig. 2). In addition, MTWAS showed an average increase in prediction R² of 40.1% (SD = 25.1%) and 9.2% (SD = 2.1%) over TIGAR and UTMOST, respectively.

**Fig. 2: The prediction R² evaluated in the GTEx datasets.**

Figure 2a shows the improvements of the prediction R² over that of PrediXcan for all methods (0 for no improvement; negative values mean that the average prediction R² is smaller than that of PrediXcan). We note that the improvement of our method is more significant in tissues with smaller sample sizes. For example, MTWAS improves the prediction R² by an average of 60.9% over that of PrediXcan for tissues with sample sizes smaller than 200; and by 42.2% for tissues with sample sizes between 200 and 400; and by 26.9% for tissues with sample sizes larger than 400. UTMOST also outperformed PrediXcan significantly, but was inferior to MTWAS uniformly. TIGAR performed comparably to UTMOST in tissues with sample sizes larger than 400, but performed worse in tissues with smaller sample sizes. This can be explained by our intuition that cross-tissue information helps improve the prediction accuracy, especially in tissues with small sample sizes. We also provide the average signed prediction R², calculated as the product of Pearson’s correlation coefficient sign and the R² value (Table 1). Given that the signed prediction R² can be negative, its average tends to be lower and more conservative compared with the prediction R² for all methods. Nevertheless, MTWAS still showed the best performance in terms of the signed prediction R².

Table 1 The prediction accuracy of MTWAS, PrediXcan, UTMOST, and TIGAR

Full size table

MTWAS’s improvement in prediction accuracy can be attributed to three aspects: (i) the imputation for unobserved entries in expression matrices utilizes cross-tissue information and increases effective sample sizes of training datasets; (ii) while cross-tissue effects aggregate information from various tissues, tissue-specific effects retain their inherent tissue-specific characteristics, and both contribute to the prediction; (iii) although computationally more demanding, EBIC for linear and generalized linear regression models under high-dimensional settings appears to be superior to other variable selection criteria such as those used in lasso and elastic net¹⁸.

We performed more experiments to further demonstrate each of the aforementioned points. For point (i), we applied PrediXcan, TIGAR, and UTMOST to the imputed expression matrix (denoted as PrediXcan-imp, TIGAR-imp, and UTMOST-imp, respectively), and compared the results with those obtained by the respective original methods. Across the 47 tissues, all methods performed better with our imputed data than with the original unimputed data (Fig. 2b, c and Supplementary Fig. 5). For example, PrediXcan-imp improved the prediction R² over PrediXcan by a minimum of 5.32% in muscle skeletal (sample size = 601) to a maximum of 32.7% in the uterus (sample size = 107).

For point (ii), instead of partitioning eQTLs into ct-eQTLs and ts-eQTLs, we directly performed regression with all cis-SNPs as predictors on each tissue, and selected variables using EBIC. We denoted this method as MTWAS-tissue. As shown in Fig. 2b, c and Supplementary Fig. 5, MTWAS achieved higher prediction R² compared with MTWAS-tissue for all 47 tissues, suggesting that our partition and aggregation of ct-eQTLs and ts-eQTLs indeed helped. For point (iii), we compared the performances of MTWAS-tissue, PrediXcan-imp, TIGAR-imp, and UTMOST-imp. We found that MTWAS-tissue outperformed the other three methods across all tissues. As all the tested methods use imputed gene expression as input, the advantage of MTWAS-tissue is solely attributed to the use of EBIC.

We consider two criteria for calling a gene predictable for a method: a common criterion (with R² > 0.01) and a more stringent one (with false discovery rate (FDR) <0.05). MTWAS identified more predictable genes than the other three methods under both criteria (Table 1, Fig. 3, and Supplementary Fig. 6). The numbers of predictable genes identified by MTWAS ranged from 10,444 (for whole blood) to 17,052 (for uterus) under the common criterion, and from 2451 (for vagina) to 8229 (for nerve tibial) under the stringent criterion. Compared to PrediXcan and UTMOST, MTWAS identified an average of 41.6% (SD = 10.0%) and 2.6% (SD = 1.3%) more predictable genes under the common criterion, and 92.0% (SD = 48.6%) and 19.2% (SD = 4.0%) more under the stringent criterion, respectively.

**Fig. 3: Predictable genes in the GTEx datasets.**

In addition to the GTEx analysis, we also performed an independent replication study by applying the weights trained with the GTEx samples for Epstein–Barr virus (EBV) transformed lymphocytes to the expression levels of the 373 European individuals from the GEUVADIS lymphoblastoid cell lines (LCLs)²⁰. MTWAS also achieved a better prediction R² (Table 2) compared to PrediXcan, TIGAR, and UTMOST. The prediction R² was improved by 47.8%, 126.7%, and 6.3%, respectively. MTWAS achieved consistently higher prediction R² than the other three methods in different quantiles (P < 2.2 × 10⁻¹⁶; one-sided Kolmogorov–Smirnov test). MTWAS also identified more predictable genes with both the common and stringent criteria (Table 2).

Table 2 Replication study on the GEUVADIS cohort for lymphoblastoid cell lines

Full size table

Improvements in prediction accuracy on immune cell types

We applied our method in two eQTL datasets of multiple cell types. The DICE dataset includes the bulk RNA-seq data of 91 samples on 13 types of immune cells and 2 activation conditions²¹. We omitted the results of TIGAR because the inadequate sample size may lead to an over-fitted model for the method¹⁰. MTWAS showed consistent improvement of prediction accuracy compared with PrediXcan and UTMOST, with average improvements of 77.69% (SD = 2.09%) and 5.87% (SD = 1.17%) in prediction R² (Supplementary Fig. 7a). MTWAS also identified about 5 times as many predictable genes as PrediXcan, and about 14 times as UTMOST in total (Supplementary Fig. 7b).

Our method is also applicable to single-cell RNA-seq (scRNA-seq) data. We generated pseudo-bulk data matrices comprising 14 immune cell types from the OneK1K dataset, which consist of 1.27 million peripheral blood mononuclear cells collected from 982 donors²². Consistent with the results obtained from the bulk data, we found that MTWAS achieved the highest prediction R², along with identifying the largest number of predictable genes across all cell types (Supplementary Fig. 8). Specifically, MTWAS identified from 84 (CD4 SOX4 cells, N_cell = 4065) to 3665 (NK cells, N_cell = 463,528) predictable genes under the stringent criterion, with an improvement of 104.8% (SD = 61.6%) and 73.3% (SD = 12.1%) compared with PrediXcan and UTMOST, respectively. We note that the improvements of multi-cell-type TWAS methods (both MTWAS and UTMOST) over the single-cell-type method (PrediXcan) are more significant in cell types with fewer cell counts and thus lower statistical power in identifying predictable genes. This highlights the value of leveraging cross-cell-type information to improve predictions in cell types with limited data.

Investigation of cross-tissue and tissue-specific effects

MTWAS partitions eQTLs into ct-eQTLs and ts-eQTLs, and aggregates their effects in prediction. We observed that both ct-eQTLs and ts-eQTLs are significantly enriched in promoter regions (P < 2.2 × 10⁻²⁶, one-sided Fisher’s exact test), and more enriched than those selected by PrediXcan and UTMOST (Fig. 4a). In addition, the combined annotation dependent depletion (CADD) scores of ct-eQTLs are also significantly higher than that of ts-eQTLs (P = 9.4 × 10⁻¹¹, one-sided t test, Fig. 4b). Higher CADD scores indicate more severe deleteriousness of the eQTLs shared across multiple tissues. CADD scores of both ct-eQTLs and ts-eQTLs identified by MTWAS are higher than those eQTLs selected by PrediXcan and UTMOST.

**Fig. 4: Characterization of the eQTLs identified by MTWAS, PrediXcan, and UTMOST.**

We further consider two subsets of the predictable genes of MTWAS according to whether the genes are regulated by ct-eQTLs only or by ts-eQTLs only, referring to as cross-tissue genes (ct-genes) and tissue-specific genes (ts-genes), respectively. Different characteristics can be observed for the two subsets. KEGG enrichment analysis demonstrates that ct-genes are enriched for lysosome (P = 5.6 × 10⁻⁴), peroxisome (P = 5.6 × 10⁻⁴), phagosome (P = 8.8 × 10⁻⁴), and metabolism pathways that are abundant in all tissues and cell types (Supplementary Figs. 9 and 10). In contrast, ts-genes are enriched in pathways that are critical in differentiating between cell types, as was also observed in a recent study²³.

For example, ts-genes in the brain cerebellar are enriched in the neuroactive ligand-receptor interaction pathway (P = 5.4 × 10⁻⁵) and the taste transduction pathway (P = 1.7 × 10⁻⁴). The esophagus mucosa ts-genes are enriched in the gastric acid secretion pathway (P = 2.1 × 10⁻⁴), and the cAMP signaling pathway (P = 3.8 × 10⁻⁵). It has been shown that esophageal mucosa expresses predominantly EP2 receptors and esophageal ulceration increases the expression of the EP2 receptor, activation of CREB, which is the downstream target of the cAMP signaling. The ts-genes in the pancreas are also enriched in multiple pathways, such as circadian entrainment (P = 4.8 × 10⁻⁵), protein digestion and absorption (P = 6.8 × 10⁻⁵), dopaminergic synapse (P = 2.7 × 10⁻⁴), insulin secretion (P = 2.9 × 10⁻⁴), morphine addiction (P = 3.7 × 10⁻⁴), and pancreatic secretion (P = 6.3 × 10⁻⁴), etc.

Furthermore, we found that ts-genes are more intolerant to protein-loss-of-function compared with ct-genes, as evaluated by LOEUF²⁴ and pLI²⁵. Specifically, ts-genes had significantly lower LOEUF (P = 3.2 × 10⁻¹¹, two-sided t test) and higher pLI (P = 4.1 × 10⁻¹⁶, two-sided t test) than ct-genes, indicating that ts-genes are more variation intolerant compared with ct-genes. This is consistent with a previous finding that ts-genes endure stronger selective pressures compared with ct-genes²³. This has important implications for understanding genetic bases of disease, as mutations in tissue-specific genes may be more likely to lead to specific diseases affecting particular tissues or organs.

Applications to GWAS studies on UKBB phenotypes

We applied MTWAS, UTMOST, TIGAR, and PrediXcan to 84 UKBB self-reported cancer and non-cancer illness phenotypes with effective sample sizes larger than 5000 (a detailed summary can be found in Supplementary Data 1). We use the Bonferroni correction to account for multiple testing, and report genes with R² > 0.01 in the prediction stage and adjusted P value <0.05 in the association stage (Supplementary Data 2). MTWAS, UTMOST, and PrediXcan identified significant genes in 44, 34, and 39 of the 84 phenotypes, respectively. Specifically, for 8 phenotypes including heart valve problems, gastric/stomach ulcers, irritable bowel syndrome, peritonitis, muscle/soft tissue problems, iron deficiency anemia, other renal/kidney problems, and chronic fatigue syndrome, MTWAS identified genes whose expressions are significantly associated, whereas UTMOST and PrediXcan did not find any associated genes. In addition, MTWAS identified a greater number of genes than both UTMOST and PrediXcan in 20 phenotypes, while UTMOST and PrediXcan identified more genes than MTWAS in 2 and 19 phenotypes, respectively. UTMOST identified the highest number of gene-tissue association pairs in 24 phenotypes. MTWAS followed closely, identifying the majority of gene-tissue pairs in 16 phenotypes. In contrast, PrediXcan only managed to identify the most gene-tissue pairs in 3 phenotypes. We also found that MTWAS identified more gene-tissue pairs than MTWAS-tissue, showing the benefits of incorporating ct-eQTLs in the prediction model.

A Venn diagram is provided to illustrate genes that are significantly associated with heart attack/myocardial infarction (MI) in the UKBB identified by MTWAS, UTMOST, PrediXcan, and TIGAR (Supplementary Fig. 11). MTWAS identified 30 independent genes and 114 gene-tissue pairs associated with MI, among which 7 genes were not identified by UTMOST, PrediXcan, or TIGAR (Table 3). We performed KEGG pathway analysis on the ct-genes and ts-genes that are significantly associated with MI. The ct-genes are enriched in metabolism and lysosome pathways; while the ts-genes are enriched in more tissue-specific pathways (Supplementary Fig. 12).

Table 3 Independent TWAS risk genes of heart attack/myocardial infarction (MI) in the UKBB, identified by MTWAS

Full size table

Below we discuss biological insights for genes that are uniquely identified by MTWAS. The BRCA-1 associated protein gene, BRAP (P = 1.87 × 10⁻¹⁰ in artery aorta and 5.39 × 10⁻⁸ in artery coronary), is crucial for both cardiac development and myocardial function. The absence of BRAP, as observed in knockout models, resulted in cell cycle arrest, diminished proliferation of cardiomyocytes, and early onset of heart failure²⁶. Conversely, the introduction of transgenic BRAP overexpression led to an augmentation in myocardial mass and increased cell cycle activity specifically in neonatal cardiomyocytes. Given the critical role of the aorta in systemic circulation and the coronary arteries in supplying oxygen-rich blood to the myocardium, the dysregulation of BRAP expression in arterial tissues could disrupt the delicate equilibrium of cardiomyocyte growth and function, potentially leading to the development and progression of heart failure. Another example is that prohibitin (PHB) (P = 4.66 × 10⁻⁸ in adipose subcutaneous) has emerged as a potential therapeutic target for diabetic cardiomyopathy (DCM). In a type 2 diabetic rat model, PHB overexpression alleviated insulin resistance, left ventricular dysfunction, and fibrosis, suggesting its promise in treating human DCM²⁷. In addition, FOSB (significant in multiple tissues) has been identified as a potential diagnostic biomarker and therapeutic target for heart failure in a recent study²⁸. We also notice that ERP29 (P = 7.12 × 10⁻⁹ in artery tibial), involved in Connexin43 (Cx43) hemichannel assembly, plays a crucial role in Cx43 stability. Dysregulation of Cx43 is linked to myocardial diseases, and ERP29’s function suggests a potential connection to these conditions, providing insights into connexin-related myocardial disorders²⁹.

In addition, MTWAS identified that the expression of ANKRD55 in colon transverse is significantly associated with rheumatoid arthritis (P = 4.69 × 10⁻⁸). ANKRD55 encodes a protein called ankyrin repeat domain 55, which is well-established to be a potential risk factor for autoimmune diseases³⁰. Another example is the association between NOX4 and malignant melanoma in the thyroid identified by MTWAS (P = 8.37 × 10⁻¹²). As a regulator of glycolysis within thyroid cells, NOX4 facilitates the proliferation of cancerous thyroid cells through the generation of mitochondrial reactive oxygen species³¹. In papillary thyroid carcinomas, the down-regulation of the sodium/iodide symporter induced by the BRAFV600E mutation is mediated by NOX4³². Our results indicate that NOX4 could be a shared molecular link between malignant melanoma and thyroid dysfunction. In fact, emerging evidence suggests that malignant melanoma and papillary thyroid carcinoma may occur concurrently^33,34,35. In addition, the hypothyroidism condition is likely to promote melanoma spread, which suggests the protective effect of thyroid hormones against disease progression³⁴. Remarkably, NOX4 inhibitors have shown promises as adjuncts to current therapies, particularly for melanoma patients with BRAF mutations³⁶. These findings highlight the comparative advantages of MTWAS in gene and gene-tissue pair identification across multiple phenotypes, emphasizing its potential as a valuable tool in genetic research and analysis.

Computational efficiency

We compared the computational efficiency of MTWAS with the other multi-tissue TWAS method UTMOST. For MTWAS, we reported the CPU time for (i) imputing the missing entries of the expression matrices; and (ii) identifying eQTLs and estimating the eQTL weights, while UTMOST only involves the second process (Supplementary Table 1). Taking chromosome 1 (1911 genes) as an example, it takes MTWAS about 18 min to impute missing entries. The eQTL identification and weights estimation process take about 45 min for MTWAS, which is about half the time required by UTMOST (87 min). The computation was performed with an Intel Xeon processor with 2.90 GHz and 128 cores.

Discussion

Although TWAS have supplemented GWAS loci with valuable insights into disease mechanisms, the accuracy of gene expression prediction remains moderate. TWAS mainly include two key steps: the prediction of genetically regulated gene expression and the association of genetically regulated gene expression with disease traits. Previous multi-tissue TWAS methods, such as UTMOST, aggregate information across multiple tissues in the prediction step by encouraging the presence of shared eQTLs across all tissues, thus improving prediction accuracy. However, when eQTLs are only shared in a subset of tissues for a gene, to encourage common effects for all tissues may not be an optimal modeling approach.

To address these limitations, we propose a statistical framework, MTWAS, which partitions and aggregates cross-tissue and tissue-specific genetic effects in predicting the genetically regulated gene expression. MTWAS first imputes the gene expression data of multiple tissues, and then employs the EBIC criterion to select ct-eQTLs and ts-eQTLs based on the imputed dataset. We have demonstrated that MTWAS outperforms existing methods in achieving higher accuracy in predicting gene expression across all tissues, and has greater power in identifying gene-trait associations. The classification of ct-eQTLs and ts-eQTLs also brings insights into the genetic regulation of gene expression.

Applications to multi-cell-type bulk and single-cell RNA-seq datasets showcase that MTWAS works well with single-cell transcriptomes. Currently, we directly applied MTWAS to pseudo-bulk data of single-cell studies. For future work, it would be interesting to develop prediction models tailored for scRNA-seq data studies.

To identify ct-eQTLs, we performed PCA analysis to the sample-by-tissue matrix for each gene to extract the cross-tissue information. We clarify that the PCA used here is different from that used for identifying batch effects. To correct for the batch effect in the gene expression dataset, we regressed out the probabilistic estimation of expression residuals (PEER) factors in data preprocessing.

It is important to acknowledge that, like other TWAS methods, the gene-trait associations identified by MTWAS do not necessarily indicate causal relationships. Instead, they serve as valuable pointers toward potential functional links between genes and diseases, warranting further investigations through experimental validations and functional studies. We note that leveraging some fine-mapping methods, such as SuSiE³⁷ and DAP-K^38,39 for identifying causal QTLs, can improve the prediction accuracy for certain genes (Supplementary Table 2). Therefore, it is conceptually advantageous to integrate fine-mapping methods into our cross-tissue TWAS framework to prioritize causal genes, which can be a promising future direction of our research.

In conclusion, the MTWAS framework improves the prediction accuracy of gene expression and the statistical power for identifying gene-trait associations over existing TWAS methods. We believe that MTWAS is a valuable tool in deciphering the genetic bases of complex traits and facilitating personalized treatment strategies.

Methods

Imputation of gene expression data

The first step of our method is to impute expression data. Genotype information is not used at this step. For each gene, we consider its expression matrix containing N samples and K tissues, with missing entries in the matrix corresponding to unobserved expression levels of the gene. We used a nonparametric method missForest¹⁹ to impute the missing entries. Specifically, we first sorted the tissues in ascending order based on the number of missing samples and performed an initial imputation using the mean value of the observed samples. Then, for each tissue, we trained a random forest model to impute the missing values. For example, for the kth tissue (column), we trained a model using samples that have the non-missing entries in the kth column, with the kth column as the response and other columns as predictors (with missing entries in other columns imputed). The missing entries of the kth column can be predicted using the trained model. We continue this iterative training process until convergence. For each gene, the final imputed matrix $\widetilde{{{{{{{{\bf{E}}}}}}}}}$, which is of dimensions N × K, is used for subsequent analyses.

Identification of ct-eQTLs and ts-eQTLs

To identify ct-eQTLs, we perform Principal Component Analysis (PCA) on $\widetilde{{{{{{{{\bf{E}}}}}}}}}$ of each gene. Then, we treat each of the top principal components (PCs) of $\widetilde{{{{{{{{\bf{E}}}}}}}}}$ as the response variable and regress it against genotypes of cis-SNPs of the corresponding gene. We identify the set of ct-eQTLs minimizing:

$$EBI{C}_{\uplambda }({{{{{{{\mathcal{S}}}}}}}})=-2{\ell }_{N}({\hat{\beta }}_{{{{{{{{\mathcal{S}}}}}}}}})+| {{{{{{{\mathcal{S}}}}}}}}| \log N+2\uplambda | {{{{{{{\mathcal{S}}}}}}}}| \log M,\,\,{{{{{{{\mathcal{S}}}}}}}}\supseteq {{{{{{{{\mathcal{S}}}}}}}}}_{0},$$

(1)

where ${{{{{{{\mathcal{S}}}}}}}}$ is the selected set; $| {{{{{{{\mathcal{S}}}}}}}}|$ is the size of set ${{{{{{{\mathcal{S}}}}}}}}$; ${\hat{\beta }}_{{{{{{{{\mathcal{S}}}}}}}}}$ is the estimated effect of the selected variables; ${\ell }_{N}({\hat{\beta }}_{{{{{{{{\mathcal{S}}}}}}}}})$ is the log-likelihood; M is the number of cis-SNPs; λ is the tuning parameter; and ${{{{{{{{\mathcal{S}}}}}}}}}_{0}$ is a pre-fixed set of cis-SNPs. When using the first PC as the response variable, we have ${{{{{{{{\mathcal{S}}}}}}}}}_{0}=\varnothing$. As we proceed to the next PC, ${{{{{{{{\mathcal{S}}}}}}}}}_{0}$ encompasses the ct-eQTLs identified previously. This inclusion acts as a form of penalization, to avoid selecting an excessive number of ct-eQTLs. The number of PCs used to identify ct-eQTLs is determined by the magnitudes of eigenvalues of $\widetilde{{{{{{{{\bf{E}}}}}}}}}$. Intuitively, an eigenvalue of 2 means that the corresponding PC explains about two variables’ worth of the variability. We use 2.0 as a default cutoff, which leads to about 5 PCs on average in the GTEx studies. Supplementary Fig. 4a shows the number of the ct-eQTLs identified in GTEx cohorts as the number of PCs increases. We also evaluate the prediction performance across varying numbers of PCs (Supplementary Fig. 4b, c). In general, prediction accuracy improves as the number of PCs increases. However, beyond 4 PCs, the improvement levels off, suggesting that an average of 5 PCs strikes a good balance between prediction accuracy and computational efficiency.

Next, we identify ts-eQTLs. For each tissue, we regress its each gene’s expression levels against genotypes of the gene’s cis-SNPs, and select variables based on the EBIC criterion (Eq. (1)). The set ${{{{{{{{\mathcal{S}}}}}}}}}_{0}$ comprises all identified ct-eQTLs. We utilize a stepwise method for efficient computation, which follows the SODA procedure proposed in ref. ¹⁸ without considering interaction terms. We note that both imputed and observed samples are used in selecting ts-eQTLs. Despite the imputation is mostly capturing the tissue-shared part, it also involves interaction and nonlinear information that may carry tissue-specific component. We performed a replication study showing that including imputed samples in ts-eQTL identification has moderate benefits in prediction accuracy (Supplementary Table 3). The algorithm employs a forward step selecting predictors with significant overall effects, culminating in subsequent backward elimination steps to precisely refine the model. We adapt the algorithm by allowing a prior set of pre-selected terms, which ensures that terms selected in earlier steps are retained in the model. Effects of all selected terms are updated each time we introduce or eliminate a variable. The procedure is summarized in Algorithm 1.

Algorithm 1 The algorithm for identifying ct-eQTLs and ts-eQTLs with EBIC criterion.

Require: A fixed term set ${{{{{{{{\mathcal{S}}}}}}}}}_{0}$, which can be $\varnothing$.

1: Forward procedure for selecting eQTLs. Let ${{{{{{{{\mathcal{M}}}}}}}}}_{t}$ denote the selected set of eQTLs at step t. Start with ${{{{{{{{\mathcal{M}}}}}}}}}_{1}={{{{{{{{\mathcal{S}}}}}}}}}_{0}$.

2: while not terminated do

3: for each $j\notin {{{{{{{{\mathcal{M}}}}}}}}}_{t}$ do

4: create a candidate set ${{{{{{{{\mathcal{M}}}}}}}}}_{t,j}={{{{{{{{\mathcal{M}}}}}}}}}_{t}\cup \{j\}$, and evaluate its EBIC

5: end for

6: select predictor j^⋆ with the lowest EBIC: ${j}^{\star }=\arg {\min }_{j}{{{{{{{\rm{EBIC}}}}}}}}({{{{{{{{\mathcal{M}}}}}}}}}_{t,j})$

7: if ${{{{{{{\rm{EBIC}}}}}}}}({{{{{{{{\mathcal{M}}}}}}}}}_{t,{j}^{\star }}) < {{{{{{{\rm{EBIC}}}}}}}}({{{{{{{{\mathcal{M}}}}}}}}}_{t})$ then

8: continue with ${{{{{{{{\mathcal{M}}}}}}}}}_{t+1}={{{{{{{{\mathcal{M}}}}}}}}}_{t,{j}^{\star }}$.

9: else

10: terminate and obtain set $\tilde{{{{{{{{\mathcal{M}}}}}}}}}={{{{{{{{\mathcal{M}}}}}}}}}_{t}$.

11: end if

12: end while

13: Backward procedure for eliminating unimportant terms. Let ${{{{{{{{\mathcal{S}}}}}}}}}_{t}$ denote the selected set at the step t of the backward stage. Start with ${{{{{{{{\mathcal{S}}}}}}}}}_{1}=\tilde{{{{{{{{\mathcal{M}}}}}}}}}$.

14: while not terminated do

15: for each $j\in {{{{{{{{\mathcal{S}}}}}}}}}_{t}\backslash {{{{{{{{\mathcal{S}}}}}}}}}_{0}$ do

16: create a candidate set ${{{{{{{{\mathcal{S}}}}}}}}}_{t,j}={{{{{{{{\mathcal{S}}}}}}}}}_{t}\backslash \{j\}$, and evaluate its EBIC

17: end for

18: find term j with the lowest EBIC: ${j}^{\star }=\arg {\min }_{j}{{{{{{{\rm{EBIC}}}}}}}}({{{{{{{{\mathcal{S}}}}}}}}}_{t,j})$

19: if ${{{{{{{\rm{EBIC}}}}}}}}({{{{{{{{\mathcal{S}}}}}}}}}_{t,{j}^{\star }}) < {{{{{{{\rm{EBIC}}}}}}}}({{{{{{{{\mathcal{S}}}}}}}}}_{t})$ then

20: remove term j^⋆

21: else

22: terminate and retain set $\tilde{S}={S}_{t}$.

23: end if

24: end while

25: Enumerate all possible combinations of non-fixed terms and find the subset that reaches the smallest EBIC

$${\hat{{{{{{\mathcal{A}}}}}}}}={\arg\min}_{{{{{{\mathcal{A}}}}}}\subset {{\tilde{{{{{{\mathcal{S}}}}}}} \backslash {{{{{\mathcal{S}}}}}}_0}}} {{{{{\rm{EBIC}}}}}}({{{{{\mathcal{A}}}}}}\cup {{{{{\mathcal{S}}}}}}_0)$$

With the selected ct-eQTLs and ts-eQTLs, we perform weighted least squares to estimate their effect sizes. Specifically, for a tissue with an observed sample size N_obs (the number of non-missing entries), and an imputed sample size N_imp (the number of missing entries), we assign a weight of $\min (1,{N}_{obs}/{N}_{imp})$ to the imputed samples, and a weight of 1 to the observed samples. This weighting scheme prioritizes the impact of the observed data, especially when the imputed data volume surpasses the observed data, so as to retain tissue-specific signals in estimating eQTL effects. We show that it improves the prediction accuracy compared with estimating effect sizes using only observed samples, in both GTEx studies (P < 0.05 for 46 out of 47 tissues, one-sided paired Wilcoxon test for prediction R²) and the replication study (see Supplementary Table 3).

Gene-trait association studies

For each trait, we compute gene-level summary statistics based on the training weights for the eQTLs derived by MTWAS. We denote the sample size as N. For each gene on each tissue, we assume a linear model between a complex phenotype (Y) and the gene expression (E):

$${{{{{{{\bf{Y}}}}}}}}={{{{{{{\bf{E}}}}}}}}\gamma+\eta,$$

(2)

where Y is an N × 1 phenotype vector, E is an N × 1 expression vector, and γ is the gene-level effect size. We assume both Y and E have been standardized, and the error term η follows a normal distribution with mean 0. The MTWAS Z-score vector is:

$$Z=\frac{\hat{\gamma }}{{{{{{{{\rm{se}}}}}}}}(\hat{\gamma })}.$$

(3)

If individual-level genotype data are accessible, we could directly estimate gene expression with $\widehat{{{{{{{{\bf{E}}}}}}}}}={\sum}_{j\in {{{{{{{\mathcal{A}}}}}}}}}{\hat{\beta }}_{j}{X}_{j}$, where X_j is the genotype of the jth SNP; ${\hat{\beta }}_{j}$ is the estimated effects on gene expression of the jth SNP; and ${{{{{{{\mathcal{A}}}}}}}}$ is the set of identified eQTLs for the gene. Then we regress Y against $\widehat{{{{{{{{\bf{E}}}}}}}}}$ to derive the association between the trait and the gene expression. If the individual-level genotype data are not available, we could derive MTWAS test statistics with GWAS summary statistics. Specifically, the MTWAS Z-score can be approximated with^2,40

$$Z=\frac{\hat{\gamma }}{{{{{{{{\rm{se}}}}}}}}(\hat{\gamma })}\approx \,\sum\limits_{j\in {{{{{{{\mathcal{A}}}}}}}}}{\hat{\beta }}_{j}\frac{{\hat{\sigma }}_{j}}{\hat{\sigma }}{z}_{j},$$

(4)

where z_j is the z-score for the jth SNP in GWAS summary statistics; ${\hat{\sigma }}_{j}^{2}$ is the sample variance of SNP j; ${\hat{\sigma }}^{2}$ is the sample variance of the gene expression. We remove the major histocompatibility complex region (6p21.3; GRCh38 coordinates 6: 28,510,120–33,480,577) in the TWAS analysis of UKBB phenotypes. We apply the Bonferroni correction to account for multiple testing. The independent TWAS genes are extracted by calculating squared Pearson correlation (r²) between the predicted expressions of all gene pairs within each tissue. For any gene pair with r² > 0.5, we only keep the gene with the lower TWAS P value⁴¹.

Compared methods

PrediXcan

PrediXcan is a TWAS method testing the molecular mechanisms through which genetic variation affects phenotype². For each gene on each target tissue, PrediXcan trains a prediction model with elastic net, using the accessible genome variation and gene expression levels. Barbeira et al.³ extend the PrediXcan to S-PrediXcan, which can be applied when only GWAS summary statistics are available. The PrediXcan software is available at https://github.com/hakyimlab/PrediXcan/tree/master/Software.

TIGAR

TIGAR is an improved Bayesian tool for transcriptomic data prediction. Specifically, they adopt a nonparametric Bayesian approach by assuming a Dirichlet process prior for the distribution of the effect-size variance^9,10. The method can flexibly model the genetic architecture of gene expression levels, and is more general than elastic net and other Bayesian methods with a specific prior. The TIGAR software is available at https://github.com/yanglab-emory/TIGAR. We implemented TIGAR with the default settings.

UTMOST

UTMOST is a TWAS method that trains a cross-tissue expression prediction model by using the genotype information and matched expression data from multiple tissues¹³. Specifically, the cross-tissue expression prediction is formulated as a penalized multivariate regression problem, and effect sizes are estimated by minimizing a squared loss function with a lasso penalty on columns (within-tissue effects) and a group-lasso penalty on rows (cross-tissue effects). The UTMOST software is available at https://github.com/Joker-Jerome/UTMOST.

Data preprocessing

We followed the quality control and sample exclusion process provided by the GTEx portal for the genotype and gene expression datasets⁶. SNPs with minor allele frequency (MAF) <0.05 or with strand-ambiguities were removed. We examine cis-SNPs located within a genomic range of ± 1 base pair from the gene and overlapped with HapMap Project Phase 3 SNPs. For the GTEx dataset, gene expression data of 17,566 genes were normalized through rank-based inverse normal transformation, and were further adjusted for sex, 3 genotyping PCs, and top 15 PEER factors (to quantify batch effects and experimental confounders)^2,42. We trained models based on 47 tissues with European ancestry sample sizes larger than 100. The GEUVADIS dataset was used for external validation, and we focused on the 373 individuals of European ancestry. For the DICE dataset, we utilized 2 genotyping PCs as covariates as suggested in ref. ²¹. A total of 19,020 genes were evaluated. For the OneK1K dataset, in line with the guidelines provided by Yazar et al.²², we excluded genes expressed in fewer than 10% of the cohort for each cell type. We focused on genes that were retained in more than one cell type after filtering. The expression values were log-transformed ($\log (x+1)$). The gene expression was subsequently adjusted for sex, age, 6 genotyping PCs, and 2 PEER factors.

Model training and evaluation

For the GTEx, DICE, and OneK1K studies, we used the fivefold CV for performance evaluation. Specifically, we randomly divided data into five subsets. For each fold, we performed imputation on four of these subsets designated as training data, and the remaining subset was used for testing, which were unimputed. The performance of predicting gene expression levels from genotypes was evaluated with the prediction R² and the number of predictable genes. We considered two criteria for identifying predictable genes, including prediction R² > 0.01, which is a common standard widely used in TWAS analysis such as PrediXcan². We also considered a stringent criterion, where we performed an F test with degrees of freedom 1 and N − 2 to assess the significance of the predictability, where N is the sample size. The P values were then adjusted with the Benjamini–Hochberg (BH) procedure to control the FDR at a 0.05 level⁴³.

In addition to the internal CV analysis, we conducted an external validation study with the GEUVADIS dataset, which consists of LCLs of 373 individuals of European ancestry. We trained the model on the EBV transformed lymphocytes from the GTEx dataset (N = 116).

Pathway enrichment analysis

We tested the enrichment of ct-genes and ts-genes in the KEGG pathway using the R package clusterProfiler^44,45. We employed the BH procedure to control the FDR at a 0.05 level⁴³.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The MTWAS summary statistics for 84 UKBB phenotypes are provided in Supplementary Data 2. The eQTL weights generated by MTWAS, utilizing the GTEx v8, DICE, OneK1K datasets are available at Zenodo⁴⁶ (https://doi.org/10.5281/zenodo.11647460). The genotype and gene expression data of GTEx v8 project were downloaded from the database of Genotypes and Phenotypes (dbGaP) under accession number phs000424.v8.p2. The genotype and gene expression data of GEUVADIS LCLs were downloaded from the EBI ArrayExpress database under accession code E-GEUV-1. The DICE project provides anonymized gene expression data for public access at https://dice-database.org. The genotype data can be accessed from the dbGaP under accession number phs001703.v4.p1. The OneK1K single-cell gene expression and genotype data are available via Gene Expression Omnibus under accession number GSE196830. GWAS summary statistics from the UKBB were downloaded from the repository at http://www.nealelab.is/uk-biobank. The effective sample sizes for the binary UKBB phenotypes were calculated by $\frac{4{n}_{case}{n}_{control}}{{n}_{case}+{n}_{control}}$. The LD matrix was estimated with UKBB European ancestry samples, which can be downloaded from https://pan.ukbb.broadinstitute.org⁴⁷. We partitioned the genome into 1703 independent blocks using LDetect⁴⁸ at https://bitbucket.org/nygcresearch/ldetect-data/src/master/, based on the 1000G reference panel with European ancestry. Source data are provided with this paper.

Code availability

An R package implementing MTWAS is available on the GitHub repository (https://github.com/szcf-weiya/MTWAS) and Zenodo⁴⁶ (https://doi.org/10.5281/zenodo.11647460).

References

Visscher, P. M. et al. 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101, 5–22 (2017).
Article CAS PubMed PubMed Central Google Scholar
Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091–1098 (2015).
Article CAS PubMed PubMed Central Google Scholar
Barbeira, A. N. et al. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat. Commun. 9, 1–20 (2018).
Article CAS Google Scholar
Cloney, R. Integrating gene variation and expression to understand complex traits. Nat. Rev. Genet. 17, 194–194 (2016).
Article CAS PubMed Google Scholar
Smemo, S. et al. Obesity-associated variants within FTO form long-range functional connections with IRX3. Nature 507, 371–375 (2014).
Article CAS PubMed PubMed Central ADS Google Scholar
Lonsdale, J. et al. The genotype-tissue expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).
Article CAS Google Scholar
Yao, D. W., O’connor, L. J., Price, A. L. & Gusev, A. Quantifying genetic effects on disease mediated by assayed gene expression levels. Nat. Genet. 52, 626–633 (2020).
Article CAS PubMed PubMed Central Google Scholar
Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 48, 245–252 (2016).
Article CAS PubMed PubMed Central Google Scholar
Nagpal, S. et al. TIGAR: an improved Bayesian tool for transcriptomic data imputation enhances gene mapping of complex traits. Am. J. Hum. Genet. 105, 258–266 (2019).
Article CAS PubMed PubMed Central Google Scholar
Parrish, R. L., Gibson, G. C., Epstein, M. P. & Yang, J. TIGAR-V2: efficient TWAS tool with nonparametric Bayesian eQTL weights of 49 tissue types from GTEx V8. Hum. Genet. Genomics Adv. 3, 100068 (2022).
Barbeira, A. N. et al. Integrating predicted transcriptome from multiple tissues improves association detection. PLoS Genet. 15, e1007889 (2019).
Article PubMed PubMed Central Google Scholar
Feng, H. et al. Leveraging expression from multiple tissues using sparse canonical correlation analysis and aggregate tests improves the power of transcriptome-wide association studies. PLoS Genet. 17, e1008973 (2021).
Article CAS PubMed PubMed Central Google Scholar
Hu, Y. et al. A statistical framework for cross-tissue transcriptome-wide association analysis. Nat. Genet. 51, 568–576 (2019).
Article CAS PubMed PubMed Central Google Scholar
Urbut, S. M., Wang, G., Carbonetto, P. & Stephens, M. Flexible statistical methods for estimating and testing effects in genomic studies with multiple conditions. Nat. Genet. 51, 187–195 (2019).
Article CAS PubMed Google Scholar
Brown, C. D., Mangravite, L. M. & Engelhardt, B. E. Integrative modeling of eQTLs and cis-regulatory elements suggests mechanisms underlying cell type specificity of eQTLs. PLoS Genet. 9, e1003649 (2013).
Article CAS PubMed PubMed Central Google Scholar
Chen, L. et al. TIVAN: tissue-specific cis-eQTL single nucleotide variant annotation and prediction. Bioinformatics 35, 1573–1575 (2019).
Article CAS PubMed Google Scholar
Chen, J. & Chen, Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771 (2008).
Article MathSciNet Google Scholar
Li, Y. & Liu, J. S. Robust variable and interaction selection for logistic regression and general index models. J. Am. Stat. Assoc. 114, 271–286 (2019).
Article MathSciNet CAS PubMed Google Scholar
Stekhoven, D. J. & Bühlmann, P. MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118 (2012).
Article CAS PubMed Google Scholar
Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).
Article CAS PubMed PubMed Central ADS Google Scholar
Schmiedel, B. J. et al. Impact of genetic polymorphisms on human immune cell gene expression. Cell 175, 1701–1715 (2018).
Article CAS PubMed PubMed Central Google Scholar
Yazar, S. et al. Single-cell eQTL mapping identifies cell type-specific genetic control of autoimmune disease. Science 376, eabf3041 (2022).
Article CAS PubMed Google Scholar
Arvanitis, M., Tayeb, K., Strober, B. J. & Battle, A. Redefining tissue specificity of genetic regulation of gene expression in the presence of allelic heterogeneity. Am. J. Hum. Genet. 109, 223–239 (2022).
Article CAS PubMed PubMed Central Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article CAS PubMed PubMed Central Google Scholar
Volland, C. et al. Control of p21^Cip by BRCA1-associated protein is critical for cardiomyocyte cell cycle progression and survival. Cardiovasc. Res. 116, 592–604 (2020).
Article CAS PubMed Google Scholar
Dong, W.-q. et al. Prohibitin overexpression improves myocardial function in diabetic cardiomyopathy. Oncotarget 7, 66 (2016).
Article PubMed Google Scholar
Yu, Y.-d, Xue, Y.-t & Li, Y. Identification and verification of feature biomarkers associated in heart failure by bioinformatics analysis. Sci. Rep. 13, 3488 (2023).
Article PubMed PubMed Central ADS Google Scholar
Brecker, M., Khakhina, S., Schubert, T., Thompson, Z. & Rubenstein, R. The probable, possible, and novel functions of ERp29. Front. Physiol. 11, 574339 (2020).
Ugidos, N. et al. Interactome of the autoimmune risk protein ANKRD55. Front. Immunol. 10, 2067 (2019).
Article CAS PubMed PubMed Central Google Scholar
Tang, P. et al. NADPH oxidase NOX4 is a glycolytic regulator through mROS-HIF1α axis in thyroid carcinomas. Sci. Rep. 8, 15897 (2018).
Article PubMed PubMed Central ADS Google Scholar
Azouzi, N. et al. NADPH oxidase NOX4 is a critical mediator of BRAFV600E-induced downregulation of the sodium/iodide symporter in papillary thyroid carcinomas. Antioxid. Redox Signal. 26, 864–877 (2017).
Article CAS PubMed PubMed Central Google Scholar
Lazzara, D. R., Zarkhin, S. G., Rubenstein, S. N. & Glick, B. P. Melanoma and thyroid carcinoma: our current understanding. J. Clin. Aesthetic Dermatol. 12, 39 (2019).
Google Scholar
Ulisse, S. et al. Is melanoma progression affected by thyroid diseases? Int. J. Mol. Sci. 23, 10036 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ozgun, A. et al. Malignant melanoma and papillary thyroid carcinoma that were diagnosed concurrently and treated simultaneously: a case report. Oncol. Lett. 9, 468–470 (2015).
Article PubMed Google Scholar
Beretti, F. et al. The interplay between HGF/c-met axis and NOX4 in BRAF mutated melanoma. Int. J. Mol. Sci. 22, 761 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Ser. B: Stat. Methodol. 82, 1273–1300 (2020).
Article MathSciNet Google Scholar
Wen, X., Lee, Y., Luca, F. & Pique-Regi, R. Efficient integrative multi-snp association analysis via deterministic approximation of posteriors. Am. J. Hum. Genet. 98, 1114–1129 (2016).
Article CAS PubMed PubMed Central Google Scholar
Barbeira, A. N. et al. Fine-mapping and qtl tissue-sharing information improves the reliability of causal gene identification. Genet. Epidemiol. 44, 854–867 (2020).
Article PubMed PubMed Central Google Scholar
Song, S. et al. Openness weighted association studies: leveraging personal genome information to prioritize non-coding variants. Bioinformatics 37, 4737–4743 (2021).
Article CAS PubMed PubMed Central Google Scholar
Dai, Q. et al. OTTERS: a powerful TWAS framework leveraging summary-level reference data. Nat. Commun. 14, 1271 (2023).
Article CAS PubMed PubMed Central ADS Google Scholar
Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7, 500–507 (2012).
Article CAS PubMed PubMed Central Google Scholar
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodol.) 57, 289–300 (1995).
Article MathSciNet Google Scholar
Kanehisa, M. et al. KEGG for linking genomes to life and the environment. Nucleic Acids Res. 36, D480–D484 (2007).
Article PubMed PubMed Central Google Scholar
Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. clusterProfiler: an R package for comparing biological themes among gene clusters. Omics: A J. Integr. Biol. 16, 284–287 (2012).
Article CAS Google Scholar
Song, S., Wang, L., Hou, L. & Liu, J. S. MTWAS: Partitioning and aggregating cross-tissue and tissue-specific genetic effects to identify gene-trait associations. Zenodo https://doi.org/10.5281/zenodo.11647460 (2024).
Pan-UKB team. https://pan.ukbb.broadinstitute.org (2020).
Berisa, T. & Pickrell, J. K. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics 32, 283 (2016).
Article CAS PubMed Google Scholar
Lalonde, S. et al. Integrative analysis of vascular endothelial cell genomic features identifies AIDA as a coronary artery disease candidate gene. Genome Biol. 20, 1–13 (2019).
Article CAS Google Scholar
Castillo-Avila, R. G. et al. Association between genetic variants of CELSR2-PSRC1-SORT1 and cardiovascular diseases: a systematic review and meta-analysis. J. Cardiovas. Dev. Dis. 10, 91 (2023).
CAS Google Scholar
Joshua, J., Caswell, J., O’Sullivan, M. L., Wood, G. & Fonfara, S. Feline myocardial transcriptome in health and in hypertrophic cardiomyopathy-a translational animal model for human disease. PLoS ONE 18, e0283244 (2023).
Article CAS PubMed PubMed Central Google Scholar
Li, X. et al. Meta-analysis identifies robust association between SNP rs17465637 in MIA3 on chromosome 1q41 and coronary artery disease. Atherosclerosis 231, 136–140 (2013).
Article CAS PubMed Google Scholar
Aggarwal, S., Narang, R., Saluja, D. & Srivastava, K. Diagnostic potential of SORT1 gene in coronary artery disease. Gene 909, 148308 (2024).
Article CAS PubMed Google Scholar
Nordestgaard, B. G. & Langsted, A. Lipoprotein (a) as a cause of cardiovascular disease: insights from epidemiology, genetics, and biology. J. Lipid Res. 57, 1953–1975 (2016).
Article CAS PubMed PubMed Central Google Scholar
Enas, E. A., Varkey, B., Dharmarajan, T., Pare, G. & Bahl, V. K. Lipoprotein (a): An independent, genetic, and causal factor for cardiovascular disease and acute myocardial infarction. Indian Heart J. 71, 99–112 (2019).
Article PubMed PubMed Central Google Scholar
Paquette, M., Dufour, R. & Baass, A. PHACTR1 genotype predicts coronary artery disease in patients with familial hypercholesterolemia. J. Clin. Lipidol. 12, 966–971 (2018).
Article PubMed Google Scholar
Yuan, W. et al. New findings in the roles of Cyclin-dependent Kinase inhibitors 2B Antisense RNA 1 (CDKN2B-AS1) rs1333049 G/C and rs4977574 A/G variants on the risk to coronary heart disease. Bioengineered 11, 1084–1098 (2020).
Article CAS PubMed PubMed Central Google Scholar
Ozaki, K. et al. SNPs in BRAP associated with risk of myocardial infarction in Asian populations. Nat. Genet. 41, 329–333 (2009).
Article CAS PubMed Google Scholar
Hinohara, K. et al. Validation of eight genetic risk factors in East Asian populations replicated the association of BRAP with coronary artery disease. J. Hum. Genet. 54, 642–646 (2009).
Article CAS PubMed Google Scholar
Karamanavi, E. et al. The FES gene at the 15q26 coronary-artery-disease locus inhibits atherosclerosis. Circ. Res. 131, 1004–1017 (2022).
Article CAS PubMed PubMed Central Google Scholar
Ken-Dror, G., Talmud, P. J., Humphries, S. E. & Drenos, F. APOE/C1/C4/C2 gene cluster genotypes, haplotypes and lipid levels in prospective coronary heart disease risk among UK healthy men. Mol. Med. 16, 389–399 (2010).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. This research has been conducted using the publicly available UK Biobank GWAS summary statistics released by Neale’s lab. This work was supported by the National Key R&D Program of China (2021YFF1200905) to L.H. and by the National Institute of General Medical Sciences Grant R01 GM152814-01 to J.S.L.

Author information

Authors and Affiliations

Center for Statistical Science, Department of Industrial Engineering, Tsinghua University, Beijing, China
Shuang Song & Lin Hou
Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
Lijun Wang
MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
Lin Hou
Department of Statistics, Harvard University, Cambridge, MA, USA
Jun S. Liu

Authors

Shuang Song
View author publications
You can also search for this author in PubMed Google Scholar
Lijun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lin Hou
View author publications
You can also search for this author in PubMed Google Scholar
Jun S. Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.S., L.H., and J.S.L. conceived the study. S.S. developed the statistical methods and performed simulations and real data analyses with guidance from L.H. and J.S.L. S.S. and L.J.W. developed the MTWAS software. S.S. and L.H. applied the data. All authors contributed to this work in data analysis, result interpretation, and manuscript writing. All authors have reviewed and approved the final version of the manuscript.

Corresponding authors

Correspondence to Lin Hou or Jun S. Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Song, S., Wang, L., Hou, L. et al. Partitioning and aggregating cross-tissue and tissue-specific genetic effects to identify gene-trait associations. Nat Commun 15, 5769 (2024). https://doi.org/10.1038/s41467-024-49924-4

Download citation

Received: 27 November 2023
Accepted: 25 June 2024
Published: 09 July 2024
DOI: https://doi.org/10.1038/s41467-024-49924-4

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Model overview

Improvements in prediction accuracy on GTEx tissues

Improvements in prediction accuracy on immune cell types

Investigation of cross-tissue and tissue-specific effects

Applications to GWAS studies on UKBB phenotypes

Computational efficiency

Discussion

Methods

Imputation of gene expression data

Identification of ct-eQTLs and ts-eQTLs

Gene-trait association studies

Compared methods

PrediXcan

TIGAR

UTMOST

Data preprocessing

Model training and evaluation

Pathway enrichment analysis

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links