Main

Decades of public and private investment in large-scale data collection and aggregation have in recent years yielded data repositories called biobanks, which link health outcomes to biological samples for hundreds of thousands of individuals (for example, refs. 1,2,3). Such resources often include thousands of variables drawn from electronic health records (EHR), self-report survey measures, laboratory assays, and physical and cognitive assessments3, thus enabling research across diverse data types.

Although these massive resources now power discovery in human health and disease (for example, refs. 4,5), the tremendous breadth and depth of data can obfuscate larger patterns present across a biobank. For example, indicators of a particular health construct might be reflected across a range of variables nested within different data modalities in ways that are both anticipated and unexpected. To more completely consider the correlated human health landscape, approaches are needed that can identify underlying structure and reduce thousands of variables into a smaller number of constructs, refining data in a way that is digestible and scalable for human interpretation.

Dimensionality reduction is a common task in many domains, with a recent proliferation of methods having been applied to biobank-scale data. Principal component analysis (PCA), for example, provides a lower-dimensional representation of the strongest axes of variation in a dataset. It has been leveraged to, for instance, identify dimensions of genetic ancestry in genotype data6, extract features from individual biobank questionnaires7, and identify sets of genetic variants with similar patterns of association across thousands of genome-wide association studies (GWAS)8. Other data reduction approaches prioritize identifying correlated sets of variables across data types9, modelling latent classes10 or creating lower-dimensional representations for visualization purposes11,12. Deep learning methods such as autoencoders and transformers have been used to integrate ’omics data across modalities13,14 and to extract relevant features from electronic health records (EHRs)15,16, respectively.

Biobank analyses have devoted relatively less attention to factor analysis (FA)17,18,19, an approach commonly used in the social sciences that models the observed correlation between variables as arising from one or more shared continuous latent (unobserved) factors. Factor analysis has the benefit of being model based, facilitating more direct statistical inference than descriptive summaries (for example, PCA) or ‘black box’ algorithmic solutions, and it directly prioritizes extracting factors that have a simple relationship to the observed items, when possible. Conventionally, factor analysis is applied to sets of items within a single questionnaire to identify or confirm underlying structure and estimate scores that more accurately measure the latent constructs captured. This approach has scaled successfully to large cohorts, for example, in modelling measures of cognition20 or wellbeing21, or in identifying structure across disease comorbidities22. Recently, factor analysis has been adapted to model the structure of genetic, rather than observed phenotypic, correlations across traits23,24,25, with further extensions to other types of large-scale ’omics data proposed26.

In the current study, we modify and extend the factor analytic approach to explore the structure of a much broader set of multimodal biobank phenotypes. We apply this approach to the cross-sectional phenome of UK Biobank (UKB)3 and evaluate whether the identified structure is (1) informative about relationships that might be unexpected or normally obscured and (2) effectively summarized by factor scores to enable more powerful analyses of linked phenotypic and genetic data. We also consider the limitations and trade-offs involved in applying factor analysis at this scale. While our approach is one of many that may be used to distill biobank data, these analyses reemphasize the value of principled dimensionality reduction, and reveal important insights into human variation across the complex and multifaceted human phenome.

We note that these analyses are restricted to individuals of predominantly European genetic ancestry, which is a key limitation to their generalizability. This analytic decision is driven by the statistical necessity of having a large ancestrally homogeneous group of research participants in one biobank, to limit impacts of population stratification. Since there is a relatively limited number of respondents of predominently non-European genetic ancestry in UKB, and since results from factor analysis are non-transferable across datasets, we neither have the necessary sample size within UKB to meaningfully compare results across ancestries, nor could we compare our European-ancestry-only factor analysis results from UKB to factor analysis results from a sample in a different, primarily non-European-ancestry biobank. However, we avidly urge current and future biobank efforts to collect large enough samples sizes in diverse genetic ancestries to make these sorts of critical comparisons possible. This step is imperative to not risk exacerbating health disparities27.

Results

Distilling the phenotypic landscape

We use an adapted multistage factor analytic approach to distill UKB’s phenotypic data, first considering 2,772 phenotypes and all unrelated individuals with predominantly estimated European genetic ancestry (N = 361,144; Extended Data Fig. 1 and Supplementary Text). FA methods treat observed variables (items) as measures of a small number of unobserved (latent) factors F, with corresponding effect sizes or ‘loadings’, Λ, and item-specific residuals ϵ, allowing the observed covariance between items Σ to be fit as \(\varSigma =\varLambda {\rm{Cov}}(F)\varLambda {\prime} +{\rm{Cov}}(\epsilon )\)19. We deconvolve the latent factors and loadings by prioritizing sparsity criteria for the loadings while requiring the factors to be uncorrelated (orthogonal). The content of each fitted factor can therefore be interpreted as reflecting the correlation between sets of items conditional on the relationships captured by other factors (Supplementary Text). To reduce the impact of differential availability of certain assessments across participants, we begin by fitting a factor model in a core subsample of 42,325 individuals with high assessment-level completion and 898 phenotypes with low (that is, mean = 9.1%, s.d. = 10.7%) missingness and prevalence above 1% (Supplementary Text and Table 1). Genetic comparison of the fitted factors between the core subsample and the full European-ancestry cohort suggests that the identified architecture generalizes well (Supplementary Fig. 1).

Fig. 1: Makeup of factors in the final model.
figure 1

Horizontal bars represent proportion of variance explained in a given factor score by each of 8 major categories of assessment in UKB, estimated using hierarchical partitioning. To the left, factors are numbered in order of variance extraction in the exploratory factor analysis. To the right, brief descriptions of the items contained within a factor are listed, arrived at by expert consensus of coauthors and colleagues.

The final model consists of 35 orthogonal latent factors drawn from 505 observed items in UKB (Fig. 1 and Extended Data Fig. 2). Initial model fitting with exploratory FA (Methods, and Supplementary Figs. 2 and 3) selected a model explaining 18.5% of overall variance across input phenotypes that demonstrated good absolute fit in the modelling sample (N = 33,860; root mean square error of the approximation (RMSEA) = 0.015; comparative fit index (CFI) = 0.883; see Supplementary Table 2 for additional fit metrics across the modelling and holdout samples; see Supplementary Text for more on interpretation of absolute and relative fit in the context of this analysis). While total variance explained is low for a conventional factor analysis with a much smaller set of individual items taken from a single survey, the aim was to model and extract major axes of variation across a biobank with a broad and diverse set of items. The same number of components extracted from a principal component analysis explains a similar amount of overall variance (21.6%), but the allocation of items and weights across components differs from that of the factors, highlighting important distinctions between these two common dimensionality reduction algorithms (Extended Data Fig. 3, and Supplementary Table 3 and Text). The initial exploratory model was then refined for more robust modelling in confirmatory FA (Methods and Supplementary Text, Figs. 4 and 5, and Table 4). We adopt a final confirmatory FA model that demonstrates acceptable fit based on absolute fit metrics in a holdout sample of 8,465 UKB participants (RMSEA = 0.028; standardized root mean squared residual (SRMR) = 0.076; Supplementary Table 2 and Text).

The loadings of a factor’s component items reflect their relative importance within the construct captured17,18,19. On average, each factor influences 32.49 items (s.d. = 20.48; range 3–84; Supplementary Table 5), with items known to be correlates of numerous health outcomes, such as ‘Health satisfaction’ and ‘Overall health rating’28,29, loading on as many as 10 factors. Because individual assessments in UKB were designed to capture specific phenotypic domains, many factors draw heavily from individual categories within the 72 UKB questionnaires and assessments included in the factor model (Supplementary Table 6), with 9 deriving most of their items from a single source. Most factors, however, draw items from multiple sources (mean = 11.11; s.d. = 6.26; range 2–26), highlighting the utility of a factor analytic approach in identifying phenotypic structure across assessments.

Factors related to medical diseases recapitulate previous clinical, epidemiological and biological knowledge without any expert or manual curation. Factor 12, for example, captures correlates of hypertension across the phenome including diagnostic items (for example, self-reported hypertension and measured blood pressure), risk factors (for example, family history, body mass index (BMI) and waist circumference), comorbidities (for example, self-reported high cholesterol and diabetes) and relevant medications (for example, diuretics and calcium channel blockers). Along similar lines, Factor 16 captures commonly used diagnostic indicators of coronary artery disease, such as self-report and hospital inpatient diagnoses (for example, chronic ischaemic heart disease and myocardial infarction); symptoms (for example, angina, pain in the chest or throat and shortness of breath); and medications (for example, aspirin, beta-blockers and statins; Supplementary Table 5). Beyond these more expected associations, the factors capture less established links between variables such as BMI and frequency of computer gaming (Factor 7), diet and education (Factor 10), and information technology occupations and performance on computerized cognitive tests (Factor 34).

It is important to note, however, that interpretation of the factors relies on their forced orthogonality. Returning to the medical domain, Factor 28 captures diabetes diagnoses and responsive lifestyle changes such as altering one’s diet and reducing sugar intake. Had factors not been modelled as independent, Factor 28 would probably share variance with Factors 12 and 16 above, as well as with Factor 7, which captures BMI and adiposity; each of these factors captures aspects of the well-known ‘metabolic syndrome’. In the context of our model, Factor 28, and each factor more generally, is most accurately interpreted as representing the remaining covariance of the items within it (for example, diabetes diagnosis and treatment), once accounting for the variance explained by the other related (for example, cardiometabolic) and unrelated factors.

Characterizing factors with linked biobank data

To further characterize the 35 factors and their relation to public health outcomes, we generate weighted sum scores indexing individuals’ values for each factor in the full sample of individuals with predominately European-estimated genetic ancestry (N = 361,099 due to subsequent participant withdrawals from UKB). Scores are computed on the basis of the fitted factor model using weighted linear combinations of observed items while accounting for missingness (Supplementary Text and Figs. 6 and 7). The expected correlation between estimated scores and the corresponding latent factors30 are generally strong (mean = 0.863, s.d. = 0.073, range: 0.611[Factor 28] to 0.974[Factor 4]), suggesting good reliability. Using these scores, we conduct a series of follow-up analyses including: correlating factors with 403 top-level medical diagnostic codes (or ‘phecodes’; Supplementary Table 7 and Extended Data Fig. 4) and 28 biomarkers (Extended Data Fig. 5); predicting all-cause mortality from survey completion to the last date at which participant death records are available (Fig. 2a); and performing a series of genetic analyses including GWAS of the factors (Supplementary Table 8), heritability estimation and enrichment (Fig. 2b and Extended Data Fig. 6), and genetic correlations (Extended Data Fig. 7, and Supplementary Fig. 8 and Table 9). In each case we control for covariates including age, sex, genetic principal components and UK Biobank assessment centre, since we observe that most factors are significantly related to these features (Methods and Supplementary Table 10). We also create polygenic scores (PGS) in the National Longitudinal Study of Adolescent to Adult Health (Add Health31) on the basis of factor GWAS summary statistics to supplement and extend our findings in an independent, US-based cohort.

Fig. 2: Prospective mortality hazard ratios and heritability estimates across all 35 factors.
figure 2

a, Mortality hazards per factor. Factors are ordered by hazard ratio for mortality from time of last survey completion to the last date at which death records were available for analysis. T0 was defined as the last contact an individual had with the UK Biobank study, within the context of the items included in the factors. b, Heritability estimates for each factor, ordered by the point estimate. Results in a reflect estimated hazard ratios ±95% CIs, while results in b reflect estimated SNP heritabilities ±1 standard error. For both panels, darker blue boxes remain significant after adjustment for multiple comparisons. Covariates for both analyses included 20 genetic PCs, age, chromosomal sex, age2, age × chromosomal sex, age2 × chromosomal sex and assessment centre. Mortality analyses additionally included a covariate representing days from baseline assessment to T0.

All but three factors are associated with prospective mortality after correction for multiple testing (that is, P < 0.0014 for 35 factors), reflecting the significance of these axes of variation across individuals (Fig. 2a). Factor 20, which includes items such as number of surgeries, cancer diagnosis and ‘diagnosis with a life-threatening illness’ has the highest mortality prediction across all 35 factors (hazard ratio [HR] = 1.62[1.59–1.64]). Other factors among the strongest predictors of mortality capture constructs such as joint pain and disability (Factor 14), trauma (Factor 9), social and economic stability (Factor 15) and physical activity (Factor 23). These results provide a benchmark for assessing which constructs are most central to health outcomes and highlight the utility of using factor scores prospectively as additional longitudinal data are incorporated into a biobank.

Genetic results from GWAS of the factor scores are generally consistent with previous findings for related phenotypes. For example, Factor 11, which captures asthma diagnosis and related medications, comorbidities and lab findings, has a strong genetic correlation with a previous GWAS of asthma32 (rg = 0.89(0.01); InterceptGcov = 0.59[0.02]; UKB-excluded rg = 0.82[0.04]), and its heritability is enriched in regions of the genome associated with blood and immune cell types (P = 1.61 × 10−6; Extended Data Fig. 6), highlighting a known key role for the immune system in the pathogenesis of chronic respiratory disease. Similarly, GWAS of Factor 16 captures known lipid biology, with many of the 33 significant loci mapped to core lipid metabolism genes such as LPA, LPL, LDLR, SORT1, APOE and PCSK9 (Fig. 3d), and exhibits strong genetic correlation with a previous coronary artery disease GWAS33 (rg = 0.87[0.02]; InterceptGcov = 0.42[0.01]; UKB-excluded rg = 0.81[0.04]). Comparison of Factor 28 to a previous GWAS of type 2 diabetes34, however, reveals imperfect capture of the clinical definition (rg = 0.68[0.02], InterceptGcov = 0.36[0.01]; UKB-excluded rg = 0.70[0.03]; Extended Data Fig. 8), probably due to its modelled orthogonality from other factors capturing cardiometabolic constructs. More specifically, the factor has higher genetic overlap with cholesterol measures (for example, total cholesterol35 rg = 0.29[0.04] vs 0.04[0.03]) but lower overlap with BMI36 (rg = 0.23[0.03] vs 0.49[0.03]) and an inverse correlation with blood pressure37 (rg = −0.18[0.02] vs 0.20[0.02]), reflective of its inclusion of high cholesterol in its factor definition and its forced independence of Factors 7 and 12 described above.

Fig. 3: Genetic properties of factors vs items.
figure 3

a, Distributions of SNP-based heritability point estimates for items and factors, with density curves overlaid. Dashed vertical lines represent the median point estimate for each category. b, SNP heritability point estimates with standard error bars shown for an example factor, Factor 16, and its top 10 component items by loading. c, Number of GWAS significant loci (P < 5 × 10−8 for Bonferroni-adjusted significance within a given phenotype) across all 35 factors and their top 5 component items by loading. Loci shown in purple are significant only in GWAS of one or more top items. Loci shown in green are significant in GWAS of the factors. For example, there are 2,350 loci that are significant in GWAS of only one of a factor’s top 5 items (second bar in the graph). Of these loci, 486 are also significant in GWAS of the corresponding factor (shown in green). d, Comparison of loci identified in Factor 16 (top of Miami plot) versus its top 5 items by loading (bottom of Miami plot). Below the Miami plot are all loci across the factor (in blue) and top items (in orange), demonstrating the patterns presented in c at the single-factor level. All P values for c and d are from two-sided tests in GWAS using linear regression for each variant and the covariates described in Methods.

Even where genetic results are qualitatively similar, factor analysis aids in genetic discovery by combining shared information across items, which decreases measurement error of the underlying construct. Across all 35 factors, this increase in power from modelling covariance across items yields 548 loci, of 2,329 total, that are not genome-wide significant in GWAS of their top 5 component items (Fig. 3c). More generally, all but one factor are significantly heritable after multiple testing correction (mean factor hg2 = 0.10[0.09]; Fig. 2b), and the observed-scale single nucleotide polymorphism (SNP) heritability of the 35 factors is on the whole higher than for the 505 component items (mean item observed-scale hg2 = 0.05[0.07]; 2-sample t-test P = 0.002; Fig. 3a). Within factors, 20 of 35 have higher SNP heritability point estimates than all 5 of their top-loading items (for example, Fig. 3b). Although factors that capture measured physical characteristics have higher heritabilities on average (mean = 0.22, s.d. = 0.10), the largest benefits are observed for factors containing mostly dichotomous or ordinal self-report items; these items are likely to have higher measurement error than empirically measured continuous items (Supplementary Note of ref. 38; Extended Data Fig. 9).

Furthermore, identifying patterns of shared and non-shared genetic signal across items within a factor can provide insight into how they relate. Factor GWAS are best powered for shared signal, identifying nearly all (that is, 91.5% of) loci significant in at least 3 of 5 top-loading items, but much fewer (that is, 20.7% of) loci significant in only 1 of the top items, which are thus likely to be item-specific (Fig. 3c; for example, Fig. 3d). In other words, loci that are common to multiple component items are more likely to capture shared covariance across these items and thus be picked up in the factor GWAS. Item-specific effects could be hypothesized to reflect qualitatively distinct genetic mechanisms, but attempts to partition non-shared effects between top items and their corresponding factors generally identify similar loci and tissue enrichments as the factor GWAS, implying that ‘item-specific’ signal residual on the factor may commonly reflect differential relative effects of shared loci instead (Supplementary Text and Fig. 9). Direct comparisons between GWAS of the top items in a factor have the potential to be more informative by suggesting causal relationships between observed items, such as evidence for a partially causal effect of doctor-diagnosed asthma on self-reported on-the-job breathing problems within Factor 11 (latent causal variable (LCV) genetic causality proportion = 0.80, P = 3.98 × 10−5; Supplementary Fig. 10; see Supplementary Text for further details on LCV analyses); however, power to establish these relationships is limited by the power of the GWAS for each item.

Combining factor definitions with individual-level data and linked biobank phenotypes allows us to further characterize the factors and uncover associations relevant for health. Beyond the general trends described above, we use the remaining results sections to highlight key findings in individual factors across the sociodemographic, psychiatric and behavioural domains.

Disentangling subdomains of socioeconomic status

Socioeconomic status (SES) is one of largest single predictors of health and mortality39, and, for research purposes, is traditionally estimated with indicators of education, occupation and income40. Three factors predominantly include SES-related variables and reflect correlates of occupation (Factor 5), educational attainment (EA; Factor 10), and social and economic stability (Factor 15), respectively, across the phenome. These factors pull from a range of questionnaires, from diet to employment history to social support, capturing items both traditionally and non-traditionally considered SES indicators (Supplementary Table 5). Factor 10 includes classic SES items such as educational attainment and job codes, as well as apparent cohort-specific correlates such as intake of ground coffee/espresso, muesli and wine. Factor 5 captures jobs such as low-ranking military, physical labour and factory occupations, as well as work environments full of fumes or noise. Finally, Factor 15 reflects social and economic stability, including social support networks, loneliness, home ownership, household income and never having been divorced.

The SES factors are pervasively, yet differentially, associated with health outcomes in UKB (Supplementary Table 7 and Extended Data Fig. 4). Factors 5 and 10 are much more similar to each other in their patterns of association to linked hospital inpatient phecodes (Pearson correlation of regression betas |r| = 0.61) than they are to Factor 15 (|r| = 0.24 and 0.33, respectively). Factors 5 and 10 are most distinguished by differential associations with respiratory and cardiometabolic diseases, with Factor 5 being more associated with respiratory diseases and Factor 10 with cardiometabolic diseases. Factor 15, in contrast, is much more protective against hospitalization for mental health and substance use disorders than the other two factors. Factor 15 therefore suggests a distinct domain of SES that is protective against the ‘diseases and deaths of despair’ that have been shown to be the most significant drivers of decreasing life expectancy in recent years41. In fact, Factor 15 is the most prospectively predictive of survival of all 35 factors (HR = 0.75[0.74–0.76]; Fig. 2a).

Genetic data provide additional insight into how these factors parse multiple axes of SES. Although genetic associations with SES are often correlated with environmental influences and thus cannot be interpreted as purely causal, they remain an informative tool. For example, genetic correlations across the SES factors are low to moderate, suggesting partially overlapping yet distinct domains of SES (Fig. 4a). Factor 10 shares substantial genetic overlap with previous GWAS of EA42 (rg = 0.93[0.01]; InterceptGcov = 0.29[0.01]; UKB-excluded rg = 0.92[0.02]) and its correlates, including household income43 and cognitive performance44. Notably, the SNP heritability of Factor 10 (hg2 = 0.21[0.01]) is greater than that of the EA GWAS38 (hg2 = 0.12), reflecting a potential benefit of including correlated measures of traditional and non-traditional SES. Factor 5, meanwhile, reflects strong and roughly equal genetic overlap with EA, region-based social deprivation45 and household income. Factor 15 is the most distinct SES factor, with moderate genetic correlations with social deprivation and household income, but low and non-significant associations with EA and cognitive performance, respectively. Factors 5 and 15 are associated only with genetic effects on EA that operate through the inherited family environment, while Factor 10 is also associated with direct genetic effects46. Associations between previous SES GWAS and other factors within our model reveal that Factors 5, 10 and 15 alone do not fully capture SES, with numerous other residual correlations further reflecting the complex interconnections between SES measures and health (Supplementary Text and Fig. 11).

Fig. 4: Genetic associations of factors within the SES domain.
figure 4

a, Genetic correlation across factors in the SES domain and previous GWAS of SES indicators. All genetic associations are flipped to be in the direction reflecting greater SES for consistency (for example, ‘Social deprivation’ becomes ‘Social enrichment’). Colour of each box within the heat map indicates the strength of genetic overlap across the two corresponding phenotypes. b, Associations between polygenic scores derived from the SES factors and SES-related items in an outside cohort (Add Health) with corresponding sample sizes (N). Barplots show estimated incremental variance explained (change in R2 or Nagelkerke’s pseudo-R2 from adding polygenic scores to regression models for continuous or binary outcomes, respectively (Methods)) with error bars representing 95% bootstrapped confidence intervals.

Polygenic score analyses of the SES factors conducted in the independent National Longitudinal Study of Adolescent to Adult Health (Add Health) dataset further solidify these key distinctions and provide evidence of generalization beyond the UKB sample (Supplementary Table 11 and Fig. 4b). The Factor 10 PGS outperforms those of Factors 5 and 15 in its associations with years of education (R2 = 7.7%), college completion (pseudo-R2 = 5.5%) and verbal cognition (R2 = 4.8%). In contrast, Factor 5 scores are the best predictors of income (R2 = 5.3%) and high-paying vs low-paying jobs (pseudo-R2 = 5.7%), while Factor 15 scores are most associated with neighbourhood satisfaction (R2 = 3.9%). Not only do these findings demonstrate replication of our results in an independent, non-UKB sample, but the level of differentiation in prediction also articulates a clear disentangling of factors related to SES.

Elucidating associations between trauma and health outcomes

Unlike somatic diseases with well characterized, directly measurable criteria, psychiatric constructs are not biologically defined, and diagnostic boundaries are imprecise. Experiences of trauma, for example, represent one of the most critical but understudied public health concerns globally, with significant associations to downstream chronic health problems and mortality47. Factor 9 places trauma within the larger behavioural and medical landscape without the need to decide which trauma-related measures to include or how to combine and weight them. The factor’s component items include exposure to traumas such as feeling hated as a child or being physically abused by a family member, as well as related mental health outcomes such as post-traumatic stress disorder (PTSD), major depressive disorder, mania, psychosis, addiction and self-harm (Supplementary Table 5).

Of the 35 factors, Factor 9 has one of the highest prospective all-cause mortality hazards (HR = 1.36[1.31–1.41]; Fig. 2a), reinforcing the critical public health importance of the construct. Its associations with psychiatric and medical diagnoses are uniquely broad (Fig. 5), with strong phenotypic correlations observed across all diagnostic categories, including common circulatory (for example, hypertension, odds ratio (OR) = 1.29[1.27–1.32]), digestive (for example, acid reflux, OR = 1.34[1.30–1.38]), respiratory (for example, asthma, OR = 1.39[1.36–1.43]) and endocrine (for example, type 2 diabetes OR = 1.62[1.57–1.68]) outcomes (Supplementary Table 7). The biomarker most associated with this factor is C-reactive protein (β = 0.11[0.004], z = 31.47), a blood-based indicator of inflammation and inflammatory disorders, providing further evidence for its relevance as a clinical flag for suites of trauma-related exposures and outcomes48.

Fig. 5: Factor 9 associations across top-level inpatient diagnostic phecodes.
figure 5

Box-and-whisker plots are shown for associations within UKB with 403 derived medical phecodes grouped by category. These associations are defined as the test statistics (that is, z-scores) for the factor score in a logistic regression model including our standard covariates (that is, first 20 genetic PCs, age, chromosomal sex, age2, age × chromosomal sex, age2 × chromosomal sex and dummy variables representing the assessment centres of origin). Boxes represent the middle quartiles of Factor 9’s test statistics across phecodes within a category, with whiskers extending to maximum and minimum observed values, excluding outliers >1.5× the interquartile range away from the middle quartiles which are plotted individually. Median values per category are indicated by individual black lines inside the boxes. The dotted grey lines represent the critical test statistics for significance at two-sided P < 0.05 after correcting for multiple comparisons across all 403 phecodes.

Factor 9 shows strong genetic correlations with previous GWAS of trauma exposure49 (rg = 0.93[0.02]; InterceptGcov = 0.53[0.01]), childhood maltreatment50 (rg = 0.81[0.02]; InterceptGcov = 0.46[0.01]) and PTSD51 (rg = 0.75[0.07]; InterceptGcov = 0.13[0.01]). It has moderate genetic correlations with external GWAS of psychiatric (for example, schizophrenia52 rg = 0.35[0.03]; InterceptGcov = 0.02[0.01]) and substance use (for example, cannabis use disorder53 rg = 0.45[0.05]; InterceptGcov = 0.01[0.01]) outcomes (Extended Data Fig. 7). Of 8 genome-wide significant factor loci, 4 have been previously identified in GWAS of trauma phenotypes49,50. The remaining 4 loci novel to trauma associations have been identified in previous GWAS of psychiatric, behavioural, neural and medical outcomes (for example, refs. 54,55,56,57; Supplementary Table 8). Within the Add Health cohort, the Factor 9 PGS outperforms those derived from previous GWAS of trauma-related constructs49,50,51 in predicting adulthood psychiatric diagnoses (Supplementary Table 12), providing further evidence for an increase in genetic signal when aggregating across multiple facets of trauma-related experiences, as well as for the generalizability of that signal to an independent sample.

Bundling related health behaviours to boost association power

Health behaviours represent a strong candidate for factor analysis since multiple observed indicators probably reflect a generalized underlying tendency capturing individual differences. Factor 23, for example, includes traditional self-reported measures of exercise frequency and duration, as well as other physical and social activity measures such as spending time outdoors, playing sports and walking for pleasure (Supplementary Table 5). It also incorporates dietary items such as fruit, vegetable and oily fish intake, probably reflecting broader correlations across pro-health behaviours. Modifiable health behaviours such as these have been shown to mitigate adverse medical outcomes, with inactivity accounting for 1% of disability-adjusted life years lost globally58. Indeed, Factor 23 has strong correlations across all categories of medical phecodes and a substantial association with prospective survival (HR = 0.77[0.76-0.79]; Fig. 2a).

Analyses linking Factor 23 to other biobank data identify clearer associations than comparable analyses using any individual top item (Fig. 6). Prospective analyses of all-cause mortality, for example, show weaker effects for each of Factor 23’s 10 top-loading items (incremental pseudo-R2 = 1.27 × 10−3 for factor vs 6.75 × 10−10 to 1.12 × 10−3 for individual items). An unweighted sum of z-scores across these top items is also more weakly associated with survival (incremental pseudo-R2 = 1.00 × 10−3) than the weighted sum used by the factor scores (Fig. 6a). Even with optimal weighting, at least 7 items are necessary to capture 80% of total factor variance, suggesting that signal in Factor 23 relies on the ability to summarize information across related items.

Fig. 6: Comparative performance of Factor 23 versus individual items.
figure 6

The factor score is compared to each of the top 10 items and to an unweighted sum of z-scores of those items. a, Comparison of incremental R2 for mortality prediction in N = 217,393 individuals with complete data for the included items and sufficient accuracy in the independent-variable factor scores for Factor 23 (Methods). The comparative baseline model for each included covariates for the first 20 genetic PCs, age, chromosomal sex, age2, age × chromosomal sex, age2 × chromosomal sex, dummy variables representing the assessment centres of origin and days from baseline assessment to T0. b, Comparison of point estimates of heritability. Results show estimated observed-scale SNP heritability ±1 standard error from GWAS with the listed sample size (N). c, Comparison of variance explained by polygenic scores for Factor 23 vs its top 3 component items for 5 relevant traits in the external Add Health study. Barplots show estimated variance explained (change in R2 from adding polygenic scores to linear regression models for each outcome (Methods)) with error bars representing 95% bootstrapped confidence intervals, with a lower bound of 0 for visualization purposes. See Supplementary Table 13 for comparison to all top 10 items.

This stronger signal enables GWAS of Factor 23 to identify the largest number of significant loci so far for a self-reported measure of physical activity. Of 34 loci, 24 are not significant in GWAS of the top 5 items, and 25 have not previously been identified in GWAS of self-reported physical activity, although 14 were significant at a less stringent threshold (that is, P < 5 × 10−5 (ref. 59)). Similarly, consistent with the trend across the factors, the SNP heritability of Factor 23, 0.07(0.003), is higher than most of its component items and sum of top-item z-scores (hg2 = 0.05 [0.004]; Fig. 6b). This increase in power probably comes as a result of leveraging the correlated structure across items and moving towards a more continuous construct measure. Furthermore, heritability for Factor 23 is enriched for regions of the genome associated with central nervous system cell types (P = 2.52 × 10−9; Extended Data Fig. 6), indicating that physical activity is primarily linked to brain and behaviour, rather than traits such as muscle tone or cardiovascular health. Among top items in Factor 23, there is nominal evidence for a partially causal effect of recent participation in strenuous sports on current self-rated health (LCV genetic causality proportion = 0.37, P = 5 × 10−3; Supplementary Text and Fig. 10), additionally supporting a role for motivation and socialization in linking activity to perceived health.

Echoing results within UKB, polygenic score analyses within the independent Add Health dataset reveal that the Factor 23 PGS outperforms polygenic scores formed from each of its top 10 items in predicting medical and health behaviour outcomes traditionally thought of as relevant to physical activity (for example, R2 = 1.4% for hours of physical activity per week, R2 = 2.1% for BMI and pseudo-R2 = 3.4% for cardiovascular disease; Supplementary Table 13 and Fig. 6c). The Factor 23 PGS even outperforms a multivariate regression model including all 10 item-based PGS (Factor 23 R2 = 3.4% [2.2%,4.7%]; combined top item PGSs R2 = 2.6% [1.4%,3.9%]), further emphasizing the power gained from combining measures across the phenome in a way that leverages underlying correlation structure.

Discussion

The results of this study demonstrate that applying a model-based data reduction technique to hundreds of diverse items across a biobank can distill the phenotypic landscape into tractable latent constructs that differentiate individuals across interpretable axes of variation. Latent factors extracted from UKB link causes, correlates and consequences of health, behaviour and disease at the cohort level, empowering deeper analyses of the identified domains with linked genetic, registry and biomarker data, and revealing patterns that are otherwise obscured by the sheer volume of data.

For researchers interested in the structure of human variation across the phenome, these results provide a critical advance. While approaches such as principal component analysis can reduce dimensionality and maximize variance explained, FA explicitly models relationships between observed variables and the broad constructs that underlie them, with its emphasis on ‘simple structure’ ensuring that the content of each factor can be inferred from its top-loading items17,18,19 (see Supplementary Text and Table 3, and Extended Data Fig. 3 for comparison). Indeed, within the medical domain in particular, our extracted factors capture causes, correlates and consequences of diagnoses such as asthma (Factor 11) and coronary artery disease (Factor 16), in a manner that is hypothesis-free and driven entirely by the correlation structure of the underlying data.

Summarizing items across the phenome into a smaller number of estimated factor scores based on correlation structure can increase heritability and power for downstream analyses (Fig. 3). We show that factors are, on average, more heritable than their component items, which leads to greater power for genetic discovery, particularly for loci that capture shared signal. Notably, these relative gains in power are not uniformly distributed, with factors containing mostly dichotomous or self-report survey items, rather than empirical measurements, experiencing the greatest increases in heritability (Extended Data Fig. 9). Factor 23, for example, leverages the correlation structure between self-reported physical activity, engagement in sports and other pro-health behaviours to yield factor scores that have greater heritability than 9 of the factor’s top 10 items, and outperform all top 10 items in prospective survival prediction in UKB and in variance explained in morbidity outcomes by polygenic scores in a separate, US-based sample (Fig. 6 and Supplementary Table 13). Polygenic scores for Factor 9, which combines indicators of trauma and its sequelae in a hypothesis-free, data-driven manner, similarly outperform PGS based on previous binary or summative trauma measures49,50,51 (Supplementary Table 12). In each case, these results support considering multiple indicators across the phenome, particularly when studying complex human phenotypes that cannot be empirically assayed.

Beyond yielding increases in power, factors underscore important trends in the social and behavioural sciences, a number of which we have described in vignettes within this manuscript. Factors 5, 10 and 15, for example, successfully deconvolve indicators of socioeconomic status into subdomains, in a way that generalizes to an independent US-based sample (Fig. 4). This data-driven demonstration of the multifaceted nature of SES supports longstanding hypotheses (for example, ref. 40) of separate axes of education, income, occupation and other elements of social standing, and begins to parse their differential relationships to health outcomes (Supplementary Table 7 and Extended Data Fig. 4). Factor 23 links physical activity with self-rated health and significantly predicts mortality despite having been constructed as orthogonal to other factors capturing physical and mental health outcomes (Fig. 2a), reinforcing evidence that physical activity improves quality of life and has numerous health benefits independent of disease status60,61. Factor 9, meanwhile, draws on the strength of relationships between multiple trauma indicators across the phenome, combining them in a data-driven way to demonstrate robust and ubiquitous associations with morbidity and mortality (Fig. 5). While such associations with trauma are neither unexpected nor individually novel, no analysis to our knowledge has articulated such associations at this scale across the phenome; this underscores the powerful public health impact of trauma and the value of linking multiple items together when assessing this complex construct47.

Our approach represents only one possible way of distilling biobank-scale data into informative measures. Although comparison to PCA provides some initial intuition on the benefits of factor analysis (see Supplementary Text and Extended Data Fig. 2), this study is exploratory and not intended to recommend a single best approach. Indeed, we anticipate that comparisons across methods will prove most informative. For instance, our results modelling observed phenotypic correlations may serve as a useful comparator to those of recently developed methods for decomposing genetic correlations8,25. Furthermore, within the medical domain, supplementing curated, clinician-driven methods for defining disease phenotypes from registry data (for example, refs. 62,63) with data-driven connections to survey and lab-based measures where available, as suggested by our factors, could prove useful in stratifying individuals for prognosis and treatment.

The results are subject to several important caveats and limitations (see Supplementary Text for additional discussion). Perhaps most importantly, latent constructs identified via factor analysis are not ‘real’; they are simply statistical relationships, or the weighted linear combination of items that capture the correlation structure from the overall dataset, regardless of whether there is some true underlying phenomenon driving the measured variables. Factors are therefore non-trivially dependent on the nature of the dataset used, including the variables measured, the characteristics of the participants and the sociodemographic context in which the data were collected64.

Participants at the core of this study are, for example, unlikely to be representative of the global population. Analyses were restricted to UKB participants of predominately estimated European genetic ancestry, thus limiting generalizability of results, particularly as it relates to genetic inference27. We openly recognize this as a key limitation of these analyses, which we highlight in the introduction to this paper to emphasize its critical importance.

As an additional limitation, UK Biobank participants are known to be non-representative of the UK population as a whole, with documented ascertainment and participation biases (for example, refs. 3,65,66). Even among those participating in the study, individuals providing more complete initial responses and contributing to later optional follow-up surveys are known to be, in general, healthier and more educated67,68. This trend is reflected in our own core, low-missingness data group being substantially more likely to report having completed college or university (45.7% core group, 30.7% non-core group; \({\chi }^{2}\) = 3,816.0, P < 0.001). While polygenic score analyses in Add Health suggest that genetic signal captured in the fitted factors at least partially generalizes to individuals of European ancestry in the United States, identifying which constructs elucidated in this study are most robustly maintained across different sociopolitical, cultural and diagnostic contexts requires future work in additional cohorts.

Adapting factor analysis to permit execution at the scale and structure of UK Biobank also forced us to make analytic decisions that introduce a number of methodological limitations. Most prominently, we required orthogonality across the factors to reduce the number of parameters to be estimated and substantially simplify matrix algebra operations. As such, the variance captured by each factor is forced to be independent, complicating interpretation; specifically, each factor must be viewed as representing the covariance structure of the items within it, once accounting for covariance modelled by the other factors. Caution must therefore be exercised in interpreting relationships within and between factors, since this independence is an artefact of the modelling approach. Relatedly, although full information maximum-likelihood estimation methods remain the ‘gold standard’ for modelling missingness and categorical items, their computational burden proved intractable at this scale69. Instead, we used pairwise deletion and Pearson correlations for the exploratory factor analysis (EFA), and used diagonal weighted least squares (DWLS) to fit the confirmatory factor analysis (CFA), which we expect to have weaker relative performance but to nevertheless be an acceptable alternative (see Supplementary Text for additional details)70. Future computational advances, including the use of graphics processing units, along with continued methods development70, may allow for factor analysis and structural equation modelling to be more widely and more optimally applied at biobank scale.

As the first example to our knowledge of the factor analytic approach applied across multimodal data at biobank scale, these results provide a proof-of-concept that such methods can return both sensible and insightful relationships. This approach provides an important first step towards better embracing the full and complex measured phenome to power discovery for human health and wellbeing.

Methods

Ethics approvals

All research was conducted in compliance with all relevant ethics regulations. Use of the UKB data was approved under application 31063. Analysis of the UKB data was reviewed by the Partners HealthCare Institutional Review Board (IRB) (Partners Human Research), which determined in expedited review that the project met the US federal criteria definition of ‘not human subjects research’. Analysis of the Add Health data was reviewed by the Office of Research Subject Protection (OSRP) at the Broad Institute of MIT and Harvard, which determined that the project met US federal criteria for exemption from IRB review.

UK Biobank cohort

Sample selection and quality control

The UK Biobank is a longitudinal health study of roughly 500,000 volunteers between the ages of 40–69 at recruitment time, which was between the years of 2006–2010. Starting from the 487,409 genotyped participants in the UK Biobank second round release, we first subset to unrelated individuals with low autosomal missingness rates used for PCA in ref. 3. We adopted this selection to aid consistency with other UK Biobank applications and analyses. We then restricted the cohort to individuals of predominantly estimated European genetic ancestry based on analysis of the top six principal components (PCs). This selection was intended to be more broadly inclusive than the ‘white British’ criteria used by the UK Biobank genetics team3 while still restricting to a sufficiently homogeneous and unrelated set of individuals to permit GWAS with conventional linear regression. After ancestry selection, we made final exclusions for individuals who withdrew from UK Biobank participation before the GWAS analysis and individuals who were omitted from imputation phasing (for example, individuals with sex chromosome aneuploidies). After all sample quality control (QC), there were 361,194 QC positive individuals. Between initial QC and the start of analyses for the current study, an additional 50 participants withdrew from UKB, resulting in a final N of 361,144.

Phenotype curation

A core challenge to the analysis of such a wide range of phenotypes as those available in the UK Biobank is the curation and harmonization of the large number of variable scalings, categorizations and follow-up responses. To automate this process, we used a modified version of the PHEnome Scan Analysis Tool (PHESANT71). Unlike standard PHESANT, the modified version does not perform association analyses but simply generates a collection of recoded phenotypes.

The incorporation of new phenotypes requires careful examination of raw data codings and, in the case of binary phenotypes, consideration of control definition. Recodings of variables and inherent orderings of ordinal categorical variables are defined in the data-coding file, available in the GitHub repository https://github.com/astheeggeggs/PHESANT. We restricted the phenotype data to those that belong to individuals in the unrelated GWAS subset, and ran the modified version of PHESANT on the phenotypes in the UK Biobank application. For continuous phenotypes, we retained the raw version of the continuous phenotype, with no transformation applied to the data. We processed 3,011 unique phenotypes using PHESANT. For all binary phenotypes, we required a minimum case count of 100.

Along with these PHESANT-curated phenotypes, we also processed 633 ICD10 disease codes, treating all individuals with a specific ICD10 code as cases. The remaining UK Biobank samples were treated as controls. Curation of the ICD10 codes was carried out separately for computational efficiency. For the ICD10 phenotypes, individuals were assigned a vector of ICD10 primary diagnoses. We truncated these codes to the three-digit category level and assigned each individual to either case or control status for that ICD10 code in turn by checking whether their vector of primary diagnoses contains that code. Throughout, we assumed that the data contained no missingness, so the sum of cases and controls is the number of individuals in the European-ancestry subset of the UK Biobank data. Consistent with our treatment of binary phenotypes, ICD10 code case/control phenotypes were removed if less than 100 individuals in the European-ancestry subset had a given phenotype as a primary diagnosis.

Add Health sample

Add Health originated as an in-school survey of a nationally representative sample of US adolescents enrolled in grades 7 through 12 during the 1994–1995 school year31. Respondents were born between 1974 and 1983, and a subset of the original Add Health respondents had been followed up with in-home interviews, allowing researchers to assess correlates of outcomes in the transition to early adulthood. In Add Health, the mean birth year of respondents is 1979 (s.d. = 1.8) and the mean age at the time of assessment (Wave 4) is 29.0 years (s.d. = 1.8).

Factor analysis modelling

The factor analysis model treats observed variables X as measures of a smaller number of unobserved latent factors F, with corresponding effect sizes or ‘loadings’, Λ, and item-specific residuals ϵ17,18,19,72. Specifically, let

$$X=F{\varLambda }^{{\prime} }+\epsilon$$
(1)

where X is the n × p matrix of p centred and standardized observed variables for n individuals, F is the n × t matrix of n individuals’ values for t latent variables, and Λ is the p × t matrix of the effects of the t latent variables on the p observed variables. This model is easily extendable to the inclusion of ‘nuisance’ covariates, either by explicitly adding terms for covariates or residualizing them out of X. For the purposes of our analyses, we chose the latter approach, which is described in later sections.

We are able to fit this model by considering the observed covariance across items.

$$\Sigma =\varLambda {F}^{{\prime} }F{\varLambda }^{{\prime} }+{\epsilon }^{{\prime} }\epsilon$$
(2)

Assuming that the observed covariance between the items is fully explained by the latent factors, we denote \(\Psi =E[{\epsilon }^{{\prime} }\epsilon ]\) as the p × p diagonal matrix of residual variances per item (that is, item uniquenesses). We additionally chose to model the latent factors as independent and fixed their scale to have unit variance, allowing us to model the observed covariance between the items with the matrix decomposition

$$\Sigma =\varLambda {\varLambda }^{{\prime} }+\Psi$$
(3)

Multiple methods exist for the extraction of latent factors which generally yield similar results; we considered multiple such methods below. Once factors are extracted, however, there are an infinite number of equivalent Λs up to a particular rotation; this is referred to as ‘rotational indeterminacy’. To uniquely fit the model, additional optimization criteria must be specified for Λ. For the current analysis, we relied on the ‘varimax’ rotation, one of a number of standard rotations that encourages sparsity, or a ‘simple structure’, on the model by penalizing factors or items with multiple large loadings. This rotational restriction facilitates interpretability, as the majority of the signal associated with a particular factor can be identified on the basis of a limited number of its top-loading items.

In the current study, we undertook a multistage approach to best meet the assumptions of the factor analysis methodology while adapting it for such large-scale data (Extended Data Fig. 1). We first identified a core data group consisting of 42,325 individuals and 898 items from which a stable pairwise correlation matrix could be estimated across a range of questionnaires and assessments. We split this group into modelling (N = 33,860) and holdout (N = 8,465) subgroups on the basis of an 80:20 split. After systematically removing collinear items from the dataset (730 items remaining), we performed an EFA within the modelling subgroup to determine the factor structure (that is, the number of latent factors t and which elements of Λ are non-zero). We then further refined the model suggested by the EFA, utilizing a structural equation model for CFA in the same modelling subgroup. We tested the fit of this final model, with constrained parameters, in the holdout sample. These steps are described in more detail below, as well as in the Supplementary Text.

Core data group

To enable the estimation of reasonably unbiased pairwise correlations between variables, we began with all individuals of European ancestry (N = 361,144) and all phenotypes analysed in both sexes in the initial release of the Neale Lab UKB Round 2 mega-GWAS (2,772 phenotypes; https://www.nealelab.is/uk-biobank/ukbround2announcement). We first identified a core group of individuals with a high rate of assessment completion (N = 42,325; see Supplementary Text: Selection of individuals for core data group). We then identified items with low missingness in this core group, sufficient prevalence (>1%), and non-structured and non-item-dependent missingness, from which pairwise correlations could be successfully estimated (898 items; see Supplementary Text: Item selection for core data group). The overall missingness rate in this final core data group was 9.1%, with missingness on each item of up to 28.6% (s.d. 10.7%), and for each individual up to 33.3% (s.d. 7.9%). See Supplementary Text: Characteristics of core data group, for more information about demographic and item composition.

From this core data group, we systematically removed collinear items to improve the stability of factor analysis estimation. Starting with a Pearson correlation matrix residualized for our chosen ‘nuisance’ covariates (that is, first 20 genetic PCs, age, chromosomal sex, age2, age × chromosomal sex, and age2 × chromosomal sex; see Supplementary Text: Selection of ‘nuisance’ covariates), we removed items that were redundant between observed and derived items (67 items), highly correlated with missingness (6 items), had pairwise correlations r > 0.95 (43 items), had squared multiple correlation (SMC) > 0.98 (43 items) or had correlation induced by ‘None of the above’ response categories (6 items). After these exclusions, 730 items remained for factor analysis.

Finally, this core group was further divided into modelling (N = 33,860) and holdout (N = 8,465) groups on the basis of an 80:20 split to avoid bias from overfitting in the evaluation of model fit.

Exploratory factor analysis

We performed an EFA using the ‘psych’ package in R (v.4.0.2) on a partial pairwise Pearson correlation matrix—residualized for the first 20 genetic PCs, age, chromosomal sex, age2, age × chromosomal sex, and age2 × chromosomal sex—within the modelling subgroup (N = 33,860) to determine the factor structure.

Conventional methods for selecting the number of latent factors provided inconsistent results for the data: the scree plot suggested 30–50 factors (Supplementary Fig. 2), parallel analysis suggested 177 factors, and 253 eigenvalues of the correlation matrix were >1. To best model large-scale structure across phenotypic items within UKB, we therefore based our decision for number of factors to extract primarily on the inflection point of the scree plot, in tandem with measures of model stability and factor non-triviality (see Supplementary Text: Exploratory factor analysis, for further discussion).

We explored factor solutions with an increasing number of factors using weighted least square (WLS), generalized weighted least square (GLS), minimum residual (MINRES) and unweighted least square (ULS), all with ‘varimax’ rotation to extract orthogonal factors. As an upper bound, Heywood and ultra-Heywood cases were observed when fitting more than 169 factors with GLS, 186 factors with WLS or 38 factors with ULS or MINRES. Inspection of these models found that WLS and GLS yielded many factors with strong loadings (>0.3) for at most 1 item, while the ULS or MINRES solutions generally yielded factors incorporating variation from multiple items (Supplementary Fig. 3).

On the basis of stability and non-triviality (that is, lack of factors with only 1–2 items with loadings >0.3), we selected a 36-factor MINRES solution (MINRES-36; root mean square of residuals = 0.02, variance explained = 18.5%; Supplementary Fig. 3d) as our preferred EFA model for subsequent refinement.

Principal component analysis

We compared the factor analysis to PCA, another commonly used dimensionality reduction method, using the same partial Pearson correlation matrix that served as input to the EFA. Eigenvalues and eigenvectors of this input matrix were computed using the linalg.eig function within numpy in Python. Eigenvectors corresponding to the 36 greatest eigenvalues, in order, were extracted for comparison to the 36-component model of the EFA.

Confirmatory factor analysis

We further refined the model suggested by the EFA, utilizing a structural equation model in the same modelling subgroup. CFA allowed us to test the fit of a more parsimonious model (that is, omitting loadings with small estimates in the EFA). Within the CFA, we also more appropriately modelled the covariance structure of the diverse variable types (that is, binary, ordinal, continuous), and more robustly handled missingness, in contrast to our decision to use a partial pairwise Pearson correlation matrix as input to the EFA.

For the 564 variables from the EFA with loadings >0.1 on at least one factor (see Supplementary Text: Selection of minimum loading for factor inclusion; Supplementary Fig. 4), missing data were imputed using classification and regression trees (CART) within the Multivariate Imputation by Chained Equations (MICE) package73 in R, with all covariates as well as 20 additional auxiliary variables (for example, previously excluded ‘gatekeeper’ items and assessment centre) included as predictors. Items whose missingness pattern depended on the target item’s missingness were omitted as predictors for that target item. Evaluation of this approach using synthetic missingness at random (MAR) and completely at random (MCAR) showed good convergence and minimal systematic bias. Comparisons of correlation and covariance matrices generated using pairwise deletion versus imputation revealed them to be nearly identical (Supplementary Text: Multiple imputation of core data group; Supplementary Fig. 5); as such, to conserve computational resources, only a single imputed dataset was used for modelling purposes.

Confirmatory factor analysis models were fit to the imputed data for the modelling group (now N = 33,854 due to participant withdrawals during the course of the study) using structural equation modelling with an extensively modified version of the lavaan package74 (v.0.6-3) in R (see Supplementary Text: Computing aspects of structural equation modelling; adapted code is available at https://github.com/ce-carey/ukb-factor-analysis). Correlations between variables were estimated as appropriate for their measurement scale (for example, polychoric for pairs of ordered variables, Pearson for pairs of continuous variables and polyserial for pairs containing one of each), assuming an underlying normal distribution. Continuous variables were standardized before modelling, and all variables were modelled conditional on exogenous ‘nuisance’ covariates (that is, first 20 genetic PCs, age, chromosomal sex, age2, age × chromosomal sex, and age2 × chromosomal sex). Model parameters were estimated using DWLS, with final robust standard errors and test statistics calculated using a scale-shifted approach (that is, the WLSMV option in lavaan).

Fitting the EFA-derived model using CFA yielded a number of initial errors due to a lack of estimable pairwise correlation (due to collinearity) and cell sizes of 0 for ordinal variables. After removing 9 items to address these errors, 23 Heywood cases remained, indicative of overfitting and near collinearity. These new instances of collinearity were observed in part due to the lenient pairwise r > 0.95 threshold used for the initial EFA and the addition of latent modelling of ordinal and categorical variables in the CFA. Items were iteratively removed until no negative residual variances remained (505 final items). In addition, once these items were removed, one factor (Factor 8) overlapped completely with another (Factor 4) and was removed to facilitate model fitting (see Supplementary Text: Differences between EFA and CFA). Supplementary Table 4 documents the reason for each variable’s exclusion from the EFA to the final factor model.

Finally, we noticed that misfit in certain parts of the model was being driven by the presence of extreme outliers (see Supplementary Text: Extreme outliers of continuous variables). Therefore, we removed from analysis all individuals in the core group with values greater than 20 standard deviations from the mean on any continuous variable (N in modelling group = 52; N in holdout group = 13). This resulted in a final N of 33,802 in the modelling subgroup.

To evaluate the applicability of the factor model beyond the modelling subgroup, we obtained fit metrics in the validation holdout subgroup (initially N = 8,465; N = 8,452 after removing continuous-variable outliers) while constraining the model parameters (that is, factor loadings) to those estimated in the training subgroup.

Traditional assessments of model fit include model chi-square (P value should not be significant), RMSEA (values 0.01, 0.05 and 0.08 indicate excellent, good and acceptable fit, respectively), SRMR (values < 0.08 indicate good fit), CFI (values > 0.90 indicate good fit) and Tucker Lewis index (TLI; values > 0.90 indicate good fit). For the sake of completeness, we report all of the above values for the EFA and for both the modelling and holdout samples for the CFA in Supplementary Table 2. In assessing overall model fit, we relied primarily on the absolute fit measures (for example, RMSEA) since they are better suited to evaluating how well our fitted model approximates the structure of the observed phenotypic data (see Supplementary Text: Assessment of model fit, for more on interpretation of absolute and relative fit in the context of this analysis). In addition, where applicable we report both the uncorrected and ‘scaled’ versions of these metrics, where the scaled values rely on the adjusted test statistic from fitting the model with DWLS. We caution that neither of these values are fully robust for use with categorical data, but no better alternatives are currently available without computation of the likelihood, which is currently infeasible here70.

Factor scores

On the basis of the final factor model, we then generated latent factor scores for each individual from the values of their observed indicator items (see Supplementary Text: Factor score generation, for full details). For each individual i, the estimated factor score for factor t is a weighted sum of the items xj

$${\hat{f}}_{i,t}=\sum _{j}{a}_{j,t}{x}_{i,\;j}$$
(4)

If we take the factor model as true, then the resulting estimates are of an individual’s ‘true’ score for the underlying latent construct, otherwise they simply estimate the value that best approximates the observed data for each individual with the low rank approximation of the complete data modelled by the CFA. The current analysis used two sets of factor scores, corresponding to two different estimation methods to compute the factor scoring coefficients \({a}_{j,t}\), following previous recommendations to avoid biased results in factor score regression75.

Dependent variable factor scoring coefficients

Where factor scores were used as the dependent variable in an analysis (for example, GWAS), we calculated \({\hat{f}}_{i,t}\) using factor scoring coefficients computed with Bartlett’s method76:

$${A}_{\rm{B}}={\hat{\Psi }}^{-1}{\hat{\varLambda }}{({\hat{\varLambda}}^{{\prime} }{\hat{\Psi }}^{-1}{\hat{\varLambda}})}^{-1}$$
(5)

where X is the n × p matrix of p residualized and standardized observed variables for n individuals, \({\hat{\varLambda}}\) is the \(p\times t\) matrix of estimated factor loadings and \({\hat{\Psi}}\) is the \(p\times p\) diagonal matrix of estimated residual variances (that is, item uniquenesses). Bartlett’s estimator is a weighted least squares solution that minimizes the residual variance of the items given the factor scores, weighting by the fitted item uniquenesses from the model.

Independent variable factor scoring coefficients

Where factor scores were used as the independent variable in the analysis (for example, mortality), we used factor scoring coefficients computed with the Thomson–Thurstone (regression) method72,77,

$${A}_{{\rm{TT}}}={\hat{\Psi }}^{-1}{\hat{\varLambda}}{(I+{\hat{\varLambda }}^{{\prime} }{\hat{\Psi }}^{-1}{\hat{\varLambda}})}^{-1}$$
(6)

where \(I\) is an identity matrix. These factor scoring coefficients give the best linear prediction of the factor score, minimizing the sum of the expected mean squared error across factors.

Adjustments for categorical items

The above framework for the factor score estimators assumes that all of the item data \(X\) are observed and residualized for exogenous ‘nuisance’ covariates. Although this is true for observed continuous items in the model, it is not true for the CFA model where the observed data are categorical and modelled through a link function.

To address the different measurement scale for categorical variables, we estimated the expected value of each individual’s latent continuous variable given the observed categorical item and the fitted probit regression with exogenous ‘nuisance’ covariates. The residual between this expected latent value and the value predicted by the covariates was then substituted for the observed categorical variable in the factor scoring calculations.

The categorical variables would also have weaker covariances with the factor than the unobserved latent values modelled for the CFA loadings. To account for this attenuation, we transformed the loadings and residual variances corresponding to categorical items for use in factor scoring (see Supplementary Text: Modifications for categorical and missing data). Specifically, we used

$${\hat{\lambda }}_{j,t}\sqrt{\mathrm{var}\left({x}_{j},|,{z}_{k}\right)}\times \frac{\sum _{c\epsilon C}h(c)}{{\sigma }_{x}}$$
(7)

as loadings, where \({\hat{\lambda }}_{j,t}\) is the loading for item j on trait t estimated in the CFA, \(h(c)\) is the density of the standard normal distribution at the fitted probit threshold for each category \(c\) of the categorical variable, the variance of the categorical item conditional on the covariates \(\mathrm{var}\left({x}_{j},|,{z}_{k}\right)\) was estimated empirically from the residual expected latent values described above, and the standard deviation of the categorical item \({\sigma }_{x}\) was estimated from the class probabilities. Similarly, we substituted

$$\mathrm{var}\left({x}_{j},|,{z}_{k}\right)\times\left(1-\left[1-{\psi }_{{jj}}\right]\frac{{\left[\sum _{c\epsilon C}h(c)\right]}^{2}}{{\sigma }_{x}^{2}}\right)$$
(8)

as the residual variance for categorical items in the factor scoring equations, where \({\psi }_{{jj}}\) is the estimated residual variance in the CFA and the remaining terms are defined as in the transformation of the loadings.

Adjustments for missingness

Factor score estimates that sum across all items as described above cannot be computed for individuals with missing data. Instead, for each individual with a set of missing items M, we computed factor scoring coefficients optimized for the subset of items that are observed, that is

$${A}_{{\rm{B}},-M}={\hat{\Psi}}_{-M}^{-1}{\hat{\varLambda}}_{-M}{\left({\hat{\varLambda}}^{\prime}_{-M}{\hat{\Psi}}_{-M}^{-1}{\hat{\varLambda}}_{-M}\right)}^{-1}$$
(9)

for dependent variable (Bartlett) factor scores and

$${A}_{{{\rm{TT}}},-M}={\hat{\Psi}}_{-M}^{-1}{\hat{\varLambda}}_{-M}{\left(I+{\hat{\varLambda}}^{\prime}_{-M}{\hat{\Psi }}_{-M}^{-1}{\hat{\varLambda}}_{-M}\right)}^{-1}$$
(10)

for independent variable (Thomson–Thurstone) factor scores.

The resulting factor score estimates maintain their desired relationship to other variables conditional on each missingness pattern, consistent with the bias-avoiding method of factor score regression, but the different amounts of information about the factor available for each individual leads to heteroskedasticity in factor scores across the missingness patterns. For analyses with the factor score as the dependent variable, we addressed this heteroskedasticity using WLS regression with estimated inverse-variance weights

$${w}_{i,t}=\frac{1}{{A}_{-M,t}^{{\prime} }{S}_{-M}{A}_{-M,t}}$$
(11)

where S is the sample covariance matrix of pairwise complete observations after residualization for exogenous ‘nuisance’ covariates. For linear regression analyses with factor scores as the independent variable, we used Huber–White sandwich standard errors78. Detail on motivation for each of these adjustments is provided in the Supplementary Text.

To further limit potential artefacts from heteroskedasticity related to structured missingness and ensure consistent interpretation of scores across UKB participants, individuals were included in factor score regressions if the factor score from their observed items was expected to correlate (\({r}^{2}\ge 0.8\)) with what their estimated factor score would be with if all items were observed (see Supplementary Text: Minimum correlation with complete data scores; Supplementary Fig. 6). Because we observed this threshold to be more universally liberal in the independent than in dependent factor scores, to allow for better concordance in samples across phenotypic and genetic analyses, we further restricted phenotypic analyses with the independent variable factor score to only those individuals included in the genetic analyses, resulting in sample sizes ranging from 75,226 (Factor 24) to 360,656 (Factor 20) (mean N = 252,219.571[121,829.646]).

Validation of factor scoring adjustments

To validate our factor-score-generating methodologies, we compared scores from our methods to those generated using a maximum-likelihood (ML)-based method in lavaan74 for the core data group. Factor scoring was performed using the ML option in lavaan for all 10 multiple imputations of the core dataset. To test for phenotypic concordance across methods, we obtained Pearson correlation coefficients between factor scores generated using our method and the mean of those obtained in lavaan across all 10 imputations. GWAS of the lavaan-generated scores were conducted with WLS using inverse-variance weights estimated on the basis of the observed variance in scores across individuals and across imputation replicates. Heritability and genetic correlation between these GWAS results and the GWAS for our dependent-variable factor scores were compared using linkage disequilibrium (LD) score regression79 (Supplementary Fig. 7).

Phenotypic association analyses

To further characterize the latent factors and also reveal potentially interesting associations, we tested the independent-variable factor scores for associations with 403 top-level phecodes and 28 biomarkers in the UK Biobank, as well as with prospective mortality. Covariates for these phenotypic analyses included the first 20 genetic PCs, age, chromosomal sex, age2, age × chromosomal sex, age2 × chromosomal sex and dummy variables representing the assessment centres of origin, to account for residual correlation between factors and these key variables (Supplementary Table 10).

Phecodes

Phecodes (1,685 items) were taken from the Pan-UK Biobank pan-ancestry GWAS project (https://pan.ukbb.broadinstitute.org/)80 and were derived from ICD-9 and ICD10 codes across a patient’s inpatient hospitalization records and, if applicable, death registry data. These diagnostic codes were mapped to descriptive phecodes using scripts from the University of Michigan (available at https://github.com/umich-cphds/createUKBphenome), which derived their mappings81,82 from those supplied by PheWAS Catalog (https://phewascatalog.org/). Phecodes were filtered to have a minimum case count of 250 in the full EUR sample, with a minimum of 25 cases per chromosomal sex, leaving 940 for analysis. Given the nested nature of the phecodes, such that top-level codes contain all diagnoses listed in subcodes, we restricted analyses to the 403 remaining top-level phecodes only. Associations were performed using a generalized linear model with a binomial link function and Huber–White (‘HC0’) robust standard errors78 using the statsmodels package (v.0.13.1) in Python.

Biomarkers

Serum biomarkers were obtained from the UK Biobank (28 items, after excluding rheumatoid factor and oestradiol for known QC issues83). Associations between each independent-variable factor score and biomarker were performed using ordinary least squares regression and Huber–White (‘HC0’) robust standard errors78 using the statsmodels package (v.0.13.1) in Python. Due to known issues with sample dilution83, an additional covariate was included representing the estimated serum sample dilution factor.

Mortality analyses

Given that the factors from our analyses represent major axes of measured phenotypic variation in the UK Biobank, it is plausible that they would be differentially associated with downstream mortality. We therefore performed Cox proportional hazards regression to assess relative risk of mortality across individuals on the basis of independent-variable factor scores.

Since surveys and assessments were administered at different times, to avoid issues of immortal time bias, \({T}_{0}\) was defined as the last contact an individual had with the UK Biobank study, within the context of the items included in the factors. Of the items included in the final factor model, the differently timed assessments included: baseline, a maximum of five 24-hour diet follow-up questionnaires, a work environment questionnaire and a mental health questionnaire. For example, if an individual completed the baseline assessment, mental health questionnaire and a 24-hour diet follow-up questionnaire, their \({T}_{0}\) would be their most recent questionnaire completion date. We included several ‘continuously updating’ items within the factors (for example, primary ICD10 codes and items relating to hospital stays, which are updated on the basis of linked inpatient hospital records). Therefore, for the mortality analyses, we recoded each individual’s factor scores on factors containing these ‘continuously updating’ items with their values at \({T}_{0}.\) If a person’s first instance of a primary ICD10 code was dated after their \({T}_{0}\), they would be recoded and rescored as ‘not’ having that code.

Date of death was obtained with linked death registries, and analyses were right censored to the earliest recommended censoring date across the death registries specified by UKB for England, and Wales and Scotland (that is, 30 September 2021 at the time of analyses). In addition to the standard phenotypic analysis covariates (that is, first 20 genetic PCs, age, chromosomal sex, age2, age × chromosomal sex, age2 × chromosomal sex and dummy variables representing the assessment centres of origin), a covariate was added representing days from baseline assessment to \({T}_{0}\). Analyses were performed using the lifelines package (v.0.26.4) in Python.

Genome-wide association analysis

Variant QC

Over 92 million imputed autosomal and X chromosome variant dosages were available in the UK Biobank release. The variant QC process focused on using widely adopted GWAS QC parameters to retain high-quality variants. After restricting to the 361,194 QC positive individuals, we retained SNPs with minor allele frequency (MAF) > 0.001, Hardy–Weinberg Equilibrium (HWE) P value > 1 × 10−10 and imputation INFO score > 0.8. INFO scores were taken directly from the UK Biobank SNP manifest file. The only exception involved SNPs annotated as having protein-truncating or missense consequences (from Ensembl VEP consequence annotation), where we relaxed the cut-off to MAF > 1 × 106. After variant QC, 13,364,303 autosomal variants were retained for association analysis.

GWAS model and implementation

GWAS of the dependent-variable factor scores were performed in Hail (https://hail.is/) using WLS regression. Variance weights were calculated as described in the ‘Factor score generation’ section above. To limit the impact of structured missingness and ensure consistent interpretation of scores across individuals, individuals were included only if their score based on their missingness pattern explained 80% of variance in a hypothetical observed factor score for which no items were missing (Supplementary Fig. 6). Covariates included the first 20 genetic PCs, age, chromosomal sex, age2, age × chromosomal sex, age2 × chromosomal sex and dummy variables representing the assessment centres of origin. Post-association test statistics were corrected for LD score regression (LDSC) intercept to reduce potential impacts of stratification. Effective sample size for genetic analyses, taking into account missingness, was calculated as the sum of each person’s inverse-variance weight divided by the regression weight assigned to a hypothetical person with 0 missingness, and ranged from 74,782 to 359,419 (mean =236,980.029[112899.969]).

Identification of significant independent loci

We used the FUMA84 pipeline to identify independent genomic loci. We considered an independent locus as the region including all SNPs in pairwise LD (r2 > 0.6), with the lead SNPs in a range of 250 kb from each other and independent from other loci at r2 < 0.1. We used the 1000 Genomes Phase 3 Europeans reference panel to determine LD.

Comparison of factor vs item GWAS

To investigate the genetic properties of the factors, we compared factor GWAS to the GWAS of component items. Summary statistics for all 505 items included in the final factor model are publicly available via the Neale Lab UKB Round 2 mega-GWAS (https://www.nealelab.is/uk-biobank/ukbround2announcement), and were adjusted for LD score intercept to be consistent with the factor summary statistics.

Independent loci for the 5 top-loading items per factor were identified using a local version of FUMA84 and identical parameters to those used to define the factor GWAS loci. Given that the intention for these analyses was to compare loci across top items and corresponding factors, loci were henceforth defined by their basepair intervals, and loci across GWAS (for example, for a factor and its top item) were combined if their basepair intervals overlapped.

Comparison of factor vs item SNP heritability

For each of the 505 items included in the final factor model, observed-scale hg2 estimates were downloaded from the publicly available Neale Lab UKB Round 2 mega-GWAS Heritability Browser (https://nealelab.github.io/UKBB_ldsc/index.html).

It should be noted that conventionally, to account for differences in heritability estimates by measurement type (that is, continuous vs binary items), estimates are reported on the liability scale. Indeed, this is the default estimate reported in the Heritability Browser. However, for factor scores there is not a clear transformation to the liability scale since they are neither binary (since they are weighted sums over a large number of items) nor fully continuous (as when individual binary/categorical items contribute substantial weight). For this reason, we report all heritabilities (for both factors and constituent items) on the observed scale. Although this will tend to understate the amount of underlying genetics that may exist for binary items, it will equitably indicate the amount of genetic signal captured in the item or factor as it is currently measured.

LD score regression analyses

LD score regression analyses of heritability, enrichment85 and genetic correlation79 were performed using LDSC (available at https://github.com/bulik/ldsc) with LD scores computed in individuals of European genetic ancestry from the 1000 Genomes Project. All analyses were performed with default settings except where otherwise indicated.

SNP heritability

SNP-based heritability was estimated on the observed scale for each factor using stratified LD score regression (S-LDSR)85 and v.1.1 of the baseline-LD model (available at https://alkesgroup.broadinstitute.org/LDSCORE/). We used this stratified model to more robustly fit variation in genetic signal across the genome, estimating per-SNP heritability conditional on 75 annotations, including functional categories, evolutionary constraint, histone marks, and LD- and allele frequency-related annotations. The default filter in LDSC for maximum chi-square was omitted to avoid truncating top hits at our large sample size.

Cell-type enrichment analyses

To gain insights into the underlying biology of the factors, we evaluated heritability enrichment of regions associated with cell-type-specific chromatin marks using S-LDSC85 and annotations derived from the Roadmap Epigenomics Consortium86, as described in ref. 85. To reduce multiple testing burden and give a broader summary of systems-level biology, we grouped the updated cell-type–specific annotations described in ref. 87 into 9 tissue groups (adipose, blood/immune, cardiovascular, central nervous system, digestive, liver, musculoskeletal/connective, pancreas, and other) following the same procedure described in ref. 85, taking the union of annotations belonging to each group. Consistent with recommendations in ref. 85, only factors with strongly significant heritability estimates (z > 7) were included in these analyses.

Genetic correlation analyses

Genetic correlations across factors were computed using LD score regression with LD score estimates derived from European-ancestry individuals in the 1000 Genomes Project. We additionally ran genetic correlations between our factors and 68 traits selected from previous GWAS, spanning a number of anthropometric, medical, psychological, behavioural and sociodemographic domains (Supplementary Table 9). Traits used for genetic correlation analyses were chosen before conducting the analyses, with the agreement of the coauthors. Note that given the near-ubiquity of the UK Biobank in modern genetics research, a number of these previous GWAS share overlapping samples and, at times, component phenotype definitions, with the current study. As such, we have reported genetic correlations with UKB samples removed within the text as well, when such summary statistics were available. We also report the genetic covariance intercepts, which are a proxy for the amount of sample overlap present across pairs of summary statistics within a genetic correlation analysis79.

Polygenic scoring

Polygenic scores in Add Health were constructed with LDpred88. LDpred estimates polygenic scores using SNP weights that estimate the conditional association of each SNP accounting for LD and the estimated genetic architecture of the trait, and has been shown to have greater prediction accuracy than conventional LD pruning followed by P value thresholding.

For the Add Health sample, we used the genotyped data from the Add Health prediction cohort to create the LD reference file. After imputing the genetic data to the Haplotype Reference Consortium89 using the Michigan Imputation Server90, we used only HapMap3 variants with a call rate >98% and a minor allele frequency >1% to construct the polygenic scores. We limited the analyses to European-ancestry individuals. Polygenic scores were calculated with an expected fraction of causal genetic markers set at 100%. In total, we used 1,168,025 HapMap3 variants to construct the polygenic scores in Add Health. We then used Plink91 to multiply the genotype probability of each variant by the corresponding LDpred posterior mean over all variants. For all sets of summary statistics, that is, from Factor Score GWAS summary statistics, individual item GWAS summary statistics or outside publicly available GWAS summary statistics, we ensured that Add Health was not included in the discovery GWAS sample.

We then determined the association of a given polygenic score and an outcome of interest in Add Health. Outcomes used in analyses were as follows: (1) years of completed education, (2) income, (3) a binary measure of having ever completed college, (4) a binary indicator of working either a high-paying or low-paying job, (5) a measure of cognition called the Peabody Picture Vocabulary Test, (6) a 1–5 Likert scale measure of neighbourhood satisfaction (‘If, for any reason, you had to move from here to some other neighbourhood, how happy or unhappy would you be?’), self-reports of doctor-diagnosed (7) panic or anxiety disorder, (8) bipolar disorder or (9) PTSD, (10) self-report of doctor-diagnosed cardiovascular disease, (11) self-report of doctor-diagnosed diabetes, (12) self-report of number of hours of physical activity per week, (13) measured BMI and (14) measured systolic blood pressure.

Prediction accuracy was based on an ordinary least squares (OLS) or logistic regression (depending on the outcome) of the outcome phenotype on the polygenic score and a set of standard controls, which included birth year, sex, an interaction between birth year and sex, and the first 10 genetic principal components of the variance–covariance matrix of the genetic data. Variance explained by the polygenic scores was calculated in regression analyses as either the R2 change (for continuous or quasi-continuous phenotypes), or the Nagelkerke’s pseudo-R2 change (for binary outcomes), that is, the R2 or pseudo-R2 of the model including polygenic scores and covariates minus the R2 or pseudo-R2 of the model including only covariates. The 95% confidence intervals around all pseudo-R2 values were bootstrapped with 1,000 repetitions each.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.