Main

Cell-specific transcriptional programmes in metazoans are established by transcription factors (TFs) binding specific DNA elements mostly within transcriptional enhancers1,2,3. However, the principles governing how thousands of enhancers and hundreds of TFs active in any cell type interact to produce cell-specific transcriptional programmes are largely unknown3,4,5. One major challenge is that virtually all genome-scale studies focus on characterizing sequences in enhancers and transcriptional regulators that have strong transcriptional activity measured in gene reporter systems6,7,8,9,10,11,12,13. However, emerging evidence suggests that critical developmental information is encoded in enhancers that drive weak tissue-specific expression patterns14,15,16. Such weak enhancers contain suboptimal TF-binding motifs and spacing, and mutant enhancers with optimized motifs drive elevated but less-specific patterns of transcription, leading to developmental defects14,15,16,17. These results suggest an important evolutionary trade-off between activity and specificity encoded within weak enhancers, also referred to as ‘suboptimization’14. Whether such a trade-off is encoded in TFs themselves is unclear. If so, understanding the sequence features that encode such a trade-off could enable the design of natural TF variants with customized cellular reprogramming and other functionalities.

The investigation of trade-offs in TFs is impeded by current models that TF specificity and activity are encoded in separate protein portions. The activity of mammalian TFs is thought to be mediated by sequence motifs that comprise a ‘minimal’ activation domain, which is distinct from the DNA-binding domain (DBD) that determines binding specificity (Fig. 1a). Minimal activation domains are typically short (9–40 amino acids) and tend to assume secondary structure when bound to co-activators10,11,13. The minimal activation domains however are almost invariably embedded within much longer intrinsically disordered regions (IDRs) that do not have a stable secondary structure (Fig. 1a)6,7,8,9,10,11. An emerging view suggests that TF IDRs may contribute to transcriptional activity by engaging in multivalent weak interactions. Such interactions can drive phase separation of TFs in vitro and partitioning of TFs into condensates enriched in co-activators and RNA polymerase II (RNAPII) in cells18,19,20,21,22,23. Whether the ability of TFs to form condensates is important for their in vivo function is debated18,19,20,24. Nevertheless, the deletion of IDRs of yeast TFs was shown to reduce genomic binding25, suggesting that TF IDRs may contribute to transcriptional activity and also to binding specificity.

Fig. 1: Traces of aromatic periodicity in human TF IDRs.
figure 1

a, Model of a TF (top) and the method used to identify aromatic periodic blocks (bottom). b, The top 80 TFs ranked according to the IDR periodicity score. Ranks are shown in parentheses. The height of the bars in the outer circle is proportional to the periodicity score. The inner circles indicate whether the IDR contains a minimal activation domain (AD) identified in the four studies. c, Positioning of aromatic residues in NFAT5. Red dots indicate the position of aromatic residues in periodic block; yellow dots indicate the position of all other aromatic residues. d, Omega plot of the NFAT5 IDR. The empirical P value is reported. Red dots indicate aromatic residues, white dots indicate any other residue. e, Disorder plot (Metapredict; black) and AlphaFold2 pLDDT score (yellow) for HOXC4. f, Omega plots of the HOXC4 IDR (top) and the portion encoding the periodic aromatic block (bottom). The coordinates, ΩAro scores and the percentage of randomly generated sequences that have a lower ΩAro score than the actual sequence are provided. g, Representative images of droplet formation of purified recombinant HOXC4 IDR–mEGFP proteins. Scale bars, 5 μm. h, Relative amount of condensed protein in the droplet assays. Data are the mean ± s.d. of n = 10 images from two replicates. The curves were generated as nonlinear regressions to a sigmoidal curve function. i, Schematic (top) and results of luciferase reporter assays (bottom). The luciferase values were normalized to an internal Renilla control and the values are displayed as percentages of the activity measured using an empty vector. Data are the mean ± s.d. of n = 3 biological replicates. P values are from two-sided unpaired Student’s t-tests. j, Pipeline for the identification of regions with significant periodicity. k, Density plot of protein regions with significant periodicity. The length of the region is plotted against the lowest P value from the K–S test within the region. The depth of the colour is proportional to the density of the dots. The numbers of proteins that contain a region with significant periodicity over the total number of proteins in each category are shown. l, Omega scores of IDRs in various protein classes. P values are from one-way analysis of variance with Tukey’s multiple comparisons post test. For the box plots, the centre line shows the median, the bounds of the box correspond to interquartile (25th–75th) percentile, and whiskers extend to Q3 + 1.5× the interquartile range and Q1 − 1.5× the interquartile range; the dots beyond the whiskers show Tukey’s fences outliers. m, Schematic models of prion-like domains (PLDs) and TF IDRs, and their omega scores.

Source data

In this study we set out to investigate whether human TF IDRs are suboptimized (that is, their activity and specificity are submaximal because they are in a trade-off). To do so, we took inspiration from recent insights into prion-like IDRs of RNA-binding proteins to identify a single sequence feature in human TF IDRs that contributes to both transcriptional activity and binding specificity. Prion-like IDRs of RNA-binding proteins (for example, FUS, HNRNPA1 and TDP-43) encode regularly spaced aromatic residues whose number and periodic arrangement promote phase separation26,27. We found that hundreds of TF IDRs encode traces of aromatic periodicity. Optimization of aromatic dispersion enhanced the activity and reduced the specificity of TFs, with consistent changes in in vitro phase separation.

Results

Human TFs encode short periodic blocks of aromatic residues

Prion-like domains of RNA-binding proteins contain periodically arranged aromatic residues that promote phase separation27 but is not known whether TFs contain periodically arranged aromatic residues. To gain initial insights into the extent of periodicity of aromatic residues in human TFs, we developed a computational pipeline to identify short blocks of periodically arranged aromatic residues with varying spacer lengths in approximately 1,500 human TFs that had been previously curated (Fig. 1a)1. We filtered for periodic blocks of at least four aromatic residues that overlap IDRs and identified 531 TF IDRs containing at least one periodic block (Fig. 1b, Extended Data Fig. 1a,b and Supplementary Table 1). Only 60 of the 531 TF IDRs that contained a short periodic block also contained a minimal activation domain annotated from four recent studies8,9,10,11 and they overlapped in only 31 TF IDRs (Fig. 1a–c, Extended Data Fig. 1c and Supplementary Table 1), suggesting that the periodic blocks are distinct from minimal activation domains. Transcription factor IDRs with periodic blocks were enriched for aromatic residues and serines, and were depleted of charged residues (Extended Data Fig. 1d–f), consistent with typical aromatic ‘stickers’ and serine/glycine-rich ‘spacers’ in prion-like domains26,27,28.

To quantify the extent of periodicity, we generated a ‘periodicity score’ as a weighted sum of the periodic blocks, and ranked TFs based on the periodicity score of their IDRs (Fig. 1b and Supplementary Table 1). The periodicity score was further validated by calculating a previously described patterning parameter (the omega score, ΩAro)27. The ΩAro score measures the extent of mixing of aromatic residues—where high dispersion leads to a low ΩAro value—which is then compared with the mean dispersion of 1,000 randomly generated sequences27. For example, the 30 aromatic residues in the NFAT5 IDR are more uniformly dispersed than in 1,000/1,000 randomly generated sequences of identical composition (ΩAro = 0.124, empirical P = 0; Fig. 1c,d). These results suggest that approximately 30% of human TF IDRs contain short blocks of periodically arranged aromatic residues and some of the observed periodicity seems to be non-random.

Three TF IDRs that encode periodic aromatic blocks were selected for functional testing (HOXB1, HOXD4 and HOXC4). All three purified recombinant monomeric enhanced green fluorescent protein (mEGFP)-tagged IDRs formed droplets in a concentration-dependent manner in the presence of a crowding agent (10% polyethylene glycol 8000 (PEG 8000)); Fig. 1e–h and Extended Data Fig. 2a–d). The droplets underwent fusion and wetted the surface of the microscopy slide (Supplementary Videos 16), which are hallmarks of liquid–liquid phase separation29. Substitution of aromatic residues (AroLITE) reduced droplet formation (Fig. 1g,h and Extended Data Fig. 2c,d,f–h). As a test of transcriptional activity, the wild-type IDRs fused to the GAL4 DBD activated transcription of a luciferase reporter driven by five repeats of the upstream activation sequence (5×UAS) when transfected into various cells (P < 0.05, Student’s t-test) and substitution of aromatic residues virtually abolished activity of all six IDRs tested (Fig. 1i and Extended Data Fig. 2e,i–m). These findings suggest that aromatic residues are necessary for in vitro phase separation and transactivation capacities of TF IDRs that contain periodic blocks of aromatic residues.

Submaximal periodicity of aromatic residues in TF IDRs

We noted that many TF IDRs contain short periodic aromatic blocks, but their overall periodicity tends to be limited (Fig. 1e,f). Thus, we hypothesized that aromatic dispersion of TF IDRs might be lower than the theoretical maximum. To test this idea, we quantified periodicity using several approaches. We developed a method to identify protein regions with significant periodicity, independent of sequence length and composition. The spacer length between adjacent aromatic residues was calculated for each protein and the observed distribution of spacer lengths within a sequence was compared with the expected geometric distribution using the Kolmogorov–Smirnov (K–S) test (Fig. 1j). The mean of the geometric distribution was extrapolated from the proportion of aromatic residues, implicitly modelling their occurrence by a Poisson process. The method was applied to 100-amino-acid-long regions using a sliding window approach and the P value of the K–S test was plotted against the position of each window in every protein in the human proteome. The P value and length of the regions encompassing 100 residue windows below the P value threshold were used to define regions with significant periodicity (Fig. 1j). Of note, our approach captured the previously described periodic region in HNRNPA1 (Extended Data Fig. 3a)27.

Regions with significant periodicity were identified in 2,202 human proteins and 396/2,202 of the periodic regions overlapped IDRs annotated by Metapredict (Extended Data Fig. 3b,c and Supplementary Table 2). The proteins containing regions of significant periodicity were enriched for prion-like proteins and were not enriched for TFs (Fig. 1k and Extended Data Fig. 3d,e). Only 134/1,542 TFs were found to contain a region of significant periodicity and only 63 of these regions were in the IDR (Fig. 1k). Furthermore, the average ΩAro score of IDRs in TFs was significantly higher than that of prion-like domains (P < 1 × 10−4, one-way analysis of variance; Fig. 1l,m). These results demonstrate that TF IDRs have lower periodicity than prion-like domains and suggest that the periodicity of TF IDRs may be submaximal.

Increasing aromatic dispersion enhances transactivation

If TF IDRs have submaximal aromatic dispersion, one could expect that increasing their aromatic dispersion enhances activity. We tested this idea using the HOXD4 IDR as a proof-of-concept (Fig. 2a). We first substituted seven non-aromatic residues with tyrosines in regions of spacer lengths of >15 amino acids in the IDR, increasing its periodicity (AroPLUS; Fig. 2a). Purified mEGFP-tagged AroPLUS IDR protein formed droplets at a lower concentration (Csat) than the wild-type HOXD4 IDR in vitro (Fig. 2b,c) and had a twofold higher activity in the GAL4-DBD transactivation assay (P = 0.032, Student’s t-test; Fig. 2a), which was specific to adding aromatic residues in positions that increase periodicity (Fig. 2a–c). We also generated a HOXD4 IDR mutant in which aromatic residues in the native sequence were uniformly dispersed (AroPERFECT; Fig. 2a). The AroPERFECT IDR formed liquid-like droplets at a similar Csat to the wild-type IDR in vitro (Fig. 2b,c and Supplementary Videos 1,2,7,8). However, fluorescence recovery after photobleaching (FRAP) analyses revealed an increase in the recovery of fluorescence (Fig. 2d and Extended Data Fig. 4a), suggesting enhanced liquid-like features of IDR droplets. Moreover, the AroPERFECT IDR had a ~five-fold higher activity in the GAL4-DBD transactivation assay compared with the wild-type IDR (P < 1 × 10−4, Student’s t-test; Fig. 2a and Extended Data Fig. 4b). These results suggest that increased aromatic dispersion in the HOXD4 IDR enhances its activity.

Fig. 2: Increasing aromatic dispersion in TF IDRs enhances transactivation.
figure 2

a, Schematic models of HOXD4 IDRs (left). Aromatic residues (orange dots) and alanine mutations (white dots) are highlighted. Additionally introduced tyrosines are also shown as red dots. Omega plots of the HOXD4 IDRs and ΩAro scores (middle). Results of luciferase reporter assays (right). Data are from three biological replicates. b, Representative images of droplet formation of purified HOXD4 IDR–mEGFP fusion proteins at the indicated concentrations in droplet formation buffer. Scale bars, 5 μm. c, Relative amount of condensed protein per concentration quantified in the droplet formation assays. Data are the mean ± s.d. of n = 15 images from three replicates. The curves were generated as nonlinear regressions to a sigmoidal curve function. d, Fluorescence intensity of wild-type and AroPERFECT HOXD4 in vitro droplets before, during and after photobleaching. Data are the mean ± s.d. of n = 20 images from two replicate imaging experiments. e, Results of a HOXD4 IDR tiling experiment using luciferase reporter assays. Sequences were tiled into fragments of 40 amino acids with 20-amino-acid overlaps. The activities of the full-length IDRs are indicated with dashed horizontal lines. A predicted activation domain (AD) in the HOXD4 wild-type IDR is highlighted (light blue bar). Luciferase activity is reported as the fold change relative to cells transfected with empty vector. f, Results of luciferase reporter assays of the indicated HOXD4 IDR constructs. The position of the 40-mer tile containing the AD in e is illustrated. Data are from three biological replicates. g, Schematic models of synthetic sequences (left); tyrosine residues are highlighted (orange dots). Results of luciferase reporter assays (right). Data are from two (bottom) or three (top) biological replicates. a,eg, Luciferase values were normalized to an internal Renilla control and the values are displayed as percentages normalized to the activity measured using an empty vector. Data are the mean ± s.d. *P < 0.05, **P < 0.01 and ***P < 1 × 10−3; two-sided unpaired Student’s t-test.

Further mutagenesis of the HOXD4 IDR revealed that increasing the aromatic dispersion enhances transactivation within the confines of additional sequence features but independent of predicted structural elements. The HOXD4 IDR contains a predicted minimal activation domain (Fig. 2e). A 40-amino-acid fragment containing this element, however, had lower activity in the AroPERFECT sequence (Fig. 2e). Furthermore, the elevated activity of the AroPERFECT IDR could not be explained by the creation of additional minimal activation domains (Fig. 2e) and no correlation with short linear motifs13 was apparent (Extended Data Fig. 4c,d and Supplementary Table 3). A shift of the uniformly spaced aromatic residues by two positions, but not by one position, towards the amino (N) terminus led to moderately elevated activity (Extended Data Fig. 4c,d), and the degree of enhancement correlated with the number of small inert residues adjacent to aromatic residues (Extended Data Fig. 4e), consistent with previous studies on prion-like sequences28,30,31,32. Finally, we complemented the IDR portion downstream of the minimal activation domain with a short periodic portion of the FUS IDR, which also enhanced activity (WT(N)-FUSNxs; Fig. 2f).

Increased aromatic dispersion enhanced the transcriptional activity of multiple other TF IDRs (HOXC4, OCT4, PDX1 and FOXA3; Extended Data Figs. 4f–k and 5a–c), whereas reducing aromatic dispersion of the periodic EGR1 IDR reduced activity (Extended Data Fig. 5e,f). The spacer residues seemed to constrain the effect of aromatic dispersion, as increased aromatic dispersion of the HOXB1 IDR did not enhance its already strong activity (Extended Data Fig. 5g). Supporting this model, aromatic dispersion in a synthetic neutral IDR backbone correlated with activity, but in a negatively charged backbone it did not (Fig. 2g). These results suggest that optimizing aromatic dispersion can enhance the activity of TFs but not without limitations that require further investigation.

Evidence for gain-of-function of periodic HOXD4 mutants

To investigate the impact of the periodic HOXD4 mutants in vivo, we generated HAP1 cell lines in which monomeric enhanced GFP (mEGFP)-tagged full-length HOXD4 variants were knocked-in into the endogenous locus (Extended Data Fig. 6a–d). Surprisingly, knock-in of the AroPERFECT and AroPLUS HOXD4 mutants altered the morphology of the colonies, suggesting a gain-of-function effect (Fig. 3a and Supplementary Fig. 1a). The wild-type HOXD4–mEGFP protein was modestly enriched in the nucleus, whereas AroPERFECT and AroPLUS HOXD4 were expressed at higher levels and formed intense nuclear clusters (Fig. 3a and Supplementary Fig. 1a). To probe nuclear HOXD4 clusters in cells that express the three variants at comparable levels, we integrated doxycycline (DOX)-inducible mEGFP-tagged alleles using a PiggyBac transposon. The average granularity (that is, normalized s.d. of the fluorescence signal) in cells expressing AroPERFECT and especially AroPLUS HOXD4 transgenes was higher compared with cells expressing wild-type HOXD4 (Fig. 3b,c and Supplementary Fig. 1b). These results suggest that increased aromatic periodicity in the HOXD4 IDR has a gain-of-function effect in vivo.

Fig. 3: Evidence for gain-of-function of periodic HOXD4 mutants in vivo.
figure 3

a, Differential interference contrast (DIC) microscopy of the indicated cell lines (top). Representative fluorescence microscopy images of cell nuclei (bottom). The fusion proteins were visualized using anti-GFP immunofluorescence in fixed cells. Dashed white lines represent the nuclear contour. Scale bars, 0.4 mm (DIC microscopy) and 10 μm (fluorescence microscopy). b, Representative images of HAP1 HOXD4 wild type–mEGFP, HOXD4 AroPERFECT–mEGFP and HOXD4 AroPLUS–mEGFP nuclei after 24 h of HOXD4 expression. The fusion proteins were visualized using mEGFP fluorescence in fixed cells. The number of individual nuclei per condition is provided. Scale bar, 5 μm. a,b, The normalized signal intensity was calculated by dividing the s.d. of the mEGFP signal of each nucleus by the corresponding mean mEGFP signal. c, Granularity scores of nuclei with the corresponding mean nuclear mEGFP intensities. Data are the mean ± s.d. of n = 536 (wild-type), 565 (AroPERFECT) and 504 (AroPLUS) nuclei pooled from two independent replicates. a.u., arbitrary units. d, Principal component (PC) analysis of the RNA-seq expression profiles of parental HAP1, HOXD4-knockout and the indicated knock-in HAP1 cell lines. e, Differential expression analysis of HOXD4 AroPERFECT–mEGFP and HOXD4 AroPLUS–mEGFP versus HOXD4 wild type–mEGFP HAP1 cells. P values were determined using the Benjamini–Hochberg method. f, Western blot analysis of HOXD4–mEGFP, IFI16 and ARHGAP4 in the indicated cell lines. HOXD4–mEGFP proteins were probed with anti-GFP. HSP90 was used as the loading control. HOXD4 targets (blue dot) and non-HOXD4 targets (red dot) are highlighted. g, Schematic model of the condensate tethering system (left). Fluorescence images of ectopically expressed YFP–RNAPII CTD in live U2OS cells cotransfected with the indicated cyan fluorescent protein (CFP)–LacI-HOXD4 IDR fusion constructs (right). The dashed line represents the nuclear contour. Inserts: magnified views of the regions in the red boxes. Scale bars, 10 μm (main images) and 40 μm (inserts). h, Relative YFP signal intensity in the tether foci. Data are the mean ± s.d. of n = 50 (wild-type YFP and wild-type YFP–RNAPII CTD), 51 (AroPERFECT YFP) and 53 (AroPERFECT YFP–RNAPII CTD) nuclei pooled from two independent replicates. c,h, P values are from two-sided unpaired Student’s t-tests; NS, not significant.

Source data

To gain insights into the genes that are deregulated by the periodic HOXD4 mutants, we performed RNA sequencing (RNA-seq) on the HAP1 cell lines that encode integrated HOXD4 variants at the endogenous locus. Principal component analysis of approximately 16,000 quantified transcripts revealed that the expression profile of AroPERFECT and AroPLUS HOXD4-expressing cells were distinct from that of the wild-type and HOXD4-knockout cells (Fig. 3d). We annotated 1,133 HOXD4 target genes based on differential expression between the parental and HOXD4-knockout cells. In the AroPERFECT and AroPLUS cells, 76% of the HOXD4 target genes were deregulated in the same direction as in knockout cells, consistent with loss of heterodimerization with PBX factors33 (Fig. 3e, Extended Data Fig. 6e and Supplementary Table 4). However, we identified 396 genes that were upregulated in the AroPERFECT- and AroPLUS-expressing cells but downregulated in the knockout cells. One of the genes was HOXD4 itself, consistent with previous studies showing that HOXD4 autoregulates its own gene34,35,36. The elevated levels of HOXD4 and ARHGAP4 were validated with western blots (Fig. 3f). We also identified 43 genes that were upregulated in the AroPERFECT-expressing cells and 64 genes that were upregulated in the AroPLUS-expressing cells, which were not HOXD4 targets—for example, IFI16 (Fig. 3f and Extended Data Fig. 6e,f). Morphology and expression phenotypes were confirmed in PiggyBac cells expressing similar levels of wild-type and periodic HOXD4 transgenes (Extended Data Fig. 6f–i and Supplementary Fig. 1c). These results indicate that increased aromatic dispersion in the HOXD4 IDR is associated with enhanced activity and altered gene specificity, which seems to be partly gain-of-function.

To further probe the link between aromatic dispersion, transcriptional activity and condensates, we measured RNAPII CTD recruitment into HOXD4 IDR condensates using a cell-based condensate system37. Wild-type or AroPERFECT HOXD4 IDRs were tethered to a LacO array in U2OS cells expressing an ectopic RNAPII CTD–yellow fluorescent protein (YFP) fusion protein (Fig. 3g). RNAPII CTD was mildly enriched in the tethered HOXD4 wild-type IDR condensates and its enrichment was significantly higher in the AroPERFECT IDR condensates (Fig. 3g,h). These results suggest that the enhanced activity and altered gene specificity of periodic HOXD4 IDR is associated with reduced heterodimerization and enhanced RNAPII interaction.

Optimizing C/EBPα enhances transactivation

Transcription factors can reprogramme cell identity4,5; we therefore tested the impact of optimizing aromatic dispersion of well-known reprogramming TFs.

C/EBPα is a master regulator of myeloid cell differentiation38 (Fig. 4a). Purified recombinant mEGFP-tagged C/EBPα IDRs formed in vitro droplets with liquid-like features (Fig. 4b and Supplementary Videos 914) and had transactivation capacity in the GAL4-DBD luciferase system (Fig. 4a). IDR droplet formation and transactivation was dependent on the presence of aromatic residues (Fig. 4a,b and Extended Data Fig. 7a). To test the impact of increased aromatic dispersion, we generated an IDR in which the aromatic residues were dispersed with perfectly uniform spacing (AroPERFECT IS15). Increased dispersion did not affect the Csat for droplet formation (Fig. 4a,b and Extended Data Fig. 7a) but enhanced recovery after photobleaching in droplets (Fig. 4c) and enhanced transactivation twofold in the GAL4-DBD luciferase system compared with the wild-type IDR (P < 1 × 10−4, Student’s t-test; Fig. 4a and Extended Data Fig. 7b). Moreover, RNAPII CTD was more enriched in AroPERFECT IS15 condensates compared with wild-type IDR condensates tethered onto the LacO array (Fig. 4d,e). In vitro, an increase in both the number of aromatic residues and their dispersion (AroPERFECT IS10) resulted in a decrease in FRAP (Fig. 4c) as well as decreased transactivation in the GAL4-DBD luciferase system compared with the wild-type IDR (P < 1 × 10−3, Student’s t-test; Fig. 4a). These results suggest that increased aromatic dispersion enhances transactivation of the C/EBPα IDR but the increase in aromaticity inhibits it.

Fig. 4: Optimizing aromatic dispersion in C/EBPα enhances transactivation.
figure 4

a, Schematic models of wild-type and mutant C/EBPα proteins (left). The positions of the bZIP DBD (grey box) and aromatic residues (orange dots) are indicated. Omega plots and ΩAro scores (middle). Results of luciferase reporter assays (right). Data are the mean ± s.d. of n = 3 biological replicates with three technical replicates each. b, Representative images of droplet formation of purified C/EBPα IDR–mEGFP fusion proteins at the indicated concentrations in droplet formation buffer. Scale bars, 5 μm. c, Fluorescence intensity of C/EBPα wild type, AroLITE and AroPERFECT IS15 IDR in in vitro droplets before, during and after photobleaching. Data are the mean ± s.d. of n = 15 (wild-type) and 14 (AroPERFECT IS15 and AroPERFECT IS10) droplets from two replicates. d, Fluorescence images of ectopically expressed YFP–RNAPII CTD in live U2OS cells that were cotransfected with the indicated CFP–LacI-C/EBPα IDR fusion constructs. The dashed line represents the nuclear contour. Inserts: magnified views of the regions in the red boxes. Scale bars, 10 μm (main images) and 40 μm (inserts). e, Relative YFP signal intensity in the tether foci. Data are the mean ± s.d. of n = 51 (wild-type YFP, AroPERFECT YFP and wild-type YFP–RNAPII CTD) and 56 (AroPERFECT YFP–RNAPII CTD) nuclei pooled from two independent replicates. f, Results of a C/EBPα IDR tiling experiment using luciferase reporter assays. C/EBPα wild type and AroPERFECT IS15 IDR sequences were tiled into fragments of 40 amino acids with 20-amino-acid overlaps. The activities of the full-length IDRs are indicated with dashed horizontal lines. g, Results of luciferase reporter assays of the indicated IDR constructs. a,f,g, Luciferase values were normalized to an internal Renilla control and the values are displayed as percentages normalized to the activity measured using an empty vector. f,g, Data are the mean ± s.d. of n = 3 biological replicates. a,e,g, P values are from a two-sided unpaired Student’s t-tests.

Further mutagenesis of the C/EBPα IDR revealed that increased aromatic dispersion enhances transactivation within the confines of additional sequence features. The C/EBPα IDR encodes a minimal activation domain39. The activity of this element was lower in the AroPERFECT IS15 IDR and the elevated activity of the AroPERFECT IS15 IDR was not caused by the creation of additional minimal activation domains (Fig. 4f). Second, when we increased the aromatic dispersion only in the portion of the C/EBPα IDR downstream of the activation domain (WT(N)-IS15), the activity of the IDR was elevated threefold compared with the wild type and twofold compared with the N-terminal portion (Fig. 4g). Third, replacement of the downstream IDR portion with portions of the periodic FUSN-IDR (WT(N)-FUSN and WT(N)-FUSNxs) enhanced activity over the wild-type C/EBPα IDR (Fig. 4g). Fourth, a shift of the aromatic pattern of AroPERFECT IS15 IDR by one amino acid towards the carboxy (C) terminus resulted in higher transactivation compared with the wild type, whereas a shift by two positions did not (Extended Data Fig. 7c), and the magnitude of change correlated with the proportion of small inert residues adjacent to the aromatic residues (Extended Data Fig. 4e). Aromatic dispersion therefore enhances transactivation independent of the known C/EBPα activation domain and within the confines of the spacer residues.

Optimizing C/EBPα enhances macrophage reprogramming

We next measured the cellular reprogramming capacity of stably transduced C/EBPα variants in a leukaemic human B cell line (RCH-rtTA cells). In this system, induction of C/EBPα by DOX reprogrammes B cells into terminally differentiated macrophages while arresting the cell cycle40,41. Cell conversion was monitored through fluorescence-activated-cell-sorting (FACS) analysis of the B cell marker CD19 and the macrophage marker Mac1 (also known as CD11b; encoded by the gene ITGAM; Fig. 5a,b and Extended Data Fig. 7d)40,41. As expected, C/EBPα expression led to a gradual increase in the proportion of Mac1+CD19 macrophages among the GFP+ cell population over seven days (Fig. 5c and Extended Data Fig. 7d,e). Expression of the AroPERFECT IS15 C/EBPα mutant increased both the speed of appearance and proportion of Mac1+ cells among the GFP+ population (Fig. 5c and Extended Data Fig. 7d,e).

Fig. 5: Optimizing aromatic dispersion in C/EBPα enhances macrophage reprogramming, and leads to stronger and more promiscuous genomic binding.
figure 5

a, Schematic models of wild-type and mutant C/EBPα proteins. The transactivation data are identical to the data displayed in Fig. 4a. P values are from two-sided unpaired Student’s t-tests. b, Schematic model of C/EBPα-mediated transdifferentiation of B cells to macrophages. c, FACS quantification of GFP+ RCH-rtTA cells encoding C/EBPα overexpression cassettes. The proportions of CD19 Mac1+ cells were measured 48, 96 and 168 h after transgene induction. Data are the mean ± s.d. of n = 5 (wild type and AroPERFECT IS15) and 3 (AroLITE and AroPERFECT IS10) independent experiments. d, Graph-based clustering (uniform manifold approximation and projection, UMAP) of the scRNA-seq data of C/EBPα-mediated transdifferentiation. Clusters were annotated based on marker genes. Overlayed is the partition-based graph abstraction (PAGA) showing the cell trajectory based on dynamic modelling of RNA velocity. Inset: pseudotime plot. e, Proportion of mEGFP+ cells in the macrophage clusters (colour-coded as in d). f, Heatmap representation of ChIP–Seq read densities of wild-type and AroPERFECT IS15 C/EBPα within a 1.5-kb window around all shared C/EBPα peaks and differentially enriched peaks in AroPERFECT IS15 C/EBPα. ‘Peaks unique to IS15 and reported before’ denotes binding sites differentially enriched in IS15 binding that overlap C/EBPα peaks reported in previous literature. FE, fold enrichment. g, Enrichment scores of bZIP TF motifs and adjusted (adj.) P values of enrichment at the three indicated peak sets. P values were determined using the Benjamini–Hochberg method. h,j, AroPERFECT IS15 C/EBPα shows enhanced binding at the FAM98A (h) and GBP5 (j) loci. Displayed are genome browser tracks of ChIP–Seq data of C/EBPα 24 and 48 h after C/EBPα induction. The coordinates are hg38 genome assembly coordinates. i,k, UMAPs coloured on FAM98A (i) and GBP5 (k) expression. The numbers denote the mean ± s.d. expression in the whole samples. l, Luciferase assays using the indicated reporter plasmids cotransfected with expression vectors encoding either wild-type or AroPERFECT IS15 C/EBPα. Luciferase values were normalized to an internal Renilla control and the values are displayed as percentages of the activity measured using the ‘basic’ vector. Data are the mean ± s.d. of four biological replicates. P values are from two-sided unpaired Student’s t-tests.

To gain insights into the transcriptional programmes driven by the C/EBPα proteins, we performed single-cell RNA-seq (scRNA-seq) of cultures expressing wild-type and AroPERFECT IS15 C/EBPα variants after seven days of transgene induction. The culture expressing the transcriptionally inert AroPERFECT IS10 C/EBPα variant was included as a negative control. Cross-referencing the clusters on the combined scRNA cell-state map of the three cultures with marker genes of known cell populations identified terminally differentiated macrophages, macrophage precursors and various B cell subpopulations in our data (Fig. 5d and Extended Data Fig. 8a–g). Consistent with the FACS analysis, the proportion of late macrophages was higher among the GFP+ cells in the AroPERFECT IS15-transduced population (Fig. 5e), indicating enhanced reprogramming capacity. A comparative analysis of the transcriptomes of late macrophages expressing wild-type or AroPERFECT IS15 C/EBPα revealed largely similar expression profiles; however, the AroPERFECT IS15 macrophages expressed a small set of 31 genes that were not detected in the wild-type C/EBPα-expressing macrophages (Extended Data Fig. 8h,i and Supplementary Table 5), suggesting slightly altered gene specificity.

Optimizing C/EBPα leads to stronger genomic binding

To dissect the molecular basis of enhanced reprogramming we performed chromatin immunoprecipitation with sequencing (ChIP–Seq) of C/EBPα–GFP proteins, using an anti-GFP antibody, after 24 and 48 h of transgene induction in isolated clonal cell lines (Extended Data Fig. 8j). The majority of sites bound by wild-type C/EBPα were also bound by AroPERFECT IS15 C/EBPα, but the read densities at the bound sites were consistently higher in the AroPERFECT IS15 samples (Fig. 5f, Extended Data Fig. 8k and Supplementary Fig. 2a,b). Overall, approximately 100× more differentially bound peaks had higher read densities in AroPERFECT IS15 than the other way around (Fig. 5f and Supplementary Fig. 2a).

Differential genomic binding of AroPERFECT IS15 C/EBPα was associated with differences in motif composition at the binding sites. For these analyses, we used approximately 28,000 ChIP–Seq peaks that were identified as ‘shared’ by both wild-type and AroPERFECT IS15 C/EBPα, and approximately 60,000 ChIP–Seq peaks that were uniquely bound by AroPERFECT IS15 C/EBPα at least at one time point (Fig. 5f). Cross-referencing the peaks with published C/EBPα ChIP–Seq datasets revealed that approximately 50,000 of the sites were previously reported as binding sites of wild-type C/EBPα (‘peaks unique to IS15, reported before’ in Fig. 5f) and about 10,000 were specific to our AroPERFECT IS15 C/EBPα data (‘peaks specific to IS15’ in Fig. 5f). The shared binding peaks and peaks unique to IS15 reported previously were highly enriched for the same canonical C/EBPα motif but the peaks specific to IS15 were less enriched for the C/EBPα motif and more enriched for other basic-leucine zipper (bZIP) TF motifs, including C/EBPβ and NFIL3 (Fig. 5g).

The impact of differential binding on gene expression was confirmed using multiple approaches. IS15-specific binding at several loci was associated with detectable IS15-specific expression of the gene in the scRNA-seq data of B cell and macrophage clusters (Fig. 5h–k and Supplementary Fig. 2c–f). Furthermore, cloning of IS15-specific peaks in a luciferase reporter revealed elevated activity when cotransfected with an AroPERFECT IS15 C/EBPα vector compared with the wild type (Fig. 5l). Finally, differential expression was confirmed with FACS analysis of the products of two macrophage-restricted genes: CD66 (the product of the CEACAM genes; Extended Data Fig. 8l–n) and FCGR2A (Extended Data Fig. 8o–q). Together, these results suggest that the enhanced reprogramming capacity of AroPERFECT IS15 C/EBPα is associated with stronger and more promiscuous genomic binding.

Optimizing NGN2 enhances neural differentiation

As a second proof-of-concept, we tested the impact of optimizing aromatic dispersion on the reprogramming capacity of the neurogenic TF neurogenin-2 (NGN2; ref. 42; Fig. 6a).

Fig. 6: Optimizing aromatic dispersion in NGN2 enhances neural differentiation.
figure 6

a, Schematic models of wild-type and mutant NGN2 proteins (left). The positions of the bHLH DBD (grey box) and aromatic amino acids (yellow dots) are indicated. Omega plots and ΩAro scores (right). b, Fluorescence intensity of NGN2 wild-type and AroPERFECT IDR in in vitro droplets before, during and after photobleaching. Data are the mean ± s.d. of n = 20 droplets pooled from two independent replicates. c, Schematic model of the NGN2-mediated human iPSC-to-neuron differentiation experiment. ROCKi, Rho-kinase inhibitor. d, Representative fluorescence microscopy images of differentiating human iPSCs expressing the indicated NGN2 proteins. Hoechst dye was used as a nuclear counterstain; mEGFP, NGN2-T2A–mEGFP. Insets: magnified views of the regions in the white boxes. Scale bars, 0.1 mm (main images) and 0.05 mm (insets). e, Number of cells, based on Hoechst nuclear staining, in the NGN2-directed differentiation experiments. f, Neurite density (fraction of tubulin-covered area) in the NGN2-directed differentiation experiments. e,f, Data are the mean ± s.d. of n = 6 images pooled from two independent experiments. P values from a two-sided unpaired Student’s t-test. g, Principal component analysis of the RNA-seq expression profiles of parental ZIP13K2 human iPSCs and human iPSCs expressing the indicated NGN2 transgenes. h, Differential expression analysis of human iPSCs expressing the indicated transgenes. NGN2 target genes are highlighted. P values were determined using the Benjamini–Hochberg method. i, Heatmap representation of ChIP–Seq read densities of cells expressing wild-type, AroLITE and AroPERFECT NGN2 within a 1.5 kb window around all shared NGN2 peaks (top), differentially enriched peaks in AroPERFECT NGN2 (centre) and differentially enriched peaks in wild-type NGN2 (bottom). FE, fold over input. j, NGN2 differential binding at the TMEM97 locus. Genome browser tracks of ChIP–Seq data after 24 and 48 h of NGN2 expression are displayed. The arrowhead highlights a differentially bound peak at 24 h. The coordinates are hg38 genome assembly coordinates. k, Nascent transcription (TT-SLAM-Seq) metagene profiles at approximately 9,000 NGN2 target genes. TSS, transcription start site; TES, transcription end site.

Wild-type recombinant, mEGFP-tagged NGN2 C-terminal IDR (C-IDR) formed liquid-like droplets in a concentration-dependent manner, dependent on the presence of aromatic residues (Extended Data Fig. 9a–c and Supplementary Videos 15,16). Similar to results with the IDRs of C/EBPα, HOXD4 and HOXC4, a mutant NGN2 C-IDR in which the five aromatic residues uniformly dispersed (AroPERFECT C-IDR) formed droplets similar to the wild-type IDR in vitro and had a small statistically non-significant difference in FRAP (Fig. 6b). None of the IDRs had measurable activity in the GAL4-DBD luciferase system (Extended Data Fig. 9a), consistent with a report that a minimal activation domain is located within the NGN2 DBD43.

To assay the reprogramming capacity of NGN2 mutants, DOX-inducible FLAG-tagged NGN2 transgenes were stably integrated in ZIP13K2 human induced pluripotent stem cells (iPSCs) using a PiggyBac transposon (Fig. 6c and Extended Data Fig. 9d,e). The transposon also encoded mEGFP separated by a T2A sequence. Following 24 h of DOX induction, mEGFP+ cells were FACS-sorted and replated at a defined density. After 48 h, the medium was exchanged with medium supporting neural differentiation and the cells were eventually characterized by staining nuclei and tubulin (Fig. 6c). Twice as many sorted cells expressing the AroPERFECT NGN2 mutant survived and half as many cells expressing the AroLITE NGN2 mutant survived compared with the wild-type NGN2-expressing cells after five days of transgene induction (P < 0.05, Student’s t-test; Fig. 6d,e). Consistent with these data, the density of cell projections was significantly higher in the AroPERFECT NGN2-expressing cultures compared with cultures of cells expressing wild-type NGN2 after five days of transgene induction (P < 0.05, Student’s t-test; Fig. 6d,f and Supplementary Fig. 3a–c). These results indicate that the increased aromatic dispersion in the C-terminal IDR of NGN2 enhances its capacity to reprogramme iPSCs into neuron-like cells.

To investigate the molecular basis of enhanced reprogramming by the AroPERFECT NGN2 mutant, we performed RNA-seq after five days as well as NGN2 ChIP–Seq 24 and 48 h after transgene induction. The global RNA-seq profiles of cultures expressing wild-type, AroLITE and AroPERFECT NGN2 proteins were largely similar and included NGN2 target genes annotated based on previous studies (Fig. 6g,h and Extended Data Fig. 9f,g), consistent with media conditions promoting the survival of neurons but not iPSCs after the media switch on day 2 (Fig. 6b). The ChIP–Seq data revealed that most sites bound by wild-type NGN2 were also bound by the AroLITE and AroPERFECT protein (Fig. 6i and Extended Data Fig. 9h) but the read densities at the binding sites were consistently lower in the AroLITE-expressing cells and moderately higher in AroPERFECT-expressing cells at 24 h (Fig. 6i,j and Extended Data Fig. 9i). The basic helix–loop–helix (bHLH) TF motif composition of the binding peaks was largely similar (Extended Data Fig. 9j). Consistent with these results, measurements of genome-wide nascent transcription after short-term NGN2 induction revealed elevated transcription of NGN2 target genes in AroPERFECT-expressing cells (Fig. 6k, Extended Data Fig. 9k,l and Supplementary Fig. 4). These results suggest that optimizing the aromatic dispersion in the NGN2 C-terminal IDR enhances neural reprogramming and slightly alters genomic binding.

Optimizing MYOD1 enhances myotube differentiation

Finally, we tested the impact of optimizing aromatic dispersion on the function of the myogenic TF MYOD1 (ref. 44; Fig. 7a). Both the N-terminal and C-terminal MYOD1 IDRs had transactivation capacity in the GAL4-DBD luciferase system in myoblasts (Fig. 7a). Increased aromatic dispersion of aromatic residues abolished transactivation of the N-terminal IDR that contains a minimal activation domain but increased transactivation of the C-terminal IDR (Fig. 7a and Extended Data Fig. 10a), and the enhanced activity of the AroPERFECT C-IDR was not caused by the creation of minimal activation domains (Extended Data Fig. 10b).

Fig. 7: Optimizing aromatic dispersion in MYOD1 enhances myotube differentiation.
figure 7

a, Schematic models of wild-type and mutant MYOD1 proteins (left). The position of the bHLH DBD (grey box) and aromatic amino acids (orange dots) are indicated. Omega plots and ΩAro scores of the N-terminal and C-terminal IDRs (middle). Results of luciferase reporter assays in C2C12 mouse myoblasts (right). Luciferase values were normalized to an internal Renilla control and the values are displayed as percentages normalized to the activity measured using an empty vector. Data are the mean ± s.d. of three biological replicates. P values are from two-sided unpaired Student’s t-tests. b, Schematic model of the MYOD1-mediated myotube differentiation experiment. c, Representative fluorescence microscopy images of differentiating C2C12 myoblasts expressing the indicated MYOD1 proteins on day 3 after DOX induction. The mEGFP signal of the MYOD1-T2A–mEGFP construct was used as a cytoplasmic marker. Nuclear counterstain (DAPI) is shown in magenta. Magnified views of the regions in the white boxes are provided (zoom; bottom). Scale bars, 0.5 mm (main images) and 0.2 mm (zoom). d, MYOD1-driven myotube differentiation efficiency. The fusion index was calculated as the percentage of nuclei in fused cells (cells containing at least three nuclei). Data are the mean ± s.d. of n = 15 images per genotype pooled from three biological replicates. P values are from two-sided unpaired Student’s t-tests. e, Principal component analysis of RNA-seq expression profiles of parental C2C12 cells as well as cells expressing the indicated MYOD1 transgenes. f, Differential expression analysis of C2C12 cells expressing AroLITE or AroPERFECT C MYOD1 versus C2C12 cells expressing wild-type MYOD1. MYOD1 target genes are represented as blue dots. Highlighted genes were differentially expressed and are involved in cell adhesion. P values were calculated using the Benjamini–Hochberg method.

To assay the reprogramming capacity of MYOD1 mutants, DOX-inducible MYOD1 transgenes were stably integrated into C2C12 murine myoblasts using a PiggyBac transposon (Fig. 7b). The transposon also encoded mEGFP separated by a T2A sequence from MYOD1 (Fig. 7b). In this system, forced expression of MYOD1 differentiates myoblasts into multinucleated myotubes within a few days45. Cell fusion was quantified as the percentage of 4,6-diamidino-2-phenylindole (DAPI)-stained nuclei in multinucleated cells visualized using the mEGFP fluorescence signal as the cytoplasmic marker46. Approximately 50% of nuclei expressing wild-type MYOD1 were found in fused cells after three days of transgene induction (Fig. 7c,d and Extended Data Fig. 10c). Mutation of the aromatic residues into alanines in both IDRs (AroLITE) prevented fusion, whereas mutation of the aromatic residues in the C-terminal IDR (AroLITE C) had a negligible effect (Fig. 7c,d). Expression of the MYOD1 mutant with enhanced periodicity in its C-terminal IDR (AroPERFECT C) led to a significant increase in fusion after three days (P < 0.05, Student’s t-test; Fig. 7c,d). These results suggest that increased periodicity of aromatic residues in the C-terminal IDR of MYOD1 enhances myotube differentiation.

RNA-sequencing analysis of differentiating cells expressing various MYOD1 proteins revealed signatures consistent with observed morphological differences. Principal component analysis of the RNA-Seq data demonstrated that the global expression profiles of AroLITE-expressing cells were similar to that of the parental myoblasts (Fig. 7e and Extended Data Fig. 10d). The expression profile of AroPERFECT C-expressing cells was largely similar to cells expressing wild-type MYOD1 but included 290 differentially expressed genes, 197 of which were MYOD1 targets and were enriched for genes implicated in cell adhesion (Fig. 7e,f and Extended Data Fig. 10d–g). These results suggest that morphologies are associated with differences in gene expression profiles of differentiating myotubes expressing various MYOD1 proteins.

Discussion

The results presented here support a model that human TFs have suboptimal transcriptional activity. We present evidence that suboptimality in several TFs is encoded as submaximal dispersion of aromatic residues in their IDRs. In several cellular reprogramming systems, an increase in aromatic dispersion enhanced the activity and compromised gene specificity of the TFs. Together with previous work showing that enhancer DNA sequences are suboptimal for TF binding14,16, the results suggest an important evolutionary trade-off between activity and specificity at multiple levels in eukaryotic transcriptional control.

The results provide insights into how human TFs work. Some TFs encode short linear motifs that can fold into secondary structures and mediate specific interactions with effector proteins47. Such sequences are typically identified as minimal activation domains that are sufficient to activate transcription of a reporter gene7,8,9,10,13,48,49. Our results suggest that some TF IDRs encode periodically arranged aromatic residues that contribute to activity via multivalent interactions with other disordered protein regions. This mode of activity may be distinct from, and complementary to, the transcriptional activity conferred by minimal activation domains. Consistent with this proposal, hydrogels of periodic low-complexity domains can bind RNAPII CTD that itself is highly periodic50, and we found that periodic TF IDRs recruit RNAPII CTD more efficiently than wild-type TF IDRs in the cell-based condensate tethering system. This model may help explain why minimal activation domains are typically embedded in large disordered sequences6,7,8,9,10 and why some TF IDRs can be substituted with the periodic FUS prion-like domain51,52. This model predicts that important regulatory information may be encoded in sequences with weak or no activity.

Transcription factor-mediated differentiation and reprogramming are generally stochastic and inefficient, and the inefficiency is thought to be explained by chromatin barriers or lack of TF effector partners4,5,53,54,55,56,57,58. Our results suggest that an additional impediment to directed differentiation and reprogramming may be the suboptimal activity of native TFs, and that reprogramming efficiency may be improved by enhancing a prion-like phase separation ‘grammar’ in native TFs. In summary, we propose that altering phase separation capacity may be a universal strategy to optimize any TF-dependent process.

Methods

Ethics statement

The research complied with all relevant ethical regulations and was approved by the Max Planck Institute for Molecular Genetics and the Centre for Genomic Regulation.

Cell culture

The cell lines HAP1, HEK293T, V6.5 mESCs, ZIP13K2 human iPSCs, Kelly, SH-SY5Y, C2C12 murine myoblasts and U2OS were cultured as per American Type Culture Collection guidelines. RCH-rtTA cells were derived from the RCH-ACV lymphoblastic leukaemia cell line59. RCH-rtTA cells and their derivates were cultured in RPMI medium (Gibco) containing 10% fetal bovine serum supplemented with 1% glutamine (Gibco), 1% penicillin–streptomycin (Thermo Fischer Scientific) and 550 µM β-mercaptoethanol (Gibco). Cells were maintained at a density of 0.1–6 × 106 cells ml−1. The cell lines were checked for mycoplasma contamination and tested negative.

Genomic DNA extraction

Genomic DNA of cultured cells was extracted using a GeneJET genomic DNA purification kit (Thermo Fischer Scientific) following the manufacturer’s instructions.

Generation of HOXD4–GFP knock-in and knockout lines

For an endogenous knock-in of mEGFP-tagged HOXD4 variants, we cloned a synthesized, codon-optimized sequence for wild-type, AroPERFECT or AroPLUS HOXD4 (Twist Bioscience) into a pUC19 backbone (Addgene, catalogue number 50005) that was linearized by restriction digest with BamHI (NEB) and HindIII (NEB). Besides the aforementioned HOXD4 coding sequences, the repair template contained N- and C-terminal homology regions for the HOXD4 genomic locus amplified from HAP1 genomic DNA, a synthesized GS-linker sequence (Sigma) and a mEGFP fluorescent protein sequence amplified from a pET45 plasmid (Extended Data Fig. 6a). All plasmids were cloned via Gibson Assembly using a NEBuilder HiFi DNA assembly kit (NEB).

The endogenous HOXD4 locus was targeted by two guide RNAs cutting the N- or C-terminus of the HOXD4 coding sequence, respectively (Extended Data Fig. 6a). Both guide RNA sequences (Supplementary Table 6) were cloned into the sgRNA-Cas9 vector px459 (Addgene, catalogue number 62988). Repair template and guide RNA vectors were cotransfected into HAP1 cells using Lipofectamine 3000 transfection reagent (Thermo Fischer Scientific) at a molar ratio of 5:1:1 following the manufacturer’s instructions. To screen for functional integration, the transfected cells were sorted for mEGFP expression by flow cytometry after four days and a second time after an additional week. Positive cells were seeded into 96-well plates as single cells. After expansion, the clones were genotyped for correct integration by PCR on extracted genomic DNA (Extended Data Fig. 6c). Positive clones for every HOXD4-expressing line with similar mEGFP expression levels were selected. To generate a HOXD4-knockout cell line, HAP1 cells were transfected with both guide RNAs only. After four days, the cells were seeded as single cells by flow cytometry and genotyped for HOXD4 deletion by PCR on extracted genomic DNA and quantitative real-time PCR on synthesized complementary DNA (Extended Data Fig. 6c,d).

Generation of cells encoding DOX-inducible transgenes using the PiggyBac system

To generate a DOX-inducible overexpression system of HOXD4, we randomly integrated the coding sequences of wild-type, AroPERFECT and AroPLUS HOXD4 into HAP1 cells using the PiggyBac transposon system. To generate a DOX-inducible overexpression system of NGN2, we randomly integrated the coding sequences of wild-type, AroLITE and AroPERFECT NGN2 into ZIP13K2 cells using the PiggyBac transposon system. Similarly, to generate a DOX-inducible overexpression system of MYOD1, we randomly integrated the coding sequences of wild-type, AroLITE, AroPERFECT C and AroLITE C MYOD1 into C2C12 cells using the PiggyBac transposon system. The details are described in the Supplementary Information.

Generation of DOX-inducible C/EBPα overexpression lines in RCH cells

TetO-C/EBPα–mEGFP plasmids were cloned via Gibson assembly using a pHAGE2-tetO backbone. HEK293T cells were cotransfected with vector plasmid and packaging plasmid using calcium phosphate transfection. Viral supernatants were collected 48 h later and concentrated by ultracentrifugation at 20,000g and 20 °C for 2 h. The viral concentrates were resuspended in PBS. RCH cells were transduced by centrifugation with concentrated virus solution for 2 h at 32 °C and 1,000g in culturing medium.

MYOD1-mediated myogenic differentiation of C2C12 myoblasts

C2C12 myoblasts with an integrated MYOD1 overexpression cassette were seeded on chambered µ-Slide 8 well ibiTreat coverslips (Ibidi). Once 85–90% confluence was reached, 2 µg ml−1 DOX was added to the culture medium to induce expression of the MYOD1 transgene. The differentiation medium was changed every day for three days. For imaging, the cells were washed with PBS and fixed with 4% paraformaldehyde for 15 min at room temperature. The cells were counterstained with DAPI (Fig. 7c and Extended Data Fig. 10c).

RNA isolation and quantitative real-time PCR

RNA from cultured cells was extracted using a Direct-zol RNA MicroPrep kit (Zymo Research) following the manufacturer’s instructions. Subsequently, 1 μg of extracted RNA was used as input material for cDNA synthesis with the RevertAid first strand cDNA synthesis kit (Thermo Fischer Scientific) using random hexamer primers as per the manufacturer’s instructions. The synthesized cDNA was diluted 1:10 with water and stored at −20 °C. Quantitative real-time PCR was performed using 2×PowerUP SYBR green master mix (Applied Biosystems) and the primers listed in Supplementary Table 6.

KAPA stranded messenger RNA-seq of HAP1 HOXD4 knock-in cells

Six-well plates were seeded with HAP1 cells at a density of 1 × 105 cells per well and cultured for three days until 80% confluency was reached. RNA was extracted using a Direct-zol RNA MicroPrep kit (Zymo Research) following the manufacturer’s instructions. For each sample, 1 μg RNA was used as input for library preparation using the KAPA stranded mRNA-seq kit (Roche) according to the manufacturer’s instructions. Unique dual-indexed set-B (UDI; Kapa Biosystems) adaptors were ligated and the library was amplified for eight cycles. The libraries were sequenced on a NovaSeq 6000 system as paired-end 100 with 50 × 106 fragments per library (Fig. 3d,e and Extended Data Fig. 6e).

Generation of DNA constructs for protein purification

For the purification of mEGFP- or mCherry-labelled fusion proteins, we amplified sequences from codon-optimized gene fragments (Twist Bioscience) for HOXD4 wild type, AroLITE A, AroLITE G, AroLITE S, AroPLUS, AroPLUS patched, AroPLUS LITE, AroPLUS LITE patched and AroPERFECT; HOXC4 wild type, AroLITE S and AroPERFECT; HOXB1 wild type and AroLITE A; NANOG wild type and AroLITE A; C/EBPα wild type, AroLITE A, AroPERFECT IS15 and AroPERFECT IS10; and NGN2 wild type, AroLITE A and AroPERFECT C IDRs. The primers used are listed in Supplementary Table 6. The amplified gene fragments were cloned into a pET45-mEGFP or pET45-mCherry backbone21, linearized by restriction digest with AscI (NEB) and HindIII (NEB), via NEBuilder HiFi assembly. All sequences of interest were cloned C-terminally to the fluorescence marker.

Protein purification

Overexpression of recombinant protein in BL21 (DE3) (NEB M0491S) was performed as described20. Escherichia coli pellets were resuspended in 25 ml of ice-cold Buffer A (50 mM Tris pH 7.5, 500 mM NaCl and 20 mM imidazole) supplemented with cOmplete protease inhibitors (Sigma, catalogue number 11697498001) and 0.1% Triton X-100 (Thermo Fischer Scientific, catalogue number 851110), and sonicated for ten cycles (15 s on, 45 s off) on a Qsonica Q700 sonicator. The bacterial lysate was cleared by centrifugation at 15,500g and 4 °C for 30 min. For protein purification, we used the Äkta avant 25 chromatography system. All 25 ml of the cleared lysate was loaded onto a cOmplete His-Tag purification column (Merck, catalogue number 6781543001) pre-equilibrated in Buffer A. The loaded column was washed with 15×column volumes (CV) of Buffer A. Fusion protein was eluted in 10×CV of Elution Buffer (50 mM Tris pH 7.5, 500 mM NaCl and 250 mM imidazole) and diluted 1:1 in Storage Buffer (50 mM Tris pH 7.5, 125 mM NaCl, 1 mM dithiothreitol and 10% glycerol). The fractions enriched for GFP were pooled after His-affinity purification and manually loaded through an injection valve connected to a 500 μl capillary tube onto an equilibrated Superdex 200 increase 10/300 GL column (Cytiva, 28-9909-44). The loaded column was equilibrated with 0.15×CV of ice-cold Buffer A supplemented with cOmplete protease inhibitors. The fusion proteins were eluted with 1.1×CV of ice-cold Buffer A supplemented with cOmplete protease inhibitors. The elution fractions were pooled. The eluates were further concentrated by centrifugation at 10,000g and 4 °C for 30 min using 3000 MWCO Amicon Ultra centrifugal filters (Merck, UFC803024). The concentrated fraction was diluted 1:100 in Storage Buffer, re-concentrated and stored at −80 °C.

In vitro droplet fusion and surface wetting assay

For the in vitro fusion and surface wetting assays, we measured the concentration of purified mEGFP-tagged fusion proteins using a NanoDrop 2000 system (Thermo Fischer Scientific) and subsequently diluted the measured protein preparations to 50 μM in Storage Buffer. The protein preparations were mixed 1:1 with 5 μl of 20% PEG 8000 in de-ionized water (wt/vol). The resulting 10 μl was immediately pipetted on a chambered coverslip (Ibidi, 80826-90). Images of the contact interface between the drop and the slide were acquired using an LSM880 confocal microscope equipped with a plan-apochromat ×63, numerical aperture (NA) = 1.40 oil DIC objective with a ×5 zoom, resulting in a lateral pixel resolution of 0.04 μm. A total of 25 images were taken in a time series with 15 s intervals for each video (Supplementary Videos 116). C/EBPα droplet fusion and surface wetting assays were performed with different protein preparations as the in vitro droplet formation assay.

In vitro droplet assay

For the in vitro droplet formation experiments (Figs. 1g, 2b, 4b and Extended Data Figs. 2c,g, 4h, 9b), we measured the concentration of purified mEGFP IDR fusion proteins using a NanoDrop 2000 system (Thermo Fischer Scientific) and subsequently diluted the protein preparations to the required concentration in Storage Buffer. The in vitro droplet formation assay was performed as previously described21. The protein preparations were mixed 1:1 with 5 μl of 20% PEG 8000 in de-ionized water (wt/vol) and equilibrated for 30 min at room temperature. The resulting 10 μl was pipetted on a chambered coverslip (Ibidi, 80826-90). After equilibration for 3 min, images of the drop on the slide were acquired with an LSM880 confocal microscope equipped with a Plan-Apochromat ×63, NA = 1.40 oil DIC objective with a ×2.5 zoom, resulting in a lateral pixel resolution of 0.04 μm, if indicated. Quantification of condensate formation was based on at least ten images acquired in at least two independent image series per condition.

Image analysis of in vitro droplet formation

Protein droplets were detected using the ZEN blue 3.4 Image Analysis and Intellesis software packages. By use of a previously trained Intellesis model in spectral mode, we achieved image segmentation of individual pixels into objects (droplet area) or background (image background). A minimum cutoff of 120 nm in diameter was applied on the identified objects. Relative amounts of condensed protein were calculated by dividing the sum of mEGFP signal in objects defined as droplet area by the overall sum of mEGFP signal in the field of view. All values were calculated using RStudio. Plots were generated using GraphPad Prism 9. To fit data to a sigmoidal curve, we applied the in-built nonlinear regression function (Sigmoidal; x is the concentration; Figs. 1h, 2c and Extended Data Figs. 2d,h, 4i, 7a, 9c).

FRAP

FRAP experiments on droplets were formed as described above without 30 min of pre-assembly at room temperature and a protein concentration of 25 μM. The droplets were bleached immediately after pipetting the protein mixture onto the slide using ten iterations of 488 nm light at 70% laser power. Bleaching was performed on a central region of a settled single droplet. Fluorescence recovery was measured over a time course of 60 s at intervals of 2 s. Quantification of FRAP data was based on at least ten images acquired in at least two independent image series per condition. The resulting signal recovery was normalized to the background and fitted to a power law model in Microsoft Excel. All figures were generated using GraphPad Prism 9 (Figs. 2d, 4c, 6b and Extended Data Fig. 4a,j,k).

Generation of DNA constructs for transactivation assays

To study the transactivation strength of TF IDRs, we amplified sequences from codon-optimized gene fragments (Twist Bioscience) using the primers listed in Supplementary Table 6. The amplified gene fragments were cloned into a pGAL4 (Addgene, catalogue number 145245) backbone, linearized with AsiSI (NEB) and BsiWI (NEB), via NEBuilder HiFi assembly.

Generation of DNA constructs for TF-IDR tiling assays

To control for the potential creation of short linear motifs in TF-IDR mutants, we tiled up the HOXD4 wild-type and AroPERFECT, C/EBPα wild-type and AroPERFECT IS15, OCT4 wild-type C and AroPERFECT C, MYOD1 wild-type C and AroPERFECT C, and EGR1 wild-type and AroSCRAMBLED IDRs into 40-amino-acid segments with 20-amino-acid overlaps. We amplified all 40-amino-acid tiles in steps of 20 amino acids starting from the first amino acid of the sequence using the primers listed in Supplementary Table 6. The amplified gene fragments were cloned into a pGAL4 backbone, linearized with AsiSI (NEB) and BsiWI (NEB), via NEBuilder HiFi assembly (Figs. 2e,4f and Extended Data Figs. 5d,f, 10b).

Transactivation assay

The transactivation activity of TF IDRs was assayed using the Dual-Glo Luciferase Assay system (Promega). Mouse embryonic stem cells were seeded at a density of 1 × 105 cells cm−2 on 24-well plates that had been pre-coated with gelatin. For feeder-free culture conditions, mESC medium was supplemented with 2× leukemia inhibitory factor (LIF). HEK293T, SH-SY5Y and Kelly cells as well as C2C12 mouse myoblasts were seeded on 24-well plates at a density of 1 × 105 cells cm−2. After 24 h, every well was transfected with 200 ng pGal4 empty vector control or the equimolar amount of the expression construct carrying an IDR of interest, 250 ng of the firefly luciferase expression vector (Promega) and 15 ng of the Renilla luciferase expression vector (Promega) using FuGENE HD transfection reagent (Promega) according to the manufacturer’s instructions. After another 24 h, the cells were washed once with PBS and lysed in 100 μl of 1×Lysis Passive Buffer (Promega) for 15 min on a shaker at room temperature. Subsequently, 10 μl of cell lysate was pipetted, in triplicate, onto a white-bottomed 96-microwell plate, followed by quantification of the firefly and Renilla genes using the Dual-Glo Luciferase Assay System Quick Protocol for 96-well plates (Promega). Triplicate data were normalized to Renilla luminescence of the respective well and finally normalized to the empty vector control. Data are shown as the mean ± s.d. All data shown were generated from three independent transfections from at least two cell passages (Figs. 1i, 2a,e–g, 4a,f,g, 5a, 7a and Extended Data Figs. 2e,i–m, 4d,f, 5a,e,g, 7c, 9a) and were plotted using GraphPad Prism 9. Two-tailed Student’s t-tests were performed to assess statistical significance.

Western blots

Cultured cells were washed twice in PBS and lysed in RIPA buffer for 30 min on an orbital shaker at 4 °C. Subsequently, the cell lysate was centrifuged for 20 min at 20,000g. The cleared lysate was transferred to a new tube and total protein was quantified by BCA assay (Thermo Fischer Scientific). Extracted protein (20 μg) was run on a 4–12% NuPAGE SDS gel and transferred onto a polyvinylidene fluoride membrane using an iBlot2 dry gel transfer device (Invitrogen) following the manufacturer’s instructions. For GAL4-DBD blots, 50 μg of extracted protein was used. The membranes were blocked with 5% skim milk in TBST and incubated overnight with primary antibodies at 4 °C. The primary antibodies used in this study include antibodies to IFI16 (Santa Cruz Biotechnology, sc-8023; 1;200), GFP (Invitrogen, A11122; 1:2,000), HSP90 (BD, 610419; 1:4,000), ARHGAP4 (Santa Cruz Biotechnology, sc-376251; 1:200), ESX1 (Santa Cruz Biotechnology, sc-365740; 1:200), GAL4-DBD (Santa Cruz Biotechnology, sc-510; 1:200) and FLAG (Merck, F1804; 1:2,000). Horseradish peroxidase-conjugated secondary antibodies to the host species were used at dilutions of 1:3,000–1:5,000 and visualized with HRP substrate SuperSignal West Dura (Thermo Fischer Scientific; Fig. 3f and Extended Data Figs. 4b,g, 5c, 6f, 7b, 10a).

Generation of DNA constructs for locus reconstruction assays

To confirm mutant-specific regulation of C/EBPα target promoters and enhancers, we amplified promoter and enhancer regions of GBP5, FAM98A and S100A using the primers listed in Supplementary Table 6. The amplified fragments were cloned into a pGL3-Basic vector (Promega), linearized with BamHI (NEB) and SalI (NEB) in case of an enhancer region or with HindIII (NEB) and KpnI (NEB) in case of a promoter, via NEBuilder HiFi assembly. Full-length C/EBPα wild type and AroPERFECT IS15 sequences for overexpression were cloned into a pGAL4 backbone, linearized with EcoRI (NEB) and AsiSI (NEB), via NEBuilder HiFi assembly.

Locus reconstruction with pGL3 reporter assays

Transcription factor activity at genomic loci was assayed using the Dual-Glo Luciferase Assay system (Promega). Mouse embryonic stem cells were seeded at a density of 1 × 105 cells cm−2 on 24-well plates that had been pre-coated with gelatin. For feeder-free culture conditions, mESC medium was supplemented with 2× leukemia inhibitory factor (LIF). After 24 h, every well was transfected with 200 ng of plasmid containing a C/EBPα wild type or AroPERFECT IS15 overexpression cassette, 250 ng of pGL3-Basic control of an equimolar amount of the pGL3 construct carrying enhancer/promoter sequences of interest and 15 ng of the Renilla luciferase expression vector (Promega) using FuGENE HD transfection reagent (Promega) following the manufacturer’s instructions. After a further 24 h, the cells were washed once with PBS and lysed in 100 μl 1×Lysis Passive Buffer (Promega) for 15 min on a shaker at room temperature. Subsequently, 10 μl of the cell lysate was pipetted, in triplicate, onto a white-bottomed 96-microwell plate, followed by quantification of the firefly and Renilla genes using the Dual-Glo luciferase assay system quick protocol for 96-well plates (Promega). Triplicate data were normalized to the Renilla luminescence of the respective well and then normalized to the pGL3-Basic vector control. Data are shown as the mean ± s.d. All data shown were generated from three independent transfections from at least two cell passages (Fig. 5l) and were plotted using GraphPad Prism 9. Two-tailed Student’s t-tests were performed to assess statistical significance.

LacO-LacI tethering assay

For the LacO-LacI tethering experiments (Figs. 3g and 4d), we used a vector containing CFP–LacI, followed by a previously published multiple cloning site20. The RNAPII-CTD plasmid was cloned via digestion with AsiSI (NEB) and BsiWI (NEB) using the NEBuilder HiFi assembly master mix.

The tethering experiments were adapted from a previous report20. Imaging was performed on live U2OS cells 48 h after transfection with 100 ng CFP–LacI-HOXD4 wild type, HOXD4 AroPERFECT, C/EBPα wild type or C/EBPα AroPERFECT IS15 plasmid and 100 ng RNAPII-CTD–YFP-NLS using the FuGENE HD transfection reagent. Images were acquired using an LSM880 confocal microscope equipped with a plan-apochromat ×63 NA = 1.40 oil DIC objective with a ×2 zoom. The laser intensities were adjusted before imaging to prevent possible channel bleed. Images were acquired across two experimental replicates.

LacO-LacI tethering assay analysis

For the analysis of LacO-LacI images (Figs. 3h and 4e), regions of interest corresponding to CFP–LacI-IDR fusion proteins were detected manually based on the cyan channel using ImageJ v2.0.0. The mean intensities of these selected regions of interest were measured in both the YFP and CFP channels. The background intensity of the YFP channel was defined using a mean intensity measurement of a random nuclear region of the same size and shape as the primary region of interest. Enrichment of the YFP signal in the regions of interest (predefined by the CFP signal) was calculated by dividing the YFP mean signal intensity of the region of interest by the YFP mean signal intensity of the random nuclear region. Values were plotted as indicated using GraphPad Prism 9; n, number of observations.

Generation of HAP1 cells expressing DOX-inducible HOXD4 transgenes with the PiggyBac system

To generate a DOX-inducible overexpression system of HOXD4, we randomly integrated the coding sequences of wild-type, AroPERFECT and AroPLUS HOXD4 into HAP1 cells using the PiggyBac transposon system.

N-terminally FLAG-tagged coding sequences of human wild-type, AroPERFECT or AroPLUS HOXD4 (Twist Bioscience) with a downstream 5×GS-linker (Sigma) were cloned into a backbone of the inducible Caspex expression vector (Addgene, catalogue number 97421), linearized by restriction digest with NcoI (NEB) and KpnI (NEB). Carrier plasmids and PiggyBac transposase expression vector (SBI, PB210PA-1) were cotransfected at a molar ratio of 6:1 into wild-type HAP1 cells using Lipofectamine 3000 according to the manufacturer’s instructions. The transfected bulk population was screened for integration by addition of 2 μg ml−1 puromycin (Gibco) to the cell culture medium 24 h after transfection for a total of four days. Bulk populations of every condition were induced by addition of 2 μg ml−1 DOX (Sigma) and screened for matching mEGFP expression levels across conditions using flow cytometry. For the generation of clonal HOXD4 overexpression lines, bulk cells were single-cell sorted by FACS. HAP1 HOXD4 cells were directly sorted into wells of a 96-well plate. Wells without any cells or with more than two cells were discarded. The other clones were expanded and eventually tested for HOXD4 expression following DOX induction by FACS (Extended Data Fig. 6h). Cells with the most similar expression levels were selected for further experiments.

Generation of DOX-inducible NGN2 overexpression systems in human iPSCs

To generate a DOX-inducible overexpression system of NGN2, we randomly integrated the coding sequences of wild-type, AroLITE and AroPERFECT NGN2 into ZIP13K2 cells using the PiggyBac transposon system.

N-terminally FLAG-tagged coding sequences of human wild-type, AroLITE or AroPERFECT NGN2 (Twist Bioscience) with a downstream T2A tag (Sigma) were cloned into a backbone of the inducible Caspex expression vector linearized by restriction digest with NcoI (NEB) and KpnI (NEB). Carrier plasmids and PiggyBac transposase expression vector were cotransfected at a molar ratio of 6:1 into wild-type ZIP13K2 cells using Lipofectamine stem transfection reagent (Thermo Fischer Scientific) following the manufacturer’s instructions. The transfected bulk population was screened for integration by addition of 2 μg ml−1 puromycin (Gibco) to the cell culture medium 24 h after transfection for a total of four days. The surviving cells were seeded at low density with added 1×Y-27632 Rho-kinase inhibitor (biogems, 1293823) for the first 24 h and expanded for several days until colonies derived from single cells were big enough to be picked and cultured separately. Clones of every condition were induced by addition of 2 μg ml−1 DOX (Sigma) and screened for matching mEGFP expression levels across conditions using flow cytometry.

Generation of DOX-inducible MYOD1 overexpression lines in C2C12 cells

To generate a DOX-inducible overexpression system of MYOD1, we randomly integrated the coding sequences of wild-type, AroLITE, AroPERFECT C and AroLITE C MYOD1 into C2C12 cells using the PiggyBac transposon system.

N-terminally FLAG-tagged coding sequences of human wild-type, AroLITE, AroPERFECT C or AroLITE C MYOD1 (Twist Bioscience) with a downstream T2A tag (Sigma) were cloned into a backbone of the inducible Caspex expression vector linearized by restriction digest with NcoI (NEB) and KpnI (NEB). Carrier plasmids and PiggyBac transposase expression vector were cotransfected at a molar ratio of 6:1 into wild-type C2C12 cells using Lipofectamine 3000 transfection reagent following the manufacturer’s instructions. The transfected bulk population was screened for integration by addition of 2 μg ml−1 puromycin (Gibco) to the cell culture medium 24 h after transfection for a total of four days. Cells of every condition were induced by addition of 2 μg ml−1 DOX (Sigma) and screened for matching mEGFP expression levels across conditions by flow cytometry.

Imaging of HAP1 HOXD4 PiggyBac overexpression cells

For the subnuclear localization analysis of HOXD4 mutants, HAP1 cells with integrated HOXD4 overexpression cassettes were seeded onto chambered coverslips. After 24 h, the culture medium was substituted with 2 µg ml−1 DOX to induce expression of HOXD4 transgenes. The following day, the cells were washed with PBS and fixed with 4% paraformaldehyde for 15 min at room temperature. The cells were then stained with 0.25 µg ml−1 DAPI (Invitrogen). Images were acquired using a Stellaris 8 confocal microscope and a plan-apochromat ×100 NA = 1.40 oil CS2 objective (Leica). For the analysis of subnuclear localization, a mosaic of at least 100 tile regions was imaged for each condition over two replicates. Object quantification was performed using the ZEN 3.4 software (Zeiss). Briefly, DAPI counterstain was used to segment objects after Gaussian smoothing. The mean mEGFP intensities were then individually calculated for each segmented nucleus and the granularity was calculated by dividing the s.d. of the mEGFP signal of each nucleus by the corresponding mean mEGFP signal using customer ImageJ/FIJI routines (Fig. 3b)60.

Imaging of HAP1 HOXD4 knock-in cells

For imaging of HOXD4 knock-in cells, 2 × 104 cells were seeded onto chambered coverslips. After 24 h, the cells were washed with PBS and fixed with 4% paraformaldehyde for 15 min at room temperature. The cells were permeabilized with PBS supplemented with 0.1% Tween-20 (Sigma) for 5 min and PBS supplemented with 0.25% Tween-20 for 15 min. The cells were then stained with primary (antibody-GFP; Invitrogen, A11122; 1:500) and secondary (goat anti-rabbit Alexa Fluor 594; Jackson ImmunoResearch, 2338059, 1:500) antibodies. Nuclei were stained with 0.25 µg ml−1 DAPI. Images were acquired using a Stellaris 8 confocal microscope and a Plan-Apochromat ×100/1.40 oil CS2 objective (Leica). For the analysis of subnuclear localization, a mosaic of at least 100 tile regions was imaged for each condition over two replicates. Object quantification was performed using the ZEN 3.4 software (Zeiss). Briefly, DAPI counterstain was used to segment objects after Gaussian smoothing. The mean mEGFP intensities were then individually calculated for each segmented nucleus and the granularity was calculated by dividing the s.d. of the mEGFP signal of each nucleus by the corresponding mean mEGFP signal using customer ImageJ/FIJI routines (Fig. 3a)60.

NGN2-mediated neural differentiation of human iPSCs

We adapted our protocol for the differentiation of human iPSCs into neurons by overexpression of NGN2 from a previous study42. ZIP13K2 cells with an integrated NGN2 overexpression cassette were cultured on 10 cm culture plates that had been pre-coated with Matrigel (Corning). When the cultures reached a confluency of approximately 80%, 2 μg ml−1 DOX (Sigma) was added to the culture medium to induce expression of the NGN2 transgene. After 24 h, the induced cultures were sorted for mEGFP-expressing cells by flow cytometry. Positive cells were seeded at a density of 2 × 104 cells cm−2 in mTeSR+ medium plus 1×Rho-kinase inhibitor on Matrigel-pre-coated 96-well microclear plates (Greiner bio-one). On day 2, the mTeSR+ medium was replaced with N2B27 neural cell culture medium supplemented with 5 μg ml−1 human BDNF (Bio-Techne). The differentiation medium was changed every day for a total of four days. Living cells were stained with 0.25 μg ml−1 Hoechst and Spy650-TUB (1:2,000; Spirochrome) and incubated in the microscope before image acquisition to equilibrate and thermalize all materials (Fig. 6d–f).

KAPA stranded mRNA-seq of ZIP13K2 NGN2 PiggyBac cells

On day 5 of NGN2-mediated neural differentiation, RNA was extracted from ZIP13K2 induced neurons following the Direct-zol RNA MicroPrep Kit (Zymo Research) standard protocol. Complementary DNA libraries were then prepared and sequenced as described earlier in the ‘KAPA stranded messenger RNA-seq of HAP1 HOXD4 knock-in cells’ section (Fig. 6g,h and Extended Data Fig. 9f,g).

Live-cell imaging of human iPSC-derived neurons

Living cells were imaged using the Celldiscoverer 7 imaging platform (Zeiss) in wide-field mode running under the ZEN Blue 3.1 imaging software and full environmental control (5% vol/vol CO2, 100% humidity and 37 °C). The final experiments were performed using a plan-apochromat ×20, NA = 0.7 objective and a ×2 tube lens (Zeiss), and captured on an Axiocam 506 camera (Zeiss) with 3 × 3 binning, resulting in a lateral pixel resolution of 0.347 μm per pixel. The fully automated imaging approach typically captured 20–40% of individual well surfaces. Focus stabilization was achieved by surface method in each third tile region. All images were acquired with one or two additional transmitted light or contrasting method (brightfield, oblique or phase gradient contrast) channel. Each individual image position was acquired in consecutive sections of three slices surrounding the focus position with a z-spacing of 0.63 μm to ensure the acquisition of each and every neurite. All parameters were kept identical during the experimental time course. The resulting large overview tile scan underwent a maximum-intensity projection and subsequent channel stitching using the nuclear counterstain (Hoechst) as reference (Fig. 6d). We quantified cell numbers (Hoechst) and neurite density (SPY650) based on the respective channel.

Image analysis of nuclei and neurite densities in differentiated neurons

Wide-field images were acquired using a ×20 air objective (NA = 0.7) with ×2 optical post magnification on a Celldiscoverer 7 microscope under the ZEN Blue 3.2 software (Zeiss). For each well and replicate, a mosaic of 201 tile regions was imaged. A definite hardware focus was defined as the centre for three slices of a consecutive z-stack with a slice distance of 0.34 µm. Image acquisition was performed using a Zeiss Axiocam 506 camera in 3 × 3 binning mode, resulting in a lateral resolution of 0.34 µm per pixel. The resulting images were projected using maximum-intensity projection in a ZEN 3.4 on a dedicated Zeiss analysis workstation. Object quantification was performed in the image analysis module in ZEN 3.4 (Zeiss, Germany). Briefly, within maximum-intensity projections, nuclei were identified by nuclear counterstaining using Otsu intensity thresholds after faint smoothing (Gauss: 2,0) and nearby objects were segmented downstream by standard water shedding. Neurites were segmented by fixed intensity threshold on the respective staining without any water shedding (Fig. 6e,f).

FLAG-NGN2 ChIP–Seq

To study the chromatin association of wild-type, AroLITE and AroPERFECT NGN2, we performed ChIP–Seq experiments in ZIP13K2 cells expressing the respective constructs 24 and 48 h after induction of NGN2-mediated neural differentiation (Fig. 6i,j and Extended Data Fig. 9h,i). The previously published ChIPmentation protocol was used61.

The cells were detached using Accutase solution (Sigma), washed twice in PBS and fixed by incubation with 1% formaldehyde for 10 min at room temperature with rotation. Subsequently, the reaction was quenched by the addition of glycine to a final concentration of 125 mM. Per replicate, 3 × 106 cells were used as starting material. Briefly, we followed the ChIPmentation protocol version 3 for histone marks and TFs62. The cells were lysed in lysis buffer 3 (10 mM Tris–HCl pH 8.0, 100 mM NaCl, 1 mM EDTA pH 8.0, 0.5 mM EGTA, 0.1% sodium deoxycholate and 0.5% N-laurosylsarcosine) supplemented with 1×cOmplete protease inhibitor cocktail. The chromatin was then sonicated for 10 min using a Covaris E220 Evolution focused-ultrasonicator with 2% duty cycles, 105 W peak incident power and 200 cycles per burst. The lysates were clarified by centrifugation for 10 min at 20,000g and 10% of the clarified lysate was put aside as input control. The remaining lysate was mixed with 50 µl of equilibrated anti-FLAG (Merck, F1804; 1 µg total) coupled to Dynabeads Protein G magnetic beads (Invitrogen) and incubated on a 3D-shaker overnight at 4 °C. The next day, the samples were washed twice in TF-wash buffer I (20 mM Tris–HCl pH 7.4, 150 mM NaCl, 0.1% SDS, 1% Triton X-100 and 2 mM EDTA pH 8.0), followed by two washes in TF-wash buffer III (10 mM Tris–HCl pH 8.0, 250 mM LiCl, 1% Triton X-100, 0.7% sodium deoxycholate and 1 mM EDTA pH 8.0) and a final wash with 10 mM Tris–HCl pH 8.0. All samples were tagmented for 5 min at 37 °C using the Illumina Tagment DNA kit and immediately put on ice. The tagmented chromatin was washed twice in ice-cold wash buffer I and twice in TET buffer (10 mM Tris–HCl pH 8.0, 5 mM EDTA pH 8.0 and 0.2% Tween-20), and reverse-crosslinked for 1 h at 55 °C and 9 h at 65 °C in the presence of 300 mM NaCl and proteinase K (Ambion). Subsequently, DNA was purified using AMPureXP beads. Sequencing libraries were amplified using the Kapa HiFi HotStart ready mix (Roche) and Nextera custom primers (Illumina)61 for a total of 12 cycles and paired-end sequenced on an NovaSeq 6000 system (Illumina) with a depth of approximately 50 × 106 fragments per library (Fig. 6i,j and Extended Data Fig. 9h, i).

TT-SLAM-Seq

To study the immediate transcriptional effects of wild-type, AroLITE and AroPERFECT NGN2 overexpression on ZIP13K2 human iPSCs, the cells were treated with DOX for 12 or 24 h and subjected to 15 min of 4-thiouridine labelling using 500 µM 4-thiouridine. TT-SLAM-Seq was performed as previously described21.

Image analysis of differentiated C2C12 myotubes

Wide-field images were acquired using a ×20 air objective (NA = 0.7) with ×2 optical post magnification on a Celldiscoverer 7 under the ZEN Blue 3.2 software (Zeiss). For each well and replicate, a mosaic of 49 tile regions was covered. We defined the definite hardware focus as the centre for three slices of a consecutive z-stack with a slice distance of 0.34 µm. Image acquisition was performed using a Zeiss Axiocam 506 microscope, in 3 × 3 binning mode, resulting in a lateral resolution of 0.34 µm per pixel. The resulting images were projected using maximum-intensity projection in ZEN 3.4 (Zeiss) on a dedicated Zeiss analysis workstation. Quantification of fusion scores was conducted by implementation of a simple hierarchy order, which was built within the image analysis module in ZEN 3.4 (Zeiss). We designed two segregating parent classes by fixed intensity thresholds based on mEGFP signal resulting in fused myotubes and non-myotubes. Within these primary regions, nuclei were identified. Secondary objects were identified exclusively within primary objects (myotubes and non-myotubes) by applying Gaussian smoothing and fixed intensity thresholds on the nuclear counterstaining, followed by standard water shedding the respective fluorescence image. All nuclei objects were filtered according to an area between 30 and 300 µm2 (Fig. 7d).

C/EBPα-mediated transdifferentiation of B cells to macrophages

To induce C/EBPα-mediated B cell-to-macrophage transdifferentiation, infected RCH-rtTA cells were seeded at 0.3 × 106 cells ml−1 in RCH culture medium supplemented with IL-2 (Preprotech, 200-03) and CSF-1 (Preprotech, 315-03B), both at 10 ng ml−1, as well as 2 µg ml−1 DOX. The macrophage transdifferentiation was monitored by flow cytometry. Briefly, blocking was carried out for 10 min at room temperature using a 1:20 dilution of human FcR binding inhibitor (eBiosciences, 16-9161-73). Subsequently, the cells were stained with antibodies to CD19 (APC–Cy7 mouse anti-human CD19; BD Pharmingen, catalogue number 557791) and Mac1 (APC mouse anti-human CD11b/Mac1; BD Pharmingen, catalogue number 550019) at 4 °C for 20 min in the dark. After washing, DAPI counterstaining was performed just before analyses. All analyses were performed using an LSR Fortessa instrument (BD Biosciences). Data analysis was completed using the FlowJo software (Fig. 5c and Extended Data Fig. 7e).

FACS analysis of CD66a and FCGR2A

CD66 and FCGR2A expression levels were monitored by FACS analysis during C/EBPα-mediated transdifferentiation of B cells to macrophages. RCH-rtTA cells expressing DOX-inducible wild-type or AroPERFECT IS15 CEBPA were seeded at 0.5 × 106 cells ml−1 in RCH culture medium supplemented with IL-2 and CSF-1, both at 10 ng ml−1, as well as 2 µg ml−1 DOX. The cells were collected at 24 and 48 h. Blocking was carried out for 10 min at room temperature using a 1:20 dilution of human FcR binding inhibitor. Subsequently, the cells were stained with antibodies to CD66a (Alexa Fluor 647 anti-human CD66a; BioLegend, catalogue number 398905) and FCGR2A (PE anti-human FCGR2A; BioLegend, catalogue number 305503) at 4 °C for 20 min in the dark. After washing, DAPI counterstaining was performed just before analysis. All analyses were performed using an LSR Fortessa instrument (BD Biosciences). Data analysis was completed using the FlowJo software (Extended Data Fig. 8n,q).

Generation of scRNA-seq data

One week after induction of C/EBPα-mediated B cell-to-macrophage transdifferentiation, the cells were collected and washed twice in PBS to remove dead cells and debris. The cells were then resuspended in solution at a density of 700 cells µl−1. We used the Chromium Next GEM Single Cell 3′ technology for generating gene expression libraries from single cells. Briefly, gel beads-in-emulsion (GEMs) are generated by the combination of barcoded Single Cell 3′ v3.1 Gel Beads, a master mix containing cells and partitioning oil on a Chromium Next GEM Chip G. To achieve single-cell resolution, the cells are delivered at a limiting dilution, such that the majority (approximately 90–99%) of generated GEMs contain no cell, whereas the remainder largely contain a single cell. Immediately following GEM generation, gel beads were dissolved, primers were released and any co-partitioned cell was lysed. Primers (containing an Illumina TruSeq Read 1, 16 nucleotide 10X Barcode, 12 nucleotide unique molecular identifier and 30 nucleotide poly-dT sequence) were mixed with the cell lysate and a master mix containing reverse transcription reagents. Incubation of the GEMs produced barcoded full-length cDNA from poly-adenylated mRNA. After incubation, the GEMs were broken and pooled fractions were recovered. Silane magnetic beads were used to purify the first-strand cDNA from the post GEM-reverse transcription reaction mixture, which includes leftover biochemical reagents and primers. Barcoded full-length cDNA was amplified via PCR to generate sufficient mass for library construction. The cDNA was analysed using an Agilent Bioanalyzer assay (catalogue number 5067-4626) to check size distribution profile and for quantification. Only 25% of the cDNA was used for 3′ Gene Expression Library construction. Enzymatic fragmentation and size selection were used to optimize the cDNA amplicon size. TruSeq Read 1 (read 1 primer sequence) was added to the molecules during GEM incubation. P5, P7, a sample index and TruSeq Read 2 (read 2 primer sequence) were added via end repair, A-tailing, adaptor ligation and PCR. The final libraries contained the P5 and P7 primers used in Illumina bridge amplification. The final libraries were analysed using an Agilent Bioanalyzer assay to estimate the quantity and check size distribution, and were then quantified by quantitative PCR using a library quantification kit (Kapa Biosystems, catalogue number KK4835).

C/EBPα–GFP ChIP–Seq

To study the chromatin association of C/EBPα wild type and AroPERFECT IS15, we performed ChIP–Seq in C/EBPα wild type and AroPERFECT RCH-rtTA cells 24 and 48 h after induction of C/EBPα-mediated macrophage transdifferentiation (Fig. 5f–h,j, Extended Data Fig. 8k,l,o and Supplementary Fig. 2a–c,e). The protocol was previously described41. The cells (5 × 106) were collected, crosslinked for 10 min using 1% formaldehyde and quenched using a final concentration of 0.125 M glycine. After a wash in cold PBS and centrifugation, the pellets were lysed in 500 µl pre-cooled SDS lysis buffer (1% SDS, 10 mM EDTA, 50 mM Tris pH 8 and 1×protease inhibitor cocktail) and incubated on ice for 15 min. The chromatin was sheared using a Bioruptor Pico sonicator (Diagenode) at 4 °C for 18 cycles of 30 s on and 30 s off. After sonication, the solution was clarified by centrifugation at 1,000g and 4 °C for 5 min; the supernatant was transferred to a low-bind tube and mixed with 900 µl ChIP dilution buffer (0.01% SDS, 1.1% Triton X-100, 1.2 mM EDTA, 16.7 mM Tris–HCl pH 8.0, 167 mM NaCl and 1×protease inhibitor cocktail) containing antibody-coupled beads (10 µl anti-GFP; clone 3E6, Thermo Fischer Scientific, A-11120, and 35 µl of protein G magnetic beads; Thermo Fischer Scientific, 10003D). Five per cent were saved as input and the samples were incubated overnight at 4 °C under rotation. The beads were then collected and washed with 500 µl low salt buffer (0.1% SDS, 1% Triton X-100, 2 mM EDTA, 20 mM Tris–HCl pH 8.0 and 150 mM NaCl), high salt buffer (0.1% SDS, 1% Triton X-100, 2 mM EDTA, 20 mM Tris–HCl pH 8.0 and 500 mM NaCl), RIPA-LiCl buffer (10 mM Tris–HCl pH 8.0, 1 mM EDTA, 250 mM LiCl, 0.5% NP-40 and 0.5% sodium deoxycholate) and twice with TE buffer (10 mM Tris–HCl pH 8.0 and 1 mM EDTA). The beads were then collected and eluted in 70 µl Elution buffer (10 mM Tris–HCl pH 8.0, 5 mM EDTA, 300 mM NaCl and 0.5% SDS), followed by incubation with proteinase K for 1 h at 55 °C and then overnight at 65 °C to reverse the crosslinking. The beads were collected and transferred to a new tube and a second step of elution was performed with 30 µl Elution buffer. Finally, DNA was purified using a Qiagen MinElute column and 3 ng DNA was used to construct sequencing libraries with a NEBNext ultra DNA library prep kit for Illumina (E7370L). The libraries were sequenced on Illumina NextSeq 2000 instruments using the 50 nucleotides single-end mode to obtain around 50 × 106 reads per sample.

Identification of periodic blocks in TF IDRs

We used 1,392 full-length TF protein sequences from Animal Transcription Factor DataBase (AnimalTFDB) v3.0 (ref. 63) and determined the positions of all aromatic residues F, Y and W (stickers) within them. Next, we identified spacers—stretches of non-aromatic residues between the stickers. A periodic block of aromatic residues was defined as a region that comprises at least four aromatic amino acids. We considered spacer lengths of 4–9, 10–20 or 21–30 amino acids. The ranges of different spacer lengths used for the analysis were chosen based on previous modelling studies on biopolymers using the stickers-and-spacers formalism64,65,66. Next, we identified periodic blocks that overlap IDR regions using the Metapredict v2 IDR prediction network67. This resulted in the identification of periodic blocks of aromatic residues in 531 TF IDRs (Extended Data Fig. 1a,b, Supplementary Table 1). For an internal ranking of periodic TF IDRs, we calculated a periodicity score comprising the number of periodic blocks that overlapped with the protein IDRs. The three spacer subgroups were weighed by 1, 1.1 and 1.2 for the lengths of 4–9, 10–20 and 21–30 residues in a single spacer, respectively. The weighing values were arbitrarily chosen with the assumption that uniform aromatic dispersion with long spacers may be less likely to occur randomly (Extended Data Fig. 1a,b and Supplementary Table 1).

Prion-like domain analysis

For all predictions, if not stated otherwise, the total human proteome was used from the GRCh38.p13 assembly. For this, we filtered all non-canonical proteins using Ensembl v104 annotation. For genes that did not have any isoform classified as ‘Ensembl canonical’, the longest ‘Genecode basic’ isoform was considered. The AnimalTFDB v3.0 database63 was used as the reference set for annotating TFs and TF families (Fig. 1k,l and Extended Data Fig. 3d,e). Prion-like domains were identified using the PLAAC web application with default settings68. From the above described set of human proteins, aromatic-rich prion-like domains were defined as those with 10% of more aromatic content.

Identification of intrinsically disordered protein regions

Intrinsically disordered protein regions were predicted using Metapredict with default settings using the Metapredict v2 network67.

Identification of regions with significant periodicity in the human proteome

We developed an in-house method to identify regions with significant, albeit not necessarily perfect, periodicity. Briefly, the number of residues between adjacent aromatic residues (that is, spacer length) was calculated for each protein and the observed distribution of spacer lengths within a sequence was compared with the expected geometric distribution using a K–S test. The mean of the geometric distribution was then extrapolated from the proportion of aromatic residues, implicitly modelling their occurrence by a Poisson process. Next, the method was applied to every 100-amino-acid-long region using a sliding window approach and the P value of the K–S test was plotted against the position of each window in every protein. After plotting the P value of every 100-amino-acid-long region of each protein, the consecutive points below a P-value threshold (0.5 × average P value) were identified as periodic regions. Those regions were compared with the Metapredict IDRs and InterPro domain regions (https://www.ebi.ac.uk/interpro/), and overlap was defined as the overlap between regions of at least one amino acid. Only regions that contained at least five aromatic residues in the 100-amino-acid-window with the lowest P value were included. Regions with significant periodicity were defined by the minimum P-value cutoff of 0.01. (Fig. 1k,l and Extended Data Fig. 3a,b). All regions are listed in Supplementary Table 2.

Omega score calculation

The ΩAro score was calculated using a modified localCIDER version69. Given that the omega score function is not length normalized, we adapted the Python code to allow for variable interspace size referred in the package as the so-called blob size. This parameter is now calculated by dividing the sequence length by the fraction of aromatic residues. For this analysis, only IDRs with a minimum of three aromatic residues were included. The mean random score was defined as the mean of 1,000 κ-score calculations of randomly shuffled sequence from the original sequence. The ggplot2 program (ref. 70) was used for plotting violin plots and custom R to generate a distribution plot for the mean of random (Figs. 1f, 2a, 4a, 6a, 7a and Extended Data Figs. 2b, 4c,f, 5g, 7c, 9a). One-way analysis of variance with a post-Tukey test was used to compare IDR sets (Fig. 1l).

Bulk RNA-seq analysis

RNA-seq raw data were filtered and trimmed using cutadapt71 with default settings. Filtered data from HAP1 and ZIP13K2 cells were mapped to a custom human genome hg38 including the cloned mEGFP sequence using STAR aligner72. Count read tables were generated by the same program. C2C12 RNA-seq data were mapped to the mm10 mouse genome using the abovementioned programs. Differential expression analysis was performed using the DEseq2 package73 in R version 4.2 (ref. 74). Differentially expressed genes were defined as having a fold change ≥ 1.5, Benjamini–Hochberg P ≤ 0.01 and a minimum mean read count across the experiment samples of 50 reads. For the HAP1 dataset, knockout samples were compared with the parental lines, and AroPERFECT and AroPLUS were compared with the HOXD4 wild-type line. For the ZIP13K2 datasets, the NGN2 wild-type line was compared with the parental ZIP13K2 line. AroLITE and AroPERFECT NGN2 were compared with the wild-type NGN2 line. Genes were considered as NGN2 targets if they were differentially expressed in the parental ZIP13K2 versus wild-type NGN2 comparison and had a peak assigned in the wild-type NGN2 ChIP–Seq analysis. For the C2C12 experiments, we compared the gene expression in the wild-type MYOD1 line with parental C2C12 cell gene expression and AroLITE, AroLITE C, AroPERFECT and AroPERFECT C variants with wild-type MYOD1. The differentially expressed genes are listed in Supplementary Table 4.

Principal component analysis was carried out using the PCAPlot function from the DEseq2 package on the normalized read matrix that was transformed using the variance stabilizing transformation function from the DEseq2 package and plotted using ggplot2 (Figs. 3d, 6g, 7e and Extended Data Fig. 10d). Volcano plots were plotted using ggplot2 (Figs. 3e, 6h, 7f and Extended Data Figs. 8i, 10e). Heatmaps were plotted with the aid of the ComplexHeatmap package75 in R and cluster analysis was done by k-means clustering using the cluster76 package in R (Extended Data Figs. 6e, 9f and 10f).

Gene-set-enrichment analysis of the MYOD1 RNA-seq was conducted using GSEAPreranked v6.0.12 (ref. 77) with 1,000 permutations on the ranked list of gene sets from the comparisons of AroPERFECT C versus wild type and wild type versus parental sorted according to the Wald statistic (stat)73 against the Wikipathways cell adhesion gene set in Mus musculus78 (Extended Data Fig. 10g). Empirical P values were used for the plots. Highest-ranking genes in the AroPERFECT-C versus wild type comparison that are MYOD1 targets were highlighted in the volcano plots (Fig. 7f and Extended Data Fig. 10e).

The marker genes shown in Extended Data Fig. 9g were identified as single-cell cluster markers in NGN2-induced neural differentiation in previous studies79,80.

ScRNA-seq analysis

Data pre-processing

The scRNA-seq datasets were processed using 10X Genomics’ Cell Ranger pipeline v3.1.0 (ref. 81) and mapped to a custom human genome hg38 including mEGFP and codon-optimized wild-type, AroPERFECT IS15 and AroPERFECT IS10 C/EBPα sequences. The Cell Ranger hdf5 files were processed using the Seurat package v4.0.6 (ref. 82).

Filtering and normalization

We kept cells with more than 2,000 expressed genes, and genes with >5 reads across the samples were considered for analysis. Further filtering was done by removing cells with >20% mitochondrial genes and <5% ribosomal gene expression. The top ten genes associated with PCA components were then checked for mitochondrial and ribosomal genes. Next, cells were scored for cell cycle and gene expression on S and G2M genes was regressed to eliminate any dependence on cell cycle to clustering. Doublets were also identified and filtered out. mEGFP and C/EBPα wild-type, AroPERFECT IS15 and AroPERFECT IS10 reads were then used to identify mEGFP+ cells, and their expression was then transposed to the metadata so it would not affect clustering. Finally, the Harmony package was used to batch correct the three libraries.

Cluster identification

Cluster identification was then carried out using Seurat’s built-in functions FindvariableGenes, RunPCA, RunUMAP and FindClusters by first identifying the genes with the highest variation across all samples and cell types, building a shared-nearest-neighbour graph and then running the Louvain algorithm on it. The number of clusters was determined by the optimum of the modularity function from the Louvain algorithm. The number of mEGFP+ cells was then calculated for each cluster and this was used to filter untransformed cell clusters, mainly cluster 0 and cluster 2.

Assignment of cell types to clusters

Cell-type cluster assignment was based on the comparison of marker sets from a published bulk RNA-seq experiment83 and augmented by both RNA velocity analysis and known markers for both B cell and macrophage cell types. Briefly, RNA-seq data and marker sets were retrieved from ref. 83 and raw FASTQ files, aligned and reads were counted using STAR aligner against the human genome v38. Raw count data were then processed in DESeq2 and normalized to the variance stabilizing transformation. Marker set variance stabilizing transformation data were then retrieved and clustered according to the methods described previously83 and each gene was assigned a gene cluster for Early, Early–inter, Inter1, Inter2, Inter–late, Late1 and Late2 as described in the publication. This assignment was designated ‘Choi et al. differentiation clusters’ in Extended Data Fig. 8a,b. To quantify the number of genes that are highly expressed in each single-cell cluster, single-cell gene expression was averaged within the single-cell cluster and normalized to the z-score. Normalized gene expression for the abovementioned marker set was then clustered by k-means clustering with k = 8 in an effort to separate each single-cell cluster by expression profile and a heatmap was generated using complexHeatmap to visualize the expression profile (Extended Data Fig. 8a). For each k-means cluster, the gene list was retrieved and the number of terms of Choi et al. differentiation clusters was quantified for each cluster (Extended Data Fig. 8b). This analysis helped define the B cell and macrophage population and assigned them to differentiation stages. Pseudotime and PAGA graph analysis also was used to aid in the trajectory of by giving temporal context to the single-cell clusters. Based on the differentiation term quantification, the expression pattern of the marker set, pseudotime and PAGA graph, we manually assigned each single-cell cluster to a differentiation state as follows: clusters 0, 2 and 3 were considered as the earlier cell stage as they showed the least amount of marker cell induction and also the lowest pseudotime score. As mentioned earlier, clusters 0 and 2 were excluded based on mEGFP quantification (Extended Data Fig. 8d) and were considered as untransduced B cells. Cluster 3 cells were assigned as initial B cells. Cluster 4 was assigned to Early based on quantification high amount of Early and Early–Inter terms and based on difference in proportion of Early–Inter was higher for that cluster. Cluster 1 had similar term quantification but was assigned as Early–Inter based on PAGA analysis. Finally, clusters 5, 6 and 7 had the highest quantify of Inter2, Late1 and Late2 macrophage markers. Clusters 5 and 6 had very similar quantifications and were thus assigned Differentiating macrophage 2 and 1, respectively, based on PAGA analysis. Late macrophage assignment was based on the unique expression signature by having the highest pseudotime score (Extended Data Fig. 8c). To confirm this assignment, we also used cell-type markers and visualized the normalized expression in a UMAP graph. Markers for B cells—CD19—and macrophage cell types—ITGAM, CD14, CD68 and PTPRC—as well as CEACAM1, CEACAM4, CEACAM6, CEACAM8, FCGR2A, FCGR2B and FCGR3A were used (Fig. 5i,k, Extended Data Fig. 8f,m,p and Supplementary Fig. 2d,f).

Differential expression analysis

Inter-cluster differential expression analysis was performed using the Wilcoxon test using the FindMarkers function, with default settings, and inter-sample cluster differential expression analysis between wild-type and IS15 cells in cluster 7 was performed using the FindMarkers and DESeq2 functions. The differentially expressed genes within the clusters are listed in Supplementary Table 5. A q-value cutoff of 0.05 was used to define differentially expressed genes for the Wilcoxon test and an adjusted Benjamini–Hochberg P value of 0.05 was used for the inter-sample test (Supplementary Table 5). Volcano and bar plots were generated in ggplot2; and violin, UMAP and feature plots were generated using Seurat’s VlnPlot, FeaturePlot and DimPlot functions. The dot plot was made using a custom function to modify the output of the complexHeatmap package (Extended Data Fig. 8i,j).

RNA velocity

We generated loop files necessary for RNA velocity using velocyto84 and exported barcodes, expression matrix, metadata and UMAP coordinates from Seurat to CSV files. scVelo85 was used to build the manifold, calculate and visualize the RNA velocity using generalized dynamical model to solve the full transcriptional dynamics. PAGA graph86 was calculated from this model to visualize the cell trajectory. Pseudotime was calculated using the Markov diffusion process and plotted by the scVelo bult-in function (Extended Data Fig. 8c).

ChIP–Seq analysis

ChIP–Seq data from C/EBPα and NGN2 were mapped to a custom human genome hg38 using BWA v0.7.17 (ref. 87). SAMtools88 was used for SAM to BAM file conversion, sorting and indexing, and Genome Analysis Toolkit v4 (ref. 89) was used to remove duplicate reads. Peak calling was then performed using MACS3 v3.0.0 b1 (ref. 90) using the input of the respective sample. Analysis and differential peak calling were done with DiffBind v3.6.5 (ref. 91). Normalization was done with the native method and background input. Differential calling was done using the DEseq2 method; the false-detection-rate threshold was set to 0.01. Peak visualization was performed using the DiffBind ‘plotprofile’ function with default settings for general profiles, unless otherwise stated. Set of overlapping sites was done using bedtools v2.6.0 and the intersect function. The profiles in Supplementary Fig. 2b were plotted using ‘percentOfRegion’ with 27 windows and 300% extension. The regions plotted correspond to a merged set of promoters, a merged set of enhancers and separate sets for B cell and macrophage superenhancers from a previous study83. Principal component analysis was done on normalized count samples and plotted with DiffBind (Extended Data Figs. 8k and 9h).

TT-SLAM-Seq analysis

Raw reads were filtered and trimmed as described earlier for bulk RNA-seq samples. Filtered reads were aligned to the SILVA database69 (downloaded 6 March 2020) using STAR v2.7.9a with the parameters ‘–outFilterMultimapNmax 50–outReadsUnmapped Fastx’ to remove ribosomal RNA content. Unaligned reads were then reverse-complemented using the seqtk ‘seq’ v1.3-r106 using the ‘-r’ parameter (https://github.com/lh3/seqtk). Reverse-complemented reads were processed using SLAM-DUNK92 with the ‘all’ pipeline v0.4.1 using the ‘-rl 100 -5 0’ parameters with the GENCODE gene annotation v39 as ‘-b’ option. Reads with a ‘T>C’ conversion representing nascent transcription were filtered from the BAM files using alleyoop (provided together with SLAM-DUNK) with the ‘read-separator’ command. Counts per gene were quantified based on the ‘T>C’-converted reads using featureCounts v2.0.6 (ref. 93) with the -s 1 and -t gene parameters for stranded and gene body counting. Samples were then submitted to differential expression using the method described above. Heatmap representation was plotted as described earlier (Extended Data Fig. 9k). For genome-wide coverage tracks, technical replicates were merged using SAMtools ‘merge’. BigWig files for single and merged replicates were obtained as described above. DeepTools2 v3.5.1 (ref. 94) was used to generate a metaplot using two separate BED files containing separate stranded genes in each file (Fig. 6k).

Sequence disorder and pLDDT calculation for HNRNPA1

Disorder and pLDDT scores were calculated using Metapredict v2, and score plots were made using the built-in Metapredict graph plotting function (Extended Data Fig. 3a).

AlphaFold predicted models

AlphaFold models were computed by an in-house implementation of AlphaFold95 using version 2.0.0 (16 July 2021). The preset parameter was set to ‘–preset = casp14’, matching the CASP14 prediction pipeline. In addition, templates were restricted to those available before the CASP14 predictions using the parameter –max_template_date = 2020-05-14. Models were rendered using UCSF ChimeraX, colouring the structure for aromatic residues (Extended Data Figs. 3c and 5a).

Spacer analysis

The IDR composition was measured by calculating the frequency of each amino acid as a probability with the ‘alphabetFrequency’ function from Biostrings package v2.40.2 divided by the frequency of the amino acid calculated over the full human proteome in R. Quantification was performed for IDRs with and without periodic blocks. The frequency bar chart was plotted using ggplot (Extended Data Fig. 1d). To calculate the amino acid composition around the aromatic residues, we extracted the sequence, in FASTA format, of every periodic block for positions −2, −1, 0, +1 and +2 around the aromatic residue (0 represents the aromatic residue) using custom Python script. The FASTA file was then submitted to GLAM2 analysis to calculate the frequency of amino acids and to output a position weigth matrix. The cumulative bar plot was plotted using ggplot masking the position weigth matrix table into disorder promoting, order-promoting and neutral residues (Extended Data Fig. 1e). Periodic block motif analysis was performed by extracting sequences of the periodic blocks in TF IDRs described in this study, and charged blocks from a previous study96, in FASTA format and submitting them to GLAM2 analysis. The top three position weigth matrices were plotted (Extended Data Fig. 1f).

Gene-set-enrichment analysis

Gene ontology enrichment analyses for proteins that contain a regions with significant periodicity and the TFs that contain a periodic block were done using gProfiler97. Gene ontology categories for biological process were filtered for term size of >1,000 genes to remove general categories. An adjusted P-value cutoff of 0.001 was used. For periodic block containing TFs analysis REAC and WikiPathways enrichment was also done with gProfiler. Gene-set-enrichment analysis was done using clusterProfiler98,99 (Extended Data Fig. 3d,e).

UCSC track visualization

For track visualization, MACS3 backgroup-subtracted bigWig files from each replicate were merged using the UCSC bigWigMerge tool and then converted from big bedGraph format back into bigWig using the UCSC bedGraphToBigWig tool. Visualization was done using the pygenometracks tool set100.

Statistics and reproducibility

All experiments were repeated as stated in the figures, legends and methods. Statistical details are presented in the figure legends and as detailed below. Comparisons were performed in GraphPad Prism 9.0. No statistical method was used to pre-determine sample size. Data distribution was assumed to be normal but this was not formally tested. The experiments were not randomized. Data collection and analysis were not performed blind to the conditions of the experiments. For the neural reprogramming experiments, wells were excluded in case of wash-off or out-of-focus events. Investigators were not blinded to allocation during experiments and outcome assessment.

In the box plots in Fig. 1l, the centre line shows the median, the bounds of the box correspond to interquartile (25th–75th) percentile, and whiskers extend to Q3 + 1.5× the interquartile range and Q1 − 1.5× the interquartile range. Dots beyond the whiskers show Tukey’s fences outliers.

Exact P values were as follows: Fig. 2a, P(WT versus AroPLUS) = 0.03172, P(AroPLUS versus AroPLUS LITE) = 0.07727, P(AroPLUS versus AroPLUS patched) = 0.00729, P(AroPLUS versus AroPLUS patched LITE) = 0.03433, P(WT versus AroPERFECT) = 0.00006, P(AroLITE versus AroPERFECT) = 0.00004, P(AroPLUS versus AroPERFECT) = 0.000461, P(AroPLUS LITE versus AroPERFECT) = 0.00252, P(AroPLUS patched versus AroPERFECT) = 0.00014 and P(AroPLUS patched LITE versus AroPERFECT) = 0.00008; Fig. 2f, P(WT(N)-FUSNxs versus wild type) = 0.00942, P(WT(N)-FUSNxs versus wild type (N)) = 0.01837 and P(WT(N)-FUSNxs versus FUSNxs) = 0.01054; Extended Data Fig. 5b(top), P(wild-type N-IDR versus AroPERFECT N-IDR) = 0.00121, P(AroLITE N-IDR versus AroPERFECT N-IDR) = 0.000003, P(wild-type C-IDR versus AroPERFECT C-IDR) = 0.01711, P(AroLITE C-IDR versus AroPERFECT C-IDR) = 0.000005; Extended Data Fig. 5b(middle), P(wild type versus AroPERFECT) = 0.02946 and P(AroLITE versus AroPERFECT) = 0.00069 (middle); Extended Data Fig. 5b(bottom), P(wild type versus AroPERFECT) = 0.02079, P(AroLITE versus AroPERFECT) = 0.02087 (bottom); Extended Data Fig. 6j, P(HOXD4 wild-type YFP versus HOXD4 wild-type YFP-RNAPII CTD) = 0.9999, P(HOXD4 wild-type YFP versus HOXD4 AroPERFECT YFP) = 0.0509, P(HOXD4 AroPERFECT YFP versus HOXD4 AroPERFECT YFP-RNAPII CTD) = 0.0325, P(HOXD4 wild-type YFP-RNAPII CTD versus HOXD4 AroPERFECT YFP-RNAPII CTD) = 0.9999, P(C/EBPα wild-type YFP versus C/EBPα AroPERFECT YFP) = 0.9999, P(C/EBPα wild-type YFP versus C/EBPα wild-type YFP-RNAPII CTD) = 0.9999, P(C/EBPα AroPERFECT YFP versus C/EBPα AroPERFECT YFP-RNAPII CTD) = 0.1524 and P(C/EBPα wild-type YFP-RNAPII CTD versus C/EBPα AroPERFECT YFP-RNAPII CTD) = 0.2275.

The imaging experiments in Figs. 3a, 4b, Extended Data Fig. 6g and Supplementary Fig. 1a,c were performed twice independently with similar results. The imaging experiments in Extended Data Fig. 10c were performed three times independently with similar results. The western blot experiments in Fig. 3f and Extended Data Figs. 4b,g, 5c, 6f, 7b, 10a were performed twice independently with similar results. The genotyping experiments in Extended Data Fig. 6c were performed once.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.