Article
Open access
Published: 03 July 2024

Geographical migration and fitness dynamics of Streptococcus pneumoniae

Nature (2024)Cite this article

905 Accesses
106 Altmetric
Metrics details

Subjects

Abstract

Streptococcus pneumoniae is a leading cause of pneumonia and meningitis worldwide. Many different serotypes co-circulate endemically in any one location^1,2. The extent and mechanisms of spread and vaccine-driven changes in fitness and antimicrobial resistance remain largely unquantified. Here using geolocated genome sequences from South Africa (n = 6,910, collected from 2000 to 2014), we developed models to reconstruct spread, pairing detailed human mobility data and genomic data. Separately, we estimated the population-level changes in fitness of strains that are included (vaccine type (VT)) and not included (non-vaccine type (NVT)) in pneumococcal conjugate vaccines, first implemented in South Africa in 2009. Differences in strain fitness between those that are and are not resistant to penicillin were also evaluated. We found that pneumococci only become homogenously mixed across South Africa after 50 years of transmission, with the slow spread driven by the focal nature of human mobility. Furthermore, in the years following vaccine implementation, the relative fitness of NVT compared with VT strains increased (relative risk of 1.68; 95% confidence interval of 1.59–1.77), with an increasing proportion of these NVT strains becoming resistant to penicillin. Our findings point to highly entrenched, slow transmission and indicate that initial vaccine-linked decreases in antimicrobial resistance may be transient.

Designing ecologically optimized pneumococcal vaccines using population genomics

Article 03 February 2020

Within-host microevolution of Streptococcus pneumoniae is rapid and adaptive during natural colonisation

Article Open access 10 July 2020

Evidence for the intermediate disturbance hypothesis and exponential decay in replacement in Streptococcus pneumoniae following use of conjugate vaccines

Article Open access 07 May 2022

Main

The greatest public health burden from infectious diseases remains stubbornly endemic pathogens. Once established, pathogens such as Mycobacterium tuberculosis, HIV and, now, SARS-CoV-2 are difficult to control, even when vaccines are available¹. Their persistence in the population can be partially explained by the co-circulation of multiple strains of the same pathogen. Endemic pathogens are complicated to study as we rarely understand the mechanisms that drive spread, including the role of human behaviour and why some lineages increase in prevalence over time whereas others disappear. Underlying genetic diversity is particularly extreme in the case of the bacterium S. pneumoniae (the pneumococcus), which is the leading cause of morbidity and mortality worldwide because of lower respiratory infections^2,3,4. The pneumococcus comprises >100 known antigenically distinct serotypes and >900 classified lineages (also known as global pneumococcal sequence clusters (GPSCs))^5,6,7. Moreover, it is not uncommon for more than 30 antigenically distinct serotypes to co-circulate within a country or region or for a human host to concurrently carry multiple serotypes⁸. We refer to lineage synonymously with GPSC, whereas a strain references any particular circulating phenotype (including specific serotypes and antimicrobial resistance (AMR)). Here we develop mathematical models using thousands of geolocated genome sequences from South Africa collected over a 15-year period to clarify several key uncertainties in pneumococcal migration. We model the rate and breadth of mobility geographically and how fitness changes linked to vaccine implementation and AMR may affect its spread.

The pneumococcus resides in the human upper respiratory tract. Carriage is a prerequisite for disease, and rates of carriage in children under 5 years old range from 20 to 90%⁹. Occasionally, asymptomatic carriage goes on to cause local infections such as otitis media or, more severely, invasive pneumonia and meningitis. More than 500,000 deaths per year linked to pneumococcus are estimated to occur globally^3,10. Penicillin was first used to treat pneumococcal disease in the 1930s, and it successfully reduced pneumococcal disease until the late 1960s when penicillin non-susceptibility was first noted. Multidrug-resistant strains were described soon after penicillin resistance^11,12. By 2019, 19% of deaths associated with AMR had pneumococcal aetiology¹³. In this context, vaccines are pivotal to disease control. A pneumococcal polysaccharide vaccine that included 23 serotypes was licensed in the USA in 1983. However, the absence of mucosal immunity has seen it replaced by pneumococcal conjugate vaccines (PCVs) except for in older adults and immunocompromised individuals¹⁴. PCVs (conjugated with toxin to stimulate mucosal immunity) target a small subset of the polysaccharide capsular serotypes, with the most common formulations including PCV7 and PCV13 (Pfizer)¹⁵ and PCV10 (GlaxoSmithKline)¹⁶. These target 7, 13 and 10 serotypes, respectively (with all the serotypes included within PCV7 and PCV10 also included within PCV13). In 2021, additional PCV formulations, PCV15 (Merck) and PCV20 (Pfizer), were licensed for use in the USA and Europe¹⁷. Pneumococcal vaccination is dynamic, and new vaccine compositions are frequently tested. Vaccine serotypes are often selected because of their high prevalence and AMR among disease isolates from infants and children. PCVs are now included in 76% of national immunization schedules, with different formulations in different countries. For example, in South Africa, PCV7 was implemented in 2009 and replaced by PCV13 in 2011, excluding PCV10 (ref. ¹⁸). Despite their success at reducing disease, their use has been linked to serotype replacement by NVTs in both invasive pneumococcal disease (IPD) and carriage^8,19,20,21. In South Africa, this has been characterized by increases in NVT serotypes 8 and 15B among IPD, and increases in NVT serotypes 16F, 24, 35B and 11A among carriage isolates (8, 15B and 11A are now included in PCV20)^20,22,23. Although there has been success in predicting the fitness of individual isolates based on the overall gene distribution in a population, quantitative measures of fitness linked to the serotype of individual isolates are lacking^24,25. This includes quantifying the time it takes after implementation for vaccines to affect the serotype composition in the country. Moreover, quantifying the serotype growth before and after vaccine implementation at different time points is needed. These are crucial knowledge gaps, as serotype distributions ultimately drive vaccine development and deployment strategies. In addition, vaccine implementation has resulted in reductions in AMR among both IPD and carriage isolates^20,22,26. However, it remains unclear whether these reductions will persist over time at the population level or whether AMR may rebound.

Mathematical models applied to geolocated pathogen genome sequence data are useful to disentangle the changing prevalence of different lineages. However, most phylogeographical models focus on the rate of pathogen flow between locations, which represents the overall effect of multiple transmission chains. Such models therefore consider a different ecological scale to the specific behaviours of infected people and the surrounding population at each transmission generation²⁷. The relationship between behaviours of individuals at each transmission step and the overall patterns of pathogen flow between locations are complex and nonlinear. Most existing phylogeographical approaches also struggle to account for changing levels of surveillance in both space and time. Here we develop mechanistic models that use the generation time distribution to estimate the number of transmission events that separate the most recent common ancestors (MRCAs) from each pair of tips in a time-resolved phylogeny. Taken together with measures of human mobility probabilities and human population distribution, we can infer mechanisms of pneumococcal migration at each transmission generation. We implemented this model with a focus on South Africa, where approximately 65% of children ≤5 years of age (35% across all age groups) carry the pneumococcus^28,29. We incorporate both uncertainty in the phylogenetic reconstructions and sampling uncertainty through a bootstrapping approach³⁰. We explore the robustness of our approach to highly biased observation processes using simulated data with known parameter values. Finally, we quantify the changing fitness of strains in response to vaccine introduction, including those containing AMR.

Quantifying spatial structure

In partnership with the South African National Institute for Communicable Disease and the Wits Vaccines and Infectious Diseases Analytics Research Unit, we sequenced the whole genomes of isolates from each of South Africa��s nine provinces between 2000 and 2014 (n = 6,910, 5,060 from individuals with invasive pneumococcal disease and 1,850 from carriage studies) (Fig. 1a–c). Despite the large number of sequences, this dataset only represents a very small proportion (≪0.1%) of the circulating pneumococcus over this time period. We identified 184 GPSCs with 69 different serotypes (31.9% NVT) (Fig. 1a,e and Supplementary Table 1). This diversity persisted across provinces, and the distribution of serotypes within GPSCs did not follow a distinct geographical structure (Fig. 1a). In silico predicted AMR was common (penicillin, 48.2%; erythromycin, 17.3%, clindamycin, 11.2%; and co-trimoxazole: 68.4%), with similar distributions for isolates from both carriage and disease (Fig. 1f, Extended Data Fig. 10, Supplementary Fig. 13 and Supplementary Table 3).

**Fig. 1: Descriptive summary of S. *pneumoniae* isolates.**

Taking the 9 most dominant GPSCs in turn (each comprising more than 50 sequences, 2,575 sequences in total), we built recombination-free, time-resolved phylogenetic trees to determine the divergence times between sequence pairs (Extended Data Fig. 1a–i and Supplementary Table 4). We included 1,157 genomes from 14 other countries in Africa and 2,944 from 31 countries outside Africa in our phylogenies. We compared the geographical distance spread per divergence time between pairs in South Africa and found a clear geographical structure. The geographical distance between pairs increased from a mean distance of 142 km (95% confidence interval (CI) = 54–207 km) for those separated by less than 2 years of evolutionary time to 297 km (95% CI = 274–323 km) for those separated by 10–20 years (Fig. 1d). We obtained consistent results when using only sequences that came from individuals with disease (Supplementary Fig. 1A and Supplementary Table 5) and across the different GPSCs (Supplementary Fig. 2). This result is consistent with largely common patterns of spread irrespective of which particular GPSC an individual is infected with. Despite high heterogeneity in GPSC composition within any province, overall, pairs of isolates that are from the same province had 1.27 (95% CI = 1.23–1.31) times the relative risk (RR) of being the same GPSC compared with pairs of isolates from distant provinces (>1,000 km). This RR fell to 1.09 (95% CI = 1.03–1.12) for pairs separated by 500–1,000 km (Fig. 2a and Supplementary Table 5). We obtained consistent results when we limited our analysis to only disease isolates and when we subsampled to have even numbers of sequences by province to mitigate sampling bias (Extended Data Fig. 2a,b and Supplementary Table 5). As there can be hundreds of years of diversity within a single GPSC, we refined the analysis by using different evolutionary windows of separation between pairs of isolates as determined from the phylogenetic trees. Pairs from the same province had 3.87 (95% CI = 3.01–5.1) times the RR of having a recent common ancestor (within 5 years) than distal pairs (>1,000 km apart) (Fig. 2b and Supplementary Table 5). However, as the evolutionary time between isolates increases the relative probability of being from the same province decreases, it is only after around 50 years that the pneumococcus seemed to be well mixed throughout the country (Fig. 2c–f and Supplementary Table 6). Furthermore, comparisons of the spatial location of closely related pairs showed that pneumococcus flow was dominated by within-country movement compared with either movement between South Africa and other African countries or non-African countries (Fig. 2c–e, Supplementary Fig. 3A and Supplementary Table 5). Recognizing that our samples span age groups, we stratified the analysis by the age difference between pairs. Genome pairs from individuals who were greater than 5 years apart in age took slightly longer to become mixed across South Africa (Supplementary Fig. 3B and Supplementary Table 7). These findings are consistent with a highly entrenched pathogen that moves slowly within a country and with slow cross-border transmission.

**Fig. 2: RR framework to determine geographical structure.**

Inferring migration using human mobility

To understand whether human mobility can explain the slow spread of the pneumococcus, we built a mechanistic model of geographical spread fit to the nine dominant GPSCs and the observed province in which our genome sequences were isolated. We used the generation time distribution, time from one person being infected to infecting the next person (estimated mean of 35 days, standard deviation of 35 days (gamma distribution)) to translate branch lengths to the number of generations between pairs of sequences^31,32 (Supplementary Fig. 4). Each transmission generation is an opportunity for pneumococcal mobility. We used directional human mobility probabilities between each of the 234 South African municipalities from Meta Data for Good³⁰ to infer the probable location of a single transmission event, allowing for mobility of both the infected individual and the surrounding susceptible population. As Meta data captures mobility over a single day, we adjusted the duration in any cell to consider total mobility over the infectious period. Furthermore, as Meta users may act differently to those involved in pneumococcal transmission, we incorporated a parameter that allows individuals to have a different probability of staying within their home municipality than Meta users (Fig. 3b). We then calculated the probability of pneumococcal movement between each pair of locations for each transmission generation. This approach integrates over all possible pathways linking two locations. We incorporated the probability of an isolate being sequenced at each geographical location and within each collection year to account for the number of isolates being sequenced differing by year and location. We fit the model in a Bayesian framework using Markov chain Monte Carlo (MCMC). We compared the performance of our models using the Meta data to a gravity model in which the probability of mobility is a function of the distance to a location and its population size (Extended Data Fig. 4). We also separately modelled mobility that depended on distance only.

**Fig. 3: Mechanisms of geographical migration.**

Both the gravity model and the model that relied on Meta data were able to recover the observed spatial spread in pneumococcus (Fig. 1d, red line). However, model fit was better for the Meta data model (difference in deviance information criterion (DIC) of 3.4). Both these models outperformed the distance-only model (Supplementary Table 8). The Meta data model showed that the relative proportion of the population carrying the pneumococcus per province at any time is strongly correlated with the population size in that province (R² = 0.97, P ≤ 0.01; Fig. 3a). This model enabled us to infer the true underlying spread of pneumococcus (that is, in which we accounted for the biased observation process and mobility in both the infected individual and the surrounding susceptible population) (Fig. 1d, dashed line). We estimated that among individuals involved in pneumococcal transmission, the daily probability of staying in their home municipality was 94.3% (95% CI = 93.8–95.0%) (Fig. 3b, black squares) compared with 99.8% (ranging from 86.6% in Mogale City, Gauteng, to 99.9% in Ba-Phalaborwa, Limpopo) for Meta users (Fig. 3b, blue points). When we incorporated the mobility of both infector and infectee and accounted for the mobility across the infectious period, we estimated that after a single transmission generation (35 days), 53.1% (95% CI = 46.6–58.2) of strains remained in their starting municipality, 22.8% (95% CI = 19.2–27.1%) were in a neighbouring municipality and a small minority were more than 500 km away (Fig. 3b, black points). As the number of transmission generations increased, the probability of reaching distal municipalities also increased (Fig. 3b). The size of the community seemed to be key to determining where lineages travel. After 1 year of sequential transmission, the probability of being in a municipality with a population size of >3 million people was 26.7 (95% CI = 19.8–40.10) times that of being in a randomly selected municipality. This result is consistent with most pathogen movement passing through urban centres (Fig. 3c and Supplementary Table 9).

The municipality in which a strain emerges also seemed important. After 1 year of sequential transmission, a new strain that first occurred in a rural municipality (population density of <50 people per km²) has travelled a median distance of 468.7 km (95% CI = 71.3–1,204.4 km), whereas in the same time window, a variant first occurring in an urban municipality (>500 people per km²) has travelled only 285.6 km (95% CI = 36.3–967.0 km). Furthermore, the variant that emerged in a rural municipality would have travelled to 1.53 times as many municipalities as the urban variant (Extended Data Fig. 3a–c). This result is corroborated by previous research that demonstrated high levels of in and out migration among individuals in rural settings owing to travel for work or education^33,34. On average, 1 year after emerging, transmission chains visited 4 (95% CI = 1–8) municipalities, and after 10 years they visited 20 (95% CI = 13–27) municipalities (Fig. 3d). Overall, these results show that the breadth of geographical spread is driven by a small number of long-range transmission events, with most transmissions remaining local. Incorporating our model into a branching epidemic, we found that after 10 years, transmission events are an average 465 km (95% CI = 456–472 km) from where they began (Supplementary Fig. 5A,B).

To test the performance of our model, we simulated transmission within and between districts using the Meta matrix adjusted by a known parameter that determines the probability of staying within one’s home district. We then fit our model using a subset of infections to re-estimate this parameter. Even when only a small minority of infections were sequenced, we recovered the true probability of staying within a district. Our estimates were even robust to an extremely biased observation scenario whereby data from only two locations were available. Using this framework, we also explored the effect of misspecification of the generation time distribution. Using a 50% shorter generation time led to a small overestimate in the probability of staying in one’s home location each day (96.7% versus 95.3%), whereas using a 50% longer generation time had the opposite effect (Extended Data Fig. 5).

Vaccine-induced fitness changes

Implementation of PCV7 in 2009 and PCV13 in 2011 was associated with a substantial disruption in the patterns of circulating serotypes, results that are consistent with what has been previously observed^8,20,35,36. However, the introduction of vaccines was also associated with a marked change in fitness. By 2014, serotypes included in PCV13 represented 33.2% of all isolates in our dataset, a reduction from 85.0% in the pre-vaccine era (Fig. 4a). These patterns were consistent across the nine provinces in South Africa (Supplementary Fig. 6). To quantify changes in fitness linked to the vaccines, we fitted models to the annual distribution of serotypes across 184 GPSCs from the full dataset, allowing for differential fitness in serotypes included in PCV7 (serotypes 4, 6B, 9V, 14, 19F, 18C and 23F), PCV13 (which includes additional serotypes 1, 3, 5, 6A, 7F and 19A), and those not included in the vaccine (NVT). This method tracks the proportion of all serotypes at the population level over time and quantifies the relative advantage of each of them following the implementation of vaccines. This simple formulation was able to recover the observed distribution of serotype proportions in each group, by year, across provinces (Fig. 4a–c and Supplementary Fig. 6). We note that the number of NVT and VT isolates we used in our model do not represent the underlying incidence of NVT and VT, as only a small proportion of all infections were detected and sequenced. However, as we focused on the relative abundance of NVT and VT strains per year, our approach is robust to changes in the absolute numbers of isolates sequenced.

**Fig. 4: Vaccine-induced fitness dynamics.**

Before the implementation of vaccines (2000–2008), NVTs had a relative fitness of 1.08 (95% CI = 1.06–1.09) compared with serotypes included in the vaccine (combining serotypes in PCV7 and PCV13 into VT). Following the implementation of PCV7 and PCV13, the fitness of the serotypes they target declined respectively compared with NVT serotypes (Fig. 4d and Supplementary Tables 10 and 11). When comparing serotype fitness before and after the introduction of PCVs, VTs had a relative fitness of 0.86 (95% CI = 0.78–0.92) and 0.76 (95% CI = 0.67–0.84) for PCV7 and PCV13 serotypes, respectively (Supplementary Table 11). Meanwhile, for the NVTs, vaccines were associated with a 1.25 (95% CI = 1.14–1.35) times increase in relative fitness from 2009 to 2014 compared with before the implementation of vaccines (Fig. 4e and Supplementary Tables 10 and 11). When we directly compared the fitness advantage of NVTs compared with VTs in the PCV era, NVTs had a relative fitness advantage of 1.68 (95% CI = 1.59–1.77), which is equivalent to a 1.05 (95% CI = 1.05–1.06) growth advantage (relative to VTs) at each transmission generation. In a sensitivity analysis, the results remained consistent for carriage and disease isolates, respectively, despite them having been sampled from different cohorts (Extended Data Fig. 6 and Supplementary Fig. 7). We also found consistent results across provinces (Supplementary Fig. 6). We additionally assessed whether there was a delay between vaccine implementation and resulting changes in strain fitness (Extended Data Fig. 7). The best fitting model assumed the change in fitness occurred in the same year as vaccine implementation.

Refining this model to look at the fitness of individual serotypes showed a wide range of fitness across serotypes. Our findings highlight that all strains are fundamentally different in underlying fitness (Extended Data Fig. 8), which means that NVTs will differ in their ability to alter their ecological niche following changes in vaccine formulation. This result needs to be taken into consideration in the development of new vaccine formulations. Among NVTs, serotypes 15A, 35B and 8 had the greatest fitness advantage after PCVs were used (Supplementary Figs. 8 and 9), a result concordant with what has been previously observed^20,22,37. Shifts in lineage fitness also resulted in changing patterns of spread. By incorporating our fitness estimates for NVT and VT strains after vaccination into our mobility model, we estimated that the number of affected municipalities from a strain of a NVT was 2.02 (95% CI = 1.81–2.25) times the number of affected municipalities from types included in the vaccine (Fig. 3e).

We next explored whether using the proportion of isolates that were VT versus NVT within each GPSC at the start of the study period and our fitness estimates could explain the subsequent dynamics of individual GPSCs. Simply using VT and NVT fitness estimates could explain 60% of the variance in individual GPSC prevalence at any time. Allowing for serotype-specific differences produced a small improvement, explaining 65% of the variance (Supplementary Fig. 10 and Supplementary Table 13). The unexplained variance reflects GPSC-specific fitness and is probably driven by negative-frequency dependent selection (NFDS)²⁴. This result highlights the predictive nature of the serotype composition of GPSCs in determining PCV-driven GPSC dynamics (Supplementary Figs. 11 and 12).

The serotypes included in the vaccines were prevalent in childhood disease and had high levels of AMR in the USA, where the vaccines were developed³⁸. The high levels of AMR in the VT strains was also present globally²⁶. In South Africa, similar to other countries, reductions in AMR have been noted since vaccine implementation; however, it remains unclear whether this trend will persist or whether AMR eventually rebounds²⁶. In South Africa, before vaccines, 63.6% of VT and 8.8% of NVT strains were resistant to penicillin. We found that there was a clear reduction in overall penicillin resistance following vaccine implementation, which was driven by reductions in the proportions of strains that are VT (Fig. 4f and Supplementary Table 12). The trends, although still present, were less clear in the other investigated antimicrobials (Extended Data Fig. 9). Owing to the relevance of penicillin as a first-line antimicrobial for pneumococcal disease and the high proportion of resistance in this population, we used our same modelling framework and were able to recover the observed proportions of strains that were resistant to penicillin over time. Before vaccines, among both NVT and VT strains, there was limited difference in the fitness between those that were penicillin-resistant and penicillin-susceptible. However, following implementation of vaccines, NVT-resistant strains were 1.30 (95% CI = 1.19–1.43) times as fit as penicillin-susceptible NVT strains (Fig. 4f,g and Supplementary Fig. 19). Conversely, resistance did not seem to have changed among VT penicillin-resistant strains (relative fitness of 0.97 (95% CI = 0.91–1.03)) (Fig. 4f,g, Supplementary Fig. 13A–C and Supplementary Table 12). Expansion of NVTs within typically VT-associated lineages is the most common mechanism for serotype replacement³⁶. As a result, the penicillin resistance associated with these newly expanded NVT lineages is able to persist in the population^19,36. Together with our quantification of growing penicillin resistance among NVTs following the use of PCVs, this result suggests that the overall reduction in penicillin resistance seen following vaccine implementation may not persist. Our data also highlight the nuanced effect that vaccines can have on patterns of AMR (Supplementary Fig. 13D–F). It is probable that next-generation higher valency PCVs may also lead to increased AMR prevalence in those serotypes not included in the vaccine. Changing patterns of antimicrobial use in this population may also result in shifts in resistance patterns. The widespread presence of penicillin resistance at the beginning of our data collection period would implicate expansion of resistance among existing lineages; however, we cannot exclude the emergence of some new resistance forms.

Limitations

We did not have complete carriage and invasive disease data across the entire time period. However, we performed sensitivity analyses to determine whether our results are robust to including carriage and invasive disease together. The human mobility data are Meta baseline data that were released by Meta owing to the SARS-CoV-2 pandemic in 2020. We used aggregated data across 17 months (January 2020 to June 2021) within a single mobility pattern matrix. As mobility was altered during this period and to address the possibility that Meta mobility data are different to the movement of individuals involved in pneumococcal transmission, we included an additional parameter that adjusts the human mobility data to account for more or less time being spent in home municipalities. As we were able to obtain good fits to the observed spread of pneumococcus, and these models outperformed standard gravity models, our findings highlight how imperfect mobility data can nevertheless be useful. Vaccination levels for PCV7 were reported to be 89% by 2022 (ref. ³⁹). We did not have the data to consider changes in coverage in time or across provinces. Other settings with different levels of coverage may observe different fitness effects from the implementation of the vaccine. Within the fitness model, as we were looking at relative proportions of strains with increasing proportions of resistance, there may still be decreased total burden of resistant disease if those strains carrying it remain low in prevalence.

Conclusion

Here we quantified and explained the movement of a persistent human pathogen for which the geographical course has been largely hidden by its diversity and endemicity. The pneumococcus has an affinity for urban centres through which it channels its wider geographical spread. Although it is characterized by slow transmission overall, the use of vaccines can substantially and rapidly change pneumococcal lineage ecology. Although vaccine-associated fitness dynamics have been previously described in the pneumococcus^8,36, they have not been directly quantified. Increasing proportions of NVTs in the disease isolates from the PCV era can be largely attributed to the decrease in number of VTs rather than the increasing prevalence of NVTs²². Vaccination has had a secondary effect on penicillin resistance, with a decrease in recent years in South Africa. Given the estimated growth advantage of penicillin-resistant NVT strains, we may see a reversal of this benefit; however, estimating the carrying capacity of this growth is beyond the scope of this model. Furthermore, we quantified pneumococcal geographical spread and the spatial impact of NVT expansion after vaccination. Our findings highlight how directly observed characterizations of human mobility using mobile phone data or Meta data can be used to obtain a mechanistic understanding of how pathogens spread within phylogeographic frameworks. This includes considering mobility of both infected individuals and the susceptible population and how we can adjust these datasets to account for systematic differences in behaviour between mobile phone data and Meta data and those involved in transmission. We note that basic gravity models also performed well, providing a useful alternative in settings in which mobile phone data options are not available. Our description of pneumococcal geographical spread provides new insight into the movement of emergent strains. In South Africa, the population density in the emergence location of a NVT strain may affect its speed of spread across the country and have implications for public health responses to emergent strains. Emergence in a highly mobile, peri-urban area may enable both rapid proximal geographical spread and less frequent distal seeding events, and thus more complete distribution across the country. The fitness model and human mobility model together provide frameworks to quantify and better understand the migratory and fitness dynamics of this globally endemic pathogen. The magnitude of South Africa and its provinces demonstrates that these frameworks may be applied to other large regions.

Methods

Data sources and processing

Pneumococcal sequence data and metadata

The genomes included in this study were collected as part of the Global Pneumococcal Sequencing project (GPS), which is a global genomic survey of S. pneumoniae⁴⁰. The invasive disease isolates included here were collected by the National Institute for Communicable Disease in South Africa from 2000 to 2014 (ref. ²²). In the initial phase of genome sequencing, approximately 300 invasive-disease isolates from each year (2005–2014) were selected with a specific target age breakdown (50% from <3 year olds, 25% from 3–5 year olds, 25% from >5 year olds). In the second phase of sequencing, approximately 200 disease isolates from 2000 to 2004 and approximately 100 invasive disease isolates from 2005 to 2010 from children <5 years old were randomly selected for sequencing²² (Supplementary Table 2). The carriage isolates were collected in Soweto, Gauteng (n = 736; collection years 2010, 2012 and 2013) and Agincourt, Mpumalanga (n = 1,114; collection years 2009, 2011 and 2013) by the Wits Vaccines and Infectious Diseases Analytics Research Unit. A random sample of 400 carriage isolates from each year were chosen for genome sequencing (Supplementary Table 2). We included both carriage and invasive-disease isolates and conducted sensitivity analyses throughout to confirm the methods were robust to both carriage and invasive disease individually. Invasive disease is defined as the bacterium being isolated from a typically sterile site. The majority of total isolates were from children aged <5 years (75.0%); 7.6% were from individuals aged 5–20 years and only 17.5% were from adults (>20 years) (Supplementary Table 7). These were distributed across the before and after PCV periods. We additionally included similar GPSCs from the GPS database (for context within the global population) for the RR analysis. We utilized metadata that included collection year and month, residence province of the patient, age of the patient, sampling site and clinical manifestation. The range of sampling sites included nasopharyngeal swabs (for carriage), blood, pleural fluid, cerebrospinal fluid, peritoneal fluid, pus and other joint fluid (for invasive disease).

Population data

We estimated the population for each municipality (n = 234) across South Africa using the population-size estimates from LandScan 2017 (refs. ^41,42) (Extended Data Fig. 11).

Mobility data

The mobility data used in this study were collected using Meta Data for Good Disaster maps from South Africa. These are initiated at the onset of a disaster—in this case, the SARS-CoV-2 pandemic—and track the geographical movement of Meta users⁴³. We used the baseline (adjusting to 2 weeks before) human mobility for each month from January 2020 to July 2021 (refs. ^43,44) to attain a mobility probability from and to each municipality. For each origin (home) municipality (n = 234), we determined the mean monthly number of Meta users that were in each destination municipality in South Africa. Location pairs with a value of zero were given a value of ten. We divided each cell by the total number of users across all destinations for that origin municipality. Each cell in the resultant original–destination matrix therefore represented the probability of being in each destination municipality given your home municipality. This is the probability of mobility from each municipality to each other municipality after 2019. Because we do not have mobility data from the exact years the genomes were sampled, we adjusted the diagonal of the mobility matrix using an estimated parameter. This allows people to stay more or less at home and mitigates the effect of non-year matched mobility data. We define the radius of gyration (R_i) for each municipality as follows:

$${R}_{i}=\sqrt{{\sum }_{i}{m}_{i}{{d}_{i}}^{2}}$$

(1)

Where d_i is distance to region i, and m_i is the probability of mobility to region i. The sum is across all municipalities (n = 234)⁴⁵ (Extended Data Fig. 11).

Generation time distribution

We used a simulation framework to estimate the overall generation time using the separate contributions of the carriage durations and the incubation period. This approach has previously been used for other pathogens⁴⁶.

We sampled 1,000 carriage durations from an exponential distribution with means that are inverse to the clearance rates estimated across serotypes in ref. ³¹ (clearance rate = 0.026 (95% CI = 0.025–0.028) episode per day) and in ref. ³² (clearance rate = 0.032 (95% CI = 0.030–0.034) episodes per day)³².

To sample the day of transmission, we randomly sampled a time point between zero and the time of clearance for each individual. We then separately sampled an incubation period using a uniform distribution of between 1 and 5 days. To account for longer carriages resulting in more opportunities for transmission, we sampled from the distribution of generation, with the probability of sampling each individual weighted by the total carriage duration. The total generation time is then the sum of the duration to transmission and the incubation period.

We repeated these steps 10,000 times and estimated the mean and standard deviation of this final distribution assuming that the generation time follows a gamma distribution with a mean of 35 and a standard deviation of 35 (exponential distribution with a rate of 1/0.096). By comparing the histogram of the distribution with the gamma distribution, this seems a reasonable assumption (Supplementary Fig. 4).

Sample culture and genome sequencing

The pneumococcal isolates were selectively cultured on BD Trypticase soy agar II with 5% sheep blood (Beckton Dickinson) and incubated overnight at 37 °C in 5% CO₂. Genomic DNA was then manually extracted using a modified QIAamp DNA Mini kit (Qiagen) protocol. As part of GPS, pneumococcal isolates were whole-genome sequenced on an Illumina HiSeq platform to produce paired-end reads with an average of 100–125 bp in length, and data were deposited into the European Nucleotide Database. Whole-genome sequence data were processed as previously described^36,47.

AMR

We performed predictive antimicrobial susceptibility profiling using the CDC-AMR pipeline for three classes of antimicrobials: β-lactams (penicillin; encoded by the genes pbp1A, pbp2B and pbp2X)^48,49; sulfonamides (co-trimoxazole; folA and folP); and macrolides (erythromycin and clindamycin; ermB and mefA)^50,51. This was done for 6,798 randomly selected isolates⁴⁰.

Constructing time-resolved phylogenetic trees

We selected the GPSCs for which we had genomes from each of South Africa’s nine provinces and for which we had a minimum of 50 sequences in total to build phylogenies, henceforth referred to as ‘dominant GPSCs’. There were nine dominant GPSCs: GPSC1, GPSC2, GPSC5, GPSC10, GPSC13, GPSC14, GPSC17, GPSC68 and GPSC79 (n = 2.575). Assembly was performed using Wellcome Sanger Institute pathogen informatics automated pipelines and is freely available for download from GitHub under an open-source licence, GNU GPL 3 (ref. ⁵²). For each sample, sequence reads were used to create multiple assemblies using VelvetOptimiser (v.2.2.5) and Velvet (v.1.2.10)⁵³. An assembly improvement step was applied to the assembly with the best N50 and contigs scaffolded using SSPACE (v.2.0)⁵⁴, and sequence gaps were filled using GapFiller (v.1.11)⁵⁵. Assembly quality control parameters included a minimum average sequencing depth of 20× and an assembly length of 1.9–2.3 Mb. Sequences with more than 15% heterozygous SNP sites were excluded.

We created reference genomes for each GPSC using ABACAS (v.1.3.1) to order the contigs from a representative of each GPSC mapped to S. pneumoniae (strain ATCC 700669/Spain 23F-1) (EMBL accession: FM211187)⁵⁶. Any contigs that did not align were concatenated to the end. We multiply mapped all genomes from each dominant GPSC against these references, respectively, using a custom mapping, variant calling and local realignment around indels pipeline (multiple_mappings_to_bam.py)⁵⁷ using bwa-MEM (v.0.7.17)⁵⁸ and samtools mpileup (v.1.6)⁵⁹. The minimum base quality for a base to be considered was 50. The minimum mapping quality for a SNP to be called was 20, with a minimum of 8 reads matching the SNP. We built trees masking recombination regions using Gubbins (v.2.4.1)⁶⁰ with the hybrid model that uses FastTree for the first iteration and RAxML subsequently⁶¹ and a GTR model. We converted branch length to time using BactDating (v.1.0) with a mixed gamma, relaxed clock model⁶². We compared concordance between BEAST (v.1.10.4)⁶³ with both strict and relaxed clocks, and a Bayesian skyline prior. As the results were concordant, we used BactDating owing to its shorter runtime (Supplementary Fig. 14).

RR framework

We used a RR framework to investigate the risk of genetic similarity across geographical distance⁶⁴. We compared the location (loc) and label (G) (that is, GPSC or genetic similarity) of pairs of sequences that were collected around the same time (t). This approach has been shown to be robust to substantial biases in timing and location of isolate collection⁶⁴. We first constructed pair-wise matrices comparing every isolate to every other isolate (n pairs = 6,910). In the numerator was the ratio of pairs that were the same GPSC, collected within a year of each other, from the same province, over the total number of pairs collected within a year of each other from the same province. The denominator was the ratio of pairs that were the same GPSC, collected within a year of each other, from distant provinces (>1,000 km apart) (L_ref), over the total number of pairs collected within a year of each other from distant provinces. Geographical distances were calculated based on the centroid coordinates of each province. To demonstrate the suitability of using centroid distances, we simulated a spatial transmission process for 1,000 separate chains in which at each generation, a daughter point is placed at a randomly located location 350 m in each of the x and y direction. This was repeated over 20 generations. We then identified the centroid of each case based on the closest coordinate rounded to the nearest kilometre. We then calculated the total distance covered for both the true distance and the centroid distances and found that the resulting distances were similar (Supplementary Fig. 15).

To quantify uncertainty, we used a bootstrapping approach whereby in each bootstrap iteration, we randomly sampled with replacement the isolates before recalculating the statistic. We report the 2.5 and 97.5 percentiles from the resulting distribution.

We also repeated the same analysis but used the time-resolved phylogenetic trees to interrogate pairs across increasing divergence times (breaking the GPSCs into higher resolution). For this, rather than matrices designating whether pairs were the same GPSC or different GPSCs, the divergence time between each pair was included. We only included the divergence times between like GPSCs (Fig. 2c–e).

$${{\rm{R}}{\rm{R}}}_{{\rm{l}}{\rm{o}}{\rm{c}}}(\,g1,g2)=\frac{\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j\ne i}^{n}I({{\rm{l}}{\rm{o}}{\rm{c}}}_{i}={{\rm{l}}{\rm{o}}{\rm{c}}}_{j}\cap {t}_{ij}\le 1\,{\rm{y}}{\rm{e}}{\rm{a}}{\rm{r}}\cap ({G}_{ij} > {g}_{1},{G}_{ij} < {g}_{2}))\,/\,\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j\ne i}^{n}I({{\rm{l}}{\rm{o}}{\rm{c}}}_{i}={{\rm{l}}{\rm{o}}{\rm{c}}}_{j}\cap {t}_{ij}\le 1\,{\rm{y}}{\rm{e}}{\rm{a}}{\rm{r}})}{\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j\ne i}^{n}I({{{\rm{L}}}_{{\rm{r}}{\rm{e}}{\rm{f}}}}_{i}={{{\rm{L}}}_{{\rm{r}}{\rm{e}}{\rm{f}}}}_{j}\cap {t}_{ij}\le 1\,{\rm{y}}{\rm{e}}{\rm{a}}{\rm{r}}\cap ({G}_{ij} > {g}_{1},{G}_{ij} < {g}_{2}))\,/\,\mathop{\sum }\limits_{i=1}^{n}\mathop{\sum }\limits_{j\ne i}^{n}I({{{\rm{L}}}_{{\rm{r}}{\rm{e}}{\rm{f}}}}_{i}={{{\rm{L}}}_{{\rm{r}}{\rm{e}}{\rm{f}}}}_{j}\cap {t}_{ij}\le 1\,{\rm{y}}{\rm{e}}{\rm{a}}{\rm{r}})}$$

(2)

We utilized the framework to compare a range of geographical distances, keeping the reference distance to pairs that were >1,000 km apart (Fig. 2).

To identify the divergence time at which pairs had an equal risk of being in the same province as distant provinces (time to homogenization across South Africa), we investigated the divergence time at which there was no increased risk of similarity within a province compared against distant provinces (RR = 1). We stratified distances in South Africa into groups of distances, including the 9 within provinces, 14 pairs that were <500 km apart, 16 pairs 500–1,000 km apart and 6 pairs of provinces >1,000 km apart. We repeated the framework across rolling 20-year time windows at 10-year intervals from 0 to 100 years (g1,g2) (Fig. 2f). We repeated this for pairs for which one was from South Africa and the other was from another country in Africa (Supplementary Fig. 3A). We sampled 300 sequences from each province with replacement to compensate for biased sampling (Extended Data Fig. 2a). Furthermore, to incorporate phylogenetic uncertainty into the statistical framework, we sampled 100 individual phylogenies from the BactDating posterior. We report the 2.5 and 97.5 percentiles from the resulting distribution. We also repeated the analyses only including pairs isolated from patients with pneumococcal disease (Extended Data Fig. 2b).

Probabilistic mobility model

Overall strategy

We extended a previously published mechanistic phylogeographic model⁶⁵ to estimate the mobility of the pneumococcus between pairs of municipalities at each transmission step. To infer the probable path of transmission between sequence pairs, we used the divergence time and the generation time distribution to estimate the number of transmission generations between pairs of sequences. Each generation is a possible transmission event and provides an opportunity for a mobility event.

The approach ultimately aims to estimate an origin destination matrix for a transmission step, whereby each cell represents the probability that the pneumococcus is now in location j after one transmission step given it was previously in location i. As the phylogenetic trees combined with the generation time provide an estimate of how many transmission steps separate pairs of samples, we can use repeated matrix multiplication to integrate over all possible pathways linking two locations (see below for more details). We incorporated the probability of sampling at each geographical location, within each collection year, by GPSC to account for our observation process.

Notation

We follow notation per a previous study⁶⁵. A pair of isolates, C_a and C_b, with sequences, Seq_a and Seq_b included in a phylogeny are found in locations, L_a and L_b, and the samples were taken in the years, T_a and T_b. The inferred MRCA between C_a and C_b is time T_m and located in L_m. The number of transmission generation from the MRCA to C_a is G_a and to C_b is G_b.

Model fitting

Single transmission generation

Considering a single transmission generation, the probability that persons i and j come into contact with each other given i lives in location a and j lives in location b can be written as follows:

$$\begin{array}{l}{\rm{P}}({\rm{person}}\,i\,{\rm{and}}\,j\,{\rm{come}}\,{\rm{into}}\,{\rm{contact}}| {L}_{i}=a,{L}_{j}=b)\\ =\mathop{\sum }\limits_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot P({V}_{j}=k| {L}_{j}=b)\cdot {B}_{k}\end{array}$$

(3)

Where $P\left({V}_{i}=k| {L}_{i}=a\right)$ is the probability that individual i, whose home location is in a, visits location k and $P\left({V}_{j}=k| {L}_{j}=b\right)$ is the probability that individual j whose home location is in b visits location k, and B_k is the location-specific probability of transmission for k.

At time ${\tau }$, one infector in i in location a is expected to transmit to this number of persons in location b:

$$\begin{array}{l}E({\rm{number}}\,{\rm{of}}\,{\rm{persons}}\,{\rm{from}}\,b\,{\rm{infected}}\,{\rm{at}}\,{\rm{time}}\,{\tau }| {L}_{i}=a)\\ =\mathop{\sum }\limits_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot P({V}_{j}=k| {L}_{j}=b)\cdot {B}_{k}\cdot {S}_{b,{\tau },{\rm{gpsc}}}\end{array}$$

(4)

Where ${S}_{b,{\tau },{\rm{gpsc}}}$ is the number of susceptible people in location b at time ${\tau }$ with some lineage = gpsc.

The total number of people infected by the infector is:

$$\begin{array}{l}E\,({\rm{number}}\;{\rm{of}}\;{\rm{persons}}\;{\rm{from}}\;{\rm{all}}\;{\rm{locations}}\;{\rm{infected}}\;{\rm{at}}\;{\rm{time}}\,\tau | {L}_{i}=a)\\ \,\,=\mathop{\sum }\limits_{m}^{N}\mathop{\sum }\limits_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot {P}({V}_{j}=k| {L}_{j}=m)\cdot {B}_{k}\cdot {S}_{b,\tau ,{\rm{gpsc}}}\end{array}$$

(5)

Conditional on transmission occurring, the probability that the infectee’s isolate is taken in location b is:

$$\begin{array}{l}{\delta }_{a,b,\tau ,{\rm{gpsc}}}=P({L}_{j}=b| {L}_{i}=a)\\ \,\,=\frac{{\sum }_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot P({V}_{j}=k| {L}_{j}=b)\cdot {B}_{k}\cdot {S}_{b,\tau ,{\rm{gpsc}}}}{{\sum }_{m}^{N}{\sum }_{k}^{N}P({V}_{i}=k| {L}_{i}=a)\cdot P({V}_{j}=k| {L}_{j}=m)\cdot {B}_{k}\cdot {S}_{b,\tau ,{\rm{gpsc}}}}\end{array}$$

(6)

We then created an NXN transmission matrix,${\Delta }_{{\tau },{\rm{gpsc}},{\rm{gen}}=1}$, with N being the total number of locations, containing the transmission probabilities, asymmetrically, between all pairs of locations at each point in time; ${{\delta }}_{a,b,{\tau },{\rm{gpsc}}}$ is element [a,b] of the matrix.

Human mobility characterization

We use Meta mobility data (MetMob), as described in the mobility data section above, to characterize mobility between the 234 municipalities of South Africa. We aggregated these to the province level (n = 9) to fit the model. Initially we extracted a 234 × 234 matrix that sets out the probability that an individual from municipality a visits location k; again, where k = any municipality.

MetMob comes from individuals using Meta; however, this may not be representative of the amount of time spent at home by those involved in pneumococcus transmission. The mean of the diagonal of the Meta Mobility matrix (234 × 234) is 0.989, implying, on average, 98.9% of Meta Users stay in their home municipality, H_i. To allow for individuals to spend more or less time at home than represented in the MetMob data, we incorporated a parameter to adjust the probability of being at home (${\theta }$). We adjusted the probability of staying home using a standard logistic function and restricted it with bounds of –0.04 and 0.6 to facilitate exploration of a sensible space. The adjustment allowed by the bounds limits the range of movement H_i–0.6 and H_i+0.04.

The probability of a person remaining in location a therefore becomes:

$$\begin{array}{l}P({V}_{i}=a| {H}_{i}=a)\\ =\,{\rm{MetMob}}[a,a]-\left(-0.04+\left(\frac{(\exp ({\theta }))}{(1+\exp ({\theta }))}\right)\cdot (0.6+0.04)\right)\end{array}$$

(7)

Where a value of greater than 1.0 is obtained for a specific municipality, this is replaced by a value of 0.999.

The estimates thus far are mobility per day, but we were interested in mobility across the infectious period. Therefore, we adjusted the diagonal to account for mobility each day within the infectious period:

$${\rm{MetMob}}\left[a,a\right]=1-{{\rm{MetMob}}\left[a,a\right]}^{G}$$

(8)

We rescaled the probabilities so that the sum of all mobility is equal to 1:

$$P({V}_{i}=a| {H}_{i}=a,k\ne a)={\rm{MetMob}}[a,k]/(1-P({V}_{i}=a| {H}_{i}=a))$$

(9)

Where MetMob[a,k] considers all mobility probabilities from the Meta Mobility data.

The sum of the movements to South African municipalities is equal to 1, which assumes that the analysis contains all possible movements of both the pneumococcus and people and implying no external introductions. The outcome of this is that some mobility may be missed, especially around the country borders.

Probability of the pneumococcus being in each location after G transmission generations

To determine the probability that location k contains the home location after G transmission generations, we used matrix multiplication, which integrates across all possible pathways connecting two locations.

$${\Delta }_{{\tau }={T}_{G},{\rm{gpsc}},{\rm{gen}}=G}=\mathop{\prod }\limits_{r=2}^{G}{\Delta }_{{\tau }={t}_{r-1},{\rm{gpsc}},{\rm{gen}}=r-1}\cdot {\Delta }_{{\tau }={t}_{r},{\rm{gpsc}},{\rm{gen}}=1}$$

(10)

Where t_r is the time of generation G_r.

Probability of observing a pair of cases in two specific locations

The probability that C_A has home location L_A and C_B has home location L_B is conditional on the sequences being observed in locations L_A and L_B at times T_A and T_B. We assumed that the location of two cases, L_i and L_j, is dependent on the location of their MRCA, L_m, and the number of transmission generations separating them from their MRCA, G_A and G_B.

The observations processes across locations are independent of each other, and each transmission event is independent of other transmission events. The probability of observing (Obs) a case at L_i at time T_A is not dependent on the number of generations to, or location of, the MRCA. We considered discretized space of the nine provinces of South Africa, resulting in the following equation:

$$\begin{array}{c}P({L}_{A},{L}_{B}| {{\rm{Obs}}}_{{L}_{B},{T}_{B}},{{\rm{Obs}}}_{{L}_{B},{T}_{B}},{{\rm{Seq}}}_{A},{{\rm{Seq}}}_{B},{T}_{A},{T}_{B})\\ =\frac{{\sum }_{{L}_{m}}{\sum }_{{G}_{A}}{\sum }_{{G}_{B}}P({{\rm{Obs}}}_{{L}_{A},{T}_{A}})P({{\rm{Obs}}}_{{L}_{B}{T}_{B}})P({L}_{A}| {L}_{m},{G}_{A})P({L}_{B}| {L}_{m},{G}_{B})P({L}_{m})P({G}_{A},{G}_{B}| {{\rm{Seq}}}_{A},{{\rm{Seq}}}_{B},{T}_{A},{T}_{B})}{{\sum }_{{L}_{i}}{\sum }_{{L}_{j}}{\sum }_{{L}_{m}}{\sum }_{{G}_{A}}{\sum }_{{G}_{B}}P({{\rm{Obs}}}_{{L}_{i},{T}_{A}})P({{\rm{Obs}}}_{{L}_{i}{T}_{B}})P({L}_{i}| {L}_{m},{G}_{A})P({L}_{j}| {L}_{m},{G}_{B})P({L}_{m})P({G}_{A},{G}_{B}| {{\rm{Seq}}}_{A},{{\rm{Seq}}}_{B},{T}_{A},{T}_{B})}\end{array}$$

(11)

Probability of G generations between MRCA and a sequenced isolate

Under the previous equation, $P({G}_{A},{G}_{B}| {{\rm{Seq}}}_{A},{{\rm{Seq}}}_{B},{T}_{A},{T}_{B})$ represents the generation time distribution.

We can extract the joint probability of C_A and C_B being separated from the MRCA by G_A and G_B transmission generations, respectively, using the above-derived generation time distribution and the time-resolved phylogenetic trees.

Assuming the generation time is gamma distributed and all transmission events are independent, the sum of the gamma distribution is also gamma distributed. Additionally, we can extract the evolutionary times E_A and E_B, separating C_A and C_B from the MRCA. As previously described⁶⁵, using equation (19), we can estimate the probability of g transmission events over many trees, allowing us to incorporate phylogenetic including evolutionary parameters from the tree structure.

We determined the probability for the number of generations from MRCA for each isolate for 1–1,000 generations, using the generation time derived above.

Location of the MRCA (P(L _m))

We estimated the probability that, on average, an MRCA is in each of the nine provinces in South Africa. We estimated parameters for each of the eight provinces, setting Western Cape aside as a reference, and dividing by the total across all nine to ensure that the sum of the probabilities is 1. This again assumes no external introductions.

Calculation of likelihood

We calculated the likelihood using all pairs of available sequenced S. pneumoniae as previously described⁶⁵. We accounted for the observation process by incorporating the probability of sampling in each location for isolates belonging to each GPSC annually.

Likelihood equation

We calculated the likelihood using all pairs of sequenced pneumococci as follows:

$$L\propto \mathop{\prod }\limits_{{\rm{province}}=1}^{9}\mathop{\prod }\limits_{i=1}^{{n}_{{\rm{gpsc}}}}\prod _{j\ne i}P({L}_{i},{L}_{j}| {{\rm{Obs}}}_{{L}_{i},{T}_{i}},{{\rm{Obs}}}_{{L}_{j},{T}_{j}},{{\rm{Seq}}}_{i},{{\rm{Seq}}}_{j},{T}_{i},{T}_{j})$$

(12)

Where n_gpsc are the number of sequences available from GPSC gpsc.

Hamiltonian MCMC

We used an MCMC approach to estimate our parameters using the package fmcmc (v.0.5-1) implemented in R⁶⁶. We estimated nine parameters: a parameter that adjusts the probability of staying in the home municipality compared with Meta mobility data; and eight parameters capturing the relative probability that the MRCA of a pair of individuals was in each of the other eight provinces compared with the province of Western Cape (the reference).

We only used pairs of sequences that were separated by less than 10 years of evolutionary time between them. After 10 years, there are limited spatial signals remaining, as the bacterium has had many opportunities to move. This approach was used to make the model computationally tractable. To incorporate phylogenetic uncertainty, we repeatedly refit the model using 50 randomly selected phylogenies from the BactDating posterior. We report the 2.5 and 97.5 percentiles from the resulting distribution. In a sensitivity analysis, we showed that increasing the model to 15 years resulted in similar estimates (Extended Data Fig. 12a).

Model performance

To assess the performance of our model, we used a simulation framework. We simulated 50,000 pairs of events, locations and the location of the MRCA between pairs, whereby the probability of mobility between each pair of locations was determined by the Meta mobility matrix adjusted by a parameter of known value of −2. We tested our ability to recapture this input parameter. To incorporate the biased observation process, we downsampled the simulated data based on the by-province sampling probabilities from our true data.

We used the downsampled data to fit our model with 20,000 steps of a MCMC with a jump step of 0.08 in 3 chains. We were able to recapture the downsampled data utilizing the human mobility framework and the estimated parameter for the probability of staying at home, accounting for the sampling probability per province. We then utilized the same human mobility framework and the estimated parameter, excluding the sampling probability, and were able to recapture the complete simulated data, including the input parameter (Extended Data Fig. 5a,b).

We repeated these simulations but downsampled with various biases. We determined how well the model performed when we only sampled two provinces. We also tested how far off our estimates would be if our generation time estimate was 50% higher or 50% lower than we had estimated given the Kenyan and Gambian data^31,32 (Extended Data Fig. 5c,d).

Model sensitivity analyses

We estimated our parameters only to include isolates from patients with invasive pneumococcal disease (Supplementary Fig. 1B).

To test the impact of a range of generation times on the parameter estimates, we also re-ran our MCMC with generation times of 15, 35 (as included in the model) and 55 days. To quantify uncertainty, we sampled posterior parameters, reporting the 2.5 and 97% percentiles (Extended Data Fig. 12b).

We confirmed that the chains converged for each of the nine parameters estimated (Supplementary Fig. 16).

Probability guided transmission simulations

Person-to-person transmission chains

We simulated person-to-person transmission chains seeded in a starting municipality weighted by the population size. We determined the RR of being in a specific municipality after 10 transmission generations (approximately 1 year) across 500,000 simulations.

We fixed the starting municipality to be rural (population density <50 km^–2) or urban (population density >500 km^–2) and repeated the above simulation. For 10,000 sequential simulations, we counted the number of unique municipalities affected and distance travelled at each transmission interval weighting by population size. We then determined the number of municipalities travelled to across all transmission chains (Extended Data Fig. 3).

Branching epidemic

We simulated a branching epidemic in which we drew the number of transmission events seeded by each event from a Poisson distribution around an effective reproductive number (R_eff) of 1 and amplitude of 0.15 over 100 iterations. We determined the mean distance from the starting municipality after 60 generations (5.8 years with a 35-day generation time) and the number of municipalities visited over that time. We calculated the uncertainty at the 95% CI of a normal distribution.

Gravity model

We compared the performance of the model that used Meta data with the performance of a simple gravity model whereby the probability of mobility is determined by the distance and human population size of the municipalities. We calculated the probability of mobility between locations i and j (GravMob_i,j) to be the log of the destination population size (popsize_j) raised to parameter β divided by the distance between locations, loc_i and loc_j (dist_i,j) raised to parameter γ.

$${{\rm{GravMob}}}_{i,j}=\frac{\log ({{{\rm{popsize}}}_{j}}^{{\beta }})}{{{{\rm{dist}}}_{i,j}}^{{\gamma }}}$$

(13)

We also tested a model including only distance and estimating an exponent adjustment parameter γ.

We calculated the DIC comparing the three models and found that the Meta mobility model (DIC = 12,290.97) was the best-performing model⁶⁷ when compared with the gravity model (DIC = 12,294.34). Both of these models performed better than distance alone (DIC = 12,424.32) (Extended Data Fig. 4).

All statistical analysis for the RR framework and the mobility model was performed in R (v.3.6.2)⁶⁸.

Population-level fitness model

Overall strategy

We developed logistic growth models to fit the changing prevalence of different serotypes and assess the impact of vaccine implementation, utilizing a method that has previously been implemented for the endemic bacterium Bordetella pertussis⁶⁹.

Vaccine-type model

We first binned serotypes into three groups: those not included in the vaccine (NVTs), serotypes included in PCV7 (4, 6B, 9V, 14, 18C, 19F and 23F), and additional serotypes included in PCV13 (1, 3, 5, 6A, 7F and 19A). We used the full data for this analysis. We computed the relative abundance f_i_,ref of each group of serotypes, i, compared with a reference type, ref.

$${f}_{i,{\rm{ref}}}\left(t\right)=\frac{{N}_{i}\left(t\right)}{{N}_{i}\left(t\right)+{N}_{{\rm{ref}}}\left(t\right)}$$

(14)

We chose to use NVTs as the reference. Varying the reference did not affect the model, as long as the reference samples span all years. We then used a simple logistic model, assuming a constant growth rate to capture the evolution of this abundance, at each time t.

$${f}_{i,{\rm{ref}}}\left(t\right)=\frac{1}{1+\left(\frac{1-{f}_{i,{\rm{ref}},0}}{{f}_{i,{\rm{ref}},0}}\right)\exp \left({-r}_{i,{\rm{ref}}}\cdot t\right)}$$

(15)

where r_i,ref is the growth rate of that abundance shared across all provinces, and f_i,ref,0 is the initial relative abundance of the serotype group i with respect to a chosen ref.

To control for the varying presence of all circulating serotypes through time, we present fitness as the average relative growth rate, $\bar{{r}_{i}}$, for each group with respect to a randomly selected group in the population:

$$\bar{{r}_{i}}=\mathop{\sum }\limits_{j\ne i}^{n}\bar{{f}_{j}}({r}_{i,{\rm{ref}}}-{r}_{j,{\rm{ref}}})$$

(16)

where n is the number of groups, and $\bar{{f}_{j}}$ is the average absolute frequency of the group in the period of time considered.

This average relative growth rate, $\bar{{r}_{i}}$, can be identified as the selection rate coefficient of the group in the population considered⁶⁹. The selection rate coefficient is a direct measure of the fitness advantage of emerging variants and is one of the best indicators as to whether a strain will increase in frequency during an outbreak^70,71.

We can further multiply the selection rate by the mean generation time (35 days) to obtain the selection coefficient per generation, which is the relative fitness advantage per transmission generation.

In the Article, we also present estimates of the relative fitness advantage of groups in particular time frames. We used the same computation as for the average relative growth rate, but tailored it to specific references. For example, to compute the relative fitness advantage of NVTs compared with VTs in the PCV era $\Delta \overline{\,{r}_{{\rm{NVT}},{\rm{VT}},{\rm{after}}{\rm{PCV}}}}$, we computed:

$$\Delta \overline{\,{r}_{{\rm{NVT}},{\rm{VT}},{\rm{after}}{\rm{PCV}}}}=\frac{1}{\overline{\,{f}_{{\rm{PCV}}7}}+\overline{\,{f}_{{\rm{PCV}}13}}}\sum _{i{\epsilon }({\rm{PCV}}7,{\rm{PCV}}13)}\overline{{f}_{i}}\,(\overline{\,{r}_{{\rm{NVT}},{\rm{after}}{\rm{PCV}}}}-\overline{\,{r}_{i,{\rm{after}}{\rm{PCV}}}})$$

(17)

To fit the model, we used Hamiltonian Monte Carlo as implemented in R-Stan⁷², with stan (v.2.26.1). We used a Poisson likelihood to fit the observed proportion of sequences that were of each category and the total number of isolates in that year as an offset. The model estimated the proportion of isolates that were of each type at the start of the dataset and the fitness parameters.

Exploration of vaccine effect

To investigate whether the serotypes fitness changed across after implementation of PCV7 and PCV13, we tested a range of models. We considered a model without any shift of fitness, a model with a single shift in 2009 and a model with two growth rates represented by a shift in fitness in both 2009 and 2011. The 2009 shift pertains only to those serotypes included in PCV7 and the 2011 shift pertains to those additional 6 serotypes in PCV13. Model comparison was done using the Watanabe–Akaike information criterion (WAIC) implemented in the loo package⁷³. Furthermore, we tested for a potential delay between implementation of PCVs and the change in fitness. Model comparison is presented in Extended Data Fig. 7.

The model performed best with an initial fitness switch in 2009 (the initial year of vaccine implementation) and another fitness switch in 2011 (the year of PCV13 implementation). We used this model in the main text.

Estimation of individual serotype fitness

Using the same framework, we also estimated the growth rate per serotype to capture whether the individual dynamics were concordant with, or deviated from, what was expected according to their respective group (NVT, PCV7 or PCV13) (Fig. 4a–c). The reference strain in this analysis was set to serotype 13, which is a NVT and for which samples span all years.

Exploration of predicted GPSC dynamics based on their serotype composition

We explored whether using the proportion of isolates that were VT versus NVT within each GPSC at the start of the study period and our fitness estimates could explain the subsequent dynamics of individual GPSCs.

To predict the dynamics of individual GPSCs based on their serotype composition, we used the same framework as described above, applied to each GPSC serotype group in our dataset. As the number of such groups is large (n = 340 GPSC serotype groups), we restricted the analysis to the GPSCs present at a minimum prevalence of 1%. This led to 26 GPSCs being considered for this analysis (representing 74.7% of the dataset), split in a total of 101 GPSC serotype groups. We then modelled the dynamics of each GPSC serotype group, estimating one starting frequency per group and using the previously estimated fitness parameters, either by VT or serotype. We also considered a model with no fitness parameters (relative fitness = 1). To assess the model performances, we computed the Akaike information criterion (AIC)⁷⁴ using the average likelihood of each model and the number of parameters used in each model. We also computed the predicted GPSC dynamics over time for each model by summing all the predicted GPSC serotype group dynamics for each GPSC. As a measure of goodness of fit we used the coefficient of determination of the observed versus predicted GPSC proportions each year:

$${R}^{2}=1-\frac{{\sum }_{i}{({\rm{o}}{\rm{b}}{\rm{s}}{\rm{e}}{\rm{r}}{\rm{v}}{\rm{e}}{\rm{d}}{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}{\rm{o}}{\rm{r}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}}-{\rm{p}}{\rm{r}}{\rm{e}}{\rm{d}}{\rm{i}}{\rm{c}}{\rm{t}}{\rm{e}}{\rm{d}}{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}{\rm{o}}{\rm{r}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}})}^{2}}{{\sum }_{i}{({\rm{o}}{\rm{b}}{\rm{s}}{\rm{e}}{\rm{r}}{\rm{v}}{\rm{e}}{\rm{d}}{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}{\rm{o}}{\rm{r}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}}-{\rm{m}}{\rm{e}}{\rm{a}}{\rm{n}}{\rm{o}}{\rm{b}}{\rm{s}}{\rm{e}}{\rm{r}}{\rm{v}}{\rm{e}}{\rm{d}}{\rm{p}}{\rm{r}}{\rm{o}}{\rm{p}}{\rm{o}}{\rm{r}}{\rm{t}}{\rm{i}}{\rm{o}}{\rm{n}})}^{2}}$$

(18)

R² is also the proportion of variation explained by the variables considered in each model.

Our model estimated a constant fitness for each group considered (VTs or individual serotypes), assuming that if a group has the highest fitness, it will eventually replace the other groups. This assumption is meaningful for VTs and serotypes, as some are directly targeted by the vaccines, a strong selective force in the population. However, the fitness of each GPSC cannot be modelled with this simple assumption as it has been shown that their fitness is inherently multifactorial, as described in the NFDS model²⁴.

AMR

We then used the model to capture the decreasing proportion of penicillin resistance in the population. To do this, we incorporated the dynamics of VTs (PCV7 and PCV13) and NVTs in the population and the respective proportion of penicillin resistance within them. To keep the model tractable, we did not differentiate here between serotypes included in PCV7 or PCV13, instead we group them into VTs. We use the PCV7 implementation (2009) as the year of fitness shift. This parametrization marginally differed from the best model (next best) (Extended Data Fig. 7) and enabled us to keep the number of categories tractable. We also compared the single switch model to a model with no change in fitness at the time of vaccine implementation. This approach showed that the model with a switch was superior (Supplementary Fig. 17).

We used the same approach described above to model the proportion of VTs, f_VT, and NVTs, f_NVT. We then modelled the proportion of strains that were and were not penicillin resistant within each group, either VT, f_AMR|VT or NVT, f_AMR|NVT. We estimated the fitness of resistance in each.

We then fit the model to all four groups, f_VX,AMR (resistant NVT, susceptible NVT, resistant VT and susceptible VT).

$${f}_{{\rm{VX}},{\rm{AMR}}}(t)={{f}_{{\rm{VX}}}(t)\cdot f}_{{\rm{AMR| VX}}}(t)$$

(19)

Where f_VX is the proportion of either VT or NVT in the population and f_AMR|VX is the proportion of penicillin resistance within each of those groups. We then derived the proportion of penicillin resistance overall in the population by summing the proportion of VT resistant, f_VT,AMR, and NVT resistant, f_NVT,AMR, strains.

Estimating effect of fitness on migration

We included the calculated average relative growth rates in the simulated branching epidemic to calculate the mean number of municipalities visited, distance travelled and probability of being in the home municipality over 5 years. We report uncertainty at one standard deviation from the mean. We repeated this incorporating the post-PCV selection coefficients for NVTs and VTs and estimated the relative increase in the number of municipalities visited for NVT serotypes compared with PCV serotypes after 2 years of transmission.

Fitness model carrying capacity sensitivity

Our fitness model assumed constant fitness over time. This specifically means that if a serotype is found to be fitter than the rest of the population, we expect it to replace the whole population after some time. However, the currently best-supported model proposes that NFDS drives pneumococcus population dynamics²⁴, whereby lower frequency genes become more fit.

To test the effect of our assumption on our fitness estimates, we performed a sensitivity analysis. We introduced minimum and maximum carrying capacities in our logistic model. In equation (20), we let the relative frequency f_i_,ref go to a maximum of K_max and in equation (21), a minimum of K_min:

$${f}_{i,{\rm{ref}}}(t)=\frac{{K}_{\max }}{1+\left(\frac{{K}_{\max }-{f}_{i,{\rm{ref}},0}}{{f}_{i,{\rm{ref}},0}}\right)\exp ({-r}_{i,{\rm{ref}}}\cdot t)},\,{\rm{if}}\,{r}_{i,{\rm{ref}}} > 0$$

(20)

$${f}_{i,{\rm{ref}}}(t)=1-\frac{{1-K}_{\min }}{1+\left(\frac{({1-K}_{\min })-(1-{f}_{i,{\rm{ref}},0})}{(1-{f}_{i,{\rm{ref}},0})}\right)\exp ({r}_{i,{\rm{ref}}}\cdot t)},\,{\rm{if}}\,{r}_{i,{\rm{ref}}} < 0$$

(21)

with ${K}_{\min }\le {K}_{\max }$, and ${K}_{\min }\in \left[0,1\right],\,{K}_{\max }\in \left[0,1\right]$.

We considered a range of values for K_min (0.1, 0.05 and 0) and K_max (0.9, 0.95 and 1) (Supplementary Fig. 20). We tested the effect of this carrying capacity has on both the vaccine status model (Supplementary Fig. 18) and the AMR and vaccine status model (Supplementary Fig. 19). In each case, we compared the fitness estimates obtained. We found that the fitness estimates were robust to variable carrying capacities across those tested, which implied that our assumption does not affect this framework. However, it is important to note that this model is robust to set variable carrying capacities while being unable to estimate them, which is in contrast to NFDS. Our model assumes that there is a population replacement up to the carrying capacity and does not allow for fine-scale estimates of an equilibrium. This fitness model can be used as a quantitative descriptor of the growth of distinct groups (that is, penicillin resistance or VT serotypes in the context of a perturbation) but cannot describe the complex underlying fitness effects of changing frequencies of genes under NFDS²⁴.

All statistical analysis for the fitness model was performed in R (v.4.0.5)⁶⁸.

Ethics

The study was coordinated by the GPS (https://www.pneumogen.net/gps/), whose activities are approved by respective ethics committee in-country. Isolates sequenced in this study come from the National Institute for Communicable Diseases and the University of Witwatersrand carriage studies as previously published. These isolates were all collected as part of existing public health surveillance approved protocols in each country. No personally identifiable information was used as part of this study.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All data and code for figures and analysis are accessible at GitHub (https://github.com/sophbel/geomig_evo_pneumo). All whole-genome sequences were deposited into the European Nucleotide Database and accession numbers are available in the GitHub repository and on FigShare (https://doi.org/10.6084/m9.figshare.24219214)⁷⁵. Associated metadata are available from the Microreact webserver (https://microreact.org/project/7wqgd2gbBBEeBLLPKonbaT-belman2024southafricapneumococcus), as well as from the GitHub repository and Global Pneumococcal Sequencing Project Monocle database (https://data.monocle.sanger.ac.uk/).

Code availability

All code and scripts for analysis and figures are available from GitHub (https://github.com/sophbel/geomig_evo_pneumo).

References

World Health Organization. The top 10 causes of death. WHO https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death (2020).
Troeger, C. et al. Estimates of the global, regional, and national morbidity, mortality, and aetiologies of lower respiratory tract infections in 195 countries: a systematic analysis for the Global Burden of Disease Study 2015. Lancet Infect. Dis. 17, 1133–1161 (2017).
Article Google Scholar
Ikuta, K. S. et al. Global mortality associated with 33 bacterial pathogens in 2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet 400, 2221–2248 (2022).
Bender, R. G. et al. Global, regional, and national incidence and mortality burden of non-COVID-19 lower respiratory infections and aetiologies, 1990–2021: a systematic analysis from the Global Burden of Disease Study 2021. Lancet Infect. Dis. https://doi.org/10.1016/S1473-3099(24)00176-2 (2024).
Lees, J. A. et al. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res. 29, 304–316 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ganaie, F. et al. A new pneumococcal capsule type, 10D, is the 100th serotype and has a large cps fragment from an oral Streptococcus. mBio 11, e00937-20 (2020).
Article PubMed PubMed Central Google Scholar
Ganaie, F. A. et al. Discovery and characterization of pneumococcal serogroup 36 capsule subtypes, serotypes 36 A and 36B. J. Clin. Microbiol. 61, e0002423 (2023).
Article PubMed Google Scholar
GPS Consortium, Global Pneumococcal Sequencing collection. Microb. Genomics https://www.microbiologyresearch.org/content/global-pneumococcal-sequencing-collection (2022).
Adegbola, R. A. et al. Carriage of Streptococcus pneumoniae and other respiratory bacterial pathogens in low and lower-middle income countries: a systematic review and meta-analysis. PLoS ONE 9, e103293 (2014).
Article ADS PubMed PubMed Central Google Scholar
Wahl, B. et al. Burden of Streptococcus pneumoniae and Haemophilus influenzae type b disease in children in the era of conjugate vaccines: global, regional, and national estimates for 2000-15. Lancet Glob. Health 6, e744–e757 (2018).
Article PubMed PubMed Central Google Scholar
Jacobs, M. R. et al. Emergence of multiply resistant pneumococci. N. Engl. J. Med. 299, 735–740 (1978).
Article CAS PubMed Google Scholar
Hume-Nixon, M. et al. Systematic review of the clinical outcomes of pneumonia with a penicillin-group resistant pneumococcus in respiratory and blood culture specimens in children in low- and middle-income countries. J. Glob. Health 12, 10004 (2022).
Article PubMed PubMed Central Google Scholar
Murray, C. J. et al. Global burden of bacterial antimicrobial resistance in 2019: a systematic analysis. Lancet 399, 629–655 (2022).
Article CAS Google Scholar
Musher, D. M., Anderson, R. & Feldman, C. The remarkable history of pneumococcal vaccination: an ongoing challenge. Pneumonia 14, 5 (2022).
Article PubMed PubMed Central Google Scholar
Pfizer. Prevenar 13. Pfizer https://www.pfizer.co.uk/products/prescription-medicines/prevenar-13 (2023).
GlaxoSmithKline. Synflorix^TM, GlaxoSmithKline’s pneumococcal vaccine, receives European authorisation. GSK https://www.gsk.com/en-gb/media/press-releases/synflorix-glaxosmithkline-s-pneumococcal-vaccine-receives-european-authorisation/ (2023).
Kobayashi, M. et al. Use of 15-valent pneumococcal conjugate vaccine and 20-valent pneumococcal conjugate vaccine among U.S. adults: updated recommendations of the Advisory Committee on Immunization Practices—United States, 2022. MMWR Morb. Mortal. Wkly Rep. 71, 109–117 (2022).
Article CAS PubMed PubMed Central Google Scholar
VIEW-hub (Johns Hopkins Bloomberg School of Public Health, International Vaccine Access Center (IVAC), www.view-hub.org (2022).
Ndlangisa, K. M. et al. The Global Pneumococcal Sequencing ConsortiumYR 2022, a Streptococcus pneumoniae lineage usually associated with pneumococcal conjugate vaccine (PCV) serotypes is the most common cause of serotype 35B invasive disease in South Africa, following routine use of PCV. Microb. Genom. 8, 000746 (2022).
CAS PubMed PubMed Central Google Scholar
Javaid, N. et al. Population genomics of pneumococcal carriage in South Africa following the introduction of the 13-valent pneumococcal conjugate vaccine (PCV13) immunization. Microb. Genom. 8, 000831 (2022).
CAS Google Scholar
Belman, S. et al. Genetic background of Cambodian pneumococcal carriage isolates following pneumococcal conjugate vaccine 13. Microb. Genom. 8, mgen000837 (2022).
PubMed PubMed Central Google Scholar
von Gottberg, A. et al. GERMS-SA Investigators, effects of vaccination on invasive pneumococcal disease in South Africa. N. Engl. J. Med. 371, 1889–1899 (2014).
Article Google Scholar
Huang, L. et al. Cost-effectiveness analysis of the South African Infant National Immunization Program for the prevention of pneumococcal disease. Infect. Dis. Ther. 12, 933–950 (2023).
Article PubMed PubMed Central Google Scholar
Corander, J. et al. Frequency-dependent selection in vaccine-associated pneumococcal population dynamics. Nat. Ecol. Evol. 1, 1950–1960 (2017).
Article PubMed PubMed Central Google Scholar
Azarian, T. et al. Frequency-dependent selection can forecast evolution in Streptococcus pneumoniae. PLoS Biol. 18, e3000878 (2020).
Article CAS PubMed PubMed Central Google Scholar
Andrejko, K., Ratnasiri, B., Hausdorff, W. P., Laxminarayan, R. & Lewnard, J. A. Antimicrobial resistance in paediatric Streptococcus pneumoniae isolates amid global implementation of pneumococcal conjugate vaccines: a systematic review and meta-regression analysis. Lancet Microbe 2, e450–e460 (2021).
Article CAS PubMed PubMed Central Google Scholar
Müller, N. F., Dudas, G. & Stadler, T. Inferring time-dependent migration and coalescence patterns from genetic sequence and predictor data in structured populations. Virus Evol. 5, vez030 (2019).
Article PubMed PubMed Central Google Scholar
Skosana, Z. et al. Non-vaccine serotype pneumococcal carriage in healthy infants in South Africa following introduction of the 13-valent pneumococcal conjugate vaccine. S. Afr. Med. J. 111, 143–148 (2021).
Article CAS PubMed Google Scholar
Madhi, S. A. et al. Residual colonization by vaccine serotypes in rural South Africa four years following initiation of pneumococcal conjugate vaccine immunization. Expert Rev. Vaccines 19, 383–393 (2020).
Article CAS PubMed Google Scholar
Movement Range Maps, Facebook Data for Good, https://dataforgood.fb.com/tools/movement-range-maps/ (Meta, 2022).
Chaguza, C. et al. Carriage dynamics of pneumococcal serotypes in naturally colonized infants in a rural African setting during the first year of life. Front. Pediatr. 8, 587730 (2021).
Article PubMed PubMed Central Google Scholar
Abdullahi, O. et al. Rates of acquisition and clearance of pneumococcal serotypes in the nasopharynges of children in Kilifi District, Kenya. J. Infect. Dis. 206, 1020–1029 (2012).
Article PubMed PubMed Central Google Scholar
Ginsburg, C. et al. Human capital on the move: education as a determinant of internal migration in selected INDEPTH surveillance populations in Africa. Demogr. Res. 34, 845–884 (2016).
Article PubMed PubMed Central Google Scholar
Collinson, M. A. Striving against adversity: the dynamics of migration, health and poverty in rural South Africa. Glob. Health Action https://doi.org/10.3402/gha.v3i0.5080 (2010).
Kleynhans, J. et al. Estimated impact of the pneumococcal conjugate vaccine on pneumonia mortality in South Africa, 1999 through 2016: an ecological modelling study. PLoS Med. 18, e1003537 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gladstone, R. A. et al. International genomic definition of pneumococcal lineages, to contextualise disease, antibiotic resistance and vaccine impact. EBioMedicine 43, 338–346 (2019).
Article PubMed PubMed Central Google Scholar
Müller, A. et al. Streptococcus pneumoniae serotypes associated with death, South Africa, 2012–2018. Emerg. Infect. Dis. 28, 166–179 (2022).
Article PubMed PubMed Central Google Scholar
US Centers for Disease Control and Prevention. Pneumococcal Disease and Antibiotic Resistance. CDC https://www.cdc.gov/pneumococcal/php/drug-resistance/index.html (2022).
VIEW-hub—South Africa (Johns Hopkins Bloomberg School of Public Health, International Vaccine Access Center (IVAC)), https://view-hub.org/map/country/zaf (accessed September 2022).
GPS, L. McGee, S. D. Bentley, K. P. Klugman. GPS: Global Pneumococcal Sequencing Project, https://www.pneumogen.net/gps/ (GPS Project, 2022).
Mesev, V. in Remotely-Sensed Cities (ed. Mesev, V.) 301–314 (CRC Press, 2003).
ORNL LandScan Viewer, https://landscan.ornl.gov/ (Oak Ridge National Laboratory, 2022).
Maas, P. et al. Facebook Disaster Maps: Methodology. Meta Research https://research.facebook.com/blog/2017/6/facebook-disaster-maps-methodology/ (2017).
Maas, P. Facebook disaster maps: aggregate insights for crisis response & recovery. In Proc. 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (eds Teredesai, A. & Kumar. V.) 3173 (ACM, 2019).
Wesolowski, A. et al. Quantifying the impact of accessibility on preventive healthcare in Sub-Saharan Africa using mobile phone data. Epidemiology 26, 223–228 (2015).
Article PubMed PubMed Central Google Scholar
Salje, H. et al. Reconstruction of 60 years of chikungunya epidemiology in the Philippines demonstrates episodic and focal transmission. J. Infect. Dis. 213, 604–610 (2016).
Article PubMed Google Scholar
Page, A. J. et al. Robust high-throughput prokaryote de novo assembly and improvement pipeline for Illumina data. Microb. Genom. 2, e000083 (2016).
PubMed PubMed Central Google Scholar
Li, Y. et al. Penicillin-binding protein transpeptidase signatures for tracking and predicting β-lactam resistance levels in Streptococcus pneumoniae. mBio 7, e00756-16 (2016).
Article PubMed PubMed Central Google Scholar
Li, Y. et al. Validation of β-lactam minimum inhibitory concentration predictions for pneumococcal isolates with newly encountered penicillin binding protein (PBP) sequences. BMC Genomics 18, 621 (2017).
Article PubMed PubMed Central Google Scholar
Metcalf, B. J. et al. Using whole genome sequencing to identify resistance determinants and predict antimicrobial resistance phenotypes for year 2015 invasive pneumococcal disease isolates recovered in the United States. Clin. Microbiol. Infect. 22, 1002.e1–1002.e8 (2016).
Article CAS PubMed Google Scholar
Metcalf, B. J. BenJamesMetcalf/Sanger_SPN_Scripts-Ref. GitHub https://github.com/BenJamesMetcalf/Sanger_SPN_Scripts-Ref (2016).
Sanger Pathogen Informatics. sanger-pathogens/vr-codebase. GitHub https://github.com/sanger-pathogens/vr-codebase (2022).
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).
Article CAS PubMed PubMed Central Google Scholar
Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D. & Pirovano, W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27, 578–579 (2011).
Article CAS PubMed Google Scholar
Boetzer, M. & Pirovano, W. Toward almost closed genomes with GapFiller. Genome Biol. 13, R56 (2012).
Article PubMed PubMed Central Google Scholar
Assefa, S., Keane, T. M., Otto, T. D., Newbold, C. & Berriman, M. ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics 25, 1968–1969 (2009).
Article CAS PubMed PubMed Central Google Scholar
Wellcome Sanger Institute. sanger-pathogens/bact-gen-scripts. GitHub https://github.com/sanger-pathogens/bact-gen-scripts (2023).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Croucher, N. J. et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res. 43, e15 (2015).
Article PubMed Google Scholar
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).
Article CAS PubMed PubMed Central Google Scholar
Didelot, X., Croucher, N. J., Bentley, S. D., Harris, S. R. & Wilson, D. J. Bayesian inference of ancestral dates on bacterial phylogenetic trees. Nucleic Acids Res. 46, e134 (2018).
Article PubMed PubMed Central Google Scholar
Drummond, A. J. & Bouckaert, R. R. Bayesian Evolutionary Analysis with BEAST (Cambridge Univ. Press, 2015).
Salje, H. et al. Dengue diversity across spatial and temporal scales: local structure and the effect of host population size. Science 355, 1302–1306 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Salje, H. et al. Reconstructing unseen transmission events to infer dengue dynamics from viral sequences. Nat. Commun. 12, 1810 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Yon, G. V. & Marjoram, P. fmcmc: a friendly MCMC framework. R package version 0.5-1 https://CRAN.R-project.org/package=fmcmc (2022).
Spiegelhalter, D. J., Best, N. G., Carlin, B. P. & Van Der Linde, A. Bayesian measures of model complexity and fit. J. R. Stat. Soc. Series B Stat. Methodol. 64, 583–639 (2002).
Article MathSciNet Google Scholar
The R Developmennt Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2022).
Lefrancq, N. et al. Global spatial dynamics and vaccine-induced fitness changes of Bordetella pertussis. Sci. Transl. Med. 14, eabn3253 (2022).
Article CAS PubMed Google Scholar
Day, T. & Gandon, S. Applying population-genetic models in theoretical evolutionary epidemiology. Ecol. Lett. 10, 876–888 (2007).
Article PubMed Google Scholar
Davies, N. G. et al. Estimated transmissibility and impact of SARS-CoV-2 lineage B.1.1.7 in England. Science 372, eabg3055 (2021).
Article CAS PubMed PubMed Central Google Scholar
Goodrich, B., Gabry, J., Ali, I. & Brilleman, S. rstanarm: Bayesian applied regression modeling via Stan. R package version 2.21.1 https://mc-stan.org/rstanarm/ (2020).
Vehtari, A. et al. loo: efficient leave-one-out cross validation and WAIC for Bayesian models v.0.1.0, https://doi.org/10.32614/CRAN.package.loo (2015).
Akaike, H. in Selected Papers of Hirotugu Akaike (eds Parzen, E. et al.) 199–213 (Springer, 1998).
Belman, S. Accession Numbers. FigShare https://doi.org/10.6084/m9.figshare.24219214 (2024).

Download references

Acknowledgements

This work was supported by the Wellcome Trust (grant number WT QQ2016-2021, reference 206194, to S.B., S.L. and S.D.B.); the Bill & Melinda Gates Foundation (under Investment ID INV-003570 to S.A.M., A.v.G., M.d.P., S.D. and S.N.); the NIH (grant number R01AI160780 to H.S.); and the European Research Council (grant number 804744 to H.S. and N.L.). For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. We would like to thank all Global Pneumococcal Sequencing project partners; and M. O’Driscoll, S. Farr, and V. Carr for code review.

Author information

These authors jointly supervised this work: Stephen D. Bentley, Henrik Salje

Authors and Affiliations

Parasites and Microbes, Wellcome Sanger Institute, Hinxton, UK
Sophie Belman, Stephanie W. Lo, Kate Mellor & Stephen D. Bentley
Department of Genetics, University of Cambridge, Cambridge, UK
Sophie Belman, Noémie Lefrancq & Henrik Salje
Global Health Resilience, Earth Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
Sophie Belman
Division of Public Health Surveillance and Response, National Institute for Communicable Diseases of the National Health Laboratory Service, Johannesburg, South Africa
Susan Nzenze
South African Medical Research Council Vaccines and Infectious Diseases Analytics Research Unit, School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
Sarah Downs & Shabir A. Madhi
Department of Clinical Microbiology and Infectious Diseases, School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
Mignon du Plessis & Anne von Gottberg
Centre for Respiratory Diseases and Meningitis, National Institute for Communicable Diseases of the National Health Laboratory Service, Johannesburg, South Africa
Mignon du Plessis, Kedibone Ndlangisa, Linda De Gouveia, Mushal Ali, Nicole Wolter, Cebile Lekhuleni & Anne von Gottberg
Milner Centre for Evolution, Department of Life Sciences, University of Bath, Bath, UK
Stephanie W. Lo
National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA
Bernard Beall, Paulina A. Hawkins & Lesley McGee
Department of Science and Technology/National Research Foundation, South African Research Chair Initiative in Vaccine Preventable Diseases, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
Shabir A. Madhi
Division of Medical Microbiology, Department of Pathology, Faculty of Health Sciences, University of Cape Town, Cape Town, South Africa
Anne von Gottberg
Servicio Antimicrobianos, National Reference Laboratory (NRL), Instituto Nacional de Enfermedades Infecciosas (INEI)-ANLIS ‘Dr Carlos G. Malbrán’, Buenos Aires, Argentina
Alejandra Corso & Paula Gagetti
International Centre for Diarrheal Diseases Research, Dhaka, Bangladesh
Abdullah W. Brooks
Child Health Research Foundation, Dhaka, Bangladesh
Md Hasanuzzaman, Samir K. Saha & Senjuti Saha
Department for Microbiology, Virology and Immunology, Belarusian State Medical University, Minsk, Belarus
Alexander Davydov & Leonid Titov
Center of Bacteriology, Institute Adolfo Lutz, São Paulo, Brazil
Samanta Cristine Grassi Almeida
Cambodia–Oxford Medical Research Unit, Angkor Hospital for Children, Siem Reap, Cambodia
Paul Turner
Peking University People’s Hospital, Beijing, China
Chunjiang Zhao & Hui Wang
Department of Microbiology, Faculty of Medicine, The Chinese University of Hong Kong–Prince of Wales Hospital, Hong Kong, China
Margaret Ip
Department of Microbiology and Carol Yu Centre for Infection, The University of Hong Kong–Queen Mary Hospital, Hong Kong, China
Pak Leung Ho & Pierra Law
Francis I. Proctor Foundation and Department of Ophthalmology, University of California San Francisco, San Francisco, CA, USA
Jeremy D. Keenan
Clinical Research Center, Centre Hospitalier Intercommunal de Créteil, Créteil, France
Robert Cohen
Université Paris Est, IMRB-GRC GEMINI, Créteil, France
Robert Cohen
Groupe de Pathologie Infectieuse Pédiatrique de la Société Française de Pédiatrie, Nice, France
Robert Cohen
National Reference Center for Pneumococci, Data and Research Department – DRIM, Centre Hospitalier Intercommunal de Créteil, Créteil, France
Emmanuelle Varon
Department of Medical Microbiology, University of Ghana Medical School, Accra, Ghana
Eric Sampane-Donkor
Department of Clinical Microbiology, Christian Medical College, Vellore, India
Balaji Veeraraghavan
Central Research Laboratory, Kempegowda Institute of Medical Sciences, Bangalore, India
Geetha Nagaraj, K. L. Ravikumar, J. Yuvaraj & Varun Shamanna Noga
The Shraga Segal Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
Rachel Benisty & Ron Dagan
Center for Global Health Research, Kenya Medical Research Institute (KEMRI), Kisumu, Kenya
Godfrey Bigogo
Division of Bacterial Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, USA
Jennifer Verani
Malawi–Liverpool–Wellcome Trust Clinical Research Programme, University of Malawi College of Medicine, Blantyre, Malawi
Anmol Kiran, Jennifer Cornick, Maaike Alaerts, Brenda Kwambana-Adams, Ebenezer Foster-Nyarko, Ebrima Bojang, Martin Antonio & Peggy-Estelle Tientcheu
Department of Public Health and Epidemiology, College of Medicine and Health Sciences, Khalifa University, Abu Dhabi, UAE
Dean B. Everett
Infection Research Unit, Khalifa University, Abu Dhabi, UAE
Dean B. Everett
Faculty of Applied Sciences, UCSI University, Kuala Lumpur, Malaysia
Shamala Devi Sekaran
Faculty of Medicine and Institute for Life Sciences, University of Southampton, Southampton, UK
Stuart C. Clarke
Centro de Investigação em Saúde da Manhiça, Maputo, Moçambique
Benild Moiane, Betuel Sigauque & Helio Mucavele
Oxford Vaccine Group, Department of Paediatrics, University of Oxford, NIHR Oxford Biomedical Research Centre, Oxford, UK
Andrew J. Pollard
Faculty of Medicine and Health, University of Sydney, Sydney, New South Wales, Australia
Rama Kandasamy
Institute of Environmental Science and Research Limited, Kenepuru Science Centre, Porirua, New Zealand
Philip E. Carter
Division of Pediatric Infectious Diseases, University of Alabama at Birmingham (UAB), Birmingham, AL, USA
Stephen K. Obaro
International Foundation against Infectious Diseases in Nigeria (IFAIN), Abuja, Nigeria
Stephen K. Obaro
Wesfarmers Centre of Vaccines and Infectious Diseases, Telethon Kids Institute, The University of Western Australia, Nedlands, Western Australia, Australia
Deborah Lehmann
Infection and Immunity Unit, Papua New Guinea Institute of Medical Research, Goroka, Papua New Guinea
Rebecca Ford
Instituto de Medicina Tropical Alexander von Humboldt, Universidad Peruana Cayetano Heredia, Lima, Peru
Theresa J. Ochoa
National Medicines Institute, Warsaw, Poland
Anna Skoczynska, Ewa Sadowy, Waleria Hryniewicz & Weronika Puzia
Hamad Medical Corporation, Doha, Qatar
Sanjay Doiphode
G. N. Gabrichevsky Research Institute for Epidemiology and Microbiology, Moscow, Russia
Ekaterina Egorova, Elena Voropaeva & Yulia Urban
Department for Public Health Microbiology, National Laboratory of Health, Environment and Food, Ljubljana, Slovenia
Tamara Kastrin
Department of RDI Microbiology, Institut de Recerca Sant Joan de Deu, Hospital Sant Joan de Deu, Universitat Internacional de Catalunya and CIBER Epidemiology and Public Health (CIBERESP), Barcelona, Spain
Carmen Muñoz Almagro, Alba Redin Alonso & Desiree Henares
Faculty of Pharmacy, Siam University, Bangkok, Thailand
Somporn Srifuengfung
Centre for Epidemic Preparedness and Response, London School of Hygiene and Tropical Medicine, London, UK
Martin Antonio
Department of Infection Biology, Faculty of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, London, UK
Martin Antonio
Medical Research Council Unit The Gambia at London School of Hygiene and Tropical Medicine, Fajara, Banjul, The Gambia
Martin Antonio
Medical Affairs, Vaccines and Antivirals Pfizer, Paris, France
Jennifer Moïsi
Department of Para-Clinical Sciences, The University of the West Indies St Augustine Campus, St Augustine, Trinidad and Tobago
Michele Nurse-Lucas & Patrick E. Akpaka
Hacettepe University Faculty of Medicine, Department of Medical Microbiology, Ankara, Türkiye
Özgen Köseoglu Eser
Department of Infectious Disease Epidemiology, Faculty of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London, UK
Anthony Scott
Centre for Genomic Pathogen Surveillance, Big Data Institute, Nuffield Department of Medicine, University of Oxford, Oxford, UK
David Aanensen
School of Public Health, Imperial College London, London, UK
Nicholas Croucher
European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
John A. Lees
Department of Biostatistics, Faculty of Medicine, University of Oslo, Oslo, Norway
Rebecca A. Gladstone & Gerry Tonkin-Hill
Department of Epidemiology of Microbial Diseases, Yale School of Public Health, New Haven, CT, USA
Chrispin Chaguza
Institute of Microbiology and Infection, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
David Cleary
Pneumonia and Pandemic Preparedness, Global Health Division, Bill & Melinda Gates Foundation, Seattle, WA, USA
Keith P. Klugman & Gail Rodgers
Department of Pediatrics, Division of Pediatric Infectious Diseases, School of Medicine, University of Utah, Salt Lake City, UT, USA
Anne J. Blaschke & Nicole L. Pershing

Authors

Sophie Belman
View author publications
You can also search for this author in PubMed Google Scholar
Noémie Lefrancq
View author publications
You can also search for this author in PubMed Google Scholar
Susan Nzenze
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Downs
View author publications
You can also search for this author in PubMed Google Scholar
Mignon du Plessis
View author publications
You can also search for this author in PubMed Google Scholar
Stephanie W. Lo
View author publications
You can also search for this author in PubMed Google Scholar
Lesley McGee
View author publications
You can also search for this author in PubMed Google Scholar
Shabir A. Madhi
View author publications
You can also search for this author in PubMed Google Scholar
Anne von Gottberg
View author publications
You can also search for this author in PubMed Google Scholar
Stephen D. Bentley
View author publications
You can also search for this author in PubMed Google Scholar
Henrik Salje
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

The Global Pneumococcal Sequencing Consortium

Alejandra Corso
, Paula Gagetti
, Abdullah W. Brooks
, Md Hasanuzzaman
, Samir K. Saha
, Senjuti Saha
, Alexander Davydov
, Leonid Titov
, Samanta Cristine Grassi Almeida
, Paul Turner
, Chunjiang Zhao
, Hui Wang
, Margaret Ip
, Pak Leung Ho
, Pierra Law
, Jeremy D. Keenan
, Robert Cohen
, Emmanuelle Varon
, Eric Sampane-Donkor
, Balaji Veeraraghavan
, Geetha Nagaraj
, K. L. Ravikumar
, J. Yuvaraj
, Varun Shamanna Noga
, Rachel Benisty
, Ron Dagan
, Godfrey Bigogo
, Jennifer Verani
, Anmol Kiran
, Dean B. Everett
, Jennifer Cornick
, Maaike Alaerts
, Shamala Devi Sekaran
, Stuart C. Clarke
, Benild Moiane
, Betuel Sigauque
, Helio Mucavele
, Andrew J. Pollard
, Rama Kandasamy
, Philip E. Carter
, Stephen K. Obaro
, Deborah Lehmann
, Rebecca Ford
, Theresa J. Ochoa
, Anna Skoczynska
, Ewa Sadowy
, Waleria Hryniewicz
, Weronika Puzia
, Sanjay Doiphode
, Ekaterina Egorova
, Elena Voropaeva
, Yulia Urban
, Tamara Kastrin
, Kedibone Ndlangisa
, Linda De Gouveia
, Mushal Ali
, Nicole Wolter
, Cebile Lekhuleni
, Carmen Muñoz Almagro
, Alba Redin Alonso
, Desiree Henares
, Somporn Srifuengfung
, Brenda Kwambana-Adams
, Ebenezer Foster-Nyarko
, Ebrima Bojang
, Martin Antonio
, Peggy-Estelle Tientcheu
, Jennifer Moïsi
, Michele Nurse-Lucas
, Patrick E. Akpaka
, Özgen Köseoglu Eser
, Anthony Scott
, David Aanensen
, Nicholas Croucher
, John A. Lees
, Rebecca A. Gladstone
, Gerry Tonkin-Hill
, Chrispin Chaguza
, David Cleary
, Kate Mellor
, Bernard Beall
, Keith P. Klugman
, Gail Rodgers
, Paulina A. Hawkins
, Anne J. Blaschke
& Nicole L. Pershing

Contributions

S.B., H.S. and S.D.B. conceived the study. S.N., S.D., A.v.G., S.A.M., L.M., M.d.P. and S.D.B. collected and contributed the genomes from South Africa, and the GPS Consortium contributed other global genomes. S.B. collated the genomes and ran bioinformatics pipelines. N.L. developed the fitness framework. S.B., N.L., H.S. and S.D.B. performed the analyses. S.B., A.v.G., M.d.P., S.D., S.A.M., S.W.L., S.D.B. and H.S. discussed results and helped to contextualize them within a South African context. S.B., H.S. and S.D.B. wrote the first draft of the manuscript. All authors reviewed the first draft of the manuscript with special attention by S.B., N.L., S.N., S.D., M.d.P., S.W.L., L.M., S.A.M., A.v.G., S.D.B., H.S., A.P., C.M.A., K.K., J.L., N.C., D.L., R.K., R.D., S.M., B.V. S.B., N.L., H.S. and S.D.B., discussed the results and contributed to manuscript revisions. Funding was acquired by S.D.B., H.S. and S.B.

Corresponding author

Correspondence to Sophie Belman.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks Taj Azarian, C. Buddy Creech, Nick Ruktanonchai and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Time Resolved Trees for Dominant GPSCs.

Trees are recombination masked, aligned to a reference for each GPSC. Time resolution was performed using BactDating. The dates are along the x-axis. (a) GPSC79, N = 102 (b) GPSC68, N = 97 (c) GPSC17, N = 531 (d) GPSC14, N = 521 (e) GPSC13, N = 611 (f) GPSC10, N = 718 (g) GPSC5, N = 841 (h) GPSC2, N = 1430 (i) GPSC1, N = 1943.

Extended Data Fig. 2 Risk ratio framework sensitivity analyses to determine geographic structure when sub-sampling and only including disease isolates.

Risk ratio of being the same GPSC within a province (blue), between different provinces over increasing distance (red), compared to geographically distant pairs (>1000 km) (reference) (top) and the risk ratio of having a tMRCA 0–5, 5–10, 10–20, or 20–200 years ago within South African provinces (blue), across larger distances within South Africa (red), from South Africa to other countries in Africa (green), and from South Africa to countries outside of Africa (purple) (South Africa; N = 6910). (a) Including all isolates from each province (left) compared with sub-sampling to 300 with replacement to compensate for biased sampling in each province. (b) Including only isolates sampled from patients with pneumococcal disease. All plots use a reference of pairs which are from distant provinces in South Africa (open triangle). Error bars represent 2.5 to 97.5 CIs.

Extended Data Fig. 3 Relative risk of a pneumococcal strain being in each of municipality after 1 year of transmission.

Sequential transmission chains starting in municipalities with (a) <50 people/km2, (b) 50–500 people/km2, or (c) >500 people/km2 across 100,000 samples.

Extended Data Fig. 4 Estimated mobility and proportion of infections.

(a) a gravity model with two parameters adjusting the destination population size (beta) and the distance between locations (gamma) (pink), (b) a distance model whereby the probability of mobility is a function of the distance between locations adjusted with parameter gamma, (c) is the Meta mobility data between locations with a parameter adjusting the probability of staying in the home location (main model) across distances. (d-f) are the estimated proportion of infections by each of these models compared to the population size.

Extended Data Fig. 5 Replicated simulations for model performance testing.

Testing model performance using replicated simulations. (a) Simulated the total epidemic (black dots), biased down-sampled data as per true proportion per province (red dots) (“Biased Down-sample” in c and d), model fit to down-sampled data (red line), removing the sampling probability the model recapturing the true epidemic (black dot) (b) Population size (x-axis) compared to the proportion of infections from the down-sampled data (black) compared to the truth from the overall simulated epidemic (purple). (c) The probability of being in the home municipality and (d) the recaptured parameter after inputting a parameter of −2 to adjust the diagonal of the mobility matrix after one transmission generation. Both c and d include values from left to right for sampling as per the true data proportions in each province (6.5% of total infections), down-sampling to fit on only 2 of the 9 provinces, and if our generation time estimate is 50% smaller than the truth, exactly right, or 50% larger. Error bars represent 2.5 to 97.5 percentiles.

Extended Data Fig. 6 Comparison of fitness model results with full data or disease only data.

(a-b) Results with the full data. (c-d) Results with the disease-only data. a and c present the model fits for the proportion of serotypes from non-vaccine type (NVT), PCV7 types, and additional PCV13 types not included in PCV7 from the years 2000 to 2014 in this study. Points represent data and line represent the model fit. b and d present the relative fitness estimates for all three groups of serotypes in each era. Pre-vaccine era is prior to 2009 for NVTs, prior to 2009 for PCV7 and prior to 2011 for PCV13. Post-vaccine era is post-2009 for NVTs, post-2009 for PCV7 and post-2011 for PCV13. Error bars represent 2.5 and 97.5 percentiles.

Extended Data Fig. 7 Fitness growth model testing year of switch and schematic.

(a) Testing year of fitness switch for the logistic growth -fitness model. Adjusting the year of the fitness switch in the model fitting to vaccine status. The difference to the best WAIC (2009 [PCV7 implementation] & 2011 [PCV13 implementation]) is on the y-axis where the year of fitness switch relative to 2009 & 2011 is on the x-axis. Further we test no fitness switch (ns; yellow) and the impact of including one fitness switch in 2009 (purple). The dark gray box highlights equivalent models (ΔWAIC ≤2) and light gray box highlights similar models (ΔWAIC ≤ 7). (b) Schematic denoting the fitness growth model parameterisation which accounts for the specific timing of the PCV impacting each group of serotypes (NVT in blue; PCV7 in green; PCV13 in red).

Extended Data Fig. 8 Serotype fitness estimates.

Fitness estimates pre- and post-PCV (y-axis) for each serotype (grey), superimposed by group including NVT (blue), PCV7 (green), and PCV13 (red). Pre-vaccine and post-vaccine refer to pre- and post- 2009 and 2011 for PCV7 and PCV13 respectively. Individual serotype fitness estimates can be found in Fig. S9.

Extended Data Fig. 9 Antimicrobial resistance summary.

The proportional trends in antimicrobial resistance overall (black) and within Vaccine type [VT] (in blue) and Non-Vaccine Type [NVT] (in red) serotypes for in-silico predicted (a) penicillin (b) erythromycin (c) co-trimoxazole and (d) clindamycin.

Extended Data Fig. 10 Data fits for model accounting for proportions and fitness over time in four groups.

(a) NVT-penicillin resistant (red), (b) NVT-penicillin susceptible (green), (c) VT-penicillin resistant (yellow) and (d) VT-penicillin susceptible (blue). The dashed lines indicate the year of PCV7 implementation and fitness switch model implemented. (e) Fitness estimates pre-PCV and post-PCV for each group and colored accordingly. e is on a log scale. This model uses a shift in fitness in 2009. Error bars represent 2.5 to 97.5 percentiles.

Extended Data Fig. 11 Data descriptions across the 234 municipalities of South Africa.

(a) Population density as estimated given the area of each municipality and the populations estimated by LandScan and (b) the radius of gyration for each municipality given the distance between each municipality at the centroid weighted by the human mobility data from Meta Data for Good.

Extended Data Fig. 12 Parameter adjustment sensitivity analysis.

(a) Fitting the mobility model to 15 years of evolutionary distance. Fitting the probabilistic mobility model to pairs of genomes which are 15 years divergent from their MRCA (black), compared against the 10 years used in the main model (red), and the data (blue). (b) Assessing the mean geographic distance per evolutionary time in years for the data (blue), including the sampling probability (red) for, (left) generation time of 15 days, (middle) generation time of 35 days, and (right) generation time of 55 days.

Supplementary information

Supplementary Information

This file contains Supplementary Figs. 1–20 and Supplementary Tables 1–13.

Reporting Summary

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Belman, S., Lefrancq, N., Nzenze, S. et al. Geographical migration and fitness dynamics of Streptococcus pneumoniae. Nature (2024). https://doi.org/10.1038/s41586-024-07626-3

Download citation

Received: 31 May 2023
Accepted: 30 May 2024
Published: 03 July 2024
DOI: https://doi.org/10.1038/s41586-024-07626-3

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Main

Quantifying spatial structure

Inferring migration using human mobility

Vaccine-induced fitness changes

Limitations

Conclusion

Methods

Data sources and processing

Pneumococcal sequence data and metadata

Population data

Mobility data

Generation time distribution

Sample culture and genome sequencing

AMR

Constructing time-resolved phylogenetic trees

RR framework

Probabilistic mobility model

Overall strategy

Notation

Model fitting

Single transmission generation

Human mobility characterization

Probability of the pneumococcus being in each location after G transmission generations

Probability of observing a pair of cases in two specific locations

Probability of G generations between MRCA and a sequenced isolate

Location of the MRCA (P(L m))

Calculation of likelihood

Likelihood equation

Hamiltonian MCMC

Model performance

Model sensitivity analyses

Probability guided transmission simulations

Person-to-person transmission chains

Branching epidemic

Gravity model

Population-level fitness model

Overall strategy

Vaccine-type model

Exploration of vaccine effect

Estimation of individual serotype fitness

Exploration of predicted GPSC dynamics based on their serotype composition

AMR

Estimating effect of fitness on migration

Fitness model carrying capacity sensitivity

Ethics

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Consortia

The Global Pneumococcal Sequencing Consortium

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data figures and tables

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links

Location of the MRCA (P(L _m))