Report by the COVID-19 Genomics UK (COG-UK) Consortium
Report 9: 25th June 2020 - Report COVID-19 Genomics UK (COG-UK) Consortium
Please Note: This report is provided at the request of SAGE and includes information on the ongoing state of the research being carried out. It should not be considered formal or informal advice. The conclusions of the ongoing scientific studies may be subject to change as further evidence becomes available and as such any firm conclusions would be premature.
- COG-UK has sequenced more than 29K SARS-CoV-2 genomes during the past 15 weeks and the UK remains by far the single biggest producer of genome data having contributed ~53% of the global total.
- Updated analyses using multiple analytical approaches and the expanded COG-UK dataset are suggestive of an increase in epidemic growth rates of the D614G variant in the SARS-CoV-2 spike protein, however the magnitude of the effect cannot be measured with confidence at the present time. Importantly, there is currently no evidence to indicate a difference in disease severity between the 614D and 614G genotypes.
- As part of investigations into the higher rates of COVID-19 in North Wales, a system developed to support C. difficile genomic surveillance and typing has been adapted for SARS-CoV-2 surveillance. The system integrates genomic and patient-level data to support the identification of hospital associated clusters. Including community and hospital-associated cases also enables the identification of potential spread from hospitals into the community. Collectively the system enables staff to mount responses in
real time, and to review data retrospectively to inform future practice.
Across 17 sequencing sites, the total number of SARS-CoV-2 genomes sequenced by COG-UK now stands at 29,593 (Figure 1a) which constitutes ~53% of the global total number of SARS-CoV-2 genomes sequenced (Figure 1b).
Data from COG-UK and other high income countries account for 92.9% of the ~56K SARS-CoV-2 genomes sequenced globally, with upper middle (4.4%), lower middle (2.4%) and low income (0.3%) countries accounting for the remainder. As such, even the extensive COG-UK dataset likely captures only a fraction of the genome variation in SARS-CVoV-2 lineages worldwide. Increased genomic surveillance in middle and low income countries would enable insights from the UK epidemic to be placed in the context
of the wider global pandemic.
While the availability of genome data is a significant limitation on the types of analysis that are feasible for nearly all other countries, the scale of the COG-UK dataset is now enabling key questions to be asked of the UK outbreak. For example, while preliminary analyses were not sufficiently powered to assess the impact of a mutation in the SARS-CoV-2 spike protein on viral transmission, the COG-UK dataset is now sufficient to reveal a measurable effect (See “Updated analysis of SARS-CoV-2 spike protein variant D614G in the UK: evaluating evidence for effects on transmission and pathogenicity”, below).
Highlighted findings with public health implications
- Use of the expanded COG-UK dataset suggests that SARS-CoV-2 lineages bearing either the spike protein 614D or 614G genotype may have slightly different epidemic growth rates. The magnitude of this effect cannot be determined with confidence at the present time, however, there is no evidence to support an association with disease severity.
Updated analysis of SARS-CoV-2 spike protein variant D614G in the UK: evaluating evidence for effects on transmission and pathogenicity
Erik Volz (Imperial College London), Tom Connor (Cardiff University), Andrew Rambaut (University of
One of the aims of COG-UK is to undertake surveillance and determine whether new mutations observed in the genomic dataset are relevant to any detectable changes in the behaviour of SARS-CoV-2. We have been monitoring the prevalence of a number of variants across the SARS-CoV-2 genome since early April.
One such variant is an amino acid change from aspartate (D) to glycine (G) in the SARS-CoV-2 spike protein, which attaches to the ACE2 receptor on host cells and is critical for viral entry into the host cell. The frequency of SARS-CoV-2 genomes carrying the D614G variant has increased rapidly in both the UK and global datasets since its first observation in late February. D614G has been the focus of a growing number of non-peer reviewed reports since it was predicted that this change may impact the transmissibility of the virus. More recent laboratory work has suggested that D614G may affect binding to the ACE2 receptor by introducing structural instability into the spike that this could increase host cell entry. Sequencebased studies using global sequence data have also been proposed to be consistent with the possibility of increased transmissibility of lineages bearing the D614G variant.
However, the ability to draw conclusions from in vitro laboratory work about the impact of the D614G variant on transmission between patients and across a population is limited. Furthermore, geographic and sampling-approach biases inherent to global sequence databases risk any conclusions drawn being unreliable and affected by false signals. In addition, the ability to determine whether a genomic variant impacts upon transmissibility can be impacted by founder effects. Multiple introductions of a particular variant into a population can create a signal that looks like increased transmission potential but for which no difference in viral fitness exists. The COG-UK dataset is only now just beginning to reach sufficient scale to allow analyses to begin to resolve some of these issues.
A preliminary analyses of the D614G variant was included in COG-UK report #6. This update describes the use of several complementary analytical approaches to assess whether there is a detectable effect of the D614G variant on SARS-CoV-2 lineage growth/transmissibility. It should be noted that this work is ongoing, and some estimates presented in this report are subject to change. A final version of this analysis is expected to be complete in the coming weeks. Once that analysis is complete the results will be shared onwards to SAGE and groups including SPI-M, as well as published in the scientific literature.
Part 1 – Phylodynamic analysis of D614G in major UK SARS-CoV-2 lineages
For all UK SARS-CoV-2 lineages with more than 50 sampled genomes (as of 19th June 2020), rooted and dated phylogenetic trees were constructed. Epidemic growth rates for each lineage were estimated and statistically analysed. For detailed descriptions of the methodology, please refer to Appendix 1.
Of the 63 UK lineages analysed, spike proteins in 43 lineages had the 614G variant genotype and 20 lineages had the ancestral 614D genotype.
Initial epidemic growth rates are highly variable and Spike 614 genotype explains little variance in growth rates on its own (Figure 2). Overall, 614G lineages were introduced later, grew faster and peaked slightly later than 614D lineages.
The median importation date for 614D lineages was 18th February 2020 whereas for 614G lineages it was slightly later on 2nd March 2020.
All lineages show decreasing epidemic growth from March onward, consistent with the impact of lockdown.
The median initial epidemic growth rate for 614D lineages was 143 / year (IQR 110-197), which corresponds to R0 = 3.55 (IQR 2.95-4.5) assuming a 6.5 day serial interval.
By contrast, the median growth of 614G lineages was 175 / year (IQR 143 – 248), which corresponds to R0 = 4.12 (IQR 3.55-5.41).
Note that the estimates of R0 for these lineages are high in comparison to calculations based on all lineages; this is a consequence of the use of only lineages with >50 sequenced genomes in this analysis and means that this analysis focuses on only the most successful lineages.
Statistical analyses estimate that 614G lineages grow at a rate 1.22 (range 0.96-1.39) times faster than 614D lineages (p = 0.085).
While no association between epidemic growth rate and time of first sample was observed, the time difference between the estimated origin of a lineage and the time of first sample was strongly associated with estimated epidemic growth rate (i.e. the longer the time elapsed between when a lineage emerged and when it was first detected, the lower the epidemic growth rate).
After accounting for a particular hospital-associated outbreak in Wales (described in COG-UK Report #2), there was little evidence to suggest that there were country-specific differences in epidemic growth rates of either 614D or 614G lineages.
These results are sensitive to the validity of underlying assumptions in methods used to estimate time-scaled phylogenies and effective population size. In order to accommodate large sample sizes, phylodynamic inference was carried out in two steps, first estimating a set of time-scaled phylogenies followed by phylodynamic inference with a set of fixed trees.
Assumptions underlying these methods with the potential to be violated include; 1) Changes in epidemic growth rate occur as an autoregressive (AR(1)) process and epidemic growth rates tend to remain constant in the absence of strong phylodynamic signal; 2) Evolution is modelled with a strict (time-invariant) molecular clock, however the rate of evolution may actually vary over time and between lineages. There may also be errors in our identification of lineages as independent introductions into the UK since the
precise location of lineages close to the root of each phylogeny can be difficult to ascertain.
See Appendix 1 for the full analysis.
Part 2 – Preliminary SEIR modelling of D614G in SARS-CoV-2 lineages
A SEIR (Susceptible, Exposed, Infectious, Recovered) model based on 200 SARS-CoV-2 genome sequences sampled within Greater London and 50 sequences from the wider UK was used to estimate infections through time for 614D and 614G variants and compared with actual sampling density for each type. For detailed descriptions of the methodology, please refer to Appendix 2.
The SEIR model predicted that 614D lineages outnumber 614G lineages until the second half of March, when the latter became more prevalent (Figure 3a).
This is also reflected in the sampling pattern of 1000 genome sequences from Greater London, which shows a similar shift in mid-March to 614G being more prevalent (Figure 3b).
The relative R0 of 614G lineages predicted by the coalescent SEIR model were estimated to be 1.1 (95% CI: 0.85-1.41) times that of the ancestral 614D lineages.
Resampling from the posterior following incorporation of information on when 614G and D lineages were sampled enabled revision of the estimate of the relative R0 of 614G to be 1.26 (95% CI: 0.99-1.58) times that of the ancestral 614D lineages.
This model-based phylodynamic analysis used only 200 sequences from London due to computational requirements of this approach. It may therefore be under-powered to detect a significant difference between D614G variants. This analysis may suffer from model-misspecification bias, such as if the SEIR dynamics are an inadequate description of the course of the epidemic up to April in London. Furthermore, migration is modelled as a continuous rate per lineage, but this is a simplification as human mobility varied a great deal over the period in question. Post-stratification of the posterior to account for sampling frequencies is based on the approximation that each sample is random and representative of the total set of S614 G and D lineages.
See Appendix 2 for the full analysis.
Part 3 – Preliminary analysis of the association between D614G and disease severity
A statistical analysis based on paired genotype and clinical variables from 1879 patients sampled between February 5 and March 17 2020 analysed the link between D614G and disease severity. For detailed descriptions of the methodology, please refer to Appendix 3.
Using survival for 28 days after diagnosis as the primary outcome, there were slightly reduced odds of death (OR: 0.77, CI: 0.61-0.97) for patients infected with 614G-carrying virus, but this result lacked statistical significance after controlling for sex (OR: 0.84, CI: 0.63-1.13) and age (OR: 1.4, CI: 1.04-1.88).
Complex sampling patterns may have introduced confounding effects into the analysis. For example, later samples have a higher likelihood of death, however this may reflect bias in sampling of more severe cases in the genomic dataset.
As reported in a preliminary analysis of D614G (See COG-UK report #6), a significant association was observed between the age of patients and the probability of the virus carrying D614G, with decreasing odds of carrying the G variant as age increases. However, the basis for this association remains unclear and may also result from sampling bias.
This analysis of severity was based on a limited subset of genetic sequences which can be matched to clinical outcome and not a representative sample from all infected individuals. Sequences are preferentially sampled from hospitalized cases and are skewed towards more severe cases relative to the general population.
See Appendix 3 for the full analysis.
In COG-UK report #6 it was reported that preliminary analyses (based on a substantially smaller dataset) found no significant association between the D614G variant and changes in SARS-CoV-2 transmission potential.
From these additional analyses on a larger updated dataset, there is an indication that mutation of spike protein residue 614 may have an impact on the epidemic growth rate of SARS-CoV-2, with lineages bearing the 614G genotype growing marginally faster in the UK than lineages bearing the ancestral 614D genotype.
However, despite the updated dataset containing nearly 30K SARS-CoV-2 genomes, it is only just sufficient to discern that a difference in epidemic growth rate may be present and the magnitude of this effect cannot be determined with any confidence at this time.
There currently remains no statistically significant evidence for an association between either the 614D or 614G genotypes and an effect on disease severity.
Further investigation will be needed to determine the relevance (if any) to the progression of the SARSCoV-2 pandemic in the UK and elsewhere.
Proposed next steps
While a difference in epidemic growth rate between the 614D and 614G genotypes is suggested, there is no evidence to support an association with disease severity at the present time. As such, continued monitoring and analysis of D614G is recommended, in particular in relation to SARS-CoV-2 transmission in the community setting.
Ongoing surveillance efforts by COG-UK will continue to monitor this and other variants occurring above a certain frequency, and investigate any that have been suggested to be linked with the changes in transmission or disease severity.
Integration of genomics and clinical metadata in real time to support outbreak management and response
Dr Noel Craine (Public Health Wales Microbiology), Dr Helen Adams (Betsi Cadwaladr University Health Board), Dr Matt Bull, Dr Nicole Pacchiarini, Dr Tom Connor (Public Health Wales Pathogen Genomics Unit)
As part of ongoing investigations into the higher rates of COVID-19 in North Wales, the Welsh COG-UK centre has been working to deploy enhanced tools to support Infection Prevention and Control in hospitals of North Wales. This work builds off of an existing system that has been developed to support the C. difficile genomic surveillance and typing service, which has been piloted in North Wales. This report provides an outline of how contextualisation of hospital cases and community cases sampled using genomics, combined with rich patient metadata can be used to identify hospital associated clusters, as well as identifying if community cases are potentially hospital spill-overs or not. The system enables staff to mount responses in real time that, as well as identifying potential learning points for healthcare staff, will provide benefits in the future.
From the week commencing 26th April 2020 (CDC Epi Week 18), an increased incidence of COVID-19 was noted across four hospital wards in a hospital in North Wales, prompting outbreak investigations to be conducted locally. Alongside genomic data from the hospitals which has been generated since the start of the pandemic in Wales, genomic data from a community testing unit covering the local area were also beginning to come online at the same time. Combining these two datasets, we performed a genomic analysis to identify the UK Phylotypes of cases within the hospital and local community, while these genomic results were contextualised with other patient metadata by staff working with/within the hospital itself. This has enabled an initial assessment of the extent to which cases from the hospital are seeding the local community, as well as enabling an examination of cases within the hospital, to understand if they represented multiple outbreaks. SARS-CoV-2 cases in North Wales are distributed across
Focusing on cases from the hospital from Epi Week 18 onwards, revealed that all cases from the four wards fell within a single UK Phylotype, while, contrastingly, contemporaneous samples from the local community testing centre were distributed across the SARS-CoV-2 phylogenetic tree (Figure 5). The initial signature that had prompted this analysis was an uptick in cases in North Wales. However, in order to utilise the genomics data in real-time, UK Phylotype assignments from COG-UK are now integrated with the within-health board tracking system that has been developed for use with C. difficile. When this information began to flow, it became clear that the identified phylogroup included a larger number of hospital-associated cases from North Wales (Figure 6). While cases from four of the wards that formed part of this cluster (alpha, beta, gamma, delta) were already independently under investigation as part of a hospital associated outbreak, the genomics was to propose a hypothesis (that these cases were linked) and also identified an additional set of wards (epsilon, zeta, eta, theta) that also fell within the same UK Phylogroup, along with cases from other hospitals in North Wales.
Working from the genomics data, and examining the data within the hospital, revealed that 14/27 samples from Hospital 1 were within 0 SNPs of one-another, forming a cluster with 7 identical community cases, and representing samples beyond the original wards that were the focus of the initial outbreak investigation(s). This provided a basis for investigation and follow up within the local hospital using a system that integrates genomic clustering results and patient data in real time (Figure 7). By integrating genomics with patient data relating to stay timelines and hospital movement, it was possible to provide the required context to identify the presence of within-hospital transmission in Hospital 1, associated with phylotype UK5_1.58, and also to clearly distinguish clusters with a hospital-transmission component from community clusters of disease (Figure 8).
By being able to join the hospital data to genomics data derived from community testing samples, it was also possible to identify the fact that virtually all hospital-associated cases fell within one phylogroup, while the community cases fell across multiple phylogroups, with limited evidence of hospital associated infections being linked to these wider community clusters. Through the examination of the patient timelines and records, epidemiological links between patients (in the form of staff and patient timelines) could be examined and investigated in Hospital 1, with the genomic information providing a key tool for the grouping of cases to inform and enhance these outbreak investigations. In this case genomics provided information that enabled the identification/support for a set of linked cases that would not have been as apparent with epidemiological data
The integration of genomics data with other patient information provides a powerful tool for tracking and responding to hospital outbreaks, and demonstrates how genomics data can support the identification of complex outbreaks across multiple wards in a hospital. The availability of contemporaneous community testing samples was key in contextualising the hospital-related cases, and the system being used in North Wales provides a platform for the rapid examination of community and hospital associated clusters of infections. This benefits from the high sequencing rate of tested samples in Wales (approaching 40% in some areas).
Systems that integrate genomic summary information such as UK Phylotype with patient timelines and other information can provide powerful tools for infection prevention and control in hospitals, both in terms of outbreak response in real time, and in terms of looking back to understand the progression of outbreaks and to improve practice going forwards. These same systems can also act as powerful public health tools for understanding outbreaks in communities and on a larger scale.
Phylodynamic analysis of spike mutation 614G
This analysis considers 63 UK lineages with more than 50 sequences as of June 19, 2020. Among these, 20 lineages have the ancestral Spike 614D genotype.
- All UK lineages show evidence for decreasing epidemic growth from March through April 2020 (Figure A). There is high variance and wide confidence intervals for the initial growth rate at the time each lineage originated. The lineages are classified geographically with the criterion that >50% of samples come from England, Scotland or Wales. One lineage could not be classified as it did not have majority representation in any country. Spike 614 Genotype alone does not explain much variance in epidemic growth rates.
- D lineages were introduced earlier on average than G lineages. The median importation date of D lineages is February 18. The median importation date of G lineages is March 2.
- Figure A1 shows the median posterior epidemic growth rate over time and the median among all Spike 614 G and D lineages.
- The median initial epidemic growth rate among Spike 614 D lineages is 143 / year (IQR 110-197). With a 6.5 day serial interval this corresponds to R0 = 3.55 (IQR 2.95-4.5).
- The median epidemic growth rate among Spike 614G lineages is 175 / year (IQR 143 – 248). With a 6.5 day serial interval this corresponds to R0 = 4.12 (IQR 3.55-5.41).
- Figure A2 shows epidemic growth rates and effective population size for the largest Spike 614 G and D lineage. UK5 (G) is introduced later, grows slightly slower, and peaks slightly later.
- Figure A3 shows the initial epidemic growth rate for all lineages.
- In unweighted univariate analyses the difference in epidemic growth rate is significant (MannWhitney U test, p = 0.04373)
- In unweighted linear regression, Spike 614G lineages grow at a rate 1.25 times the rate of D lineages (p=0.056)
- Each epidemic growth rate is estimated with variable levels of precision and all epidemic growth rate estimates have very wide CIs for the initial time point. We therefore conduct a weighted linear regression of the initial epidemic growth rate on genotype. Weights are inversely proportional to the variance of each estimated epidemic growth rate.
- In weighted univariate analyses, Spike 614 G lineages grow at a rate 1.22 (0.96-1.39) times the rate of D lineages (p = 0.085).
- This association is potentially confounded by the different time of origin of these lineages.
- Whereas G lineages are overall introduced later than D and initial epidemic growth rates are large, growth in G will tend to be greater at any given point in time.
- We find that the time difference between the estimated time of origin and the earliest sample time is strongly associated with estimated epidemic growth rates.
- This is expected since lineages which grow quickly will have greater hazard of having at least one sampled descendent.
- It can also reflect error in phylogeny estimation. If lineage TMRCAs are underestimated, it would lead to downward bias on estimated epidemic growth rates
- In weighted univariate regressions, time from origin to first sample explains 30% of variance in epidemic growth rates (p<10-4).
- Results are little changed in multivariate regressions adjusting for time of first sample (p=0.0745)
- No association is observed between epidemic growth rates and time of first sample (p = 0.74).
- Country within the UK explains little variance in epidemic growth rates on its own (<10%). No country shows evidence for a difference in growth.
- We analysed all UK lineages with more than 50 sampled sequences as of June 19.
- Rooted and dated phylogenies were estimated by randomly resolving polytomies and using treedater 0.5.1. The mean clock rate of evolution was constrained to (0.00075,0.0015). Branch lengths were smoothed by enforcing a minimum number of substitutions per site on each branch and by sampling from the distribution estimated by treedater. This was carried out 20 times for each UK lineage.
- Epidemic growth rates were estimated using skygrowth 0.3.1 using MCMC and 1 million iterations for each time tree and using an Exponential(10-4) prior for the smoothing parameter. The final results were produced by averaging across 20 time trees estimated for each lineage.
- Statistical analysis. All analyses were carried out in R 3.6.3. Mann Whitney U test was used in univariate comparisons. Effect of genotype on epidemic growth rate was estimated using weighted linear regression. We adjust for the size of each lineage, time of origin of each lineage, the time of the earliest sample in each lineage, and country with most samples in each lineage. Each observation is weighted by the reciprocal variance of the estimated epidemic growth rate.
SEIR model for spike mutation 614G
This analysis was based on a structured coalescent model fitted to respectively 100 Spike 614D and G sequences sampled from within Greater London and 50 sequences sampled from outside of London. The model accounted for bi-directional migration between London and the rest of the world.
- The SEIR model predicts that Spike 614 D lineages outnumber G until the second half of March (Figure B1).
- The ratio of reproductive numbers in the Spike 614 G to D clade is 1.1 (95% CI: 0.85-1.41).
- Sampling patterns show a similar shift from D to G in the second half of March (Figure B2) based on a random sample of 1000 sequences from London.
- The log odds ratio of sampling a Spike 614 G genotype can be computed from the relative size of the G and D sub-epidemics estimated with the SEIR model. This is shown in Figure B3 with the blue ribbon. The log odds can also be computed from the empirical distribution of sampling times. We computed a Gaussian kernel density estimator for G and D genotypes and computed the ratio of sample densities through time. This produces the log odds shown with the red line in Figure B3.
- Whereas the coalescent model was not explicitly making use of the sample time information to generate these estimates, we can resample the posterior using importance weights based on the likelihood of observing the actual sequence of sampling times given the relative size of the D and G epidemics estimated with the SEIR model. We derive a sequential Bernoulli likelihood for each posterior trajectory and resample accordingly. This gives a revised estimate for the ratio of the reproduction numbers in the G and D shown in Figure B4.
- After weighting and resampling from the posterior, we estimate the ratio of reproduction numbers (G to D) to be 1.26 (95% CI: 0.99-1.58).
The SEIR model assumed a 6.5 day serial interval. The estimated parameters included the initial number infected, the susceptible population size, and the reproduction number. The model included bidirectional migration to the region outside of London (both within the UK and internationally) at a constant rate per lineage. Infections outside of London were modelled as an exponential growth coalescent. Additional estimated parameters include the migration rate, and the size and rate parameters for the exponential growth coalescent.
This model was implemented in the BEAST2 PhyDyn package and is available here:
In order to make results comparable between D and G lineages, the molecular clock rate of evolution was fixed at a value estimated using all data in treedater 0.5.1. Nucleotide evolution was modelled as a strict clock HKY process. To fit the model we ran 20 MCMC chains for 20 million iterations, each using 4 coupled MCMC chains. Bespoke algorithms were used to exclude chains which failed to sample the target posterior.
KDE estimation of sample time densities used a bandwidth of 1.8 days and was implemented in R 3.6.3.
Spike mutation 614G and disease severity
Using survival for 28 days after diagnosis as the primary outcome we observe slightly reduced odds of death for patients infected by virus carrying the spike 614G mutation (Table 1). This result lacks statistical significance after controlling for other variables which have large influence on survival (sex and age). We observe associations between time of sampling and genotype (later samples more likely to have 614G) and later samples having higher odds of death and higher age. These complex sampling patterns may introduce confounding effects into the analysis. Odds of survival decrease for later samples, which may reflect prioritization of very severe cases for genetic sequencing as the epidemic grew in March and April.
In a univariate weighted linear regression we observe a strong and significant (p<10-5) association between the age of patients and the probability of spike 614G (Figure 1). This effect is contrary to the neutral expectation of increasing frequency which would result from observed increasing trends in the 614G prevalence and increasing age trend among sampled patients.
Decreasing odds of 614G with age may indicate a confounding relationship of disease severity and probability of sample inclusion (Figure 2). Most samples collected to date were collected from patients with severe disease. If disease severity mediates the effect between genotype and death and severe patients are preferentially sampled, the true influence of genotype on death may not be observed. Resolving this question will require community sampling which is not correlated with disease severity.
Data and methods
The analysis was based on paired genotype and clinical variables from 1879 patients sampled between February 5 and March 17 2020. Observations were excluded if 1) They could not be matched to a genetic sequence; 2) multiple observations were available from the same patient in which case only the most recent was retained; 3) the patient was missing the outcome variable (death within 28 days post-diagnosis).
Data were analysed in R using linear and logistic regression and chi square tests for contingency tables. Interaction effects were explored for all dependent variables but all lacked statistical significance. Profile likelihoods were used to compute confidence intervals.