Explainer: The COVID-19 genome

What studying the virus' genomes can tell us about the pandemic

More than 20,000 SARS-CoV-2 that have caused infection in people in the UK have been sequenced by the Covid-19 Genomics UK (COG-UK) Consortium to date. Analysing these and future sequences will help create the evidence for how to use these as a vital part of controlling the pandemic.

Information from the genome sequences will help track the spread of the coronavirus in the UK and support public health planning and clinical decision making.

In such a fast-moving situation, careful interpretation of information from genome sequences, together with additional data, is essential. Here, we reflect on what the genome data can, and can’t, tell us.

Viral genomes

Virus genomes are not made of DNA like most organisms, but RNA. The genome sequence of the SARS-CoV-2 virus was determined several months ago after it was first detected in China[1]. It is small, at just under 30,000 letters, or bases (29 kilobases), with only 15 genes. Humans, by comparison, have around 20,000 genes in a 3 billion base pair (3.3 gigabase) genome. More about COVID-19 biology is available on the UKRI website.

Why sequence genomes of the SARS-CoV-2 virus?

Building genome-based trees to define transmission

Genomes mutate. Letters in the genome sequence change as organisms replicate. Virus genomes usually mutate at a steady rate – HIV extremely rapidly, influenza slower, and coronavirus slower still. Researchers can use the mutation rate as a molecular clock. Any genetic difference between two viruses is proportional to the time since they last shared a common ancestor. The individual virus sequences can be placed back in time on a phylogenetic tree, much like a family tree, which determines the relatedness of two or more SARS-CoV-2 viruses.

With a new virus, it is hard to initially define how fast the clock is ticking. The SARS-CoV-2 mutation rate was initially based on that of related viruses, though researchers now estimate it has a mutation rate of approximately 2.5 bases a month – slow in evolutionary terms.

Together with the fact that the virus has a very recent common ancestor – in December 2019 – the slow mutation rate means that there is limited genomic diversity in the circulating viruses so far, although that will change over time as mutations accumulate. Despite this, it has been possible to trace the virus’s history, from the centre of the outbreak, to all corners of the world. Researchers are constantly refining and updating the picture as more evidence becomes available. To view global data for SARS-CoV-2 to date, visit

Local transmission

The same principles of building a phylogenetic tree can be used on a more local scale, too. The virus in a particular area, be that a hospital, town, or region, may have a particular genomic change. This change may be different from a virus that is multiplying and spreading in another area. If a third area is tested, researchers can, in some cases, trace where it has come from, based on its sequence.

COG-UK researchers envisage that we will soon be at this point in the UK, where they will have accumulated enough data to see ‘local’ mutations in the virus. If sequencing can be done in real-time, then this is important information for public health officials – outbreaks can be spotted and brought to a rapid close, as well as other interventions being introduced to reduce the chances of this happening again.

Recent research led by Professor Ian Goodfellow and Dr Estée Török at the University of Cambridge assessed how useful genomic sequencing of the virus can be within a hospital. They assessed hundreds of virus sequences from Cambridge University Hospitals NHS Foundation Trust during March and April. Together with data about the movement of patients and staff, they were able to identify clusters of infections that were linked, and some that weren’t. This, in turn, helped inform infection control procedures. The genomic data provided evidence to support or refute transmission between potentially linked cases.

False Connections

But caution is needed when interpreting such data. Sequences from two or more people could be the same through chance rather than because they are part of an outbreak. Other information, such as whether the people involved have been in direct contact or shared the same environment, is an essential part of the process when investigating possible outbreaks.

As a  result, it is easier to rule out outbreaks when people with covid-19 have viruses are genetically distinct than it is to confirm an outbreak when genomes are the same. Extensive spread of the virus means that identical genomes can be seen even in different countries despite the lack of a direct epidemiological link.

An important pitfall is when not enough sequences are used in an outbreak analysis. This can lead to false connections being made between genomes, which when more sequences are added to the analysis can become more distantly related. Genomic analysis is a dynamic process and will depend on having the right number and sampling strategy.