7 Apr 2022

The crucial role of data linkage during the pandemic

A core part of any public health response – and especially one as fast-moving and far-reaching as the COVID-19 pandemic – requires a range of stakeholders in different locations to share data and collaborate on its analysis to unlock insights. We spoke to Dr Ewan Harrison, a microbiologist at the Wellcome Sanger Institute and University of Cambridge and also Deputy Director for COG-UK, for a practical explanation of how data is linked in a pandemic and wider healthcare setting, and why this is important.

“A genome tells you something about viral biology, but if you want to infer anything more, you need data linkage,” Dr Harrison notes. Ultimately, sequencing and generating a viral genome can reveal information about the mutations that the virus has at any one time but to do anything further with this information, more context is needed. As Harrison explains, “you need to understand if the person the virus came from was ill, where they might have been infected, what comorbidities they had”. The answers to all these questions rely on linking different datasets together.

Data linkage, in turn, relies on a tenet central to COG-UK’s operations, namely data sharing, which the consortium has prioritised since its inception. For example, SARS-CoV-2 genomic sequencing data is uploaded to GISAID and the European Nucleotide Archive, and is published on the COG-UK website. Openly sharing genome-sequencing data from SARS-CoV-2 samples has allowed researchers to track how the virus is evolving and has become a hallmark of the pandemic. To illustrate how far the scientific community has come, Harrison contrasts this to data sharing practices pre-COVID-19: “There was data sharing for some pathogens, mainly flu, but there wasn’t the same level of consistency as for COVID-19, it was a lot more ad-hoc”.

In practice, the sharing of viral genome data has to adhere to several key principles to enable scientists and other stakeholders to understand how genomic information links to other data. One of these principles is the date of sampling. “The most fundamental point is that if you want to share data in the public domain, it is critical to provide a sampling date”, says Harrison. This is essential because a lot of the methods used to analyse pathogen genome data depend on understanding the pathogen’s ‘molecular clock’, which can indicate how quickly mutations are accruing in a virus. To set the date of this clock requires the sample date. “You need to make sure that the sample is linked to the original specimen it came from – and this means you understand the relationship between the sampling and the patient.” Linking to the patient is critical because it allows epidemiological analyses to be carried out and informs contact tracing, while ensuring that data linkage occurs in secure trusted research environments that enable different data types to be integrated and analysed while protecting patient privacy.

One recent study which showed the significant role that data linkage plays is highlighted in the GenOMICC research findings. Over 20 human genetic loci (the specific physical locations of a gene or other DNA sequence on a chromosome) were identified that seemed to be linked with COVID-19 severity. “The study’s findings are particularly important because they allow us to understand which parts of the immune system are associated with a specific outcome, such as severe disease, or to identify potential drug targets,” says Harrison, this is an important lesson to have come from COVID-19. “Host-pathogen genomics and clinical data are not isolated; we need to be able to bring them together.”

For a future pandemic, bringing these data together earlier may allow us to understand more about the disease much quicker than we have before. However, standardised systems enabling such data linkage need to be in place first. In fact, they should be operating regardless – whether there is a pandemic or not. “Antimicrobial resistance is a big problem that could benefit from such surveillance, as with flu and RSV (respiratory syncytial virus),” says Harrison.

COVID-19 Genomics UK (COG-UK)

The COVID-19 Genomics UK (COG-UK) consortium works in partnership to harness the power of SARS-CoV-2 genomics in the fight against COVID-19.

Led by Professor Sharon Peacock of the University of Cambridge, COG-UK is made up of an innovative collaboration of NHS organisations, the four public health agencies of the UK, the Wellcome Sanger Institute and sixteen academic partners. A full list of collaborators can be found here.

The COVID-19 pandemic, caused by SARS-CoV-2, represents a major threat to health. The COG-UK consortium was formed in March 2020 to deliver SARS-CoV-2 genome sequencing and analysis to inform public health policy and to support the establishment of a national pathogen sequencing service, with sequence data now predominantly generated by the Wellcome Sanger Institute and the Public Health Agencies.

SARS-CoV-2 genome sequencing and analysis plays a key role in the COVID-19 public health response by enabling the identification, tracking and analysis of variants of concern, and by informing the design of vaccines and therapeutics. COG-UK works collaboratively to deliver world-class research on pathogen sequencing and analysis, maximise the value of genomic data by ensuring fair access and data linkage, and provide a training programme to enable equity in global sequencing.