A core part of any public health response – and especially one as fast-moving and far-reaching as the COVID-19 pandemic – requires a range of stakeholders in different locations to share data and collaborate on its analysis to unlock insights. We spoke to Dr Ewan Harrison, a microbiologist at the Wellcome Sanger Institute and University of Cambridge and also Deputy Director for COG-UK, for a practical explanation of how data is linked in a pandemic and wider healthcare setting, and why this is important.
“A genome tells you something about viral biology, but if you want to infer anything more, you need data linkage,” Dr Harrison notes. Ultimately, sequencing and generating a viral genome can reveal information about the mutations that the virus has at any one time but to do anything further with this information, more context is needed. As Harrison explains, “you need to understand if the person the virus came from was ill, where they might have been infected, what comorbidities they had”. The answers to all these questions rely on linking different datasets together.
Data linkage, in turn, relies on a tenet central to COG-UK’s operations, namely data sharing, which the consortium has prioritised since its inception. For example, SARS-CoV-2 genomic sequencing data is uploaded to GISAID and the European Nucleotide Archive, and is published on the COG-UK website. Openly sharing genome-sequencing data from SARS-CoV-2 samples has allowed researchers to track how the virus is evolving and has become a hallmark of the pandemic. To illustrate how far the scientific community has come, Harrison contrasts this to data sharing practices pre-COVID-19: “There was data sharing for some pathogens, mainly flu, but there wasn’t the same level of consistency as for COVID-19, it was a lot more ad-hoc”.
In practice, the sharing of viral genome data has to adhere to several key principles to enable scientists and other stakeholders to understand how genomic information links to other data. One of these principles is the date of sampling. “The most fundamental point is that if you want to share data in the public domain, it is critical to provide a sampling date”, says Harrison. This is essential because a lot of the methods used to analyse pathogen genome data depend on understanding the pathogen’s ‘molecular clock’, which can indicate how quickly mutations are accruing in a virus. To set the date of this clock requires the sample date. “You need to make sure that the sample is linked to the original specimen it came from – and this means you understand the relationship between the sampling and the patient.” Linking to the patient is critical because it allows epidemiological analyses to be carried out and informs contact tracing, while ensuring that data linkage occurs in secure trusted research environments that enable different data types to be integrated and analysed while protecting patient privacy.
One recent study which showed the significant role that data linkage plays is highlighted in the GenOMICC research findings. Over 20 human genetic loci (the specific physical locations of a gene or other DNA sequence on a chromosome) were identified that seemed to be linked with COVID-19 severity. “The study’s findings are particularly important because they allow us to understand which parts of the immune system are associated with a specific outcome, such as severe disease, or to identify potential drug targets,” says Harrison, this is an important lesson to have come from COVID-19. “Host-pathogen genomics and clinical data are not isolated; we need to be able to bring them together.”
For a future pandemic, bringing these data together earlier may allow us to understand more about the disease much quicker than we have before. However, standardised systems enabling such data linkage need to be in place first. In fact, they should be operating regardless – whether there is a pandemic or not. “Antimicrobial resistance is a big problem that could benefit from such surveillance, as with flu and RSV (respiratory syncytial virus),” says Harrison.