When the pandemic forced scientists to consider remote working strategies, many groups turned to online bioinformatic research. A team led by Dr. Christopher Buck at the US National Cancer Institute turned lockdown lemons into lemonade. They reoriented away from their usual work directly discovering viruses in clinical specimens to a broader, more comprehensive search aimed at finding viruses lurking in publicly available deep sequencing datasets. The remote-work project uncovered hundreds of previously unknown virus species associated with a wide range of animals. The discoveries shed new light on the deep evolutionary roots of viruses. More
Everybody knows that animals hand genes down to their offspring. Although viruses also use the familiar mechanism of vertical inheritance, in rare cases where two different virus species happen to infect the same cell, they can also exchange genes horizontally. For example, currently circulating H5N1 and H3N2 flu viruses could theoretically exchange H and N segments to create dangerous new H5N2 or H3N1 strains. Viruses can also sometimes plunder genes from host cells. Although horizontal gene transfer is a well-established phenomenon for viruses whose genomes consist of multiple segments of RNA, it’s less clear whether viruses with DNA-based genomes, such as the cancer-causing viruses Dr. Buck’s team usually studies, can routinely swap genes like trading cards.
Traditionally, researchers classify living things based on vertical inheritance patterns; however, the system becomes murky when we consider horizontal gene transfer, which involves separating and recombining genetic sequences in a more flexible way. Dr. Buck invokes the metaphor of mythological creatures such as the ancient Egyptian sphinx, which has the head of a human and the body of a lion. An organism that inherited body parts horizontally can’t be considered simply a human, simply a lion, or simply a totally unrelated organism – it would instead need to be classified as a distinct category of human-lion. If you want to know what kinds of problems a sphinx might cause, it’s important to know that the head might speak and the body might claw. The same is true for horizontally acquired viral genes.
Dr. Buck’s team set out to search the Sequence Read Archive, an enormous petabyte-scale digital repository where biologists share raw sequencing data. They scanned a wide range of animal datasets for the sequences of so-called hallmark proteins unique to particular virus families of interest. Using the National Institutes of Health supercomputer, known as Biowulf, they performed the computationally intensive task of assembling the raw sequence fragments into full-length viral genomes.
Many of the assembled genomes turned out to have startling combinations of genes derived from different virus families. The researchers developed a new system for categorizing the hundreds of annotated virus genome maps they deposited into the more readily searchable curated sequence database called GenBank. One example is a group of viruses the team named adintoviruses, which unite adenovirus-like genes with retrovirus-like genes. The survey revealed a number of adintoviruses associated with humans and other primates. By comparing the gene content of the broad range of new viruses, the researchers were able to develop a hypothetical gene flow framework representing the possible genetic transfer events among all related members of the new group. The gene-centered view indicates that horizontal gene transfer has been a dominant force in viral evolution over million-year time scales.
The similarities between the viral gene sequences spanned an incredibly varied range of animals including fish, dogs, birds, flies, spiders, worms, and molluscs. One surprising find was a pair of exotic polyomavirus sequences in a dataset representing the clean room where NASA’s Phoenix Mars Lander was assembled. The team speculates that the sequences represent viruses that infect microscopic dust mites that inhabit human skin. The result highlights the problem that deep sequencing samples can easily be cross-contaminated – so it will be important to validate host animal associations using traditional wet-bench research. For example, the team is currently investigating the hypothesis that some humans may have antibodies reflecting past infections with adintoviruses.
Gathering the available data into an easily searchable form is time consuming and labour-intensive. However, new AI technology should be useful for simplifying the process of figuring out which genes encode structurally similar proteins. Furthermore, advanced predictive algorithms can help to classify gene sequences which previously appeared to have no similarities with any known protein. The new AI technologies could also help scientists design vaccines against newly discovered viruses. For example, Dr. Buck’s team used AlphaFold – an AI tool that predicts protein structure – to infer a specific domain in the coat proteins of a sphinx-like group of viruses called adomaviruses, which unite features of adenoviruses and polyomaviruses. The new domain, which the team playfully names a halfsmoke fold (after a beloved Washington DC-area sausage) could be a target for vaccines that might protect endangered Japanese eels against the virus.
The research suggests that horizontal gene transfer among viruses is more common than previously thought and offers new insights into the evolutionary history of viruses. These emerging genetic links raise important questions, and certainly warrant further exploration. If researchers can investigate proposed gene loss and recombination events and establish the suspected plasticity of active virus components, they may be able to unlock further evolutionary revelations.
Some of the gene sequences present in the newly detected viruses resemble a gene class called oncogenes, which promote the growth of cancer cells. This knowledge may assist clinicians and researchers in conducting diagnostic procedures, confirming the prevalence of certain tumour types, identifying therapeutic targets, and informing vaccine and antiviral drug designs.
Since commencing this journey, the research team has successfully built a library of novel gene-encoding protein sequences that will help them to identify more virus families. Furthermore, by developing a new algorithm, the team has been able to continue with their intensive computational groundwork with ever-increasing accuracy. They are now closer to creating a basic searchable database with a more user-friendly interface, with exciting future possibilities.
In his signature offbeat way, Dr Buck compares viruses to Lego bricks, explaining that the same building blocks may be used repeatedly to construct countless structures. Having generally limited his database searches to focus on his favourite virus families, this latest study has explored less than 1% of available datasets. Buck implores other researchers to join this mission in trawling the complexities of these dauntingly immense public databases. Human datasets, which are currently barely touched due to privacy issues, undoubtedly harbour a goldmine of weird and wonderful treasures.
The potential for revealing as-yet unknown associations between viruses and major diseases such as cancer and Alzheimer’s is huge, and using the methods described here would at least partially avoid the bureaucratic restrictions that so often impede valuable scientific research activities.