In a breakthrough that could guide the development of targeted vaccines, MIT researchers used natural language processing methods lifted from the field of computational linguistics to analyze the viral protein sequence data of influenza A, HIV, and SARS-CoV-2 to identify regions within the genomes of those viruses that are most vulnerable to mutation.
One of the greatest challenges to defeating influenza and HIV is their rapid rate of mutation, which allows them to evade the antibodies generated by a particular vaccine through a process known as “viral escape.”
The phenomenon occurs when a mutation enables the virus to change the shape of its surface proteins in a way that prevents antibodies from binding to them, but still leaves the proteins’ functionality intact.
Hie and his co-authors, who include members of MIT’s departments of biological engineering and computational and systems biology, produced a new way to computationally model viral escape based on machine-learning models that were originally developed to analyze natural language.
The models adopted to the viral domain by the MIT researchers include constrained semantic change search (CSCS), which they adapted to search for mutations to a viral sequence that preserve fitness while being antigenically different, and bidirectional long short-term memory (BiLSTM), a neural language model architecture they adapted to learn “grammatical” protein sequences and predict viral escape.