Protein Sciences in Drug Discovery 2022

Predicting protein expressibility and solubility using protein language models

Thu3  Nov03:05pm(30 mins)
Where:
The Auditorium
Miss Hannah-Marie Martiny
Dr Henrik Nielsen

Abstract

During the process of protein production, expressibility and solubility are often limiting factors. Existing strategies to optimize protein production are time-consuming, such as adjusting experimental setup and codon optimization. Tools that predict expressibility and solubility for protein purification directly from the protein sequence are needed. However, existing predictors are often built on biased datasets and tuned only for the expression host Escherichia coli, ignoring other industrially important hosts such as Bacillus subtilis. We have shown that deep learning protein language models can learn statistical representations, which can then be used to select protein sequence candidates with high solubility and expression potential. In one study, we built a B. subtilis-specific tool to infer the likelihood of successful overexpression, which is able to prioritize protein sequences by extracting features related to expression, despite achieving only modest performance values. In a second study, we showed that several existing solubility predictors for E. coli were built on biased data and could not generalize well across multiple datasets. Instead, we introduced a new tool named NetSolP that achieved state-of-the-art performances on curated existing datasets. Our work shows the potential of language models to accelerate protein production.