Predicting protein expressibility and solubility using protein language models

Programme : Back to Miss Hannah-Marie Martiny

Predicting protein expressibility and solubility using protein language models

Thu3 Nov03:05pm(30 mins)

Where:

The Auditorium

Session:

Accelerate protein production with the development of AI techniques

Speaker:

Miss Hannah-Marie Martiny

Speaker:

Dr Henrik Nielsen

Abstract

During the process of protein production, expressibility and solubility are often limiting factors. Existing strategies to optimize protein production are time-consuming, such as adjusting experimental setup and codon optimization. Tools that predict expressibility and solubility for protein purification directly from the protein sequence are needed. However, existing predictors are often built on biased datasets and tuned only for the expression host Escherichia coli, ignoring other industrially important hosts such as Bacillus subtilis. We have shown that deep learning protein language models can learn statistical representations, which can then be used to select protein sequence candidates with high solubility and expression potential. In one study, we built a B. subtilis-specific tool to infer the likelihood of successful overexpression, which is able to prioritize protein sequences by extracting features related to expression, despite achieving only modest performance values. In a second study, we showed that several existing solubility predictors for E. coli were built on biased data and could not generalize well across multiple datasets. Instead, we introduced a new tool named NetSolP that achieved state-of-the-art performances on curated existing datasets. Our work shows the potential of language models to accelerate protein production.

Predicting protein expressibility and solubility using protein language models

Abstract

Programme

Hosted By

Klebo (Eventflo) Conference Platform