Protein Sciences in Drug Discovery 2022
Poster
6

Using Machine Learning to Predict Recombinant Protein Expression

Authors

A Sousa1; SK Ashenden1; A Afzal1S Martinez Cuesta1; M Gancedo Rodrigo2; W Lee3; SK Talapatra2; D De Silva1; R Davies2; Y Wang1; I Barrett1; A Bornot1; L Holmberg Schiavone3
1 Data Sciences & Quantitative Biology, Discovery Sciences, AstraZeneca, Cambridge, UK;  2 Discovery Biology, Discovery Sciences, AstraZeneca, Cambridge, UK;  3 Discovery Biology, Discovery Sciences, AstraZeneca, Gothenburg, Sweden

Abstract

The production of recombinant proteins is critical across several stages of early drug discovery. The process is both costly and lengthy with a minimum of 6 weeks per construct to expression screen a campaign, involving multiple steps. We are developing a machine learning platform that uses the primary sequence of proteins represented as physicochemical properties and structural features to support protein scientists by facilitating the design of protein constructs and highlighting sequences expressing at different yield classes. The model was coupled to an in-silico screening procedure that systematically designs and assesses thousands of constructs in a high-throughput manner. This method is currently being deployed in drug discovery projects and leads to the design of constructs expressing at higher yield compared to those designed using human knowledge only. We will share our plans to improve and develop the techniques through integrative team work and additional resources. We will do this by (1) considering yield values instead of classes aided by GelClick, an automated gel image analysis tool, (2) incorporating deep learning features for sequence representation, and (3) leveraging external datasets. Limited data to train the model is a key blocker so we are putting together a proof-of-concept and a pre-competitive consortium with academic and pharmaceutical industry partners to share data and models in collaboration with EMBL-EBI.