One of the main tenets of machine learning and AI is that "better", more predictive models result when there is more data available. Of course, there are many caveats to this simple assertion relating to data quality, consistency, applicability domain, training algorithm etc. One of the particular challenges for machine learning and AI in drug discovery is that much of the data of interest has been generated within commercial organisations and is proprietary. In this talk I will discuss practical, real-world approaches to the challenge of data access and data sharing in drug discovery, with particular reference to the problem of predicting protein expression from sequence information.