Science, the hard way

Joe Greener - 19th February 2023

Does it ever feel like being a scientist largely consists of wading through reams and reams of questionable papers? Partly this is a consequence of more manuscripts being published than ever, and it's a general feature of life that most things are of limited value. I'd certainly prefer a deluge to a drought. With the recent explosion of machine learning (ML) papers in structural biology and computational chemistry, though, I can't help but feel that my field is at peak bullshit. The problem isn't that there's not excellent work being done. But as I scroll through the latest pre-prints, conference papers and Tweet threads I always end up asking myself the same question: how do I meaningfully assess the quality of the work in front of me?

The flavour of the month is using the latest innovation in neural networks (NNs) - equivariance, transformers, diffusion, the lot - and applying it to long-standing problems. This is all well and good, and has given stunning advances in fields such as protein structure prediction and protein design. The issue is that many studies test their model on datasets that are problematic: they do not reflect the real use case of the model, show overlap with the training data (Kapoor and Narayanan (2022)), or are quickly becoming overfit to by the field. Many of the datasets in use are reaching the status of MNIST in ML: good for demonstrating prototypes but not for assessing the state of the art. Which is an issue when everyone is claiming state of the art performance. Don't take it personally; yes, I probably am talking about your paper, but I am talking about everyone else's too.

The culprits

Machine learning potentials

The field of machine learning interatomic potentials (MLIPs), where quantum-level accuracy can be obtained at a fraction of the compute costs of density functional theory, is advancing at amazing speed. Typically these methods are trained and validated by energy or force matching to quantum datasets like QM9 or trajectories of specific systems. The problem is that matching this data is a necessary but not sufficient condition for a useful force field; people don't want to use MLIPs to predict instantaneous forces, they want to use them for simulations that reproduce experimental properties and allow prediction. Many of these MLIPs give exploding systems when used for simulation (Fu et al. (2022)), even on the same systems they were trained on. Encountering an unseen conformation leads to catastrophic failure. The emperor is wearing clothes, but they might not stay on for long.

I don't even think that stable simulation should be a strict requirement for publication - the field is developing, after all, and new ideas are valuable even when they are rough around the edges. When you read these papers, though, it seems that they have solved every problem. I would appreciate authors saying "look we match forces well but as soon as you go out of domain you are out of luck, if you run for 100 ps you will see a NaN". I learned more from this GitHub thread, which is just a discussion, than many papers because it provides insight into the limitations of the approach. Of course you can see why papers avoid this, since mentioning severe limitations is like showing a red rag to a bull in terms of peer review.

Standard datasets and metrics for simulation stability and macroscopic properties would be useful[1], as well as test datasets containing a considered mix of seen and unseen data so the field can assess the difficult problem of learning transferable potentials. As someone with more of a classical force field background I must express frustration at the number of papers that show poor classical force field performance on datasets that don't resemble how these force fields are actually used in practice. Yes, MLIPs will supplant classical force fields for many applications. Right now, though, they struggle to reach one nanosecond of simulation on a molecule not seen during training, and the field needs to be more honest about that if it wants to develop generic potentials.

Protein-ligand binding

Protein-ligand binding has received a huge amount of attention from the ML community recently. The appeal is clear: if you can design a high-affinity small molecule binder for a protein then you can cure its associated disease. This isn't necessarily true, but it is appealing. Despite an armful of papers reporting state of the art results on the PDBBind dataset, often making clever use of graph neural networks and equivariance, these approaches have yet to result in a new drug. This itself is not surprising given the lengthy process of getting a drug to market. What is more suspect is that simpler predictors still perform well under comparison (Yu et al. (2023)), and are more likely to perform well on the novel targets that these methods are intended for due to a lower risk of overtraining. The sales pitch of ML methods is not that they will speed up medicinal chemistry on well-explored targets, after all, but that they will allow rapid virtual screening on new targets. In practice, though, most AI drug companies focus on well-established target classes (Jayatunga et al. (2022)). The field needs datasets to assess performance honestly, like a recent contribution in the related field of molecular optimisation (Gao et al. (2022)). Pharmaceutical companies could step up here since they have huge reserves of structural binding data that the wider community could use to have an AlphaFold moment, if only they could get access to it.

A number of ligand generation methods could do with more frankness about poor stereochemistry and clashing atoms too. If your method shows diverse binders but places some carbons too close and can't keep an aromatic ring flat, that is fine. I would rather read about it in your paper, though, rather than in the next paper from your competitor. Without more open discussion about the limitations of methods and data we are never going to address broader issues in drug discovery like toxicity prediction and how to avoid late-stage failures. As an outsider my understanding is that pharma companies are quite open when it comes to discussing drug targets; perhaps extending that openness to virtual screening and releasing non-commercially sensitive data would be to everyone's benefit. Who is going to go first though?

Protein structure prediction

Protein structure prediction has long suffered from questionable validation. I won't rehash the arguments at length here since they have been discussed by my previous lab (Jones (2019), Greener et al. (2022)) and known for decades (Chothia and Lesk (1986))[2]. The short version is that making a training/testing split based on sequence identity between proteins is insufficient, since homologous proteins can have effectively no detectable sequence similarity. If homologous proteins span your training and testing sets and you are predicting structural features, then you partly assess homology detection rather than the harder task of predicting structure. Honestly, the difference in accuracy is only a few percent, but that is a tricky difference for the field to grapple with - spending more effort to get a dataset that makes your model look worse is definitely doing it the hard way. This always held back secondary structure prediction (SSP), where every percent from 80% to 90% was argued over; one benefit of AlphaFold making SSP somewhat redundant is not having to watch inflated claims about 95%+ accuracy[3]. A recent study showed how ridiculous this can get by taking the hatchet to the whole field of sequence-based protein-protein interaction prediction (Bernett et al. (2023)). Once you account for data leakage, methods that claim 95%+ accuracy are no better than random.

The field could do with carefully-curated datasets that hold out whole folds based on structural classifications like CATH, SCOP or ECOD. These should define the training as well as the testing data, with recommendations on sequence searching so that people don't inadvertently pull in sequences homologous to the test set by profile drift. The boundary between template based and de novo modelling has been blurred on the methods side now because the best methods excel at both, but it needn't be blurred on the validation side. Artificial holdout is surely a better way to assess performance on orphan proteins, too, rather than looking for an improbable set of non-tiny non-designed proteins for which we have structures but no related sequences (spoiler: it doesn't exist). Journals should take the lead here: an editor I have interacted with at Bioinformatics is aware of the issue yet it still appears in many papers there, and it affects related fields like enzyme property prediction too (Kroll and Lercher (2023)).

Protein design

Protein design is an interesting outlier to the above fields when it comes to validation, because experimental testing is difficult but possible if you have the resources. It would be unfair on computational groups to require every design study to show a soluble protein, but without this it's very hard to assess success. I consider this field to be in its early stages in the wider scientific community since historically it has been the preserve of a select few labs. Metrics like the fraction of native residues recovered by fixed backbone design have recently gone over 40%. This metric requires careful consideration though, since it is not clear what the field should be aiming for. One recent paper realises this, saying that "the most crucial and immediate concern is whether a high native sequence recovery is a good indicator for successful (structure-based) protein design" (Zheng et al. (2023)), but nonetheless uses that metric throughout.

In general proteins show considerable diversity outside the conserved core and active site, so studies claiming high native sequence recovery are immediately suspect for data leakage. A specific Angstrom-accurate backbone in a packed region may be tightly constrained in terms of sequence, but I doubt that is what these methods are discovering. That type of prediction also isn't particularly useful for design, because if you have such an accurate backbone you would also have the sequence. 40% seems in the right ball park for design methods to achieve, so the field will have to come up with new computational tests to keep developing from here (Castorina et al. (2023)). It will be interesting to see how success can be demonstrated without a department's worth of postdocs crystallising designs. Perhaps a community effort towards experimental validation? As it stands, though, I know which of the recent design with diffusion pre-prints is more convincing: the one with extensive experimental validation.

The hard way

The hard way involves putting in the extra work to honestly assess performance. This applies both at the author level and at the field level. You might not like what you find, and it might harm your ability to sell your method. The first and most obvious solution is to develop better datasets that are matched to the real world use case, have a defined training set and testing set (so people can't game them with data leakage), and contain a variety of easy and hard cases. Without naming names I can think of a few shoddy datasets that claimed to be "gold standard" in their own abstract, so avoid those phrases and let the best datasets show their worth over time.

The second solution is to encourage and promote disinterested benchmark studies that run existing methods without trying to shill a new one. When I read these studies I am impressed that some poor junior researcher had to install that much software and managed to resist the temptation to tack a half-baked overtrained NN on the end. You will always have my citation (Livesey and Marsh (2020)). Ironically, such studies pointing out limitations often advance a field more than the primary research itself. If all papers compared to sensible baselines - domain-relevant algorithms like transferring annotations with sequence searching, or traditional ML like random forests - there would be less need for these benchmark studies.

The third solution is to run blind community assessments where predictions are analysed independently on unseen data. Again these are high effort enterprises and the organisers deserve praise. Let's be honest, DeepMind wouldn't have dived so deep into protein structure prediction if there hadn't been a respected and challenging competition[4] like CASP to conclusively demonstrate success on. There is a downside to a competition gaining wider attention though, shown by DeepMind choosing not to enter CASP15. They presumably understood that you get one CASP to show a bright new idea before everyone adopts the method and adds small improvements with ensembling, sampling and multiple sequence alignment generation strategies. DeepMind released an updated model and CASP15 predictions after the assessment was over, which somewhat defeats the point. CASP15 was still interesting, of course, with blind prediction showing relatively poor performance from large language models despite bold claims in various papers beforehand. There are interesting ways to run these assessments that make use of passive data expansion rather than actively generating new data - two that spring to mind are the Critical Assessment of protein Function Annotation algorithms (CAFA), which collects predictions and assesses using available annotations at a later date, and CAMEO, which predicts new protein structures immediately before they are released into the Protein Data Bank.

The fourth solution is to get rid of the requirement that a paper has to demonstrate state of the art performance to be considered noteworthy. There have been a number of influential methods that don't work very well (Ingraham et al. (2019), Alquraishi (2019)); it's time for the field to be more open to discussing this and to make space for such work. Freeing up authors to show "ball park" performance would hopefully encourage simpler methods and more discussion of limitations. Tweet threads from authors introducing papers are often enlightening because the relative informality of the medium encourages an honest discussion of shortcomings. Interactions between authors and peer reviewers have a much more defensive tone, even about the same content, and journals could be bolder in trying to change this norm. Twitter, Discord, Slack and the rest are turning science into one perennial, asynchronous conference and scientists should lean in to the honesty this can facilitate[5].

Finally, the golden rule of ML needs to be respected: it's all in the data. As is usually the case when biological data is on the scene, incorrect understanding or use of data is at the root of most issues. ML experts entering the field should spend at least as much time immersing themselves in the data as designing the network. It isn't always riveting, or the world would have more bioinformaticians. Flashy presentations at top-tier ML conferences should dive into the data and validation as much as explaining how equivariance gives them the killer inductive bias. In fact, a focus on the data can also improve performance. It is tempting to look at recent scaling laws from language models and assume that all biological problems can be solved with more parameters. Biology is a data-limited discipline though - protein sequences being a notable exception - and it is no coincidence that the most successful NNs in biology have emerged when the authors immerse themselves in the complexities of the data.

I do actually like machine learning

Please don't take this post as a complaint about the recent hype for ML in structural biology, or as a claim that there haven't been huge advances. There undeniably have, and there is a whole separate post to be written about remnants of the old guard denying the turning of the tide. It's just hard to tell what is good from what is not, and I have found myself assuming until proven otherwise that any new paper in front of me suffers from doing things the easy way.

[1] And non-trivial, given how the importance of properties like energy conservation is argued over in the molecular dynamics field. Average time to first NaN is the first bar to clear, but is too embarrassing a number to make it into many studies.
[2] Figure 2 of this paper should be shown to every ML expert first encountering protein structure data. I'm considering getting a tattoo.
[3] 95% is likely too high when using input data from sequence families, which show variation in secondary structure content.
[4] CASP is an assessment, but any assessment worth coming top of becomes a competition.
[5] This post would have to be heavily sanitised if it were to appear in a journal, for example.
Let me know what you think on this Twitter thread.

References

CC BY 4.0 Joe Greener. Last modified: October 21, 2024. Website built with Franklin.jl and the Julia programming language.