The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction

Li, Hongjian; Peng, Jiangjun; Leung, Yee; Prof. LEUNG Kwong Sak; Wong, Man-Hon; Lu, Gang; Ballester, Pedro J.

Please use this identifier to cite or link to this item: http://hdl.handle.net/20.500.11861/7426

DC Field	Value	Language
dc.contributor.author	Li, Hongjian	en_US
dc.contributor.author	Peng, Jiangjun	en_US
dc.contributor.author	Leung, Yee	en_US
dc.contributor.author	Prof. LEUNG Kwong Sak	en_US
dc.contributor.author	Wong, Man-Hon	en_US
dc.contributor.author	Lu, Gang	en_US
dc.contributor.author	Ballester, Pedro J.	en_US
dc.date.accessioned	2023-02-22T10:54:17Z	-
dc.date.available	2023-02-22T10:54:17Z	-
dc.date.issued	2018	-
dc.identifier.citation	Biomolecules, 2018, vol. 8(1), 12	en_US
dc.identifier.issn	2218-273X	-
dc.identifier.uri	http://hdl.handle.net/20.500.11861/7426	-
dc.description.abstract	It has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely used test set, and four SFs. Three of these SFs employ machine learning instead of the classical linear regression approach of the fourth SF (X-Score which has the best test set performance out of 16 classical SFs). We have found that random forest (RF)-based RF-Score-v3 outperforms X-Score even when 68% of the most similar proteins are removed from the training set. In addition, unlike X-Score, RF-Score-v3 is able to keep learning with an increasing training set size, becoming substantially more predictive than X-Score when the full 1105 complexes are used for training. These results show that machine-learning SFs owe a substantial part of their performance to training on complexes with dissimilar proteins to those in the test set, against what has been previously concluded using the same data. Given that a growing amount of structural and interaction data will be available from academic and industrial sources, this performance gap between machine-learning SFs and classical SFs is expected to enlarge in the future.	en_US
dc.language.iso	en	en_US
dc.relation.ispartof	Biomolecules	en_US
dc.title	The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction	en_US
dc.type	Peer Reviewed Journal Article	en_US
dc.identifier.doi	10.3390/biom8010012	-
item.fulltext	No Fulltext	-
crisitem.author.dept	Department of Applied Data Science	-
Appears in Collections:	Applied Data Science - Publication

Find@HKSYU

Show simple item record

SCOPUS^TM
Citations

53

checked on Jun 1, 2025

Page view(s)

89

Last Week
0

Last month

checked on Jun 8, 2025

Google Scholar^TM

Impact Indices

SCOPUS^TM
Citations

Page view(s)

Google Scholar^TM

Altmetric

PlumX
Metrics

Publisher copyright policies & self-archiving

SCOPUSTM Citations

Page view(s)

Google ScholarTM

Altmetric

PlumX Metrics

SCOPUS^TM
Citations

Google Scholar^TM

PlumX
Metrics