Please use this identifier to cite or link to this item: http://hdl.handle.net/20.500.11861/7426
DC FieldValueLanguage
dc.contributor.authorLi, Hongjianen_US
dc.contributor.authorPeng, Jiangjunen_US
dc.contributor.authorLeung, Yeeen_US
dc.contributor.authorProf. LEUNG Kwong Saken_US
dc.contributor.authorWong, Man-Honen_US
dc.contributor.authorLu, Gangen_US
dc.contributor.authorBallester, Pedro J.en_US
dc.date.accessioned2023-02-22T10:54:17Z-
dc.date.available2023-02-22T10:54:17Z-
dc.date.issued2018-
dc.identifier.citationBiomolecules, 2018, vol. 8(1), 12en_US
dc.identifier.urihttp://hdl.handle.net/20.500.11861/7426-
dc.description.abstractIt has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely used test set, and four SFs. Three of these SFs employ machine learning instead of the classical linear regression approach of the fourth SF (X-Score which has the best test set performance out of 16 classical SFs). We have found that random forest (RF)-based RF-Score-v3 outperforms X-Score even when 68% of the most similar proteins are removed from the training set. In addition, unlike X-Score, RF-Score-v3 is able to keep learning with an increasing training set size, becoming substantially more predictive than X-Score when the full 1105 complexes are used for training. These results show that machine-learning SFs owe a substantial part of their performance to training on complexes with dissimilar proteins to those in the test set, against what has been previously concluded using the same data. Given that a growing amount of structural and interaction data will be available from academic and industrial sources, this performance gap between machine-learning SFs and classical SFs is expected to enlarge in the future.en_US
dc.language.isoenen_US
dc.relation.ispartofBiomoleculesen_US
dc.titleThe Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Predictionen_US
dc.typePeer Reviewed Journal Articleen_US
dc.identifier.doi10.3390/biom8010012-
item.fulltextNo Fulltext-
crisitem.author.deptDepartment of Applied Data Science-
Appears in Collections:Applied Data Science - Publication
Show simple item record

SCOPUSTM   
Citations

51
checked on Nov 17, 2024

Page view(s)

64
Last Week
0
Last month
checked on Nov 21, 2024

Google ScholarTM

Impact Indices

Altmetric

PlumX

Metrics


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.