Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study

Li, Hongjian; Prof. LEUNG Kwong Sak; Wong, Man-Hon; Ballester, Pedro J

Please use this identifier to cite or link to this item: http://hdl.handle.net/20.500.11861/7494

DC Field	Value	Language
dc.contributor.author	Li, Hongjian	en_US
dc.contributor.author	Prof. LEUNG Kwong Sak	en_US
dc.contributor.author	Wong, Man-Hon	en_US
dc.contributor.author	Ballester, Pedro J	en_US
dc.date.accessioned	2023-03-16T03:06:48Z	-
dc.date.available	2023-03-16T03:06:48Z	-
dc.date.issued	2014	-
dc.identifier.citation	BMC Bioinformatics, 2014, vol.15 (291)	en_US
dc.identifier.uri	http://hdl.handle.net/20.500.11861/7494	-
dc.description.abstract	Background State-of-the-art protein-ligand docking methods are generally limited by the traditionally low accuracy of their scoring functions, which are used to predict binding affinity and thus vital for discriminating between active and inactive compounds. Despite intensive research over the years, classical scoring functions have reached a plateau in their predictive performance. These assume a predetermined additive functional form for some sophisticated numerical features, and use standard multivariate linear regression (MLR) on experimental data to derive the coefficients. Results In this study we show that such a simple functional form is detrimental for the prediction performance of a scoring function, and replacing linear regression by machine learning techniques like random forest (RF) can improve prediction performance. We investigate the conditions of applying RF under various contexts and find that given sufficient training samples RF manages to comprehensively capture the non-linearity between structural features and measured binding affinities. Incorporating more structural features and training with more samples can both boost RF performance. In addition, we analyze the importance of structural features to binding affinity prediction using the RF variable importance tool. Lastly, we use Cyscore, a top performing empirical scoring function, as a baseline for comparison study. Conclusions Machine-learning scoring functions are fundamentally different from classical scoring functions because the former circumvents the fixed functional form relating structural features with binding affinities. RF, but not MLR, can effectively exploit more structural features and more training samples, leading to higher prediction performance. The future availability of more X-ray crystal structures will further widen the performance gap between RF-based and MLR-based scoring functions. This further stresses the importance of substituting RF for MLR in scoring function development.	en_US
dc.language.iso	en	en_US
dc.relation.ispartof	BMC Bioinformatics	en_US
dc.title	Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study	en_US
dc.type	Peer Reviewed Journal Article	en_US
dc.identifier.doi	10.1186/1471-2105-15-291	-
item.fulltext	No Fulltext	-
crisitem.author.dept	Department of Applied Data Science	-
Appears in Collections:	Applied Data Science - Publication

Find@HKSYU

Show simple item record

SCOPUS^TM
Citations

83

checked on Nov 17, 2024

Page view(s)

33

Last Week
0

Last month

checked on Nov 21, 2024

Google Scholar^TM

Impact Indices

SCOPUS^TM
Citations

Page view(s)

Google Scholar^TM

Altmetric

PlumX
Metrics

Publisher copyright policies & self-archiving

SCOPUSTM Citations

Page view(s)

Google ScholarTM

Altmetric

PlumX Metrics

SCOPUS^TM
Citations

Google Scholar^TM

PlumX
Metrics