Journal of Molecular Biology

Year	IMPACT-FACTOR
2024	1,200
2023	1,500
2022	1,200
2021	1,540
2020	1,374
2019	1,023
2018	0,932
2017	0,977
2016	0,799
2015	0,662
2014	0,740
2013	0,739
2012	0,637
2011	0,658
2010	0,654
2009	0,570
2008	0,849
2007	0,805
2006	0,330
2005	0,435
2004	0,623
2003	0,567
2002	0,641
2001	0,490
2000	0,477
1999	0,762
1998	0,785
1997	0,507
1996	0,518
1995	0,502

Vol 57(2023) N 1 p. 136-145; DOI 10.1134/S0026893323010089

Yu.V. Milchevskiy¹*, V.Yu. Milchevskaya^1,2, Yu.V. Kravatsky^1,3

Method to Generate Complex Predictive Features for Machine Learning-Based Prediction of the Local Structure and Functions of Proteins
¹Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, 119991 Russia
²Institute of Medical Statistics and Bioinformatics, Faculty of Medicine, University of Cologne, Cologne, 50931 Germany
³Center for Precision Genome Editing and Genetic Technologies for Biomedicine, Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, 119991 Russia

^*milch@eimb.ru
Received - 2022-06-10; Revised - 2022-07-31; Accepted - 2022-09-01

Recently, prediction of the structure and function of a protein from its sequence underwent a rapid increase in performance. It is primarily due to the application of machine learning methods, many of which rely on the predictive features supplied to them. It is thus crucial to retrieve the information encoded in the amino acid sequence of a protein. Here we propose a method to generate a set of complex yet interpretable predictors, which aids in revealing factors that influence protein conformation. The method makes it possible to generate predictive features and test them for significance both in the context of a general description of the protein structures and functions and in the context of highly specific predictive tasks. Having generated an exhaustive set of predictors, we narrow it down to a smaller curated set of informative features using feature selection methods, which increases the performance of subsequent predictive modelling. We illustrate the efficiency of our methodology by applying it to local protein structure prediction, where the rate of correct prediction for DSSP Q3 (three-class classification) is 81.3%. The method is implemented in C++ for command line use and can be run on any operating system. The source code is released on GitHub at https://github.com/Milchevskiy/protein-encoding-projects.

local structure prediction, protein secondary structure prediction, protein function, protein sequence encoding, protein conformation, stepwise regression, stepwise discriminant analysis