|
Vol 57(2023) N 1 p. 136-145; DOI 10.1134/S0026893323010089 ![]() Yu.V. Milchevskiy1*, V.Yu. Milchevskaya1,2, Yu.V. Kravatsky1,3 Method to Generate Complex Predictive Features for Machine Learning-Based Prediction of the Local Structure and Functions of Proteins 1Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, 119991 Russia2Institute of Medical Statistics and Bioinformatics, Faculty of Medicine, University of Cologne, Cologne, 50931 Germany 3Center for Precision Genome Editing and Genetic Technologies for Biomedicine, Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, 119991 Russia *milch@eimb.ru Received - 2022-06-10; Revised - 2022-07-31; Accepted - 2022-09-01 Recently, prediction of the structure and function of a protein from its sequence underwent a rapid increase in performance. It is primarily due to the application of machine learning methods, many of which rely on the predictive features supplied to them. It is thus crucial to retrieve the information encoded in the amino acid sequence of a protein. Here we propose a method to generate a set of complex yet interpretable predictors, which aids in revealing factors that influence protein conformation. The method makes it possible to generate predictive features and test them for significance both in the context of a general description of the protein structures and functions and in the context of highly specific predictive tasks. Having generated an exhaustive set of predictors, we narrow it down to a smaller curated set of informative features using feature selection methods, which increases the performance of subsequent predictive modelling. We illustrate the efficiency of our methodology by applying it to local protein structure prediction, where the rate of correct prediction for DSSP Q3 (three-class classification) is 81.3%. The method is implemented in C++ for command line use and can be run on any operating system. The source code is released on GitHub at https://github.com/Milchevskiy/protein-encoding-projects. local structure prediction, protein secondary structure prediction, protein function, protein sequence encoding, protein conformation, stepwise regression, stepwise discriminant analysis |