Abstract:
Proteins interact with other macromolecular targets, such as small molecules, nucleic acids, and other proteins, via their surfaces. Protein-protein interactions are likely to be influenced by the geometrical and physicochemical properties of the surfaces of the interacting proteins. To take advantage of this for the purpose of protein-protein interface prediction, we modify BIPSPI, an XGBoost-based partner-specific protein interface predictor, using geometrical and chemical features extracted from protein surfaces in the form of patches by means of MaSIF, a framework for the extraction of meaningful features from the surfaces of proteins. We construct a map from the surface-patch level representation constructed by MaSIF to the residue-pair representation used by BIPSPI. We show that the addition of internally sorted protein surface patches to BIPSPI’s existing residue-pair representation increases the mean ROC-AUC performance of the existing predictor from 0.9153 to 0.9222 when evaluated with 10-fold cross-validation on a subset of Docking Benchmark v5.5. Additionally, we also evaluate the relative impact of the various features used in training on the performance of the combined model in terms of loss reduction over tree splits. We observe that sorting protein surface patches internally along the feature axes increases model performance and alters the relative impacts of various features. Furthermore, to reduce memory consumption while training with protein surface patches, we develop both principal component analysis-based and autoencoder-based approaches to patch compression. We observe that both methods exhibit competitive performance when trained with sorted patches but not unsorted patches.