A new approach for feature selection in molecular systems

Immagine
molecules

A recent study by Romina Wild and SISSA Professor Alessandro Laio, along with Felix Wodaczek, Vittorio Del Tatto, and Bingqing Cheng, published in the journal Nature Communications, introduces a new method for the automatic selection and balancing of features in molecular systems: Differentiable Information Imbalance (DII).

Feature selection is a crucial step in data analysis and machine learning, as it aims to identify the most relevant variables for describing a complex system. This process reduces model complexity, improving performance by eliminating redundant or irrelevant information. In molecular contexts, features can include variables such as distances between atoms, bond angles, or other chemical-physical properties that describe the structure and behaviour of a molecule.

However, feature selection poses several challenges. Determining the optimal number of features, aligning different units of measurement, and assessing their relative importance are complex issues. The DII method addresses these problems by evaluating the informational content of each feature and automatically optimizing the importance of each variable. This is done using an optimization algorithm (gradient descent) that adjusts feature weights to provide a description of the system that is both compact (low-dimensional) and easily interpretable.

The study demonstrates the effectiveness of the DII method in two case studies: identifying collective variables to describe the conformations of biomolecules and selecting features for training machine-learning-based force fields. The method, implemented in the Python library DADApy. This work highlights the usefulness of the DII method in overcoming feature selection challenges across various application areas.

 

Read the full paper: https://www.nature.com/articles/s41467-024-55449-7