Property Selection¶

This examples shows how to only compute a subset of properties for each sample with a given rascaline representation. In particular, we will use the SOAP power spectrum representation, and select the most significant features within a single block using farthest point sampling (FPS). We will run the calculation for all atomic systems stored in a file, the path to which should be given as the first command line argument.

This is useful if we are interested in the contribution of individual features to the result, or if we want to reduce the computational cost by using only part of the features for our model.

The first part of this example repeats the Computing SOAP features, so we suggest that you read it initially.

We will use the implementation of Farthest Point Sampling from scikit-matter, if you want to learn more have a look at the scikit-matter documentation.

You can obtain a testing dataset from our website.

import chemfiles
import numpy as np
from metatensor import Labels, MetatensorError, TensorBlock, TensorMap
from skmatter.feature_selection import FPS

from rascaline import SoapPowerSpectrum

First we load the dataset with chemfiles

with chemfiles.Trajectory("dataset.xyz") as trajectory:
    frames = [f for f in trajectory]

and define the hyper parameters of the representation

HYPER_PARAMETERS = {
    "cutoff": 5.0,
    "max_radial": 6,
    "max_angular": 4,
    "atomic_gaussian_width": 0.3,
    "center_atom_weight": 1.0,
    "radial_basis": {
        "Gto": {},
    },
    "cutoff_function": {
        "ShiftedCosine": {"width": 0.5},
    },
}

calculator = SoapPowerSpectrum(**HYPER_PARAMETERS)

descriptor = calculator.compute(frames)

The selections for feature can be a set of Labels, in which case the names of the labels must be a subset of the names of the properties produced by the calculator. You can see the default set of names with:

print("property names:", descriptor.property_names)

property names: ['l', 'n_1', 'n_2']

We can use a subset of these names to define a selection. In this case, only properties matching the labels in this selection will be used by rascaline (here, only properties with l = 0 will be used)

selection = Labels(
    names=["l"],
    values=np.array([[0]]),
)
selected_descriptor = calculator.compute(frames, selected_properties=selection)

selected_descriptor = selected_descriptor.keys_to_samples("center_type")
selected_descriptor = selected_descriptor.keys_to_properties(
    ["neighbor_1_type", "neighbor_2_type"]
)

properties = selected_descriptor.block().properties

We expect to get [0] as the list of l properties

print(f"we have the following angular components: {np.unique(properties['l'])}")

we have the following angular components: [0]

The previous selection method uses the same selection for all blocks. If you can to use different selection for different blocks, you should use a TensorMap to create your selection

selected_descriptor = calculator.compute(frames, selected_properties=selection)
descriptor_for_comparison = calculator.compute(
    frames, selected_properties=selected_descriptor
)

The descriptor had 180 properties stored in the first block, the selected_descriptor had 36. So descriptor_for_comparison will also have 36 properties.

print("shape of first block initially:", descriptor.block(0).values.shape)
print("shape of first block of reference:", selected_descriptor.block(0).values.shape)
print(
    "shape of first block after selection:",
    descriptor_for_comparison.block(0).values.shape,
)

shape of first block initially: (420, 180)
shape of first block of reference: (420, 36)
shape of first block after selection: (420, 36)

The TensorMap format allows us to select different features within each block, and then construct a general matrix of features. We can select the most significant features using FPS, which selects features based on the distance between them. The following code snippet selects the 10 most important features in each block, then constructs a TensorMap containing this selection, and calculates the final matrix of features for it.

def fps_feature_selection(descriptor, n_to_select):
    """
    Select ``n_to_select`` features block by block in the ``descriptor``, using
    Farthest Point Sampling to do the selection; and return a ``TensorMap`` with
    the right structure to be used as properties selection with rascaline calculators
    """
    blocks = []
    for block in descriptor:
        # create a separate FPS selector for each block
        fps = FPS(n_to_select=n_to_select)
        mask = fps.fit(block.values).get_support()
        selected_properties = Labels(
            names=block.properties.names,
            values=block.properties.values[mask],
        )
        # The only important data here is the properties, so we create empty
        # sets of samples and components.
        blocks.append(
            TensorBlock(
                values=np.empty((1, len(selected_properties))),
                samples=Labels.single(),
                components=[],
                properties=selected_properties,
            )
        )

    return TensorMap(descriptor.keys, blocks)

We can then apply this function to subselect according to the data contained in a descriptor

selection = fps_feature_selection(descriptor, n_to_select=10)

and use the selection with rascaline, potentially running the calculation on a different set of systems

selected_descriptor = calculator.compute(frames, selected_properties=selection)

Note that in this case it is no longer possible to have a single feature matrix, because each block will have its own properties.

try:
    selected_descriptor.keys_to_samples("center_type")
except MetatensorError as err:
    print(err)

invalid parameter: can not move keys to samples if the blocks have different property labels