Property SelectionΒΆ
This examples shows how to only compute a subset of properties for each sample with a given rascaline representation. In particular, we will use the SOAP power spectrum representation, and select the most significant features within a single block using farthest point sampling (FPS). We will run the calculation for all atomic systems stored in a file, the path to which should be given as the first command line argument.
This is useful if we are interested in the contribution of individual features to the result, or if we want to reduce the computational cost by using only part of the features for our model.
The first part of this example repeats the Computing SOAP features, so we suggest that you read it initially.
We will use the implementation of Farthest Point Sampling from scikit-matter, if you want to learn more have a look at the scikit-matter documentation.
You can obtain a testing dataset from our website
.
import chemfiles
import numpy as np
from metatensor import Labels, MetatensorError, TensorBlock, TensorMap
from skmatter.feature_selection import FPS
from rascaline import SoapPowerSpectrum
First we load the dataset with chemfiles
with chemfiles.Trajectory("dataset.xyz") as trajectory:
frames = [f for f in trajectory]
and define the hyper parameters of the representation
HYPER_PARAMETERS = {
"cutoff": 5.0,
"max_radial": 6,
"max_angular": 4,
"atomic_gaussian_width": 0.3,
"center_atom_weight": 1.0,
"radial_basis": {
"Gto": {},
},
"cutoff_function": {
"ShiftedCosine": {"width": 0.5},
},
}
calculator = SoapPowerSpectrum(**HYPER_PARAMETERS)
descriptor = calculator.compute(frames)
The selections for feature can be a set of Labels
, in which case the names
of the labels must be a subset of the names of the properties produced by the
calculator. You can see the default set of names with:
print("property names:", descriptor.property_names)
property names: ['l', 'n_1', 'n_2']
We can use a subset of these names to define a selection. In this case, only
properties matching the labels in this selection will be used by rascaline
(here, only properties with l = 0
will be used)
selection = Labels(
names=["l"],
values=np.array([[0]]),
)
selected_descriptor = calculator.compute(frames, selected_properties=selection)
selected_descriptor = selected_descriptor.keys_to_samples("center_type")
selected_descriptor = selected_descriptor.keys_to_properties(
["neighbor_1_type", "neighbor_2_type"]
)
properties = selected_descriptor.block().properties
We expect to get [0] as the list of l properties
print(f"we have the following angular components: {np.unique(properties['l'])}")
we have the following angular components: [0]
The previous selection method uses the same selection for all blocks. If you
can to use different selection for different blocks, you should use a
TensorMap
to create your selection
selected_descriptor = calculator.compute(frames, selected_properties=selection)
descriptor_for_comparison = calculator.compute(
frames, selected_properties=selected_descriptor
)
The descriptor had 180 properties stored in the first block, the
selected_descriptor had 36. So descriptor_for_comparison
will also have 36
properties.
print("shape of first block initially:", descriptor.block(0).values.shape)
print("shape of first block of reference:", selected_descriptor.block(0).values.shape)
print(
"shape of first block after selection:",
descriptor_for_comparison.block(0).values.shape,
)
shape of first block initially: (420, 180)
shape of first block of reference: (420, 36)
shape of first block after selection: (420, 36)
The TensorMap
format allows us to select different features within each
block, and then construct a general matrix of features. We can select the most
significant features using FPS, which selects features based on the distance
between them. The following code snippet selects the 10 most important
features in each block, then constructs a TensorMap containing this selection,
and calculates the final matrix of features for it.
def fps_feature_selection(descriptor, n_to_select):
"""
Select ``n_to_select`` features block by block in the ``descriptor``, using
Farthest Point Sampling to do the selection; and return a ``TensorMap`` with
the right structure to be used as properties selection with rascaline calculators
"""
blocks = []
for block in descriptor:
# create a separate FPS selector for each block
fps = FPS(n_to_select=n_to_select)
mask = fps.fit(block.values).get_support()
selected_properties = Labels(
names=block.properties.names,
values=block.properties.values[mask],
)
# The only important data here is the properties, so we create empty
# sets of samples and components.
blocks.append(
TensorBlock(
values=np.empty((1, len(selected_properties))),
samples=Labels.single(),
components=[],
properties=selected_properties,
)
)
return TensorMap(descriptor.keys, blocks)
We can then apply this function to subselect according to the data contained in a descriptor
selection = fps_feature_selection(descriptor, n_to_select=10)
and use the selection with rascaline, potentially running the calculation on a different set of systems
selected_descriptor = calculator.compute(frames, selected_properties=selection)
Note that in this case it is no longer possible to have a single feature matrix, because each block will have its own properties.
try:
selected_descriptor.keys_to_samples("center_type")
except MetatensorError as err:
print(err)
invalid parameter: can not move keys to samples if the blocks have different property labels
To be done
To be done
To be done