First descriptor computation

This is an introduction to the rascaline interface using a molecular crystals dataset using the Python interface. If you are interested in another programming language we recommend you first follow this tutorial and afterward take a look at the how-to guide on Computing SOAP features.

The dataset

The atomic configurations used in our documentation are a small subset of the ShiftML2 dataset containing molecular crystals. There are four crystals - one with each of the elements [hydrogen, carbon], [hydrogen, carbon, nitrogen, oxygen], [hydrogen, carbon, nitrogen], or [hydrogen, carbon, oxygen]. Each crystal has 10 structures, also denoted by frames, attributed to it. The first frame of each crystal structure is the geometry-optimized frame. The following 9 frames contain atoms that are slightly displaced from the geometry-optimized frame. You can obtain the dataset from our website.

We will start by importing all the required packages: the classic numpy; chemfiles to load data, and rascaline to compute representations. Afterward we will load the dataset using chemfiles.

import chemfiles
import numpy as np

from rascaline import SphericalExpansion


with chemfiles.Trajectory("dataset.xyz") as trajectory:
    frames = [f for f in trajectory]

print(f"The dataset contains {len(frames)} frames.")
The dataset contains 40 frames.

We will not explain here how to use chemfiles in detail, as we only use a few functions. Briefly, chemfiles.Trajectory loads structure data in a format rascaline can use. If you want to learn more about the possibilities take a look at the chemfiles documentation.

Let us now take a look at the first frame of the dataset.

frame0 = frames[0]

print(frame0)
Frame with 20 atoms

With frame0.atoms we get a list of the atoms that make up frame zero. The name attribute gives us the name of the specified atom.

elements, counts = np.unique([atom.name for atom in frame0.atoms], return_counts=True)

print(
    f"The first frame contains "
    f"{counts[0]} {elements[0]}-atoms, "
    f"{counts[1]} {elements[1]}-atoms, "
    f"{counts[2]} {elements[2]}-atoms and "
    f"{counts[3]} {elements[3]}-atoms."
)
The first frame contains 4 C-atoms, 8 H-atoms, 4 N-atoms and 4 O-atoms.

Calculate a descriptor

We will now calculate an atomic descriptor for this structure using the SOAP spherical expansion as introduced by Bartók, Kondor, and Csányi.

To do so we define below a set of parameters telling rascaline how the spherical expansion should be calculated. These parameters are also called hyper parameters since they are parameters of the representation, in opposition to parameters of machine learning models. Hyper parameters are a crucial part of calculating descriptors. Poorly selected hyper parameters will lead to a poor description of your dataset as discussed in the literature. The effect of changing some hyper parameters is discussed in a second tutorial.

HYPER_PARAMETERS = {
    "cutoff": 4.5,
    "max_radial": 9,
    "max_angular": 6,
    "atomic_gaussian_width": 0.3,
    "center_atom_weight": 1.0,
    "radial_basis": {"Gto": {"spline_accuracy": 1e-6}},
    "cutoff_function": {"ShiftedCosine": {"width": 0.5}},
    "radial_scaling": {"Willatt2018": {"scale": 2.0, "rate": 1.0, "exponent": 4}},
}

After we set the hyper parameters we initialize a rascaline.calculators.SphericalExpansion object with hyper parameters defined above and run the rascaline.calculators.CalculatorBase.compute() method.

calculator = SphericalExpansion(**HYPER_PARAMETERS)
descriptor0 = calculator.compute(frame0)
print(type(descriptor0))
<class 'metatensor.tensor.TensorMap'>

The descriptor format is a metatensor.TensorMap object. Metatensor is like numpy for storing representations of atomistic ML data. Extensive details on the metatensor are covered in the corresponding documentation.

We will now have a look at how the data is stored inside metatensor.TensorMap objects.

print(descriptor0)
TensorMap with 112 blocks
keys: o3_lambda  o3_sigma  center_type  neighbor_type
          0         1           1             1
          1         1           1             1
          2         1           1             1
          3         1           1             1
          4         1           1             1
          5         1           1             1
          6         1           1             1
          0         1           1             6
          1         1           1             6
          2         1           1             6
          3         1           1             6
          4         1           1             6
          5         1           1             6
          6         1           1             6
          0         1           1             7
          1         1           1             7
          2         1           1             7
          3         1           1             7
          4         1           1             7
          5         1           1             7
          6         1           1             7
          0         1           1             8
          1         1           1             8
          2         1           1             8
          3         1           1             8
          4         1           1             8
          5         1           1             8
          6         1           1             8
          0         1           6             1
          1         1           6             1
          2         1           6             1
          3         1           6             1
          4         1           6             1
          5         1           6             1
          6         1           6             1
          0         1           6             6
          1         1           6             6
          2         1           6             6
          3         1           6             6
          4         1           6             6
          5         1           6             6
          6         1           6             6
          0         1           6             7
          1         1           6             7
          2         1           6             7
          3         1           6             7
          4         1           6             7
          5         1           6             7
          6         1           6             7
          0         1           6             8
          1         1           6             8
          2         1           6             8
          3         1           6             8
          4         1           6             8
          5         1           6             8
          6         1           6             8
          0         1           7             1
          1         1           7             1
          2         1           7             1
          3         1           7             1
          4         1           7             1
          5         1           7             1
          6         1           7             1
          0         1           7             6
          1         1           7             6
          2         1           7             6
          3         1           7             6
          4         1           7             6
          5         1           7             6
          6         1           7             6
          0         1           7             7
          1         1           7             7
          2         1           7             7
          3         1           7             7
          4         1           7             7
          5         1           7             7
          6         1           7             7
          0         1           7             8
          1         1           7             8
          2         1           7             8
          3         1           7             8
          4         1           7             8
          5         1           7             8
          6         1           7             8
          0         1           8             1
          1         1           8             1
          2         1           8             1
          3         1           8             1
          4         1           8             1
          5         1           8             1
          6         1           8             1
          0         1           8             6
          1         1           8             6
          2         1           8             6
          3         1           8             6
          4         1           8             6
          5         1           8             6
          6         1           8             6
          0         1           8             7
          1         1           8             7
          2         1           8             7
          3         1           8             7
          4         1           8             7
          5         1           8             7
          6         1           8             7
          0         1           8             8
          1         1           8             8
          2         1           8             8
          3         1           8             8
          4         1           8             8
          5         1           8             8
          6         1           8             8

The metatensor.TensorMap is structured in several instances of an metatensor.TensorBlock. To distinguish the block each block is associated with a unique key. For the current example, we have one block for each angular channel labeled by o3_lambda, the central atom type center_type and neighbor atom type labeled by neighbor_type. Different atomic types are represented using their atomic number, e.g. 1 for hydrogen, 6 for carbon, etc. To summarize, this descriptor contains 112 blocks covering all combinations of the angular channels of the central and neighbor atom types in our dataset.

Let us take a look at the second block (at index 1) in detail. This block contains the descriptor for the \(l=1\) angular channel for hydrogen-hydrogen pairs.

LabelsEntry(o3_lambda=1, o3_sigma=1, center_type=1, neighbor_type=1)

The descriptor values

The values of the representation are stored as an array. Each entry in this array also has associated unique metadata as each block. For the spherical expansion calculator used in this tutorial the values have three dimensions which we can verify from the .shape attribute.

(8, 3, 9)

The descriptor values

The first dimension is denoted by the samples, the intermediate dimension by components, and the last dimension by properties. The “sample dimension” has a length of eight because we have eight hydrogen atoms in the first frame. We can reveal more detailed metadata information about the sample-dimension printing of the metatensor.TensorBlock.samples attribute of the block

Labels(
    system  atom
      0      12
      0      13
        ...
      0      18
      0      19
)

The result is an metatensor.TensorMap instance. It contains in total eight tuples each with two values. The tuple values are named as follows

['system', 'atom']

Meaning that the first entry of each tuple indicates the _structure_, which is 0 for all because we only computed the representation of a single frame. The second entry of each tuple refers to the index of the _center_ atom.

We can do a similar investigation for the second dimension: the metatensor.TensorBlock.components.

[Labels(
    o3_mu
     -1
      0
      1
)]

Here, the components are associated with the angular channels of the representation. The size of o3_mu is \(2l + 1\), where \(l\) is the current o3_lambda of the block. Here, its dimension is three because we are looking at the o3_lambda=1 block. You may have noticed that the return value of the last call is a list of metatensor.Labels and not a single Labels instance. The reason is that a block can have several component dimensions as we will see below for the gradients.

The last value represents the number of radial channels. For the metatensor.TensorBlock.properties dimension we find an object

Labels(
    n
    0
    1
     ...
    7
    8
)

containing a tuple of only one value ranging from 0 to 8. The name of this entry is

['n']

and denoting the radial channels. The range results from our choice of max_radial = 9 in the hyper parameters above.

After looking at the metadata we can investigate the actual data of the representation in more details

print(block.values[0, 0, :])
[-0.00102818  0.00860425 -0.02339817 -0.11691264 -0.04411467 -0.02857226
 -0.02109534 -0.00712789 -0.00230597]

By using [0, 0, :] we selected the first hydrogen and the first m channel. As you the output shows the values are floating point numbers between -1.0 and 1.0. Values in this range are reasonable and can be directly used as input for a machine learning algorithm.

Rascaline is also able to process more than one structure within one function call. You can process a whole dataset with

(420, 1, 9)

Now, the 0th block of the metatensor.TensorMap contains not eight but 420 entries in the first dimensions. This reflects the fact that in total we have 420 hydrogen atoms in the whole dataset.

If you want to use another calculator instead of rascaline.calculators.SphericalExpansion shown here check out the Reference guides section.

Computing gradients

Additionally, rascaline is also able to calculate gradients on top of the values. Gradients are useful for constructing an ML potential and running simulations. For example gradients of the representation with respect to atomic positions can be calculated by setting the gradients parameter of the rascaline.calculators.CalculatorBase.compute() method to ["positions"].

(52, 3, 1, 9)

The calculated descriptor contains the values and in each block the associated position gradients as an metatensor.block.Gradient instance. The actual values are stored in the data attribute. Similar to the features the gradient data also has associated metadata. But, compared to the values were we found three dimensions, and gradients have four. Again the first is called samples and the properties. The dimensions between the sample and property dimensions are denoted by components.

Looking at the shape in more detail we find that we have 52 samples, which is much more compared to features where we only have eight samples. This arises from the fact that we calculate the position gradient for each pair in the structure. For our selected block these are all hydrogen-hydrogen pairs. Naively one would come up with 8 * 8 = 64 samples, but rascaline already ignores pairs that are outside of the cutoff radius. Their position gradient is always zero. The metatensor.block.Gradient.samples attribute shows this in detail.

Labels(
    sample  system  atom
      0       0      12
      0       0      13
            ...
      7       0      18
      7       0      19
)

Note that we have a tuple of three with the names

['sample', 'system', 'atom']

In the above output of the Labels instance for example the (2, 0, 17) entry is missing indicating that this pair is outside of the cutoff.

Now looking at the metatensor.block.Gradient.components

[Labels(
    xyz
     0
     1
     2
), Labels(
    o3_mu
      0
)]

we find two of them. Besides the o3_mu component that is also present in the features position gradients also have a component indicating the direction of the gradient vector.

Finally, the metatensor.block.Gradient.properties dimension is the same as for the values

Labels(
    n
    0
    1
     ...
    7
    8
)

Rascaline can also calculate gradients with respect to the strain (i.e. the virial). For this, you have to add "strain" to the list parsed to the gradients parameter of the rascaline.calculators.CalculatorBase.compute() method. Strain gradients/virial are useful when computing the stress and the pressure.

If you want to know about the effect of changing hypers take a look at the next tutorial. If you want to solve an explicit problem our How-to guides might help you.

Gallery generated by Sphinx-Gallery