Note
Go to the end to download the full example code.
First descriptor computation¶
This is an introduction to the rascaline interface using a molecular crystals dataset using the Python interface. If you are interested in another programming language we recommend you first follow this tutorial and afterward take a look at the how-to guide on Computing SOAP features.
The dataset¶
The atomic configurations used in our documentation are a small subset of the
ShiftML2 dataset
containing molecular crystals. There are four crystals - one with each of the
elements [hydrogen, carbon], [hydrogen, carbon, nitrogen, oxygen], [hydrogen,
carbon, nitrogen], or [hydrogen, carbon, oxygen]. Each crystal has 10 structures,
also denoted by frames, attributed to it. The first frame of each crystal structure
is the geometry-optimized frame. The following 9 frames contain atoms that are
slightly displaced from the geometry-optimized frame. You can obtain the dataset
from our website
.
We will start by importing all the required packages: the classic numpy; chemfiles to load data, and rascaline to compute representations. Afterward we will load the dataset using chemfiles.
import chemfiles
import numpy as np
from rascaline import SphericalExpansion
with chemfiles.Trajectory("dataset.xyz") as trajectory:
frames = [f for f in trajectory]
print(f"The dataset contains {len(frames)} frames.")
The dataset contains 40 frames.
We will not explain here how to use chemfiles in detail, as we only use a few
functions. Briefly, chemfiles.Trajectory
loads structure data in a
format rascaline can use. If you want to learn more about the possibilities
take a look at the chemfiles documentation.
Let us now take a look at the first frame of the dataset.
Frame with 20 atoms
With frame0.atoms
we get a list of the atoms that make up frame zero.
The name
attribute gives us the name of the specified atom.
The first frame contains 4 C-atoms, 8 H-atoms, 4 N-atoms and 4 O-atoms.
Calculate a descriptor¶
We will now calculate an atomic descriptor for this structure using the SOAP spherical expansion as introduced by Bartók, Kondor, and Csányi.
To do so we define below a set of parameters telling rascaline how the spherical expansion should be calculated. These parameters are also called hyper parameters since they are parameters of the representation, in opposition to parameters of machine learning models. Hyper parameters are a crucial part of calculating descriptors. Poorly selected hyper parameters will lead to a poor description of your dataset as discussed in the literature. The effect of changing some hyper parameters is discussed in a second tutorial.
HYPER_PARAMETERS = {
"cutoff": 4.5,
"max_radial": 9,
"max_angular": 6,
"atomic_gaussian_width": 0.3,
"center_atom_weight": 1.0,
"radial_basis": {"Gto": {"spline_accuracy": 1e-6}},
"cutoff_function": {"ShiftedCosine": {"width": 0.5}},
"radial_scaling": {"Willatt2018": {"scale": 2.0, "rate": 1.0, "exponent": 4}},
}
After we set the hyper parameters we initialize a
rascaline.calculators.SphericalExpansion
object with hyper parameters
defined above and run the
rascaline.calculators.CalculatorBase.compute()
method.
calculator = SphericalExpansion(**HYPER_PARAMETERS)
descriptor0 = calculator.compute(frame0)
print(type(descriptor0))
<class 'metatensor.tensor.TensorMap'>
The descriptor format is a metatensor.TensorMap
object. Metatensor is
like numpy for storing representations of atomistic ML data. Extensive details
on the metatensor are covered in the corresponding documentation.
We will now have a look at how the data is stored inside
metatensor.TensorMap
objects.
print(descriptor0)
TensorMap with 112 blocks
keys: o3_lambda o3_sigma center_type neighbor_type
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
0 1 1 6
1 1 1 6
2 1 1 6
3 1 1 6
4 1 1 6
5 1 1 6
6 1 1 6
0 1 1 7
1 1 1 7
2 1 1 7
3 1 1 7
4 1 1 7
5 1 1 7
6 1 1 7
0 1 1 8
1 1 1 8
2 1 1 8
3 1 1 8
4 1 1 8
5 1 1 8
6 1 1 8
0 1 6 1
1 1 6 1
2 1 6 1
3 1 6 1
4 1 6 1
5 1 6 1
6 1 6 1
0 1 6 6
1 1 6 6
2 1 6 6
3 1 6 6
4 1 6 6
5 1 6 6
6 1 6 6
0 1 6 7
1 1 6 7
2 1 6 7
3 1 6 7
4 1 6 7
5 1 6 7
6 1 6 7
0 1 6 8
1 1 6 8
2 1 6 8
3 1 6 8
4 1 6 8
5 1 6 8
6 1 6 8
0 1 7 1
1 1 7 1
2 1 7 1
3 1 7 1
4 1 7 1
5 1 7 1
6 1 7 1
0 1 7 6
1 1 7 6
2 1 7 6
3 1 7 6
4 1 7 6
5 1 7 6
6 1 7 6
0 1 7 7
1 1 7 7
2 1 7 7
3 1 7 7
4 1 7 7
5 1 7 7
6 1 7 7
0 1 7 8
1 1 7 8
2 1 7 8
3 1 7 8
4 1 7 8
5 1 7 8
6 1 7 8
0 1 8 1
1 1 8 1
2 1 8 1
3 1 8 1
4 1 8 1
5 1 8 1
6 1 8 1
0 1 8 6
1 1 8 6
2 1 8 6
3 1 8 6
4 1 8 6
5 1 8 6
6 1 8 6
0 1 8 7
1 1 8 7
2 1 8 7
3 1 8 7
4 1 8 7
5 1 8 7
6 1 8 7
0 1 8 8
1 1 8 8
2 1 8 8
3 1 8 8
4 1 8 8
5 1 8 8
6 1 8 8
The metatensor.TensorMap
is structured in several instances of an
metatensor.TensorBlock
. To distinguish the block each block is associated
with a unique key. For the current example, we have one block for each angular channel
labeled by o3_lambda
, the central atom type center_type
and
neighbor atom type labeled by neighbor_type
. Different atomic types are
represented using their atomic number, e.g. 1 for hydrogen, 6 for carbon, etc. To
summarize, this descriptor contains 112 blocks covering all combinations of the
angular channels of the central and neighbor atom types in our dataset.
Let us take a look at the second block (at index 1) in detail. This block contains the descriptor for the \(l=1\) angular channel for hydrogen-hydrogen pairs.
block = descriptor0.block(1)
print(descriptor0.keys[1])
LabelsEntry(o3_lambda=1, o3_sigma=1, center_type=1, neighbor_type=1)
The descriptor values¶
The values of the representation are stored as an array. Each entry in this
array also has associated unique metadata as each block. For the spherical
expansion calculator used in this tutorial the values have three dimensions
which we can verify from the .shape
attribute.
print(block.values.shape)
(8, 3, 9)
The descriptor values¶
The first dimension is denoted by the samples, the intermediate dimension by
components, and the last dimension by properties. The “sample dimension”
has a length of eight because we have eight hydrogen atoms in the first frame.
We can reveal more detailed metadata information about the sample-dimension
printing of the metatensor.TensorBlock.samples
attribute of the
block
print(block.samples)
Labels(
system atom
0 12
0 13
...
0 18
0 19
)
The result is an metatensor.TensorMap
instance. It contains in total
eight tuples each with two values. The tuple values are named as follows
print(block.samples.names)
['system', 'atom']
Meaning that the first entry of each tuple indicates the _structure_, which is 0 for all because we only computed the representation of a single frame. The second entry of each tuple refers to the index of the _center_ atom.
We can do a similar investigation for the second dimension: the
metatensor.TensorBlock.components
.
print(block.components)
[Labels(
o3_mu
-1
0
1
)]
Here, the components are associated with the angular channels of the
representation. The size of o3_mu
is \(2l + 1\), where
\(l\) is the current o3_lambda
of the block. Here, its
dimension is three because we are looking at the o3_lambda=1
block. You may have noticed that the return value of the last call is a
list
of metatensor.Labels
and not a single Labels
instance. The reason is that a block can have several component dimensions as
we will see below for the gradients.
The last value represents the number of radial channels. For the
metatensor.TensorBlock.properties
dimension we find an object
print(block.properties)
Labels(
n
0
1
...
7
8
)
containing a tuple of only one value ranging from 0 to 8. The name of this entry is
print(block.properties.names)
['n']
and denoting the radial channels. The range results from our choice of
max_radial = 9
in the hyper parameters above.
After looking at the metadata we can investigate the actual data of the representation in more details
print(block.values[0, 0, :])
[-0.00102818 0.00860425 -0.02339817 -0.11691264 -0.04411467 -0.02857226
-0.02109534 -0.00712789 -0.00230597]
By using [0, 0, :]
we selected the first hydrogen and the first m
channel. As you the output shows the values are floating point numbers between
-1.0
and 1.0
. Values in this range are reasonable and can be directly
used as input for a machine learning algorithm.
Rascaline is also able to process more than one structure within one function call. You can process a whole dataset with
descriptor_full = calculator.compute(frames)
block_full = descriptor_full.block(0)
print(block_full.values.shape)
(420, 1, 9)
Now, the 0th block of the metatensor.TensorMap
contains not eight but
420 entries in the first dimensions. This reflects the fact that in total we
have 420 hydrogen atoms in the whole dataset.
If you want to use another calculator instead of
rascaline.calculators.SphericalExpansion
shown here check out the
Reference guides section.
Computing gradients¶
Additionally, rascaline is also able to calculate gradients on top of the
values. Gradients are useful for constructing an ML potential and running
simulations. For example gradients
of the representation with respect to
atomic positions can be calculated by setting the gradients
parameter of
the rascaline.calculators.CalculatorBase.compute()
method to
["positions"]
.
descriptor_gradients = calculator.compute(frame0, gradients=["positions"])
block_gradients = descriptor_gradients.block(0)
gradient_position = block_gradients.gradient("positions")
print(gradient_position.values.shape)
(52, 3, 1, 9)
The calculated descriptor contains the values and in each block the associated
position gradients as an metatensor.block.Gradient
instance. The
actual values are stored in the data
attribute. Similar to the features
the gradient data also has associated metadata. But, compared to the values
were we found three dimensions, and gradients have four. Again the first is
called samples and the properties. The dimensions between the sample and
property dimensions are denoted by components.
Looking at the shape in more detail we find that we have 52 samples, which is
much more compared to features where we only have eight samples. This arises
from the fact that we calculate the position gradient for each pair in the
structure. For our selected block these are all hydrogen-hydrogen pairs.
Naively one would come up with 8 * 8 = 64
samples, but rascaline already
ignores pairs that are outside of the cutoff radius. Their position gradient
is always zero. The metatensor.block.Gradient.samples
attribute shows
this in detail.
print(gradient_position.samples)
Labels(
sample system atom
0 0 12
0 0 13
...
7 0 18
7 0 19
)
Note that we have a tuple of three with the names
['sample', 'system', 'atom']
In the above output of the Labels instance for example the (2, 0, 17) entry is missing indicating that this pair is outside of the cutoff.
Now looking at the metatensor.block.Gradient.components
print(gradient_position.components)
[Labels(
xyz
0
1
2
), Labels(
o3_mu
0
)]
we find two of them. Besides the o3_mu component that is also present in the features position gradients also have a component indicating the direction of the gradient vector.
Finally, the metatensor.block.Gradient.properties
dimension is the same as for
the values
print(gradient_position.properties)
Labels(
n
0
1
...
7
8
)
Rascaline can also calculate gradients with respect to the strain (i.e. the virial).
For this, you have to add "strain"
to the list parsed to the gradients
parameter of the rascaline.calculators.CalculatorBase.compute()
method.
Strain gradients/virial are useful when computing the stress and the pressure.
If you want to know about the effect of changing hypers take a look at the next tutorial. If you want to solve an explicit problem our How-to guides might help you.