graph2mat.core.data.processing
Core of the data processing.
Managing sparse matrix data in conjunction with graphs is not trivial:
Matrices are sparse.
Matrices are in a basis which is centered around the points in the graph. Therefore elements of the matrix correspond to nodes or edges of the graph.
Each point might have more than one basis function, therefore the matrix is divided in blocks (not just single elements) that correspond to nodes or edges of the graph.
Different point types might have different basis size, which makes the different blocks in the matrix have different shapes.
The different block sizes and the sparsity of the matrices supose and extra challenge when batching examples for machine learning.
This module implements BasisMatrixData
, a class that
The tools in this submodule are agnostic to the machine learning framework
of choice, and they are based purely on numpy
, with the extra dependency on sisl
to handle the sparse matrices. The sisl
dependency could eventually be lift off
if needed.
Classes
|
Version of |
|
Stores a graph with the preprocessed data for one or multiple configurations. |
|
Data structure that contains all the parameters to interface the real world with the ML models. |
|
Helper classs to get attributes from the data object making sure they are numpy arrays. |
- class graph2mat.core.data.processing.BasisMatrixData(edge_index: ndarray | None = None, neigh_isc: ndarray | None = None, node_attrs: ndarray | None = None, positions: ndarray | None = None, shifts: ndarray | None = None, cell: ndarray | None = None, nsc: ndarray | None = None, point_labels: ndarray | None = None, edge_labels: ndarray | None = None, labels_point_filter: ndarray | None = None, labels_edge_filter: ndarray | None = None, point_types: ndarray | None = None, edge_types: ndarray | None = None, edge_type_nlabels: ndarray | None = None, data_processor: MatrixDataProcessor = None, metadata: Dict[str, Any] | None = None, already_basis: bool = False)[source]
Bases:
BasisMatrixDataBase
[ndarray
]Version of
BasisMatrixDataBase
that stores data as numpy arrays.See also
BasisMatrixDataBase
The base class that actually implements all the processing.
- cell: ArrayType
Shape (3,3). Lattice vectors of the unit cell, in the convention specified by the data processor (e.g. spherical harmonics). IMPORTANT: This is not necessarily in cartesian coordinates.
- edge_index: ArrayType
Shape (2, n_edges). Array with point pairs (their index in the configuration) that form an edge.
- edge_labels: ArrayType
Shape (n_edge_labels,). The elements of the target matrix that correspond to interactions between different nodes. This is flattened to deal with the fact that each block might have different shape.
All values for a given block come consecutively and in row-major order.
- edge_type_nlabels: ArrayType
Shape (n_edge_types,). Edge labels are sorted by edge type. This array contains the number of labels for each edge type.
- edge_types: ArrayType
Shape (n_edges,). The type of each edge as defined by the basis table, i.e. a
BasisTableWithEdges
.
- metadata: Dict[str, Any]
Contains any extra metadata that might be useful for the model or to postprocess outputs, for example. It includes the data processor.
- neigh_isc: ArrayType
Shape (n_edges,). Array with the index of the supercell where the second point of each edge is located. This follows the conventions in
sisl
- node_attrs: ArrayType
Shape (n_points, n_node_feats). Inputs for each point in the configuration.
- nsc: ArrayType
Number of auxiliary cells required in each direction to account for all neighbor interactions.
- point_labels: ArrayType
Shape (n_point_labels,). The elements of the target matrix that correspond to interactions within the same node. This is flattened to deal with the fact that each block might have different shape.
All values for a given block come consecutively and in row-major order.
- point_types: ArrayType
Shape (n_points,). The type of each point (index in the basis table, i.e. a
BasisTableWithEdges
).
- positions: ArrayType
Shape (n_points, 3). Coordinates of each point in the configuration, in the convention specified by the data processor (e.g. spherical harmonics). IMPORTANT: This is not necessarily in cartesian coordinates.
- shifts: ArrayType
Shape (n_edges, 3). Shift of the second atom in each edge with respect to its image in the primary cell, in the convention specified by the data processor (e.g. spherical harmonics). IMPORTANT: This is not necessarily in cartesian coordinates.
- class graph2mat.core.data.processing.MatrixDataProcessor(basis_table: ~graph2mat.core.data.table.BasisTableWithEdges, symmetric_matrix: bool = False, sub_point_matrix: bool = True, out_matrix: ~typing.Literal['density_matrix', 'hamiltonian', 'energy_density_matrix', 'dynamical_matrix'] | None = None, node_attr_getters: ~typing.List[~typing.Any] = <factory>)[source]
Bases:
object
Data structure that contains all the parameters to interface the real world with the ML models.
Contains all the objects and implements all the logic (using these objects but never modifying them) to convert:
A “real world” object (a structure, a path to a structure, a path to a run, etc.) into the inputs for the model.
The outputs of the model into a “real world” object (a matrix).
Ideally, any processing that requires the attributes of the data processor (basis_table, symmetric_matrix, etc.) should be implemented inside this class so that implementations are not sensitive to small changes like the name of the attributes.
Therefore, every model should have associated a MatrixDataProcessor object to ensure that the input is correctly preprocessed and the output is interpreted correctly.
This data processor is agnostic to the framework of the model (e.g. pytorch) and the processing is divided in small functions so that it can be easily reused.
- Parameters:
basis_table (graph2mat.core.data.table.BasisTableWithEdges) – Table containing all the basis information.
symmetric_matrix (bool) – Whether the matrix is symmetric or not.
sub_point_matrix (bool) – Whether the isolated point matrix is subtracted from the point labels. That would mean that the model is learning a delta with respect to the case where all points are isolated.
out_matrix (Literal['density_matrix', 'hamiltonian', 'energy_density_matrix', 'dynamical_matrix'] | None) – Type of matrix to output. If None, the matrix is output as a
scipy
CSR matrix.
- __init__(basis_table: ~graph2mat.core.data.table.BasisTableWithEdges, symmetric_matrix: bool = False, sub_point_matrix: bool = True, out_matrix: ~typing.Literal['density_matrix', 'hamiltonian', 'energy_density_matrix', 'dynamical_matrix'] | None = None, node_attr_getters: ~typing.List[~typing.Any] = <factory>) None
- add_basis_to_geometry(geometry: Geometry) Geometry [source]
Returns a copy of the geometry with the basis of this processor added to it.
It works by replacing an atom with atomic number Z in the geometry with the atom with the same Z in the basis table.
- basis_table: BasisTableWithEdges
- compute_metrics(output: dict, input: BasisMatrixData, metrics: Sequence['OrbitalMatrixMetric'] | None = None) dict [source]
Computes the metrics for a given output and input.
- Parameters:
output (dict) – Output of the model, as it comes out of it.
input (BasisMatrixData) – The input that was passed to the model.
metrics (Sequence[OrbitalMatrixMetric], optional) – Metrics to compute. If None, all known metrics are computed.
- Returns:
Dictionary where keys are the names of the metrics and values are their values.
- Return type:
- property default_out_format
- get_cutoff(point_types: ndarray) float | ndarray [source]
Returns the cutoff radius.
- Parameters:
point_types (np.ndarray of shape (n_points,)) – Type of each point (index in the basis table) in the configuration.
- Returns:
The cutoff radius might be a single number if all points have the same cutoff radius, or an array with the cutoff radius of each point.
If each point has its own radius, one might find an edge between i and j if dist(i -> j) is smaller than cutoff_i + cutoff_j.
If the cutoff radius is a single number, edges are found if dist(i -> j) is smaller than cutoff.
- Return type:
float or np.ndarray of shape (n_points,)
- get_labels_from_types_and_edges(config: BasisConfiguration, point_types: ndarray, edge_index: ndarray, neigh_isc: ndarray) Tuple[ndarray | None, ndarray | None] [source]
Once point types and edges have been determined, one can call this function to get the labels of the matrix.
- Parameters:
config (BasisConfiguration) – The configuration from which the labels will be extracted.
point_types (np.ndarray of shape (n_points,)) – Type of each point (index in the basis table) in the configuration.
edge_index (np.ndarray of shape (2, n_edges)) – Array with point pairs (their index in the configuration) that form an edge.
neigh_isc (np.ndarray of shape (n_edges,)) – Array with the index of the supercell shift of the second point of each edge.
- Returns:
point_labels (np.ndarray of shape (n_point_labels, )) – Array with the flattened labels for each point block in the in the configuration.
edge_labels (np.ndarray of shape (n_edge_labels, )) – Array with the flattened labels for each edge block in the configuration.
- get_nlabels_per_edge_type(edge_types: ndarray) ndarray [source]
Returns the number of labels for each edge type in a given matrix.
It takes into account whether the matrix is symmetric or not (if it is, the number of labels for each edge type is divided by 2).
- Returns:
edge_type_nlabels – Number of labels required for each edge type.
- Return type:
np.ndarray of shape (n_edge_types,)
- get_node_attrs(config: BasisConfiguration) ndarray [source]
Returns the initial features of nodes.
- get_point_types(config: BasisConfiguration) ndarray [source]
Returns the type (index in the basis table) of each point in the configuration.
- labels_to(out_format: str, data: dict[str, ndarray], threshold: float | None = None, coords_cartesian: bool = False, **kwargs)[source]
- matrix_from_data(data: BasisMatrixData, predictions: Dict | None = None, threshold: float | None = None, is_batch: bool | None = None, out_format: str | None = None)[source]
Converts a BasisMatrixData object into a matrix.
It takes into account the matrix class associated to the data processor to return the corresponding matrix type.
It can also convert batches.
- Parameters:
data – The data to convert.
predictions –
- Predictions for the matrix labels, with the keys:
node_labels: matrix elements that belong to node blocks.
edge_labels: matrix elements that belong to edge blocks.
If None, the labels from the data object are used.
threshold – Elements with a value below this number will be considered 0.
is_batch –
Whether the data is a batch or not.
If None, it will be considered a batch if it is an instance of
torch_geometric
’sBatch
.out_format – Format to output the matrix. If None, the default format of the data processor is used.
- Return type:
A matrix if data is not a batch, a tuple of matrices if it is a batch.
See also
yield_from_batch
The more explicit option for batches, which returns a generator.
graph2mat.Formats
Class containing all the available formats which can be passed to the
out_format
argument.
- one_hot_encode(point_types: ndarray) ndarray [source]
One hot encodes a vector of point types.
It takes into account the number of different point types in the basis table.
- Parameters:
point_types (np.ndarray of shape (n_points,)) – Array of point types (their index in the basis table).
- Returns:
One hot encoded array of point types.
- Return type:
np.ndarray of shape (n_points, n_classes)
- out_matrix: Literal['density_matrix', 'hamiltonian', 'energy_density_matrix', 'dynamical_matrix'] | None = None
- static sort_edge_index(edge_index: ndarray, sc_shifts: ndarray, shifts: ndarray, edge_types: ndarray, isc_off: ndarray, inplace: bool = False) Tuple[ndarray, ndarray, ndarray, ndarray] [source]
Returns the sorted edge indices.
Edges are much easier to manipulate by the block producing routines if they are ordered properly.
This function orders edges in a way that both directions of the same edge come consecutively. It also always puts first the interaction (lowest point type, highest point type), that is the one with positive edge type.
For the unit cell, the connection in different directions is simple to understand, as it’s just a permutation of the points. I.e. edges (i, j) and (j, i) are the same connection in opposite directions. However, for connections between supercells (if there are periodic conditions), this condition is not enough. The supercell shift of one direction must be the negative of the other direction. I.e. only edges between (i, j, x, y, z) and (j, i, -x, -y, -z) are the same connection in opposite directions. It is also important to notice that in the supercell connections i and j can be the same index.
- Parameters:
edge_index (np.ndarray of shape (2, n_edges)) – Pair of point indices for each edge.
sc_shifts (np.ndarray of shape (3, n_edges)) – For each edge, the number of cell boundaries the edge crosses in each lattice direction.
shifts (np.ndarray of shape (3, n_edges)) – For each edge, the cartesian shift induced by sc_shifts.
edge_types (np.ndarray of shape (n_edges, )) – For each edge, its type as an integer.
isc_off (np.ndarrray of shape (nsc_x, nsc_y, nsc_z)) – Array that maps from sc_shifts to a single supercell index.
inplace (bool, optional) – Whether the output should be placed in the input arrays, otherwise new arrays are created.
- Returns:
numpy arrays with the same shape as the inputs. If inplace=True, these are just the input arrays, now containing the outputs.
- Return type:
edge_index, sc_shifts, shifts, edge_types
- yield_from_batch(data: BasisMatrixData, predictions: Dict | None = None, threshold: float = 1e-08, as_matrix: bool = False, out_format: str | None = None) Generator [source]
Yields matrices from a batch.
It takes into account the matrix class associated to the data processor to return the corresponding matrix type.
- Parameters:
data – The batched data.
predictions –
- Predictions for the matrix labels, with the keys:
node_labels: matrix elements that belong to node blocks.
edge_labels: matrix elements that belong to edge blocks.
If None, the labels from the data object are used.
threshold – Elements with a value below this number will be considered 0.
as_matrix – Whether to return a matrix or a BasisMatrixData object.
See also
matrix_from_data
The method used to convert data to matrices, which can also be called with a batch.