Symmetries in kernel-based machine learning
Setting
We propose to learn a mapping \({\bf Q} \rightarrow O\) between the (featurized) representation of sample \(i\), \({\bf Q}_i\), and a target observable, \(O_i = O({\bf Q}_i)\). Examples of \(O\) for molecular systems could include a LUMO energy, a binding constant, or a solubility value.
The formalims outlined here follows [Scherer et al.].
Kernel machine learning 101
Training a kernel machine learning (ML) model is equivalent to solving the set of linear equations \[{\bf O} = \hat{K}{\bf \alpha},\] where the kernel function \(K_{ij} = K({\bf Q}_i, {\bf Q}_j) = {\rm Cov}(O_i, O_j)\) measures the similarity between samples \({\bf Q}_i\) and \({\bf Q}_j\) (via the covariance function), and \({\bf \alpha}\) is the (unknown) vector of weight coefficients. Because of the linearity of the equation, we can invert–only if we also regularize–the problem at hand \[ \alpha = (\hat K + \lambda \mathbb{1})^{-1}{\bf O}, \] where \(\lambda\) implements Tikhonov regularization and \(\mathbb{1}\) is the identity matrix. Both \(\hat K\) and \(\mathbb{1}\) are \(N \times N\) matrices, while \(\alpha\) and \({\bf O}\) are both vectors of length \(N\). Here \(N\) corresponds to the number of training samples. Determination of the coefficients \({\bf \alpha}\) corresponds to training the ML model.
Once trained, we can use it to predict new samples! Given a yet unseen configuration \({\bf Q}^*\), the predicted observable is given by \[ O({\bf Q}^*) = \sum_{i=1}^N \alpha_i K({\bf Q}_i, {\bf Q}^*). \tag{1} \] Eq. (1) takes on a particularly intuitive form: it is a linear expansion of kernel evaluations between each one of the \(N\) training points against the new configuration \({\bf Q}^*\). Though it’s linear, this expression is typically far more expressive than a straight-up linear regression because of the so-called kernel trick: the kernel function gathers similarities in a high-dimensional (and implicit!) Hilbert space.
Noether theorem
Emmy Noether tells us that to any conservation law, there is an associated symmetry. According to Wikipedia: “To every differentiable symmetry generated by local actions there corresponds a conserved current.”
Important examples:
- Translational invariance leads to conservation of linear momentum
- Rotational invariance leads to conservation of angular momentum
- Time invariance leads to convervation of energy
Rotation
Let’s start with rotations. There are two types of symmetries: invariance and covariance.
Invariance
Invariance means that the target observable does not depend on the orientation of the sample. Intuitively, the net charge of a molecule (in vacuum) does not depend on its orientation. Making our ML model invariant to rotations simply means ignoring rotational degrees of freedom in the representation. For instance, working with internal coordinates (e.g., pairwise distances) will do just that.
Covariance
Covariance is a bit more interesting. Compared to invariance, now the observable will rotate together with the input sample. For instance, the dipole moment of a molecule rotates with the molecule itself.
There are two main routes that lead to covariance ML Models:
- Preprocess all samples to rotate them in an (arbitrary but consistent) local axis system; train and predict in that local frame; Rotate back in the global frame;
- Pick a kernel that correctly orients its prediction based on the input sample
Route 1 is relatively easy to implement, and can be a good strategy for small molecules. For larger structures it may turn out challenging to define a local axis system.
Let’s dive into route 2.
First of all, “pick a kernel” means that among the very large number of kernels that one could choose, one / a few / some may have interesting mathematical properties. To be more precise, we’re looking for kernels that know a bit about geometry, and in fact obey the SO(3) rotation group—the group of rotations in 3D space. Because of the 3 dimensions, we will seek a matrix-valued kernel, i.e., one entry in the kernel matrix between any two samples will be of shape \(3 \times 3\).
For simplicity we will focus on two-body interactions. The representation for sample \(i\) will be \({\bf q}_i = {\bf r}_i\), where \({\bf r}\) is the interparticle vector.
Let’s first define a base (scalar) Gaussian kernel, \[ k_{\rm b}({\bf q}_m, {\bf q}_l) = \exp\left( - \frac{({\bf r}_m - {\bf r}_l)^2}{2\sigma ^2} \right), \] which only picks up distance information between the two samples.
Constructing a covariant matrix-valued kernel can be obtained by integrating \(k_{\rm b}\) over all rotation matrices, i.e., summing over all actions of the rotation group [Mehta] \[ \hat \kappa_{\rm c}({\bf q}_m, {\bf q}_l) = \int {\rm d} \hat{\mathcal{R}} \, \hat{\mathcal{R}} k_{\rm b} ({\bf q}_m, \hat{\mathcal{R}}{\bf q}_l), \] note how the rotation matrix is applied onto one of the two samples of the kernel function. An insightful paper by [Glielmo et al.] offered an analytical solution for pairs: \[ \hat \kappa_{\rm c}({\bf q}_m, {\bf q}_l) = {\rm e}^{-\alpha_{ml}} \left( \frac{\cosh \gamma_{ml}}{\gamma_{ml}} - \frac{\sinh \gamma_{ml}}{\gamma_{ml}^2} \right) \hat{{\bf r}}_m \hat{{\bf r}}_l^{\rm T}, \tag{2} \] where \(\alpha_{ml} = r_m^2 + r_l^2 / 4\sigma^2\), \(\gamma_{ml} = r_mr_l/2\sigma^2\), and \(\hat{{\bf r}}_m = {\bf r}_m / r_m\). What’s important in Eq. (2) is the right-most part of the RHS: the tensor product between the pairwise vectors is solely responsible for the covariance of \(\hat \kappa_{\rm c}({\bf q}_m, {\bf q}_l)\).
Energy conservation
Let’s say we’re learning a force field. Energy conservation can take several forms, including:
- Curl of the force is zero: \(\nabla \times {\bf F} = 0\)
- Force derives from a potential: \({\bf F} = -\nabla E\)
Let’s select/construct a kernel that implements energy conservation! (See, among others, these important papers: [Macedo & Castro], [Chmiela et al.], [Bartok & Csanyi]). \[ \begin{align} \hat\kappa_{\rm h}({\bf q}_m, {\bf q}_l) &= {\rm Cov}\left( \frac{\partial E({\bf q}_m)}{\partial {\bf r}_m}, \frac{\partial E({\bf q}_l)}{\partial {\bf r}_l} \right) \\ &= \sum_{\alpha, \beta} \frac{\partial^2 k({\bf q}_m, {\bf q}_l)}{\partial q_{m, \alpha}\partial q_{l, \beta}} \left( \frac{\partial q_{m,\alpha}}{\partial{\bf r}_m} \right) \left( \frac{\partial q_{l,\beta}}{\partial{\bf r}_l} \right)^{\rm T}, \end{align} \] which is sometimes called the Hesian kernal, due to the first term of the RHS. While you can try to work out the partial derivatives for a given kernel (e.g., the abovementioned Gaussian kernel), what’s important here again are the other terms: We recognize once again a tensor product! For pairwise interactions, covariance and energy conservation go hand in hand, and lead to matrix-valued kernels with similar characteristics.
References
- [Scherer et al.] : https://dx.doi.org/10.1021/acs.jctc.9b01256
- [Mehta] : Mehta, M. L. Random Matrices; Elsevier: 2004; Vol. 142.
- [Glielmo et al.] : https://link.aps.org/doi/10.1103/PhysRevB.95.214302
- [Macedo & Castro] : https://www.yumpu.com/en/document/view/37810994/learning-divergence-free-and-curl-free-vector-fields-with-matrix-
- [Chmiela et al.] : https://dx.doi.org/10.1126/sciadv.1603015
- [Bartok & Csanyi] : https://doi.org/10.1002/qua.24927