Clustering¶
Once you have your simulations run, generally the first step in building an MSM is clustering your trajectories based on a parameter of interest. Commonly, you’ll want to look at the root mean square deviation of the states or the euclidean distance between some sort of feature vector.
In enspara
, this functionality is availiable at three levels of detail.
1. Apps. Clustering code is availiable in a command-line application that is capable of handling much of the bookkeeping necessary for more complex clustering operations.
2. Objects. Clustering code is wrapped into sklearn-style objects that offer simple API access to clustering algorithms and their parameters.
3. Functions. Clustering code is ultimately implemented as functions, which offer the highest degree of control over the function’s behavior, but also require the most work on the user’s part.
Clustering App¶
Clustering functionality is availiable in enspara
in the script
apps/cluster.py
. Its help output explains at a granular level of detail
what it is capable of, and so here we will seek to provide a high-level
discussion of how it can be used.
When clustering, you will need to make a few important choices:
1. What type of data will you be clustering? The app accepts trajectories of coordinates as well as arrays of vectors.
2. Which clustering algorithm will you use? We currently implement k-centers and k-hybrid.
3. How “much” clustering will you do? Both k-centers and k-hybrid require the choice of k-centers stopping criteria, and k-hybrid additionally requires the choice of number of k-medoids refinements.
4. How will you compare frames to one another (i.e. what is your distance function)? Options include RMSD (for coordinates), as well as euclidean and manhattan distances.
A Simple Example¶
One thing enspara
excels as is generating fine-grained state spaces
by clustering using RMSD as a criterion. This is very fast, and is not only
thread-parallelized to use all cores on a single computer (hat tip to MDTraj’s
blazing fast RMSD calculations), but also can be parallelized across many
computers with MPI.
In a simple case, such a clustering will look something like this:
python /home/username/enspara/enspara/apps/cluster.py \
--trajectories /path/to/input/trj1.xtc /path/to/input/trj2.xtc \
--topology /path/to/input/topology.top \
--algorithm khybrid \
--cluster-number 1000 \
--distances /path/to/output/distances.h5 \
--center-features /path/to/output/centers.pickle \
--assignments /path/to/output/assignments.h5
This will make 1000 clusters using the k-hybrid clustering algorithm based
on all the atomic coordinates in trj1.xtc
and trj2.xtc
. Based
on the clusters it discovers, it will generate three files:
1. Centers file (centers.pickle
). This file, which is a python list of
mdtraj.Trajectory
trajectory objects, contains the atomic coordinates
that were at the center of each center. If 1000 clusters are discovered, this
list will have length 1000.
2. Assignments file (assignments.h5
). This file assigns each frame in
the input to each cluster center (even if subsampling is specified). If
(i, j)
in this array has value n
, then the j`th frame of
trajectory :code:`i
above was found to belong to center n
(found in
the centers file).
3. Distances file (distance.h5
). This file gives the distance between
each frame (i, j)
and the center it is assigned to (found in the
assignments file).
Feature Clustering¶
Enspara can also operate on inputs that are “features” rather than coordinates. For example, we have published work that uses clusters based on the solvent accessibility of each sidechain, rather than their position. In that featurization each frame is represented by a one-dimensional vector, and the distances between vectors is computed using some distance function, often the euclidean or manhattan distance (both of which have fast implementations in :code`enspara`).
In this case, your cluster.py
invocation will look something like:
python /home/username/enspara/enspara/apps/cluster.py \
--features features.h5 \
--algorithm khybrid \
--cluster-radius 1.0 \
--cluster-distance euclidean \
--distances /path/to/output/distances.h5 \
--centers /path/to/output/centers.pickle \
--assignments /path/to/output/assignments.h5
Here, clusters will be generated until the maximum distance of any frame to its
cluster center is 1.0 using a Euclidean distance (the --cluster-number
flag is also accepted). You can also specify a list of npy files
Subsampling and Reassignment¶
Sometimes, it is useful not to load every frame of your trajectories. This can
be necessary for large datasets, where the data exceeds the memory capacity of
the computer(s) being used for clustering, and often does not substantially
diminish the quality of the clustering. As a general rule of thumb, it is
usually safe to subsample such that frames are 1 ns apart. Thus, if frames have
been saved every 10 ps, subsampling by a factor 100 is usually safe. This can
be achieved with the --subsample
flag.
python /home/username/enspara/enspara/apps/cluster.py \
--trajectories /path/to/input/trj1.xtc /path/to/input/trj2.xtc \
--topology /path/to/input/topology.top \
--algorithm khybrid \
--subsample 10 \
--cluster-number 1000 \
--distances /path/to/output/distances.h5 \
--center-features /path/to/output/centers.pickle \
--assignments /path/to/output/assignments.h5
However, when clustering is produced with a subset of the data, it is still valuable to use all frames to build a Markov state model, because it improves the statistics in the transition counts matrix. Consequently, even when clustering uses some subset of frames, it is useful to assign every frame in the dataset to a cluster. This process is called “reassignment”.
By default, reassignment automatically occurs after clustering (it can be
suppressed with --no-reassign
). It sequentially loads subsets of the
input data (the size of the subset depends on the size of main memory) and
uses the cluster centers to determine cluster membership before purging the
subset from memory and loading the next.
Notably, reassignment is embarassingly parallel, whereas clustering is
fundamentally less scalable. As a result, one can provide the
--no-reassign
flag to suppress this behavior and use the centers in
some other script to do the reassignment (see the reassign.py
app).
Clustering Object¶
Rather than relying on a pre-built script to cluster data, there is also a
scikit-learn-like object for the two major clustering algorithms we use,
k-hybrid and k-centers. They are enspara.cluster.hybrid.KHybrid
and
enspara.cluster.kcenters.KCenters
, respectively.
An example of a script that clusters data using this object is:
import mdtraj as md
from enspara.cluster import KHybrid
from enspara.util.load import load_as_concatenated
top = md.load('path/to/trj_or_topology').top
# loads a giant trajectory in parallel into a single numpy array.
lengths, xyz = load_as_concatenated(
['path/to/trj1', 'path/to/trj2', ...],
top=top,
processes=8)
# configure a KHybrid (KCenters + KMedoids) clustering object
# to use rmsd and stop creating new clusters when the maximum
# RMSD gets to 2.5A.
clustering = KHybrid(
metric=md.rmsd,
dist_cutoff=0.25)
# md.rmsd requires an md.Trajectory object, so wrap `xyz` in
# the topology.
clustering.fit(md.Trajectory(xyz=xyz, topology=top))
# the distances between each frame in `xyz` and the nearest cluster center
print(clustering.distances_)
# the cluster id for each frame in `xyz`
print(clustering.labels_)
# a list of the `xyz` frame index for each cluster center
print(clustering.center_indices_)
Clustering Functions¶
Finally, for the finest-grained control over the clustering process, we implement
functions that execute the clustering algorithm over given data, often with very
detailed control over stopping conditions and calculations. They are
enspara.cluster.hybrid.hybrid
and enspara.cluster.kcenters.kcenters
, respectively.