pensa.clusters

pensa.clusters.clustering

find_closest_frames(data, points)

Finds the frames in a timeseries that are closest to given points.

The timeseries can be multidimensional and there can be an arbitrary number of points. Usually used to identify the frames closest to a cluster centroid, but can be used for any feature value.

Parameters:

data (float array) – Trajectory data [frames, frame_data]
points (list of float arrays) – Points to which the closest frames shall be found. Dimension must be that of frame_data.

Returns:

frames (list of int) – Indices of the frames closest to each point.
distances (list of float) – Distances to each point of the closest frame.

obtain_clusters(data, algorithm='kmeans', num_clusters=2, min_dist=12, max_iter=100, plot=True, saveas=None)

Clusters the provided data.

Parameters:

data (float array)) – Trajectory data. Format: [frames, frame_data]
algorithm (string)) – The algorithm to use for the clustering. Options: kmeans, rspace. Default: kmeans
num_clusters (int, optional) – Number of clusters for k-means clustering. Default: 2.
min_dist (float, optional) – Minimum distance for regspace clustering. Default: 12.
max_iter (int, optional) – Maximum number of iterations. Default: 100.
plot (bool, optional) – Create a plot. Default: True
saveas (str, optional) – Name of the file in which to save the plot. (only needed if “plot” is True)

Returns:

cidx (int array) – Cluster indices for each frame.
total_wss (float) – With-in-sum-of-squares (WSS).
centroids (float array) – Centroids for all the clusters.

obtain_combined_clusters(data_a, data_b, label_a='Sim A', label_b='Sim B', start_frame=0, algorithm='kmeans', num_clusters=2, min_dist=12, max_iter=100, plot=True, saveas=None)

Clusters a combination of two data sets.

Parameters:

data_a (float array) – Trajectory data [frames, frame_data]
data_b (float array) – Trajectory data [frames, frame_data]
label_a (str, optional) – Label for the plot. Default: Sim A.
label_b (str, optional) – Label for the plot. Default: Sim B.
start_frame (int) – Frame from which the clustering data starts. Default: 0.
algorithm (string) – The algorithm to use for the clustering. Options: kmeans, rspace. Default: kmeans
num_clusters (int, optional) – Number of clusters for k-means clustering. Default: 2.
min_dist (float, optional) – Minimum distance for regspace clustering. Default: 12.
max_iter (int, optional) – Maximum number of iterations. Default: 100.
plot (bool, optional) – Create a plot. Default: True
saveas (str, optional) – Name of the file in which to save the plot. (only needed if “plot” is True)

Returns:

cidx (int array) – Cluster indices for each frame.
cond (int array) – Index of the simulation the data came frome.
oidx (int array) – Index of each frame in the original simulation (taking into account cutoff)
total_wss (float) – With-in-sum-of-squares (WSS).
centroids (float array) – Centroids for all the clusters.

obtain_mult_combined_clusters(data, start_frame=0, algorithm='kmeans', num_clusters=2, min_dist=12, max_iter=100, plot=True, saveas=None, labels=None, colors=None)

Clusters a combination of multiple data sets.

Parameters:

data (list of float arrays) – Trajectory data [frames, frame_data]
start_frame (int) – Frame from which the clustering data starts. Default: 0.
algorithm (string) – The algorithm to use for the clustering. Options: kmeans, rspace. Default: kmeans
num_clusters (int, optional) – Number of clusters for k-means clustering. Default: 2.
min_dist (float, optional) – Minimum distance for regspace clustering. Default: 12.
max_iter (int, optional) – Maximum number of iterations. Default: 100.
plot (bool, optional) – Create a plot. Default: True
saveas (str, optional) – Name of the file in which to save the plot. (only needed if “plot” is True)
labels (list of str, optional) – Label for the plot. Default: None.
colors (list of str, optional) – Colors for the plot. Default: None.

Returns:

cidx (int array) – Cluster indices for each frame.
cond (int array) – Index of the simulation the data came frome.
oidx (int array) – Index of each frame in the original simulation (taking into account cutoff)
total_wss (float) – With-in-sum-of-squares (WSS).
centroids (float array) – Centroids for all the clusters.

pensa.clusters.trajectory

write_cluster_traj(cluster_idx, top_file, trj_file, out_name, start_frame=0)

Writes a trajectory into a separate file for each cluster.

Parameters:

cluster_idx (int array) – Cluster index for each frame.
top_file (str) – Reference topology for the second trajectory.
trj_file (str) – Trajetory file from which the frames are picked.
out_name (str) – Core part of the name of the output files.
start_frame (int, optional) – Frame from which to start reading the trajectory.

pensa.clusters.wss

wss_over_number_of_clusters(data, algorithm='kmeans', max_iter=100, num_repeats=5, max_num_clusters=12, plot_file=None)

Calculates the within-sum-of-squares (WSS) for different numbers of clusters, averaged over several iterations.

Parameters:

data (float array) – Trajectory data [frames, frame_data]
algorithm (string) – The algorithm to use for the clustering. Options: kmeans, rspace. Default: kmeans
max_iter (int, optional) – Maximum number of iterations. Default: 100.
num_repeats (int, optional) – Number of times to run the clustering for each number of clusters. Default: 5.
max_num_clusters (int, optional) – Maximum number of clusters for k-means clustering. Default: 12.
plot_file (str, optional) – Name of the file to save the plot.

Returns:

all_wss (float array) – WSS values for each number of clusters (starting at 2).
std_wss (float array) – Standard deviations of the WSS.

wss_over_number_of_combined_clusters(data_a, data_b, label_a='Sim A', label_b='Sim B', start_frame=0, algorithm='kmeans', max_iter=100, num_repeats=5, max_num_clusters=12, plot_file=None)

Calculates the Within-Sum-of-Squares for different numbers of clusters, averaged over several iterations.

Parameters:

data_a (float array) – Trajectory data [frames, frame_data]
data_b (float array) – Trajectory data [frames, frame_data]
label_a (str, optional) – Label for the plot.
label_b (str, optional) – Label for the plot.
start_frame (int, optional) – Frame from which the clustering data starts.
algorithm (string) – The algorithm to use for the clustering. Options: kmeans, rspace. Default: kmeans
max_iter (int, optional) – Maximum number of iterations. Default: 100.
num_repeats (int, optional) – Number of times to run the clustering for each number of clusters. Default: 5.
max_num_clusters (int, optional) – Maximum number of clusters for k-means clustering. Default: 12.
plot_file (str, optional) – Name of the file to save the plot.

Returns:

all_wss (float array) – WSS values for each number of clusters (starting at 2).
std_wss (float array) – Standard deviations of the WSS.