pensa.clusters

pensa.clusters.clustering

find_closest_frames(data, points)

Finds the frames in a timeseries that are closest to given points.

The timeseries can be multidimensional and there can be an arbitrary number of points. Usually used to identify the frames closest to a cluster centroid, but can be used for any feature value.

Parameters:
  • data (float array) – Trajectory data [frames, frame_data]

  • points (list of float arrays) – Points to which the closest frames shall be found. Dimension must be that of frame_data.

Returns:

  • frames (list of int) – Indices of the frames closest to each point.

  • distances (list of float) – Distances to each point of the closest frame.

obtain_clusters(data, algorithm='kmeans', num_clusters=2, min_dist=12, max_iter=100, plot=True, saveas=None)

Clusters the provided data.

Parameters:
  • data (float array)) – Trajectory data. Format: [frames, frame_data]

  • algorithm (string)) – The algorithm to use for the clustering. Options: kmeans, rspace. Default: kmeans

  • num_clusters (int, optional) – Number of clusters for k-means clustering. Default: 2.

  • min_dist (float, optional) – Minimum distance for regspace clustering. Default: 12.

  • max_iter (int, optional) – Maximum number of iterations. Default: 100.

  • plot (bool, optional) – Create a plot. Default: True

  • saveas (str, optional) – Name of the file in which to save the plot. (only needed if “plot” is True)

Returns:

  • cidx (int array) – Cluster indices for each frame.

  • total_wss (float) – With-in-sum-of-squares (WSS).

  • centroids (float array) – Centroids for all the clusters.

obtain_combined_clusters(data_a, data_b, label_a='Sim A', label_b='Sim B', start_frame=0, algorithm='kmeans', num_clusters=2, min_dist=12, max_iter=100, plot=True, saveas=None)

Clusters a combination of two data sets.

Parameters:
  • data_a (float array) – Trajectory data [frames, frame_data]

  • data_b (float array) – Trajectory data [frames, frame_data]

  • label_a (str, optional) – Label for the plot. Default: Sim A.

  • label_b (str, optional) – Label for the plot. Default: Sim B.

  • start_frame (int) – Frame from which the clustering data starts. Default: 0.

  • algorithm (string) – The algorithm to use for the clustering. Options: kmeans, rspace. Default: kmeans

  • num_clusters (int, optional) – Number of clusters for k-means clustering. Default: 2.

  • min_dist (float, optional) – Minimum distance for regspace clustering. Default: 12.

  • max_iter (int, optional) – Maximum number of iterations. Default: 100.

  • plot (bool, optional) – Create a plot. Default: True

  • saveas (str, optional) – Name of the file in which to save the plot. (only needed if “plot” is True)

Returns:

  • cidx (int array) – Cluster indices for each frame.

  • cond (int array) – Index of the simulation the data came frome.

  • oidx (int array) – Index of each frame in the original simulation (taking into account cutoff)

  • total_wss (float) – With-in-sum-of-squares (WSS).

  • centroids (float array) – Centroids for all the clusters.

obtain_mult_combined_clusters(data, start_frame=0, algorithm='kmeans', num_clusters=2, min_dist=12, max_iter=100, plot=True, saveas=None, labels=None, colors=None)

Clusters a combination of multiple data sets.

Parameters:
  • data (list of float arrays) – Trajectory data [frames, frame_data]

  • start_frame (int) – Frame from which the clustering data starts. Default: 0.

  • algorithm (string) – The algorithm to use for the clustering. Options: kmeans, rspace. Default: kmeans

  • num_clusters (int, optional) – Number of clusters for k-means clustering. Default: 2.

  • min_dist (float, optional) – Minimum distance for regspace clustering. Default: 12.

  • max_iter (int, optional) – Maximum number of iterations. Default: 100.

  • plot (bool, optional) – Create a plot. Default: True

  • saveas (str, optional) – Name of the file in which to save the plot. (only needed if “plot” is True)

  • labels (list of str, optional) – Label for the plot. Default: None.

  • colors (list of str, optional) – Colors for the plot. Default: None.

Returns:

  • cidx (int array) – Cluster indices for each frame.

  • cond (int array) – Index of the simulation the data came frome.

  • oidx (int array) – Index of each frame in the original simulation (taking into account cutoff)

  • total_wss (float) – With-in-sum-of-squares (WSS).

  • centroids (float array) – Centroids for all the clusters.

pensa.clusters.trajectory

write_cluster_traj(cluster_idx, top_file, trj_file, out_name, start_frame=0)

Writes a trajectory into a separate file for each cluster.

Parameters:
  • cluster_idx (int array) – Cluster index for each frame.

  • top_file (str) – Reference topology for the second trajectory.

  • trj_file (str) – Trajetory file from which the frames are picked.

  • out_name (str) – Core part of the name of the output files.

  • start_frame (int, optional) – Frame from which to start reading the trajectory.

pensa.clusters.wss

wss_over_number_of_clusters(data, algorithm='kmeans', max_iter=100, num_repeats=5, max_num_clusters=12, plot_file=None)

Calculates the within-sum-of-squares (WSS) for different numbers of clusters, averaged over several iterations.

Parameters:
  • data (float array) – Trajectory data [frames, frame_data]

  • algorithm (string) – The algorithm to use for the clustering. Options: kmeans, rspace. Default: kmeans

  • max_iter (int, optional) – Maximum number of iterations. Default: 100.

  • num_repeats (int, optional) – Number of times to run the clustering for each number of clusters. Default: 5.

  • max_num_clusters (int, optional) – Maximum number of clusters for k-means clustering. Default: 12.

  • plot_file (str, optional) – Name of the file to save the plot.

Returns:

  • all_wss (float array) – WSS values for each number of clusters (starting at 2).

  • std_wss (float array) – Standard deviations of the WSS.

wss_over_number_of_combined_clusters(data_a, data_b, label_a='Sim A', label_b='Sim B', start_frame=0, algorithm='kmeans', max_iter=100, num_repeats=5, max_num_clusters=12, plot_file=None)

Calculates the Within-Sum-of-Squares for different numbers of clusters, averaged over several iterations.

Parameters:
  • data_a (float array) – Trajectory data [frames, frame_data]

  • data_b (float array) – Trajectory data [frames, frame_data]

  • label_a (str, optional) – Label for the plot.

  • label_b (str, optional) – Label for the plot.

  • start_frame (int, optional) – Frame from which the clustering data starts.

  • algorithm (string) – The algorithm to use for the clustering. Options: kmeans, rspace. Default: kmeans

  • max_iter (int, optional) – Maximum number of iterations. Default: 100.

  • num_repeats (int, optional) – Number of times to run the clustering for each number of clusters. Default: 5.

  • max_num_clusters (int, optional) – Maximum number of clusters for k-means clustering. Default: 12.

  • plot_file (str, optional) – Name of the file to save the plot.

Returns:

  • all_wss (float array) – WSS values for each number of clusters (starting at 2).

  • std_wss (float array) – Standard deviations of the WSS.