pensa.dimensionality

pensa.dimensionality.pca

calculate_pca(data, dim=None)

Performs a scikit-learn PCA on the provided data.

Parameters:
  • data (float array) – Trajectory data [frames, frame_data].

  • dim (int, optional, default = -1) – The number of dimensions (principal components) to project onto. -1 means all numerically available dimensions will be used.

Returns:

pca – Principal components information.

Return type:

PCA obj

get_components_pca(data, num, pca=None, prefix='')

Projects a trajectory onto the first num eigenvectors of its PCA.

Parameters:
  • data (float array) – Trajectory data [frames, frame_data].

  • num (int) – Number of eigenvectors to project on.

  • pca (PCA obj, optional, default = None) – Information of pre-calculated PCA. Must be calculated for the same features (but not necessarily the same trajectory).

  • prefix (str, optional, default = '') – First part of the component names. Second part is “PC”+<PC number>

Returns:

  • comp_names (list) – Names/numbers of the components.

  • components (float array) – Component data [frames, components]

pca_eigenvalues_plot(pca, num=12, plot_file=None)

Plots the highest eigenvalues over the number of the principal components.

Parameters:
  • pca (PCA obj) – Principal components information.

  • num (int, optional, default = 12) – Number of eigenvalues to plot.

  • plot_file (str, optional, default = None) – Path and name of the file to save the plot.

pca_features(tica, features, num, threshold, plot_file=None, add_labels=False)
project_on_pc(data, ev_idx, pca=None, dim=-1)

Projects a trajectory onto an eigenvector of its PCA, i.e., calculates the value along this component at each step of the trajectory (retains the order of the trajectory). Note that the eigenvector is indexed starting from zero.

Parameters:
  • data (float array) – Trajectory data [frames, frame_data].

  • ev_idx (int) – Index of the eigenvector to project on (starts with zero).

  • pca (PCA obj, optional, default = None) – Information of pre-calculated PCA. Must be calculated for the same features (but not necessarily the same trajectory).

  • dim (int, optional, default = -1) – The number of dimensions (principal components) to project onto. Only used if tica is not provided.

Returns:

projection – Value along the PC for each frame.

Return type:

float array

sort_mult_trajs_along_common_pc(data, top, trj, out_name, num_pc=3, start_frame=0)

Sort multiple trajectories along their most important common principal components. For each of the num_pc specified components, return a trajectory in which the frames from all original trajectories are ordered by their value along the respective components.

Parameters:
  • data (list of float arrays) – List of trajectory data arrays, each [frames, frame_data].

  • top (list of str) – Reference topology files.

  • trj (list of str) – Trajetories from which the frames are picked. trj[i] should be the same as data[i] was from.

  • out_name (str) – Core part of the name of the output files.

  • num_pc (int, optional, default = 3) – Sort along the first num_pc principal components.

  • start_frame (int or list of int, default = 0) – Offset of the data with respect to the trajectories.

Returns:

  • sorted_proj (list) – sorted projections on each principal component

  • sorted_indices_data (list) – Sorted indices of the data array for each principal component

  • sorted_indices_traj (list) – Sorted indices of the coordinate frames for each principal component

sort_traj_along_pc(data, top, trj, out_name, pca=None, num_pc=3, start_frame=0)

Sort a trajectory along principal components. For each of the num_pc specified components, return a trajectory in which the frames are ordered by their value along the respective components.

Parameters:
  • data (float array) – Trajectory data [frames, frame_data].

  • top (str) – File name of the reference topology for the trajectory.

  • trj (str) – File name of the trajetory from which the frames are picked. Should be the same as data was from.

  • out_name (str) – Core part of the name of the output files

  • pca (PCA obj, optional, default = None) – Principal components information. If none is provided, it will be calculated. Defaults to None.

  • num_pc (int, optional, default = 3) – Sort along the first num_pc principal components.

  • start_frame (int, optional, default = 0) – Offset of the data with respect to the trajectories (defined below).

Returns:

  • sorted_proj (list) – sorted projections on each principal component

  • sorted_indices_data (list) – Sorted indices of the data array for each principal component

  • sorted_indices_traj (list) – Sorted indices of the coordinate frames for each principal component

sort_trajs_along_common_pc(data_a, data_b, top_a, top_b, trj_a, trj_b, out_name, num_pc=3, start_frame=0)

Sort two trajectories along their most important common principal components. For each of the num_pc specified components, return a trajectory in which the frames from both original trajectories are ordered by their value along the respective components.

Parameters:
  • data_a (float array) – Trajectory data [frames, frame_data].

  • data_b (float array) – Trajectory data [frames, frame_data].

  • top_a (str) – Reference topology for the first trajectory.

  • top_b (str) – Reference topology for the second trajectory.

  • trj_a (str) – First of the trajetories from which the frames are picked. Should be the same as data_a was from.

  • trj_b (str) – Second of the trajetories from which the frames are picked. Should be the same as data_b was from.

  • out_name (str) – Core part of the name of the output files.

  • num_pc (int, optional, default = 3) – Sort along the first num_pc principal components.

  • start_frame (int or list of int, default = 0) – Offset of the data with respect to the trajectories.

Returns:

  • sorted_proj (list) – sorted projections on each principal component

  • sorted_indices_data (list) – Sorted indices of the data array for each principal component

  • sorted_indices_traj (list) – Sorted indices of the coordinate frames for each principal component

pensa.dimensionality.tica

calculate_tica(data, dim=None, lag=10)

Performs time-lagged independent component analysis (TICA) on the provided data.

Parameters:
  • data (float array) – Trajectory data. Format: [frames, frame_data].

  • dim (int, optional, default -1) – The number of dimensions (independent components) to project onto. -1 means all numerically available dimensions will be used.

  • lag (int, optional, default = 10) – The lag time, in multiples of the input time step.

Returns:

tica – Time-lagged independent component information.

Return type:

TICA obj

get_components_tica(data, num, tica=None, lag=10, prefix='')

Projects a trajectory onto the first num eigenvectors of its tICA.

Parameters:
  • data (float array) – Trajectory data [frames, frame_data].

  • num (int) – Number of eigenvectors to project on.

  • tica (tICA obj, optional, default = None) – Information of pre-calculated tICA. Defaults to None. Must be calculated for the same features (but not necessarily the same trajectory).

  • lag (int, optional, default = 10) – The lag time, in multiples of the input time step. Only used if tica is not provided.

  • prefix (str, optional, default = '') – First part of the component names. Second part is “IC”+<IC number>

Returns:

  • comp_names (list) – Names/numbers of the components.

  • components (float array) – Component data [frames, components]

project_on_tic(data, ev_idx, tica=None, dim=-1, lag=10)

Projects a trajectory onto an eigenvector of its TICA, i.e., calculates the value along this component at each step of the trajectory (retains the order of the trajectory) Note that the eigenvector is indexed starting from zero.

Parameters:
  • data (float array) – Trajectory data [frames, frame_data].

  • ev_idx (int) – Index of the eigenvector to project on (starts with zero).

  • tica (TICA obj, optional, default = None) – Information of pre-calculated TICA. Must be calculated for the same features (but not necessarily the same trajectory).

  • dim (int, optional, default = -1) – The number of dimensions (independent components) to project onto. Only used if tica is not provided.

  • lag (int, optional, default = 10) – The lag time, in multiples of the input time step. Only used if tica is not provided.

Returns:

projection – Value along the TIC for each frame.

Return type:

float array

sort_mult_trajs_along_common_tic(data, top, trj, out_name, num_ic=3, lag=10, start_frame=0)

Sort multiple trajectories along their most important independent components. For each of the num_pc specified components, return a trajectory in which the frames from all original trajectories are ordered by their value along the respective components.

Parameters:
  • data (list of float arrays) – List of trajectory data arrays, each [frames, frame_data].

  • top (list of str) – Reference topology files.

  • trj (list of str) – Trajetories from which the frames are picked. trj[i] should be the same as data[i] was from.

  • out_name (str) – Core part of the name of the output files.

  • num_ic (int, optional, default = 3) – Sort along the first num_ic independent components.

  • lag (int, optional, default = 10) – The lag time, in multiples of the input time step. Only used if tica is not provided.

  • start_frame (int or list of int, default = 0) – Offset of the data with respect to the trajectories.

Returns:

  • sorted_proj (list) – sorted projections on each independent component

  • sorted_indices_data (list) – Sorted indices of the data array for each independent component

  • sorted_indices_traj (list) – Sorted indices of the coordinate frames for each independent component

sort_traj_along_tic(data, top, trj, out_name, tica=None, num_ic=3, lag=10, start_frame=0)

Sort a trajectory along independent components. For each of the num_pc specified components, return a trajectory in which the frames are ordered by their value along the respective components.

Parameters:
  • data (float array) – Trajectory data [frames, frame_data].

  • top (str) – File name of the reference topology for the trajectory.

  • trj (str) – File name of the trajetory from which the frames are picked. Should be the same as data was from.

  • out_name (str) – Core part of the name of the output files

  • tica (tICA obj, optional, default = None) – Time-lagged independent components information. If none is provided, it will be calculated. Defaults to None.

  • num_ic (int, optional, default = 3) – Sort along the first num_ic independent components.

  • lag (int, optional, default = 10) – The lag time, in multiples of the input time step. Only used if tica is not provided.

  • start_frame (int, optional, default = 0) – Offset of the data with respect to the trajectories (defined below).

Returns:

  • sorted_proj (list) – sorted projections on each principal component

  • sorted_indices_data (list) – Sorted indices of the data array for each principal component

  • sorted_indices_traj (list) – Sorted indices of the coordinate frames for each principal component

sort_trajs_along_common_tic(data_a, data_b, top_a, top_b, trj_a, trj_b, out_name, num_ic=3, lag=10, start_frame=0)

Sort two trajectories along their most important common time-lagged independent components. For each of the num_pc specified components, return a trajectory in which the frames from both original trajectories are ordered by their value along the respective components.

Parameters:
  • data_a (float array) – Trajectory data [frames, frame_data].

  • data_b (float array) – Trajectory data [frames, frame_data].

  • top_a (str) – Reference topology for the first trajectory.

  • top_b (str) – Reference topology for the second trajectory.

  • trj_a (str) – First of the trajetories from which the frames are picked. Should be the same as data_a was from.

  • trj_b (str) – Second of the trajetories from which the frames are picked. Should be the same as data_b was from.

  • out_name (str) – Core part of the name of the output files.

  • num_ic (int, optional, default = 3) – Sort along the first num_ic independent components.

  • lag (int, optional, default = 10) – The lag time, in multiples of the input time step. Only used if tica is not provided.

  • start_frame (int, optional, default = 0) – Offset of the data with respect to the trajectories (defined below).

Returns:

  • sorted_proj (list) – sorted projections on each principal component

  • sorted_indices_data (list) – Sorted indices of the data array for each principal component

  • sorted_indices_traj (list) – Sorted indices of the coordinate frames for each principal component

tica_eigenvalues_plot(tica, num=12, plot_file=None)

Plots the highest eigenvalues over the number of the time-lagged independent components.

Parameters:
  • tica (TICA obj) – Time-lagged independent components information.

  • num (int, default = 12) – Number of eigenvalues to plot.

  • plot_file (str, optional, default = None) – Path and name of the file to save the plot.

tica_features(tica, features, num, threshold, plot_file=None, add_labels=False)

pensa.dimensionality.visualization

compare_mult_projections(data, ana, num=3, saveas=None, labels=None, colors=None)

Compare multiple datasets along the components of a PCA or tICA.

Parameters:
  • data (list of float arrays) – Data from multiple trajectories [frames, frame_data].

  • ana (PCA or tICA object) – Components analysis information.

  • num (int) – Number of principal components to plot.

  • saveas (str, optional) – Name of the output file.

  • labels (list of str, optional) – Labels for the datasets. If provided, it must have the same length as data.

Returns:

projections – Projections of the trajectory on each principal component.

Return type:

list of float arrays

compare_projections(data_a, data_b, ana, num=3, saveas=None, label_a=None, label_b=None)

Compare two datasets along the components of a PCA or tICA.

Parameters:
  • data_a (float array) – Trajectory data [frames, frame_data].

  • data_b (float array) – Trajectory data [frames, frame_data].

  • ana (PCA or tICA object) – Components analysis information.

  • num (int) – Number of components to plot.

  • saveas (str, optional) – Name of the output file.

  • label_a (str, optional) – Label for the first dataset.

  • label_b (str, optional) – Label for the second dataset.

Returns:

projections – Projections of the trajectory on each component.

Return type:

list of float arrays

project_on_eigenvector_pca(data, ev_idx, ana)

Projects a trajectory onto an eigenvector of its PCA.

Parameters:
  • data (float array) – Trajectory data [frames, frame_data].

  • ev_idx (int) – Index of the eigenvector to project on (starts with zero).

  • ana (PCA obj) – Information of pre-calculated PCA. Must be calculated for the same features (but not necessarily the same trajectory).

Returns:

projection – Value along the PC for each frame.

Return type:

float array

project_on_eigenvector_tica(data, ev_idx, ana)

Projects a trajectory onto an eigenvector of its tICA.

Parameters:
  • data (float array) – Trajectory data [frames, frame_data].

  • ev_idx (int) – Index of the eigenvector to project on (starts with zero).

  • ana (tICA obj) – Information of pre-calculated tICA. Must be calculated for the same features (but not necessarily the same trajectory).

Returns:

projection – Value along the PC for each frame.

Return type:

float array

sort_traj_along_projection(data, ana, top, trj, out_name, num_comp=3, start_frame=0)

Sort a trajectory along given principal components.

Parameters:
  • data (float array) – Trajectory data [frames, frame_data].

  • ana (PCA or tICA obj) – Components information.

  • top (str) – File name of the reference topology for the trajectory.

  • trj (str) – File name of the trajetory from which the frames are picked. Should be the same as data was from.

  • out_name (str) – Core part of the name of the output files

  • num_comp (int, optional) – Sort along the first num_comp components.

  • start_frame (int, optional) – Offset of the data with respect to the trajectories (defined below).

Returns:

  • sorted_proj (list) – sorted projections on each component

  • sorted_indices_data (list) – Sorted indices of the data array for each component

  • sorted_indices_traj (list) – Sorted indices of the coordinate frames for each component