pensa.dimensionality
pensa.dimensionality.pca
- calculate_pca(data, dim=None)
Performs a scikit-learn PCA on the provided data.
- Parameters:
data (float array) – Trajectory data [frames, frame_data].
dim (int, optional, default = -1) – The number of dimensions (principal components) to project onto. -1 means all numerically available dimensions will be used.
- Returns:
pca – Principal components information.
- Return type:
PCA obj
- get_components_pca(data, num, pca=None, prefix='')
Projects a trajectory onto the first num eigenvectors of its PCA.
- Parameters:
data (float array) – Trajectory data [frames, frame_data].
num (int) – Number of eigenvectors to project on.
pca (PCA obj, optional, default = None) – Information of pre-calculated PCA. Must be calculated for the same features (but not necessarily the same trajectory).
prefix (str, optional, default = '') – First part of the component names. Second part is “PC”+<PC number>
- Returns:
comp_names (list) – Names/numbers of the components.
components (float array) – Component data [frames, components]
- pca_eigenvalues_plot(pca, num=12, plot_file=None)
Plots the highest eigenvalues over the number of the principal components.
- Parameters:
pca (PCA obj) – Principal components information.
num (int, optional, default = 12) – Number of eigenvalues to plot.
plot_file (str, optional, default = None) – Path and name of the file to save the plot.
- pca_features(tica, features, num, threshold, plot_file=None, add_labels=False)
- project_on_pc(data, ev_idx, pca=None, dim=-1)
Projects a trajectory onto an eigenvector of its PCA, i.e., calculates the value along this component at each step of the trajectory (retains the order of the trajectory). Note that the eigenvector is indexed starting from zero.
- Parameters:
data (float array) – Trajectory data [frames, frame_data].
ev_idx (int) – Index of the eigenvector to project on (starts with zero).
pca (PCA obj, optional, default = None) – Information of pre-calculated PCA. Must be calculated for the same features (but not necessarily the same trajectory).
dim (int, optional, default = -1) – The number of dimensions (principal components) to project onto. Only used if tica is not provided.
- Returns:
projection – Value along the PC for each frame.
- Return type:
float array
- sort_mult_trajs_along_common_pc(data, top, trj, out_name, num_pc=3, start_frame=0)
Sort multiple trajectories along their most important common principal components. For each of the num_pc specified components, return a trajectory in which the frames from all original trajectories are ordered by their value along the respective components.
- Parameters:
data (list of float arrays) – List of trajectory data arrays, each [frames, frame_data].
top (list of str) – Reference topology files.
trj (list of str) – Trajetories from which the frames are picked. trj[i] should be the same as data[i] was from.
out_name (str) – Core part of the name of the output files.
num_pc (int, optional, default = 3) – Sort along the first num_pc principal components.
start_frame (int or list of int, default = 0) – Offset of the data with respect to the trajectories.
- Returns:
sorted_proj (list) – sorted projections on each principal component
sorted_indices_data (list) – Sorted indices of the data array for each principal component
sorted_indices_traj (list) – Sorted indices of the coordinate frames for each principal component
- sort_traj_along_pc(data, top, trj, out_name, pca=None, num_pc=3, start_frame=0)
Sort a trajectory along principal components. For each of the num_pc specified components, return a trajectory in which the frames are ordered by their value along the respective components.
- Parameters:
data (float array) – Trajectory data [frames, frame_data].
top (str) – File name of the reference topology for the trajectory.
trj (str) – File name of the trajetory from which the frames are picked. Should be the same as data was from.
out_name (str) – Core part of the name of the output files
pca (PCA obj, optional, default = None) – Principal components information. If none is provided, it will be calculated. Defaults to None.
num_pc (int, optional, default = 3) – Sort along the first num_pc principal components.
start_frame (int, optional, default = 0) – Offset of the data with respect to the trajectories (defined below).
- Returns:
sorted_proj (list) – sorted projections on each principal component
sorted_indices_data (list) – Sorted indices of the data array for each principal component
sorted_indices_traj (list) – Sorted indices of the coordinate frames for each principal component
- sort_trajs_along_common_pc(data_a, data_b, top_a, top_b, trj_a, trj_b, out_name, num_pc=3, start_frame=0)
Sort two trajectories along their most important common principal components. For each of the num_pc specified components, return a trajectory in which the frames from both original trajectories are ordered by their value along the respective components.
- Parameters:
data_a (float array) – Trajectory data [frames, frame_data].
data_b (float array) – Trajectory data [frames, frame_data].
top_a (str) – Reference topology for the first trajectory.
top_b (str) – Reference topology for the second trajectory.
trj_a (str) – First of the trajetories from which the frames are picked. Should be the same as data_a was from.
trj_b (str) – Second of the trajetories from which the frames are picked. Should be the same as data_b was from.
out_name (str) – Core part of the name of the output files.
num_pc (int, optional, default = 3) – Sort along the first num_pc principal components.
start_frame (int or list of int, default = 0) – Offset of the data with respect to the trajectories.
- Returns:
sorted_proj (list) – sorted projections on each principal component
sorted_indices_data (list) – Sorted indices of the data array for each principal component
sorted_indices_traj (list) – Sorted indices of the coordinate frames for each principal component
pensa.dimensionality.tica
- calculate_tica(data, dim=None, lag=10)
Performs time-lagged independent component analysis (TICA) on the provided data.
- Parameters:
data (float array) – Trajectory data. Format: [frames, frame_data].
dim (int, optional, default -1) – The number of dimensions (independent components) to project onto. -1 means all numerically available dimensions will be used.
lag (int, optional, default = 10) – The lag time, in multiples of the input time step.
- Returns:
tica – Time-lagged independent component information.
- Return type:
TICA obj
- get_components_tica(data, num, tica=None, lag=10, prefix='')
Projects a trajectory onto the first num eigenvectors of its tICA.
- Parameters:
data (float array) – Trajectory data [frames, frame_data].
num (int) – Number of eigenvectors to project on.
tica (tICA obj, optional, default = None) – Information of pre-calculated tICA. Defaults to None. Must be calculated for the same features (but not necessarily the same trajectory).
lag (int, optional, default = 10) – The lag time, in multiples of the input time step. Only used if tica is not provided.
prefix (str, optional, default = '') – First part of the component names. Second part is “IC”+<IC number>
- Returns:
comp_names (list) – Names/numbers of the components.
components (float array) – Component data [frames, components]
- project_on_tic(data, ev_idx, tica=None, dim=-1, lag=10)
Projects a trajectory onto an eigenvector of its TICA, i.e., calculates the value along this component at each step of the trajectory (retains the order of the trajectory) Note that the eigenvector is indexed starting from zero.
- Parameters:
data (float array) – Trajectory data [frames, frame_data].
ev_idx (int) – Index of the eigenvector to project on (starts with zero).
tica (TICA obj, optional, default = None) – Information of pre-calculated TICA. Must be calculated for the same features (but not necessarily the same trajectory).
dim (int, optional, default = -1) – The number of dimensions (independent components) to project onto. Only used if tica is not provided.
lag (int, optional, default = 10) – The lag time, in multiples of the input time step. Only used if tica is not provided.
- Returns:
projection – Value along the TIC for each frame.
- Return type:
float array
- sort_mult_trajs_along_common_tic(data, top, trj, out_name, num_ic=3, lag=10, start_frame=0)
Sort multiple trajectories along their most important independent components. For each of the num_pc specified components, return a trajectory in which the frames from all original trajectories are ordered by their value along the respective components.
- Parameters:
data (list of float arrays) – List of trajectory data arrays, each [frames, frame_data].
top (list of str) – Reference topology files.
trj (list of str) – Trajetories from which the frames are picked. trj[i] should be the same as data[i] was from.
out_name (str) – Core part of the name of the output files.
num_ic (int, optional, default = 3) – Sort along the first num_ic independent components.
lag (int, optional, default = 10) – The lag time, in multiples of the input time step. Only used if tica is not provided.
start_frame (int or list of int, default = 0) – Offset of the data with respect to the trajectories.
- Returns:
sorted_proj (list) – sorted projections on each independent component
sorted_indices_data (list) – Sorted indices of the data array for each independent component
sorted_indices_traj (list) – Sorted indices of the coordinate frames for each independent component
- sort_traj_along_tic(data, top, trj, out_name, tica=None, num_ic=3, lag=10, start_frame=0)
Sort a trajectory along independent components. For each of the num_pc specified components, return a trajectory in which the frames are ordered by their value along the respective components.
- Parameters:
data (float array) – Trajectory data [frames, frame_data].
top (str) – File name of the reference topology for the trajectory.
trj (str) – File name of the trajetory from which the frames are picked. Should be the same as data was from.
out_name (str) – Core part of the name of the output files
tica (tICA obj, optional, default = None) – Time-lagged independent components information. If none is provided, it will be calculated. Defaults to None.
num_ic (int, optional, default = 3) – Sort along the first num_ic independent components.
lag (int, optional, default = 10) – The lag time, in multiples of the input time step. Only used if tica is not provided.
start_frame (int, optional, default = 0) – Offset of the data with respect to the trajectories (defined below).
- Returns:
sorted_proj (list) – sorted projections on each principal component
sorted_indices_data (list) – Sorted indices of the data array for each principal component
sorted_indices_traj (list) – Sorted indices of the coordinate frames for each principal component
- sort_trajs_along_common_tic(data_a, data_b, top_a, top_b, trj_a, trj_b, out_name, num_ic=3, lag=10, start_frame=0)
Sort two trajectories along their most important common time-lagged independent components. For each of the num_pc specified components, return a trajectory in which the frames from both original trajectories are ordered by their value along the respective components.
- Parameters:
data_a (float array) – Trajectory data [frames, frame_data].
data_b (float array) – Trajectory data [frames, frame_data].
top_a (str) – Reference topology for the first trajectory.
top_b (str) – Reference topology for the second trajectory.
trj_a (str) – First of the trajetories from which the frames are picked. Should be the same as data_a was from.
trj_b (str) – Second of the trajetories from which the frames are picked. Should be the same as data_b was from.
out_name (str) – Core part of the name of the output files.
num_ic (int, optional, default = 3) – Sort along the first num_ic independent components.
lag (int, optional, default = 10) – The lag time, in multiples of the input time step. Only used if tica is not provided.
start_frame (int, optional, default = 0) – Offset of the data with respect to the trajectories (defined below).
- Returns:
sorted_proj (list) – sorted projections on each principal component
sorted_indices_data (list) – Sorted indices of the data array for each principal component
sorted_indices_traj (list) – Sorted indices of the coordinate frames for each principal component
- tica_eigenvalues_plot(tica, num=12, plot_file=None)
Plots the highest eigenvalues over the number of the time-lagged independent components.
- Parameters:
tica (TICA obj) – Time-lagged independent components information.
num (int, default = 12) – Number of eigenvalues to plot.
plot_file (str, optional, default = None) – Path and name of the file to save the plot.
- tica_features(tica, features, num, threshold, plot_file=None, add_labels=False)
pensa.dimensionality.visualization
- compare_mult_projections(data, ana, num=3, saveas=None, labels=None, colors=None)
Compare multiple datasets along the components of a PCA or tICA.
- Parameters:
data (list of float arrays) – Data from multiple trajectories [frames, frame_data].
ana (PCA or tICA object) – Components analysis information.
num (int) – Number of principal components to plot.
saveas (str, optional) – Name of the output file.
labels (list of str, optional) – Labels for the datasets. If provided, it must have the same length as data.
- Returns:
projections – Projections of the trajectory on each principal component.
- Return type:
list of float arrays
- compare_projections(data_a, data_b, ana, num=3, saveas=None, label_a=None, label_b=None)
Compare two datasets along the components of a PCA or tICA.
- Parameters:
data_a (float array) – Trajectory data [frames, frame_data].
data_b (float array) – Trajectory data [frames, frame_data].
ana (PCA or tICA object) – Components analysis information.
num (int) – Number of components to plot.
saveas (str, optional) – Name of the output file.
label_a (str, optional) – Label for the first dataset.
label_b (str, optional) – Label for the second dataset.
- Returns:
projections – Projections of the trajectory on each component.
- Return type:
list of float arrays
- project_on_eigenvector_pca(data, ev_idx, ana)
Projects a trajectory onto an eigenvector of its PCA.
- Parameters:
data (float array) – Trajectory data [frames, frame_data].
ev_idx (int) – Index of the eigenvector to project on (starts with zero).
ana (PCA obj) – Information of pre-calculated PCA. Must be calculated for the same features (but not necessarily the same trajectory).
- Returns:
projection – Value along the PC for each frame.
- Return type:
float array
- project_on_eigenvector_tica(data, ev_idx, ana)
Projects a trajectory onto an eigenvector of its tICA.
- Parameters:
data (float array) – Trajectory data [frames, frame_data].
ev_idx (int) – Index of the eigenvector to project on (starts with zero).
ana (tICA obj) – Information of pre-calculated tICA. Must be calculated for the same features (but not necessarily the same trajectory).
- Returns:
projection – Value along the PC for each frame.
- Return type:
float array
- sort_traj_along_projection(data, ana, top, trj, out_name, num_comp=3, start_frame=0)
Sort a trajectory along given principal components.
- Parameters:
data (float array) – Trajectory data [frames, frame_data].
ana (PCA or tICA obj) – Components information.
top (str) – File name of the reference topology for the trajectory.
trj (str) – File name of the trajetory from which the frames are picked. Should be the same as data was from.
out_name (str) – Core part of the name of the output files
num_comp (int, optional) – Sort along the first num_comp components.
start_frame (int, optional) – Offset of the data with respect to the trajectories (defined below).
- Returns:
sorted_proj (list) – sorted projections on each component
sorted_indices_data (list) – Sorted indices of the data array for each component
sorted_indices_traj (list) – Sorted indices of the coordinate frames for each component