pensa.comparison

pensa.comparison.metrics

average_jsd(features_a, features_b, all_data_a, all_data_b, bin_width=None, bin_num=10, verbose=True, override_name_check=False)
average_kld(features_a, features_b, all_data_a, all_data_b, bin_width=None, bin_num=10, verbose=True, override_name_check=False)
average_ksp(features_a, features_b, all_data_a, all_data_b, verbose=True, override_name_check=False)
average_kss(features_a, features_b, all_data_a, all_data_b, verbose=True, override_name_check=False)
average_ssi(features_a, features_b, all_data_a, all_data_b, torsions=None, pocket_occupancy=None, pbc=True, verbose=True, write_plots=None, override_name_check=False)
max_jsd(features_a, features_b, all_data_a, all_data_b, bin_width=None, bin_num=10, verbose=True, override_name_check=False)
max_kld(features_a, features_b, all_data_a, all_data_b, bin_width=None, bin_num=10, verbose=True, override_name_check=False)
max_ksp(features_a, features_b, all_data_a, all_data_b, verbose=True, override_name_check=False)
max_kss(features_a, features_b, all_data_a, all_data_b, verbose=True, override_name_check=False)
max_ssi(features_a, features_b, all_data_a, all_data_b, torsions=None, pocket_occupancy=None, pbc=True, verbose=True, write_plots=None, override_name_check=False)
min_ksp(features_a, features_b, all_data_a, all_data_b, verbose=True, override_name_check=False)
pca_sampling_efficiency(ref_data, test_data, num_pc=2)

Calculates the relative sampling efficiency of test data based on reference data.

Parameters:
  • ref_data (float array) – Trajectory data from the reference ensemble. Format: [frames, frame_data].

  • test_data (float array) – Trajectory data from the test ensemble. Format: [frames, frame_data].

  • num_pc (int) – Number of principal components used.

Returns:

pca_se – Sampling efficiency of test data based on reference data.

Return type:

float

pensa.comparison.projections

pca_feature_correlation(features, data, pca=None, num=3, threshold=0.1, plot_file=None, add_labels=False)

Calculates and plots the correlation between principal components and the underlying features. Prints all features with a correlation above the threshold.

Parameters:
  • features (list of str) – Names of the features for which the PCA was/is performed.

  • data (float array) – Trajectory data for the features. Format: [frames, frame_data]

  • pca (PCA obj, default = None) – The PCA of which to plot the features. If no PCA is provided, it is calculated from the trajectory.

  • num (float, default = 3) – Number of feature correlations to plot.

  • threshold (float, default = 0.1) – Features with a correlation above this will be printed.

  • plot_file (str, optional, default = None) – Path and name of the file to save the plot.

  • add_labels (bool, optional, default = False) – Add labels of the features to the x axis.

tica_feature_correlation(features, data, num=3, tica=None, threshold=0.1, plot_file=None, add_labels=False)

Prints relevant features and plots feature correlations.

Parameters:
  • features (list of str) – Features for which the TICA was performed.

  • num (float) – Number of feature correlations to plot.

  • threshold (float) – Features with a correlation above this will be printed.

  • plot_file (str, optional, default = None) – Path and name of the file to save the plot.

  • add_labels (bool, optional, default = False) – Add labels of the features to the x axis.

pensa.comparison.relative_entropy

relative_entropy_analysis(features_a, features_b, all_data_a, all_data_b, bin_width=None, bin_num=10, verbose=True, override_name_check=False)

Calculates the Jensen-Shannon distance and the Kullback-Leibler divergences for each feature from two ensembles.

Parameters:
  • features_a (list of str) – Feature names of the first ensemble. Can be obtained from features object via .describe().

  • features_b (list of str) – Feature names of the first ensemble. Can be obtained from features object via .describe(). Must be the same as features_a. Provided as a sanity check.

  • all_data_a (float array) – Trajectory data from the first ensemble. Format: [frames, frame_data].

  • all_data_b (float array) – Trajectory data from the second ensemble. Format: [frames, frame_data].

  • bin_width (float, default=None) – Bin width for the axis to compare the distributions on. If bin_width is None, bin_num (see below) bins are used and the width is determined from the common histogram.

  • bin_num (int, default=10) – Number of bins for the axis to compare the distributions on (only if bin_width=None).

  • verbose (bool, default=True) – Print intermediate results.

  • override_name_check (bool, default=False) – Only check number of features, not their names.

Returns:

  • data_names (list of str) – Feature names.

  • data_jsdist (float array) – Jensen-Shannon distance for each feature.

  • data_kld_ab (float array) – Kullback-Leibler divergences of data_a wrt to data_b.

  • data_kld_ba (float array) – Kullback-Leibler divergences of data_b wrt to data_a.

pensa.comparison.statespecific

cossi_featens_analysis(features_a, features_b, features_c, features_d, all_data_a, all_data_b, all_data_c, all_data_d, discrete_states_ab, discrete_states_cd, max_thread_no=1, pbca=True, pbcb=True, h2oa=False, h2ob=False, verbose=True, override_name_check=False)

Calculates State Specific Information Co-SSI statistic between two features and the ensembles condition.

Parameters:
  • features_a (list of str) – Feature names of the first ensemble.

  • features_b (list of str) – Feature names of the second ensemble. Must be the same as features_a. Provided as a sanity check.

  • features_c (list of str) – Feature names of the third ensemble.

  • features_d (list of str) – Feature names of the fourth ensemble. Must be the same as features_c. Provided as a sanity check.

  • all_data_a (float array) – Trajectory data from the first ensemble. Format: [frames, frame_data].

  • all_data_b (float array) – Trajectory data from the second ensemble. Format: [frames, frame_data].

  • all_data_c (float array) – Trajectory data from the third ensemble. Format: [frames, frame_data].

  • all_data_d (float array) – Trajectory data from the fourth ensemble. Format: [frames, frame_data].

  • discrete_states_ab (list of list) – List of state limits for each feature.

  • discrete_states_cd (list of list) – List of state limits for each feature.

  • max_thread_no (int, optional) – Maximum number of threads to use in the multi-threading. Default is 1.

  • pbc (bool, optional) – If true, the apply periodic bounary corrections on angular distribution inputs. The input for periodic correction must be radians. The default is True.

  • h2o (bool, optional) – If true, the apply periodic bounary corrections for spherical angles with different periodicities. The default is False.

  • verbose (bool, optional) – Print intermediate results. Default is True.

  • override_name_check (bool, optional) – Only check number of features, not their names. Default is False.

Returns:

  • data_names (list of str) – Feature names.

  • data_ssi (float array) – State Specific Information SSI statistics for each feature.

  • data_cossi (float array) – State Specific Information Co-SSI statistics for each feature.

ssi_ensemble_analysis(features_a, features_b, all_data_a, all_data_b, discrete_states_ab, max_thread_no=1, pbc=True, h2o=False, verbose=True, write_plots=False, override_name_check=False)

Calculates State Specific Information statistic for a feature across two ensembles.

Parameters:
  • features_a (list of str) – Feature names of the first ensemble.

  • features_b (list of str) – Feature names of the first ensemble. Must be the same as features_a. Provided as a sanity check.

  • all_data_a (float array) – Trajectory data from the first ensemble. Format: [frames, frame_data].

  • all_data_b (float array) – Trajectory data from the second ensemble. Format: [frames, frame_data].

  • discrete_states_ab (list of list) – List of state limits for each feature.

  • max_thread_no (int, optional) – Maximum number of threads to use in the multi-threading. Default is 1.

  • pbc (bool, optional) – If true, the apply periodic bounary corrections on angular distribution inputs. The input for periodic correction must be radians. The default is True.

  • h2o (bool, optional) – If true, the apply periodic bounary corrections for spherical angles with different periodicities. The default is False.

  • verbose (bool, optional) – Print intermediate results. Default is True.

  • write_plots (bool, optional) – If true, visualise the states over the raw distribution. The default is False.

  • override_name_check (bool, optional) – Only check number of features, not their names. Default is False.

Returns:

  • data_names (list of str) – Feature names.

  • data_ssi (float array) – State Specific Information statistics for each feature.

ssi_feature_analysis(features_a, features_b, all_data_a, all_data_b, discrete_states_ab, max_thread_no=1, pbc=True, h2o=False, verbose=True, override_name_check=False)

Calculates State Specific Information statistic between two features across two ensembles.

Parameters:
  • features_a (list of str) – Feature names of the first ensemble.

  • features_b (list of str) – Feature names of the first ensemble. Must be the same as features_a. Provided as a sanity check.

  • all_data_a (float array) – Trajectory data from the first ensemble. Format: [frames, frame_data].

  • all_data_b (float array) – Trajectory data from the second ensemble. Format: [frames, frame_data].

  • discrete_states_ab (list of list) – List of state limits for each feature.

  • max_thread_no (int, optional) – Maximum number of threads to use in the multi-threading. Default is 1.

  • pbc (bool, optional) – If true, the apply periodic bounary corrections on angular distribution inputs. The input for periodic correction must be radians. The default is True.

  • h2o (bool, optional) – If true, the apply periodic bounary corrections for spherical angles with different periodicities. The default is False.

  • verbose (bool, optional) – Print intermediate results. Default is True.

  • override_name_check (bool, optional) – Only check number of features, not their names. Default is False.

Returns:

  • data_names (list of str) – Feature names.

  • data_ssi (float array) – State Specific Information statistics for each feature.

pensa.comparison.statistics

feature_correlation(data_a, data_b)

Calculates the correlation matrix between two sets of features. The features are normalized before the correlation is calculated.

Parameters:
  • data_a (float array) – Trajectory data [frames, frame_data].

  • data_b (float array) – Trajectory data [frames, frame_data].

Returns:

corr – Correlation matrix [num. features a, num. features b]

Return type:

float array

kolmogorov_smirnov_analysis(features_a, features_b, all_data_a, all_data_b, verbose=True, override_name_check=False)

Calculates Kolmogorov-Smirnov statistic for two distributions.

Parameters:
  • features_a (list of str) – Feature names of the first ensemble. Can be obtained from features object via .describe().

  • features_b (list of str) – Feature names of the first ensemble. Can be obtained from features object via .describe(). Must be the same as features_a. Provided as a sanity check.

  • all_data_a (float array) – Trajectory data from the first ensemble. Format: [frames, frame_data].

  • all_data_b (float array) – Trajectory data from the second ensemble. Format: [frames, frame_data].

  • verbose (bool, default=True) – Print intermediate results.

  • override_name_check (bool, default=False) – Only check number of features, not their names.

Returns:

  • data_names (list of str) – Feature names.

  • data_kss (float array) – Kolmogorov-Smirnov statistics for each feature.

  • data_ksp (float array) – Kolmogorov-Smirnov p-value for each feature.

mean_difference_analysis(features_a, features_b, all_data_a, all_data_b, verbose=True, override_name_check=False)

Compares the arithmetic means of two distance distributions.

Parameters:
  • features_a (list of str) – Feature names of the first ensemble. Can be obtained from features object via .describe().

  • features_b (list of str) – Feature names of the first ensemble. Can be obtained from features object via .describe(). Must be the same as features_a. Provided as a sanity check.

  • all_data_a (float array) – Trajectory data from the first ensemble. Format: [frames, frame_data].

  • all_data_b (float array) – Trajectory data from the second ensemble. Format: [frames, frame_data].

  • bin_width (float, default=0.001) – Bin width for the axis to compare the distributions on.

  • verbose (bool, default=True) – Print intermediate results.

  • override_name_check (bool, default=False) – Only check number of features, not their names.

Returns:

  • data_names (list of str) – Feature names.

  • data_avg (float array) – Joint average value for each feature.

  • data_diff (float array) – Difference of the averages for each feature.

pensa.comparison.uncertainty_analysis

relen_block_analysis(features_a, features_b, all_data_a, all_data_b, blockanlen=10000, cumdist=False, verbose=True)

Block analysis on the relative entropy metrics for each feature from two ensembles.

Parameters:
  • features_a (list of str) – Feature names of the first ensemble. Can be obtained from features object via .describe().

  • features_b (list of str) – Feature names of the first ensemble. Can be obtained from features object via .describe(). Must be the same as features_a. Provided as a sanity check.

  • all_data_a (float array) – Trajectory data from the first ensemble. Format: [frames, frame_data].

  • all_data_b (float array) – Trajectory data from the second ensemble. Format: [frames, frame_data].

  • blockanlen (int, optional) – Length of block to be used in the block analysis. Trajectory is then segmented into X equal size blocks. The default is None.

  • cumdist (bool, optional) – If True, set the block analysis to a cumulative segmentation, increasing in length by the block length. The default is False.

  • verbose (bool, default=True) – Print intermediate results.

Returns:

relen_blocks – List of relative entropy analysis outputs for each block.

Return type:

list of lists

relen_sem_analysis(relen_dat, write_plot=True, expfit=False, plot_dir='./SEM_plots', plot_prefix='')

Standard error analysis for the block averages.

Parameters:
  • relen_dat (list of lists) – List of relative entropy analysis outputs for each block.

  • write_plots (bool, optional) – If true, visualise the SEM analysis. Default is True.

  • expfit (bool, optional) – If True, apply an exponential fit to the SEM plot to predict the SEM value upon full convergence. Not yet fully accurate. The default is False.

  • plot_dir (str, optional) – Directory in which to save the plots (if write_plots == True)

Returns:

  • resrelenvals (list of lists) – JSD values for each block, for each residue type.

  • avresrelenvals (list of lists) – JSD values for each block, averaged by residue type.

  • avsemvals (list of lists) – SEM values averaged across each residue type.

ssi_block_analysis(features_a, features_b, all_data_a, all_data_b, blockanlen=10000, pbc=True, discretize='gaussian', group_feat=True, cumdist=False, verbose=True)

Block analysis on the State Specific Information statistic for each feature across two ensembles.

Parameters:
  • features_a (list of str) – Feature names of the first ensemble.

  • features_b (list of str) – Feature names of the first ensemble. Must be the same as features_a. Provided as a sanity check.

  • all_data_a (float array) – Trajectory data from the first ensemble. Format: [frames, frame_data].

  • all_data_b (float array) – Trajectory data from the second ensemble. Format: [frames, frame_data].

  • blockanlen (int, optional) – Length of block to be used in the block analysis. Trajectory is then segmented into X equal size blocks. The default is None.

  • discretize (str, optional) – Method for state discretization. Options are ‘gaussian’, which defines state limits by gaussian intersects, and ‘partition_values’, which defines state limits by partitioning all values in the data. The default is ‘gaussian’.

  • pbc (bool, optional) – If true, the apply periodic bounary corrections on angular distribution inputs. The input for periodic correction must be radians. The default is True.

  • cumdist (bool, optional) – If True, set the block analysis to a cumulative segmentation, increasing in length by the block length. The default is False.

  • verbose (bool, default=True) – Print intermediate results.

Returns:

  • ssi_names (list) – Feature names of the ensembles.

  • ssi_blocks (list of lists) – State Specific Information statistics for each feature, for each block.

ssi_sem_analysis(ssi_namelist, ssi_blocks, write_plot=True, expfit=False, plot_dir='./SEM_plots', plot_prefix='')

Standard error analysis for the block averages.

Parameters:
  • ssi_namelist (TYPE) – DESCRIPTION.

  • ssi_blocks (TYPE) – DESCRIPTION.

  • write_plots (bool, optional) – If true, visualise the SEM analysis. Default is True.

  • expfit (bool, optional) – If True, apply an exponential fit to the SEM plot to predict the SEM value upon full convergence. Not yet fully accurate. The default is False.

  • plot_dir (str, optional) – Directory in which to save the plots (if write_plots == True)

Returns:

  • avsemvals (list of lists) – SEM values averaged across each residue type.

  • avresssivals (list of lists) – SSI values for each block, averaged by residue type.

  • resssivals (list of lists) – SSI values for each block, for each residue type.

pensa.comparison.visualization

distances_visualization(dist_names, dist_diff, plot_filename, vmin=None, vmax=None, verbose=True, cbar_label=None, tick_step=50)

Visualizes distance features for pairs of residues in a heatmap.

Parameters:
  • dist_names (str array) – Names of the distances in PyEMMA nomenclature (contain residue IDs at position [2] and [6] when separated by ‘ ‘).

  • dist_diff (float array) – Data for each distance feature.

  • plot_filename (str) – Name of the file for the plot.

  • vmin (float, optional, default = None) – Minimum value for the heatmap.

  • vmax (float, optional, default = None) – Maximum value for the heatmap.

  • verbose (bool, optional, default = False) – Print numbers of first and last residue. Defaults to True.

  • cbar_label (str, optional, default = None) – Label for the color bar.

  • tick_step (int, optional, default = 50) – Step between two ticks on the plot axes.

Returns:

diff – Distance matrix.

Return type:

float array

pair_features_heatmap(feat_names, feat_diff, plot_filename, separator=' - ', num_drop_char=0, sort_by_pos=None, numerical_sort=False, vmin=None, vmax=None, symmetric=True, cbar_label=None)

Visualizes data per feature pair in a heatmap.

Parameters:
  • feat_names (str array) – Names of the features in PyEMMA nomenclature (contain residue IDs).

  • feat_diff (float array) – Data to be plotted for each residue-pair feature.

  • plot_filename (str) – Name of the file for the plot.

  • separator (str) – String that separates the two parts of the pair-type feature.

  • num_drop_char (int) – Number of characters to drop at the beginning of the feature name. Defaults to 0.

  • sort_by_pos (int) – Position in the name of the feature part of the quantity by which it is to be sorted. Assumes that the name is split by ‘ ‘ (single whitespace). Counting is 0-based. If None, the entire name of the feature part is sorted by numpy.unique(). Defaults to None.

  • numerical_sort (bool) – If true, the position defined by ‘sort_by_pos’ is assumed to be an integer. Defaults to False.

  • vmin (float, optional) – Minimum value for the heatmap.

  • vmax (float, optional) – Maximum value for the heatmap.

  • symmetric (bool, optional) – The matrix is symmetric and values provided only for the upper or lower triangle. Defaults to True.

  • cbar_label (str, optional) – Label for the color bar.

Returns:

diff – Matrix with the values of the difference/divergence.

Return type:

float array

residue_visualization(names, data, ref_filename, pdf_filename, pdb_filename, selection='max', y_label='max. JS dist. of BB torsions', offset=0)

Visualizes features per residue as plot and in PDB files. Assumes values from 0 to 1.

Parameters:
  • names (str array) – Names of the features in PyEMMA nomenclaturre (contain residue ID).

  • data (float array) – Data to project onto the structure.

  • ref_filename (str) – Name of the file for the reference structure.

  • pdf_filename (str) – Name of the PDF file to save the plot.

  • pdb_filename (str) – Name of the PDB file to save the structure with the values to visualize.

  • selection (str, default='max') – How to select the value to visualize for each residue from all its features Options: ‘max’, ‘min’, ‘avg’.

  • y_label (str, default='max. JS dist. of BB torsions') – Label of the y axis of the plot.

  • offset (int, default=0) – Number to subtract from the residue numbers that are loaded from the reference file.

Returns:

  • vis_resids (int array) – Residue numbers.

  • vis_values (float array) – Values of the quantity to be visualized.

resnum_heatmap(feat_names, feat_diff, plot_filename, res1_pos=2, res2_pos=6, vmin=None, vmax=None, symmetric=True, verbose=False, cbar_label=None, tick_step=50)

Visualizes data per residue pair in a heatmap.

Parameters:
  • feat_names (str array) – Names of the features in PyEMMA nomenclature (contain residue IDs).

  • feat_diff (float array) – Data to be plotted for each residue-pair feature.

  • plot_filename (str) – Name of the file for the plot.

  • res1_pos (int, optional, default = 2) – Position of the 1st residue ID in the feature name when separated by ‘ ‘.

  • res2_pos (int, optional, default = 6) – Position of the 2nd residue ID in the feature name when separated by ‘ ‘.

  • vmin (float, optional, default = None) – Minimum value for the heatmap.

  • vmax (float, optional, default = None) – Maximum value for the heatmap.

  • symmetric (bool, optional, default = True) – The matrix is symmetric and values provided only for the upper or lower triangle. Defaults to True.

  • verbose (bool, optional, default = False) – Print numbers of first and last residue. Defaults to True.

  • cbar_label (str, optional, default = None) – Label for the color bar.

  • tick_step (int, optional, default = 50) – Step between two ticks on the plot axes.

Returns:

diff – Matrix with the values of the difference/divergence.

Return type:

float array