Ensemble Comparison =================== Preparation ----------- First we import the necessary modules. .. code:: python from pensa.comparison import * from pensa.features import * from pensa.statesinfo import * import numpy as np Then we load the structural features as described in the previous tutorial: .. code:: python sim_a_rec = read_structure_features( "traj/condition-a_receptor.gro", "traj/condition-a_receptor.xtc" ) sim_b_rec = read_structure_features( "traj/condition-b_receptor.gro", "traj/condition-b_receptor.xtc" ) sim_a_rec_feat, sim_a_rec_data = sim_a_rec sim_b_rec_feat, sim_b_rec_data = sim_b_rec Relative Entropy ---------------- Here we compare the two ensembles using measures for the relative entropy. To anser the question "How different are the two distributions of each feature?" PENSA provides discrete implementations of the Jensen-Shannon distance and the Kullback-Leibler divergences (both from distribution A to distribution B and from distribution B to distribution A, which are not identical). Their sensitivity can be adjusted via the number/spacing of the bins. This type of analysis works well with large datasets, for which even a fine spacing leaves enough samples in each relevant bin. You can as well calculate the Kolmogorov-Smirnov metric and the corresponding p value using the function ``kolmogorov_smirnov_analysis()``. In contrast to the binned JSD and KLD, the KS statistic is by design discrete and parameter-free. It is more suitable for small datasets, since its usual purpose is hypothesis testing and the comparison of empirical distributions. It tries to answer the question “Are the samples from different distributions?” Another possibility is to compare only the means and standard deviations of the distributions using ``mean_difference_analysis()``. We start with the backbone torsions, which we can select via ``'bb-torsions'``. To do the same analysis on sidechain torsions, replace ``'bb-torsions'`` with ``'sc-torsions'``. .. code:: python relen = relative_entropy_analysis( sim_a_rec_feat['bb-torsions'], sim_b_rec_feat['bb-torsions'], sim_a_rec_data['bb-torsions'], sim_b_rec_data['bb-torsions'], bin_num=10, verbose=False ) names_bbtors, jsd_bbtors, kld_ab_bbtors, kld_ba_bbtors = relen The above function also returns the Kullback-Leibler divergences of A with respect to B and vice versa. To find out where the ensembles differ the most, let's print out the most different features and the corresponding value. .. code:: python sf = sort_features(names_bbtors, jsd_bbtors) for f in sf[:12]: print(f[0], f[1]) To get an overview of how strongly the ensembles differ in which region, we can plot the maximum deviation of the features related to a certain residue. The following function also writes the maximum Jensen-Shannon distance per residue in the "B factor" field of a PDB file. .. code:: python ref_filename = "traj/condition-a_receptor.gro" out_filename = "receptor_bbtors-deviations" vis = residue_visualization( names_bbtors, jsd_bbtors, ref_filename, "plots/"+out_filename+"_jsd.pdf", "vispdb/"+out_filename+"_jsd.pdb", y_label='max. JS dist. of BB torsions' ) .. image:: images/JSD_pdb.png :height: 300px :align: center :alt: JSD pbd b-factor visualisaton file. .. image:: images/sc-jsd.png :height: 300px :align: center :alt: Jensen-Shannon Distance PDF output. Let's now save the resulting data in CSV files. .. code:: python np.savetxt( 'results/'+out_filename+'_relen.csv', np.array(relen).T, fmt='%s', delimiter=',', header='Name, JSD(A,B), KLD(A,B), KLD(B,A)' ) np.savetxt( 'results/'+out_filename+'_jsd.csv', np.array(vis).T, fmt='%s', delimiter=',', header='Residue, max. JSD(A,B)' ) State-Specific Information -------------------------- In addition, we can investigate differences in discrete conformational microstates within the torsion distributions by employing the State Specific Information (SSI) analysis in a similar manner. The prime example for this kind of analysis are protein sidechain torsions. The conformational microstates of each residue are multidimensional, incorporating all torsion angles in the definition of a residue's conformational space. This is why we first combine all torsions from the same residue to one multivariate feature. .. code:: python multivar_res_feat_a, multivar_res_data_a = get_multivar_res( sim_a_rec_feat['sc-torsions'], sim_a_rec_data['sc-torsions'] ) multivar_res_feat_b, multivar_res_data_b = get_multivar_res( sim_b_rec_feat['sc-torsions'], sim_b_rec_data['sc-torsions'] ) Then we determine the state boundaries. The distributions are decomposed into the individual Gaussians which fit the distribution, and conformational microstates are determined based on the Gaussian intersects. It is therefore necessary that each state is sampled sufficiently in order to accurately define the conformational states. .. code:: python discrete_states_ab = get_discrete_states( multivar_res_data_a, multivar_res_data_b ) Now we can run the main SSI comparison. .. code:: python resnames, ssi = ssi_ensemble_analysis( multivar_res_feat_a, multivar_res_feat_b, multivar_res_data_a, multivar_res_data_b, discrete_states_ab, verbose=False ) We can plot the results in the same way as we did for the backbone analysis. .. code:: python ref_filename = "traj/condition-a_receptor.gro" out_filename = "receptor_sctors-_ssi" vis = residue_visualization( resnames, ssi, ref_filename, "plots/"+out_filename+"_ssi.pdf", "vispdb/"+out_filename+"_ssi.pdb", y_label='max. SSI of SC torsions' ) .. image:: images/SSI_pdb.png :height: 300px :align: center :alt: SSI pbd b-factor visualisaton file. .. image:: images/sc-ssi.png :height: 300px :align: center :alt: State-Specific Information PDF output. Comparing Distances ------------------- Another common representation for the overall structure of a protein are the distances between the C-alpha atoms. We can perform the same kinds of analysis on them but will need a different approach to visualize them. Let's use the relative entropy again: .. code:: python relen = relative_entropy_analysis( sim_a_rec_feat['bb-distances'], sim_b_rec_feat['bb-distances'], sim_a_rec_data['bb-distances'], sim_b_rec_data['bb-distances'], bin_num=10, verbose=False ) names_bbdist, jsd_bbdist, kld_ab_bbdist, kld_ba_bbdist = relen We print the twelve distances with the highest deviations. .. code:: python sf = sort_features(names_bbdist, jsd_bbdist) for f in sf[:12]: print(f[0], f[1]) To visualize distances, we need a two-dimensional representation with the residues on each axis. We color each field with the value of the Jensen-Shannon distance (but could as well use Kullback-Leibler divergence, Kolmogorov-Smirnov statistic etc. instead). .. code:: python matrix = distances_visualization( names_bbdist, jsd_bbdist, "plots/receptor_jsd-bbdist.pdf", vmin = 0.0, vmax = 1.0, cbar_label='JSD' ) .. image:: images/bb-dists.png :height: 300px :align: center :alt: JSD distances pbf plot.