Tracing Information Flow
========================

Here we trace infromation flow throughout the receptor using SSI analysis tools. 
The SSI resolved in the ensemble comparison concludes which features couple their
torsion angles to the ensemble conditions, thereby receiving information of the ensemble switch. 
However, to trace information flow we need to ensure that features couple to 
each other as well as to the ensemble condition.

We can trace information flow by calculating the feature-feature SSI using 
the function ``ssi_feature_analysis()``.

The Co-SSI statistic, which tells how much the communication between two features 
depends on the ensemble condition, can be calculated using ``cossi_featens_analysis()``.

Features and States
-------------------

.. code:: python

    import os
    import numpy as np
    from pensa.features import *
    from pensa.statesinfo import *
    from pensa.comparison import *

First, load the structural features as described in the previous tutorial:

.. code:: python

    sim_a_rec = read_structure_features(
        "traj/condition-a_receptor.gro",
        "traj/condition-a_receptor.xtc"
    )
    sim_b_rec = read_structure_features(
        "traj/condition-b_receptor.gro",
        "traj/condition-b_receptor.xtc"
    )
    sim_a_rec_feat, sim_a_rec_data = sim_a_rec
    sim_b_rec_feat, sim_b_rec_data = sim_b_rec


Then combine all backbone torsions from one residue to one multivariate feature. 
You can also analyze sidechain torsions, just replace ``bb`` by ``sc``. Analysis of 
sidechain torsions takes much more time to run though so for this tutorial, we look
only at backbone torsions.

.. code:: python

    bbtors_res_feat_a, bbtors_res_data_a = get_multivar_res(
        sim_a_rec_feat['bb-torsions'], sim_a_rec_data['bb-torsions'] 
    )
    bbtors_res_feat_b, bbtors_res_data_b = get_multivar_res(
        sim_b_rec_feat['bb-torsions'], sim_b_rec_data['bb-torsions']
    )

From those residue-based multivariate features, we now determine the state boundaries.

.. code:: python

    bbtors_states = get_discrete_states(
        bbtors_res_data_a, bbtors_res_data_b
    )

Let's also featurize the water cavities in the system, using the grid we generated in
the preprocessing tutorial.

.. code:: python

    grid = "ab_grid_OH2_density.dx"
    water_feat_a, water_data_a = read_water_features(
        "traj/condition-a_water.gro", "traj/condition-a_water_aligned.xtc",
        top_waters = 5, atomgroup = "OH2", grid_input = grid
    )
    water_feat_b, water_data_b = read_water_features(
        "traj/condition-b_water.gro", "traj/condition-b_water.xtc",
        top_waters = 5, atomgroup = "OH2", grid_input = grid
    )

Just like we did for torsions, we can now determine the discrete states for the orientation
of the water molecules in the pockets.

.. code:: python

    water_states = get_discrete_states(
        water_data_a['WaterPocket_Distr'],
        water_data_b['WaterPocket_Distr'],
        discretize='gaussian', pbc=True
    )

Water occupancy (is water in the pocket or not?) is described as a binary feature with the 
values 0 or 1. We can thus define the state boundaries manually.

.. code:: python

    water_occup_states = [[[-0.1, 0.5, 1.1]]] * len(water_states)


Water Pockets
-------------

We start by comparing the occupancy between the two conditions, similar to what we did in the
comparison tutorial:

.. code:: python

    water_names, water_ssi = ssi_ensemble_analysis(
        water_feat_a['WaterPocket_OccupDistr'], water_feat_b['WaterPocket_OccupDistr'], 
        water_data_a['WaterPocket_OccupDistr'], water_data_b['WaterPocket_OccupDistr'],
        water_occup_states, verbose=True, h2o=False, pbc=False
    )

Except for water site O2, there are not many differences in occupancy, so let's have a closer 
look and compare the orientation of the water molecules between the two conditions: 

.. code:: python

    water_names, water_ssi = ssi_ensemble_analysis(
        water_feat_a['WaterPocket_Distr'], water_feat_b['WaterPocket_Distr'], 
        water_data_a['WaterPocket_Distr'], water_data_b['WaterPocket_Distr'],
        water_states, verbose=True, h2o=True
    )

Beyond comparing distributions, we can also quantify the amount of feature-feature communication
between the water sites. The corresponding function looks similar to the ensemble comarison above,
however it calculates the amount of shared information between each pair of residues.

.. code:: python

    water_pairs_names, water_pairs_ssi = ssi_feature_analysis(
        water_feat_a['WaterPocket_Distr'], water_feat_b['WaterPocket_Distr'], 
        water_data_a['WaterPocket_Distr'], water_data_b['WaterPocket_Distr'],
        water_states, verbose=True, h2o=True
    )

This function produces an array for the SSI between all features. This array is two-dimensional,
so we visualize its values using a heat map:

.. code:: python

    from pensa.comparison import resnum_heatmap, pair_features_heatmap

    pair_features_heatmap(
        water_pairs_names, water_pairs_ssi,
        "plots/water-pairs_ssi.pdf",
        vmin = 0.0, vmax = 1.,
        cbar_label='SSI',
        separator=' & '
    )


Water and Torsions Combined
---------------------------

Now we want to investigate information flow between more than one type of features.
To do so, we combine the water sites and the torsions:

.. code:: python

    all_feat_a = water_feat_a['WaterPocket_Distr'] + bbtors_res_feat_a
    all_feat_b = water_feat_b['WaterPocket_Distr'] + bbtors_res_feat_b
    all_data_a = np.array(list(water_data_a['WaterPocket_Distr']) + list(bbtors_res_data_a), dtype=object)
    all_data_b = np.array(list(water_data_b['WaterPocket_Distr']) + list(bbtors_res_data_b), dtype=object)
    all_states = water_states + bbtors_states

Note that we only use backbone torsions here to minimize computational effort.
Analysis of sidechain torsions (or of both combined) can often deliver more scientific insigts.

As we did for the water sites above, we now calculate the SSI for all combined feature-feature pairs.

.. code:: python

    all_pairs_names, all_pairs_ssi = ssi_feature_analysis(
        all_feat_a, all_feat_b, 
        all_data_a, all_data_b,
        all_states, verbose=True
    )

The number of pairs for an entire protein is enormous so we determine those with 
the highest SSI. To alleviate the computational effort for the sort function, we first 
filter the pairs by a threshold SSI of 0.5:

.. code:: python

    relevant = np.abs(all_pairs_ssi) > 0.5 
    not_self = np.array([name.split(' & ')[0] != name.split(' & ')[1] for name in all_pairs_names])
    relevant *= not_self
    argrelev = np.argwhere(relevant).flatten()
    all_relevant_pairs_names = [all_pairs_names[i] for i in argrelev]
    all_relevant_pairs_ssi = all_pairs_ssi[relevant]

Then we run the actual sorting by SSI.

.. code:: python

    sort_features(all_relevant_pairs_names, all_relevant_pairs_ssi)


The Co-SSI feature-feature-ensemble analysis is done in the same manner. 

.. code:: python

    all_new_pairs_names, all_new_pairs_ssi, all_new_pairs_cossi = cossi_featens_analysis(
        all_feat_a, all_feat_b, all_feat_a, all_feat_b,
        all_data_a, all_data_b, all_data_a, all_data_b,
        all_states, all_states, verbose=True
    )
                                             
The output of ``cossi_featens_analysis()`` produces an array for the SSI and the Co-SSI
between all features. These results can again be visualized in a 2D representation.