Featurization ============= The analysis is not performed on the coordinates directly but on features derived from these coordinates. PENSA already includes several feature readers, some of which we show here. You can also write your own following the general pattern. Note that all reader functions load the names of the features and their values separately. .. image:: images/Torsions.jpg :height: 300px :align: center :alt: Torsion angles. Basic Example ************* For example, we can read protein backbone torsions. .. code:: python from pensa.features import * bbtors_feat, bbtors_data = read_protein_backbone_torsions( "traj/condition-a_receptor.gro", "traj/condition-a_receptor.xtc", selection='all', first_frame=0, last_frame=None, step=1 ) The first part of the output always contains the names of the features and the second part contains the corresponding data. .. code:: python print(bbtors_feat[:3]) print(bbtors_data[:3]) Having a look at the shape of the loaded data, we see that the first dimension is the number of frames. The second dimension is the number of features. .. code:: python print("Number of torsions:", len(bbtors_feat)) print("Data format:", bbtors_data.shape) PENSA has two simple functions to store features in csv file ... .. code:: python write_csv_features(bbtors_feat, bbtors_data, 'features/bb-torsions_a.csv') ... and to load them again. .. code:: python bbtors_feat, bbtors_data = read_csv_features('features/bb-torsions_a.csv') Protein Structure Features ************************** Here we use the special function ``read_structure_features`` that loads three types of structure features at once: - backbone torsions: ``'bb-torsions'``, - backbone C-alpha distances: ``'bb-distances'``, and - sidechain torsions: ``'sc-torsions'``. It was modeled after the feature loader in PyEMMA. .. code:: python from pensa.features import * .. code:: python sim_a_rec = read_structure_features( "traj/condition-a_receptor.gro", "traj/condition-a_receptor.xtc" ) sim_a_rec_feat, sim_a_rec_data = sim_a_rec .. code:: python sim_b_rec = read_structure_features( "traj/condition-b_receptor.gro", "traj/condition-b_receptor.xtc" ) sim_b_rec_feat, sim_b_rec_data = sim_b_rec For this function, the feature names and the data both contain dictionaries with entries for each feature type. Let's loop through them to make sure that the number of features is the same for both simulations. This is a requirement for the further analysis. .. code:: python for k in sim_a_rec_data.keys(): print(k, sim_a_rec_data[k].shape) .. code:: python for k in sim_b_rec_data.keys(): print(k, sim_b_rec_data[k].shape) Now let's do the same only for the transmembrane region. .. code:: python sim_a_tmr = read_structure_features( "traj/condition-a_tm.gro", "traj/condition-a_tm.xtc" ) sim_b_tmr = read_structure_features( "traj/condition-b_tm.gro", "traj/condition-b_tm.xtc" ) sim_a_tmr_feat, sim_a_tmr_data = sim_a_tmr sim_b_tmr_feat, sim_b_tmr_data = sim_b_tmr for k in sim_a_rec_data.keys(): print(k, sim_a_rec_data[k].shape) for k in sim_b_rec_data.keys(): print(k, sim_b_rec_data[k].shape) Water Features ************** Water molecules are featurized from water density. Unlike residues which are fixed to a protein, a single water molecule can move throughout the entire simulation box, therefore featurizing a single water molecule does not make sense. Instead, it is the spatially conserved internal protein cavities in which water molecules occupy that are of interest. Water pocket featurization extracts a distribution that represents whether or not a specific protein cavity is occupied by a water molecule, and what that water molecule's orientation (polarisation) is. .. image:: images/WaterFeatures.jpg :width: 300px :align: center :alt: Water features derived from density. .. code:: python from pensa.features import * For the pdb visualisation, the trajectory needs to be fit to the first frame of the simulation so that the density and protein align with each other. Here we featurize the top 2 most probable water sites (top_waters = 2). Orientation of the waters (water_data - spherical coordinates [radians]) is a timeseries distribution. When water is not present at the site, the orientation is recorded as 10000.0 to represent an empty state. By specifying an name to write data out with in the argument - out_name, we can visualise the pocket occupancies on the reference structure in a pdb file with pocket occupancy saved as b_factors. You must specify the water model for writing out the grid. options include: SPC TIP3P TIP4P water .. code:: python struc = "traj/condition-a_water.gro" xtc = "traj/condition-a_water_aligned.xtc" water_feat, water_data = read_water_features( structure_input = struc, xtc_input = xtc, top_waters = 2, atomgroup = "OH2", write_grid_as="TIP3P", out_name = "features/11426_dyn_151_water" ) To featurize sites common to both ensembles, we obtain the density grid following the steps in the density section of the preprocessing tutorial. This way, sites are the same across both ensembles and can be compared. .. code:: python struc = "traj/condition-a_water.gro" xtc = "traj/condition-a_water_aligned.xtc" grid = "ab_grid_OH2_density.dx" water_feat, water_data = read_water_features( structure_input = struc, xtc_input = xtc, top_waters = 2, atomgroup = "OH2", grid_input = grid ) Single-Atom Features ******************** For single atoms we use a similar function which provides the same functionality but ignores orientations as atoms are considered spherically symmetric. .. code:: python from pensa.features import * Here we locate the sodium site which has the highest probability. The density is written (write=True) using the default density conversion "Angstrom^{-3}" in MDAnalysis. .. code:: python struc = "mor-data/11426_dyn_151.pdb" xtc = "mor-data/11423_trj_151.xtc" atom_feat, atom_data = read_atom_features( structure_input = struc, xtc_input = xtc, top_atoms = 1, atomgroup = "SOD", element = "Na", out_name = "features/11426_dyn_151_sodium" )