pensa.features
Methods to read and process features from coordinates.
pensa.features.atom_features
Methods to obtain a timeseries distribution for the atom/ion pockets’ occupancies.
Atom pockets are defined as radius 2.5 Angstroms (based off of bond lengths between Na and O, and Ca and O) centered on the probability density maxima of the atoms.
The methods here are based on the following paper:
Neil J. Thomson, Owen N. Vickery, Callum M. Ives, Ulrich Zachariae:Ion-water coupling controls class A GPCR signal transduction pathways.
- read_atom_features(structure_input, xtc_input, atomgroup, element, top_atoms=10, grid_input=None, write=None, out_name=None)
Featurize atom pockets for the top X most probable atoms (top_atoms).
- Parameters:
structure_input (str) – File name for the reference file (PDB or GRO format).
xtc_input (str) – File name for the trajectory (xtc format).
atomgroup (str) – Atomgroup selection to calculate the density for (atom name in structure_input).
element (TYPE) – DESCRIPTION.
top_atoms (int, optional) – Number of atoms to featurize. The default is 10.
grid_input (str, optional) – File name for the density grid input. The default is None, and a grid is automatically generated.
write (bool, optional) – If true, the following data will be written out: reference pdb with occupancies, atom distributions, atom data summary. The default is None.
out_name (str, optional) – Prefix for all written filenames. The default is None.
- Returns:
feature_names (list of str) – Names of all features
features_data (numpy array) – Data for all features
pensa.features.csv_features
- read_csv_features(csv_file)
Load features from a CSV file as produced by PENSA.
- Parameters:
csv_file (str) – File name for the input CSV file.
- Returns:
feature_names (list of str) – Names of the features.
features_data (numpy array) – Data for the features. Format: [frames, frame_data].
- read_drormd_features(csv_file)
Load features from a CSV file as produced by DrorMD.
- Parameters:
csv_file (str) – File name for the input CSV file.
- Returns:
feature_names (list of str) – Names of the features.
features_data (numpy array) – Data for the features. Format: [frames, frame_data].
- write_csv_features(feature_names, feature_data, csv_file)
Write features to a CSV file.
- Parameters:
feature_names (list of str) – Names of the features.
features_data (numpy array) – Data for the features. Format: [frames, frame_data].
csv_file (str) – File name for the output CSV file.
pensa.features.hbond_features
- atg_to_names(atg)
- name_atom_features(u, atom_ids, feature_type='H-DON', naming='plain')
- name_pairs(u, all_pairs, pair_type='HBOND', naming='plain')
- read_h_bond_satisfaction(structure_input, xtc_input, fixed_group, dyn_group='all', naming='plain')
Find whether hydrogen-bond donors and acceptors in atom group 1 (fixed) are satisfied by partners in atom group 2 (dynamic).
- Parameters:
structure_input (str) – File name for the reference file (TPR format).
xtc_input (str) – File name for the trajectory (xtc format).
fixed_group (str) – Atomgroup selection to find bonding partners for.
dyn_group (str) – Atomgroup selection to find bonding partners within.
naming (str, default='plain') – Naming scheme for each atom in the feature names. plain: neither chain nor segment ID included chainid: include chain ID (only works if chains are defined) segid: include segment ID (only works if segments are defined)
- Returns:
feature_names (list of str) – Names of all H-bond donors and acceptors
features_data (numpy array) – Binary satisfaction data for all donors and acceptors
- read_h_bonds(structure_input, xtc_input, selection1, selection2, naming='plain')
Read hydrogen bonds between two atm groups.
- Parameters:
structure_input (str) – File name for the reference file (TPR format).
xtc_input (str) – File name for the trajectory (xtc format).
selection1 (str) – Atom group selection to find bonding partners for.
selection2 (str) – Atom group selection to find bonding partners within.
naming (str, default='plain') – Naming scheme for each atom in the feature names. plain: neither chain nor segment ID included chainid: include chain ID (only works if chains are defined) segid: include segment ID (only works if segments are defined)
- Returns:
feature_names (list of str) – Names of all bonds
features_data (numpy array) – Data for all bonds
- read_h_bonds_quickly(structure_input, xtc_input, fixed_group, dyn_group)
Find hydrogen bonding partners for atomgroup1 in atomgroup2.
- Parameters:
structure_input (str) – File name for the reference file (TPR format).
xtc_input (str) – File name for the trajectory (xtc format).
fixed_group (str) – Atomgroup selection to find bonding partners for.
dyn_group (str) – Atomgroup selection to find bonding partners within.
- Returns:
feature_names (list of str) – Names of all bonds
features_data (numpy array) – Data for all bonds
- read_water_site_h_bonds(structure_input, xtc_input, water_o_atom_name, biomol_sel='protein', site_IDs=None, grid_input=None, write_grid_as=None, out_name=None)
Find hydrogen bonds between waters occupying cavities and protein.
- Parameters:
structure_input (str) – File name for the reference file (TPR format).
xtc_input (str) – File name for the trajectory (xtc format).
water_o_atom_name (str) – Atom name to calculate the density for (usually water oxygen).
biomol_sel (str) – Selection string for the biomolecule that forms the cavity/site. The default is ‘protein’
site_IDs (list, optional) – List of indexes for the sites desired to investigate. If none is provided, all sites will be analyzed
grid_input (str, optional) – File name for the density grid input. The default is None, and a grid is automatically generated.
write (bool, optional) – If true, the following data will be written out: reference pdb with occupancies, water distributions, water data summary. The default is None.
write_grid_as (str, optional) – If you choose to write out the grid, you must specify the water model to convert the density into. The default is None. Options are suggested if default.
out_name (str, optional) – Prefix for all written filenames. The default is None.
- Returns:
feature_names (list of str) – Names of all features
features_data (numpy array) – Data for all features
- read_water_site_h_bonds_quickly(structure_input, xtc_input, atomgroups, site_IDs, grid_input=None, write=None, write_grid_as=None, out_name=None)
Find hydrogen bonds between waters occupying cavities and protein.
- Parameters:
structure_input (str) – File name for the reference file (TPR format).
xtc_input (str) – File name for the trajectory (xtc format).
atomgroup (str) – Atomgroup selection to calculate the density for (atom name in structure_input).
site_IDs (list) – List of indexes for the sites desired to investigate.
grid_input (str, optional) – File name for the density grid input. The default is None, and a grid is automatically generated.
write (bool, optional) – If true, the following data will be written out: reference pdb with occupancies, water distributions, water data summary. The default is None.
write_grid_as (str, optional) – If you choose to write out the grid, you must specify the water model to convert the density into. The default is None. Options are suggested if default.
out_name (str, optional) – Prefix for all written filenames. The default is None.
- Returns:
feature_names (list of str) – Names of all features
features_data (numpy array) – Data for all features
pensa.features.mda_combined
- read_structure_features(pdb, xtc, start_frame=0, step_width=1, cossin=False, features=['bb-torsions', 'sc-torsions', 'bb-distances'], resnum_offset=0)
Load the features. Currently implemented: bb-torsions, sc-torsions, bb-distances
- Parameters:
pdb (str) – File name for the reference file (PDB or GRO format).
xtc (str) – File name for the trajectory (xtc format).
start_frame (int, default=0) – First frame to return of the features. Already takes subsampling by stride>=1 into account.
step_width (int, default=1) – Subsampling step width when reading the frames.
cossin (bool, default=False) – Use cosine and sine for angles.
features (list of str, default=['bb-torsions', 'sc-torsions']) – Names of the features to be extracted.
resnum_offset (int, default=0) – Number to subtract from the residue numbers that are loaded from the reference file.
- Returns:
feature_names (dict of lists of str) – Names of all features
features_data (dict of numpy arrays) – Data for all features
- sort_traj_along_combined_feature(feat, data, feature_name, feature_type, ref_name, trj_name, out_name, start_frame=0)
Sort a trajectory along one feature in a combined set.
- Parameters:
feat (list of str) – List with all feature names.
data (float array) – Feature values data from the simulation.
feature_name (str) – Name of the selected feature.
feature_type (str) – Type of the selected feature.
ref_name (string) – Reference topology for the trajectory.
trj_name (string) – Trajetory from which the frames are picked. Usually the same as the values are from.
out_name (string.) – Name of the output files.
start_frame (int) – Offset of the data with respect to the trajectories.
- Returns:
d_sorted – Sorted data of the selected feature.
- Return type:
float array
pensa.features.mda_distances
- read_atom_group_distances(pdb, xtc, sel_a='protein', sel_b='resname LIG', first_frame=0, last_frame=None, step=1, naming='plain')
Load distances between all atom pairs between two selected groups.
- Parameters:
pdb (str) – File name for the reference file (PDB or GRO format).
xtc (str) – File name for the trajectory (xtc format).
sel_a (str, default='protein') – Selection string to choose atoms for the first group.
sel_b (str, default='resname LIG') – Selection string to choose atoms for the second group.
first_frame (int, default=0) – First frame to return of the features. Zero-based.
last_frame (int, default=None) – Last frame to return of the features. Zero-based.
step (int, default=1) – Subsampling step width when reading the frames.
naming (str, default='plain') – Naming scheme for each atom in the feature names. plain: neither chain nor segment ID included chainid: include chain ID (only works if chains are defined) segid: include segment ID (only works if segments are defined)
- Returns:
feature_names (list of str) – Names of all distances
features_data (numpy array) – Data for all distances [Å]
- read_atom_self_distances(pdb, xtc, selection='all', first_frame=0, last_frame=None, step=1, naming='plain')
Load distances between all selected atoms.
- Parameters:
pdb (str) – File name for the reference file (PDB or GRO format).
xtc (str) – File name for the trajectory (xtc format).
selection (str) – Selection string to choose which atoms to include. Default: all.
first_frame (int, default=0) – First frame to return of the features. Zero-based.
last_frame (int, default=None) – Last frame to return of the features. Zero-based.
step (int, default=1) – Subsampling step width when reading the frames.
naming (str, default='plain') – Naming scheme for each atom in the feature names. plain: neither chain nor segment ID included chainid: include chain ID (only works if chains are defined) segid: include segment ID (only works if segments are defined)
- Returns:
feature_names (list of str) – Names of all distances
features_data (numpy array) – Data for all distances [Å]
- read_calpha_distances(pdb, xtc, first_frame=0, last_frame=None, step=1)
Load distances between all C-alpha atoms.
- Parameters:
pdb (str) – File name for the reference file (PDB or GRO format).
xtc (str) – File name for the trajectory (xtc format).
first_frame (int, default=0) – First frame to return of the features. Zero-based.
last_frame (int, default=None) – Last frame to return of the features. Zero-based.
step (int, default=1) – Subsampling step width when reading the frames.
- Returns:
feature_names (list of str) – Names of all C-alpha distances
features_data (numpy array) – Data for all C-alpha distances [Å]
- read_gpcr_calpha_distances(pdb, xtc, gpcr_name, res_dbnum, first_frame=0, last_frame=None, step=1)
Load distances between all selected atoms.
- Parameters:
pdb (str) – File name for the reference file (PDB or GRO format).
xtc (str) – File name for the trajectory (xtc format).
gpcr_name (str) – Name of the GPCR as in the GPCRdb.
res_dbnum (list) – Relative GPCR residue numbers.
first_frame (int, default=0) – First frame to return of the features. Zero-based.
last_frame (int, default=None) – Last frame to return of the features. Zero-based.
step (int, default=1) – Subsampling step width when reading the frames.
- Returns:
feature_names (list of str) – Names of all C-alpha distances.
feature_labels (list of str) – Labels containing GPCRdb numbering of the residues.
features_data (numpy array) – Data for all C-alpha distances [Å].
- select_gpcr_residues(gpcr_name, res_dbnum)
Gets sequential residue numbers for residues provided as GPCRdb numbers.
- Parameters:
gpcr_name (str) – Name of the GPCR as in the GPCRdb.
res_dbnum (list of str) – Relative GPCR residue numbers.
- Returns:
sel_resnum (list of int) – Sequential residue numbers.
sel_labels (list of str) – Labels containing GPCRdb numbering of the residues.
pensa.features.mda_torsions
- find_atom_by_name(res, at_name)
Find the index of the first atom of a certain name in a residue.
- Parameters:
res (Residue) – MDAnalysis residue object.
at_name (str) – Name of the requested atom.
- Returns:
index – Index of the first atom with name at_name or -1 (if none of the atoms has this name)
- Return type:
int
- find_atom_indices_per_residue(pdb, at_names=["C4'", 'P', "C4'", 'P'], rel_res=[-1, 0, 0, 1], selection='all', verbose=False)
Find the indices of atoms with a certain name for each residue (and its neighbors).
- Parameters:
pdb (str) – File name for the reference file (PDB or GRO format).
at_names (list of str or list of list of str) – Names of the requested atoms or list of sets of names of requested atoms. If a list of lists is passed, all sub-lists must have the same length.
rel_res (list of int, default=[-1, 0, 0, 1]]) – Residue number of each atom’s residue relative to the current residue.
selection (str, default = 'all') – MDAnalysis selection string
verbose (bool, default = False) – Print info for all residues.
- Returns:
feature_names (list of str) – Generic names of all torsions
features_data (numpy array) – Data for all torsions [Å]
- list_depth(a_list)
- read_nucleicacid_backbone_torsions(pdb, xtc, selection='all', first_frame=0, last_frame=None, step=1, naming='segindex', radians=False)
Load nucleic acid backbone torsions
ALPHA (α): O3’(i-1)-P(i)-O5’(i)-C5’(i) BETA (β): P(i)-O5’(i)-C5’(i)-C4’(i) GAMMA (γ): O5’(i)-C5’(i)-C4’(i)-C3’(i) DELTA (δ): C5’(i)-C4’(i)-C3’(i)-O3’(i) EPSILON (ε): C4’(i)-C3’(i)-O3’(i)-P(i + 1) ZETA (ζ): C3’(i)-O3’(i)-P(i + 1)-O5’(i + 1) CHI (χ): O4’(i)-C1’(i)-N9(i)-C4(i) for purines or O4’(i)-C1’(i)-N1(i)-C2(i) for pyridines
- Parameters:
pdb (str) – File name for the reference file (PDB or GRO format).
xtc (str) – File name for the trajectory (xtc format).
selection (list, default='all') – List of quadruplets with selection indices to choose atoms for the torsions.
first_frame (int, default=0) – First frame to return of the features. Zero-based.
last_frame (int, default=None) – Last frame to return of the features. Zero-based.
step (int, default=1) – Subsampling step width when reading the frames.
naming (str, default='plain') – Naming scheme for each atom in the feature names. plain: neither chain nor segment ID included chainid: include chain ID (only works if chains are defined) segid: include segment ID (only works if segments are defined) segindex: include segment index (only works if segments are defined)
radians (bool, default=False) – Return torsions in radians instead of degrees.
- Returns:
feature_names (list of str) – Generic names of all torsions
features_data (numpy array) – Data for all torsions [Å]
- read_nucleicacid_pseudotorsions(pdb, xtc, selection='all', first_frame=0, last_frame=None, step=1, naming='segindex', radians=False)
Load nucleic acid pseudotorsions
ETA (η): C4’(i-1)-P(i)-C4’(i)-P(i + 1) THETA (θ): P(i)-C4’(i)-P(i + 1)-C4’(i + 1)
- Parameters:
pdb (str) – File name for the reference file (PDB or GRO format).
xtc (str) – File name for the trajectory (xtc format).
selection (list, default='all') – List of quadruplets with selection indices to choose atoms for the torsions.
first_frame (int, default=0) – First frame to return of the features. Zero-based.
last_frame (int, default=None) – Last frame to return of the features. Zero-based.
step (int, default=1) – Subsampling step width when reading the frames.
naming (str, default='plain') – Naming scheme for each atom in the feature names. plain: neither chain nor segment ID included chainid: include chain ID (only works if chains are defined) segid: include segment ID (only works if segments are defined) segindex: include segment index (only works if segments are defined)
radians (bool, default=False) – Return torsions in radians instead of degrees.
- Returns:
feature_names (list of str) – Generic names of all torsions
features_data (numpy array) – Data for all torsions [Å]
- read_protein_backbone_torsions(pdb, xtc, selection='all', first_frame=0, last_frame=None, step=1, naming='segindex', radians=False, include_omega=False)
Load protein backbone torsions
PHI (φ): C(i-1)-N(i)-CA(i)-C(i) PSI (ψ): N(i)-CA(i)-C(i)-N(i + 1) OMEGA (ω): CA(i)-C(i)-N(i + 1)-CA(i + 1)
- Parameters:
pdb (str) – File name for the reference file (PDB or GRO format).
xtc (str) – File name for the trajectory (xtc format).
selection (list, default='all') – List of quadruplets with selection indices to choose atoms for the torsions.
first_frame (int, default=0) – First frame to return of the features. Zero-based.
last_frame (int, default=None) – Last frame to return of the features. Zero-based.
step (int, default=1) – Subsampling step width when reading the frames.
naming (str, default='plain') – Naming scheme for each atom in the feature names. plain: neither chain nor segment ID included chainid: include chain ID (only works if chains are defined) segid: include segment ID (only works if segments are defined) segindex: include segment index (only works if segments are defined)
radians (bool, default=False) – Return torsions in radians instead of degrees.
- Returns:
feature_names (list of str) – Generic names of all torsions
features_data (numpy array) – Data for all torsions [Å]
- read_protein_sidechain_torsions(pdb, xtc, selection='all', first_frame=0, last_frame=None, step=1, naming='segindex', radians=False)
Load protein sidechain torsions.
- Parameters:
pdb (str) – File name for the reference file (PDB or GRO format).
xtc (str) – File name for the trajectory (xtc format).
selection (list, default='all') – List of quadruplets with selection indices to choose atoms for the torsions.
first_frame (int, default=0) – First frame to return of the features. Zero-based.
last_frame (int, default=None) – Last frame to return of the features. Zero-based.
step (int, default=1) – Subsampling step width when reading the frames.
naming (str, default='plain') – Naming scheme for each atom in the feature names. plain: neither chain nor segment ID included chainid: include chain ID (only works if chains are defined) segid: include segment ID (only works if segments are defined) segindex: include segment index (only works if segments are defined)
radians (bool, default=False) – Return torsions in radians instead of degrees.
- Returns:
feature_names (list of str) – Generic names of all torsions
features_data (numpy array) – Data for all torsions [Å]
- read_torsions(pdb, xtc, sel=[[0, 1, 2, 3], [1, 2, 3, 4]], first_frame=0, last_frame=None, step=1, naming=None)
Load distances between all atom pairs between two selected groups.
- Parameters:
pdb (str) – File name for the reference file (PDB or GRO format).
xtc (str) – File name for the trajectory (xtc format).
sel (list, default=[[0, 1, 2, 3]]) – List of quadruplets with selection indices to choose atoms for the torsions.
first_frame (int, default=0) – First frame to return of the features. Zero-based.
last_frame (int, default=None) – Last frame to return of the features. Zero-based.
step (int, default=1) – Subsampling step width when reading the frames.
naming (str, default='plain') – Naming scheme for each atom in the feature names. plain: neither chain nor segment ID included chainid: include chain ID (only works if chains are defined) segid: include segment ID (only works if segments are defined) segindex: include segment index (only works if segments are defined)
- Returns:
feature_names (list of str) – Generic names of all torsions
features_data (numpy array) – Data for all torsions [Å]
pensa.features.processing
- correct_angle_periodicity(angle)
Correcting for the periodicity of angles [radians].
- Parameters:
angle (list) – Univariate data for an angle feature.
- Returns:
new_angle – Periodically corrected angle feature.
- Return type:
list
- correct_spher_angle_periodicity(two_angles)
Correcting for the periodicity of spherical angles [radians]. Waters featurized using PENSA and including discrete occupancy are handled.
- Parameters:
angle (list of psi and theta angles) – Bivariate data for spherical angles of water molecule.
- Returns:
new_angle – Periodically corrected angle feature.
- Return type:
list of psi and theta angles
- get_common_features_data(features_a, features_b, data_a, data_b)
Finds common features and corresponding data from two trajectories.
- Parameters:
features_a (list of str) – First set of features.
features_b (list of str) – Second set of features.
data_a (float array) – Data from first trajectory.
data_b (float array) – Data from second trajectory.
- Returns:
new_features_a, new_features_b (np array of str) – Common features between the two trajectories.
new_data_a, new_data_b (float array) – Data corresponding to common features between the two trajectories.
- get_feature_data(feat, data, feature_name)
Returns the timeseries of one particular feature.
- Parameters:
feat (list of str) – List with all feature names.
data (float array) – Feature values data from the simulation.
feature_name (str) – Name of the selected feature.
- Returns:
timeseries – Value of the feature for each frame.
- Return type:
float array
- get_feature_subset(feat, data, selection)
Returns a subset of selected features. Does not check whether the selected features are actually present in the input.
- Parameters:
feat (list of str) – List with all feature names.
data (float array) – Feature values data from the simulation.
selection (list of str) – Names of the selected features.
- Returns:
sub_feat (list of str) – List with all feature names of the subset.
sub_data (float array) – Feature values data of the subset.
- get_feature_timeseries(feat, data, feature_type, feature_name)
Returns the timeseries of one particular feature from a set with several feature types.
- Parameters:
feat (list of str) – List with all feature names.
data (float array) – Feature values data from the simulation.
feature_type (str) – Type of the selected feature (‘bb-torsions’, ‘bb-distances’, ‘sc-torsions’).
feature_name (str) – Name of the selected feature.
- Returns:
timeseries – Value of the feature for each frame.
- Return type:
float array
- get_multivar_res(feat, data)
Groups each timeseries of all features for one particular residue.
- Parameters:
feat (list of str) – List with all feature names.
data (float array) – Feature values data from the simulation.
- Returns:
sorted_names (list of str) – Names of all features
new_data (numpy array) – Data for all features
- get_multivar_res_timeseries(feat, data, feature_type, write=None, out_name=None)
Returns the timeseries of one particular feature.
- Parameters:
feat (list of str) – List with all feature names.
data (float array) – Feature values data from the simulation.
feature_type (str) – Type of the selected feature (‘bb-torsions’, ‘bb-distances’, ‘sc-torsions’).
write (bool, optional) – If true, write out the data into a directory titled with the feature_type str. The default is None.
out_name (str, optional) – Prefix for the written data. The default is None.
- Returns:
feature_names (list of str) – Names of all features
features_data (numpy array) – Data for all features
- match_sim_lengths(sim1, sim2)
Make two lists the same length by truncating the longer list to match.
- Parameters:
sim1 (list) – A one dimensional distribution of a specific feature.
sim2 (list) – A one dimensional distribution of a specific feature.
- Returns:
sim1 (list) – A one dimensional distribution of a specific feature.
sim2 (list) – A one dimensional distribution of a specific feature.
- select_common_features(features_a, features_b, boolean=True)
Finds features in common between two trajectories.
- Parameters:
features_a (list of str) – First set of features.
features_b (list of str) – Second set of features.
boolean (bool) – Determines if returned array contains booleans or features.
- Returns:
common_a (np array of bool or str) – Common features taken from features_a.
common_b (np array of bool or str) – Common features taken from features_b.
- sort_distances_by_resnum(dist, data)
Sort distance features by the residue number. :param dist: The list of distance features. :type dist: list of str
- Returns:
new_dist – The sorted list of distance features.
- Return type:
list of str
- sort_features(names, sortby)
Sorts features by a list of values.
- Parameters:
names (str array) – Array of feature names.
sortby (float array) – Array of the values to sort the names by.
- Returns:
sort – Array of sorted tuples with feature and value.
- Return type:
array of tuples [str, float]
- sort_features_alphabetically(tors, data)
Sort torsion features alphabetically. :param tors: The list of torsion features. :type tors: list of str
- Returns:
new_tors – The sorted list of torsion features.
- Return type:
list of str
- sort_sincos_torsions_by_resnum(tors, data)
Sort sin/cos of torsion features by the residue number.. :param tors: The list of torsion features. :type tors: list of str
- Returns:
new_tors – The sorted list of torsion features.
- Return type:
list of str
- sort_torsions_by_resnum(tors, data)
Sort torsion features by the residue number.. :param tors: The list of torsion features. :type tors: list of str
- Returns:
new_tors – The sorted list of torsion features.
- Return type:
list of str
- sort_traj_along_feature(feat, data, feature_name, ref_name, trj_name, out_name, start_frame=0, verbose=False)
Sort a trajectory along a feature.
- Parameters:
feat (list of str) – List with all feature names.
data (float array) – Feature values data from the simulation.
feature_name (str) – Name of the selected feature.
ref_name (string) – Reference topology for the trajectory.
trj_name (string) – Trajetory from which the frames are picked. Usually the same as the values are from.
out_name (string.) – Name of the output files.
start_frame (int) – Offset of the data with respect to the trajectories.
- Returns:
d_sorted – Sorted data of the selected feature.
- Return type:
float array
pensa.features.water_features
Methods to obtain a timeseries distribution for the water pockets which respresents a combination of the water occupancy (binary variable) and the water polarisation (continuous variable).
Water pockets are defined as radius 3.5 Angstroms (based off of hydrogen bond lengths) centered on the probability density maxima of waters. If there is ever an instance where two water molecules occupy the same pocket at the same time, then the water that occupies the pocket most often is used to obtain the polarisation.
The methods here are based on the following paper:
Neil J. Thomson, Owen N. Vickery, Callum M. Ives, Ulrich Zachariae:Ion-water coupling controls class A GPCR signal transduction pathways.
- read_water_features(structure_input, xtc_input, atomgroup, top_waters=10, grid_input=None, write_grid_as=None, out_name=None)
Featurize water pockets for the top X most probable waters (top_waters).
- Parameters:
structure_input (str) – File name for the reference file (PDB or GRO format).
xtc_input (str) – File name for the trajectory (xtc format).
atomgroup (str) – Atomgroup selection to calculate the density for (atom name in structure_input).
top_waters (int, optional) – Number of waters to featurize. The default is 10.
grid_input (str, optional) – File name for the density grid input. The default is None, and a grid is automatically generated.
write_grid_as (str, optional) – If you choose to write out the grid, you must specify the water model to convert the density into. The default is None. Options are suggested if default.
out_name (str, optional) – Prefix for all written filenames. The default is None. If not None, the following data will be written out: reference pdb with occupancies, water distributions, water data summary. The default is None.
- Returns:
feature_names (list of str) – Names of all features
features_data (numpy array) – Data for all features