pensa.preprocessing

pensa.preprocessing.coordinates

align_coordinates(top, pdb, trj_list, out_name, sel_string='all', start_frame=0)

Aligns selected coordinates from a trajectory file.

Parameters:

top (str) – File name for the topology. Can read all MDAnalysis-compatible topology formats.
pdb (str) – File name for the reference PDB file.
trj_list (list of str) – File names for the input trajectory. Can read all MDAnalysis-compatible trajectory formats.
out_name (str) – Core of the file names for the output files
sel_string (str) – Selection string in MDAnalysis format. Defines on which atoms to align.
start_frame (int, optional) – First frame to read from the trajectory.

extract_coordinates(top, pdb, trj_list, out_name, sel_string, start_frame=0, rename_segments=None, residues_offset=0)

Extracts selected coordinates from a trajectory file.

Parameters:

top (str) – File name for topology. Can read all MDAnalysis-compatible topology formats.
pdb (str) – File name for the reference PDB file.
trj_list (str or list of str) – File name(s) for the input trajectory. Can read all MDAnalysis-compatible trajectory formats.
out_name (str) – Core of the file names for the output files.
sel_string (str) – Selection string in MDAnalysis format. Defines which atoms to extract.
start_frame (int, optional) – First frame to read from the trajectory.

extract_coordinates_combined(top, trj, sel_string, out_name, start_frame=0, verbose=False)

Extracts selected coordinates from several trajectory files.

Parameters:

top (list of str) – File names for the topologies. Can read all MDAnalysis-compatible topology formats.
trj (list of str) – File names for the input trajectories. Can read all MDAnalysis-compatible trajectory formats.
sel_string (str) – Selection string in MDAnalysis format. Defines which atoms to extract.
out_name (str) – Core of the file names for the output files.
start_frame (int, optional) – First frame to read from the trajectory.

merge_and_sort_coordinates(values, top_names, trj_names, out_name, start_frame=0, verbose=False)

Write multiple trajectories of coordinate frames into one trajectory, sorted along corresponding values.

Parameters:

values (list of float arrays) – Values along which to sort the trajectory.
top_names (list of str) – topology for the trajectory.
trj_names (list of str) – Trajetory from which the frames are picked. Usually the same as the values are from.
out_name (str) – Name of the output trajectory (usual format is .xtc).
start_frame (int or list of int) – Offsets of the data with respect to the trajectories. Defaults to zero.

Returns:

data_sort (float array) – Sorted values of the input data.
sort_idx (float array) – Sorted indices of the values.
oidx_sort (float array) – Sorted indices of the trajectory.

merge_coordinates(top_files, trj_files, out_name, segid=None)

Merge several trajectories of the same system or system part. All trajectories must be (at least) as long as the first one.

Parameters:

top_files (str[]) – List of input topology files.
trj_files (str[]:) – List of input trajectory files.
out_name (str) – Name of the output files (without ending).
segid (str, optional) – Value to overwrite the segment ID. Defaults to None.

Returns:

univ – MDAnalysis universe of the merged system.

Return type:

obj

sort_coordinates(values, top_name, trj_name, out_name, start_frame=0, verbose=False)

Sort coordinate frames along corresponding values.

Parameters:

values (float array) – Values along which to sort the trajectory.
top_name (str) – Topology for the trajectory.
trj_name (str) – Trajetory from which the frames are picked. Usually the same as the values are from.
out_name (str) – Name of the output trajectory (usual format is .xtc).
start_frame (int) – Offset of the data with respect to the trajectories.

Returns:

data_sort (float array) – Sorted values of the input data.
sort_idx (float array) – Sorted indices of the values.
oidx_sort (float array) – Sorted indices of the trajectory.

pensa.preprocessing.density

Methods to obtain a distribution for the water pockets which respresents a combination of the water occupancy (binary variable) and the water polarisation (continuous variable).

For a water molecule to exist within a water pocket, just the oxygen must occupy the pocket. If there is ever an instance where two water molecules occupy the same pocket at the same time, then the water polarisation of the molecule ID that occupies the pocket most often is used.

The methods here are based on the following paper:

Neil J. Thomson, Owen N. Vickery, Callum M. Ives, Ulrich Zachariae:

Ion-water coupling controls class A GPCR signal transduction pathways.

https://doi.org/10.1101/2020.08.28.271510

convert_to_occ(distr, unocc_no, water=True)

Convert a distribution of pocket angles and occupancies into just occupancies.

Parameters:

distr (list) – Distribution to convert.
unocc_no (float) – Value that represents unoccupied in the distribution.

Returns:

occ – Distribution representing pocket occupancy.

Return type:

list

data_out(filename, data)

Write out lists of data

Parameters:

filename (str) – Name for the written file.
data (list of lists) – Data to be written out.

dens_grid_pdb(structure_input, xtc_input, atomgroup, top_atoms=35, grid_input=None, write=False, write_grid_as=None, out_name=None)

Write out water pockets for the top X most probable atoms (top_atoms).

Parameters:

structure_input (str) – File name for the reference file (PDB or GRO format).
xtc_input (str) – File name for the trajectory (xtc format).
atomgroup (str) – Atomgroup selection to calculate the density for (atom name in structure_input).
top_atoms (int, optional) – Number of atoms to featurize. The default is 35.
grid_input (str, optional) – File name for the density grid input. The default is None, and a grid is automatically generated.
write (bool, optional) – If True, a reference pdb will be written out. The default is False.
write_grid_as (str, optional) – If you choose to write out the grid, you must specify the water model to convert the density into. The default is None. Options are suggested if default.
out_name (str, optional) – Prefix for all written filenames. The default is None.

Returns:

feature_names (list of str) – Names of all features
features_data (numpy array) – Data for all features

extract_aligned_coordinates(struc_a, xtc_a, struc_b, xtc_b, xtc_aligned=None, pdb_outname='alignment_ref.pdb')

Aligns a trajectory (a) on the average structure of another one (b).

Parameters:

struc_a (str) – File name for the reference file (PDB or GRO format).
xtc_a (str) – File name for the trajectory (xtc format).
struc_b (str) – File name for the reference file of the trajectory to be aligned to (PDB or GRO format).
xtc_b (str) – File name for the trajectory to be aligned to (xtc format).
xtc_aligned (str, default=None) – File name for the aligned trajectory. If none, it will be constructed from the
pdb_outname (str, default='alignment_ref.pdb') – File name for the average structure of the trajectory to be aligned to (PDB format)

extract_combined_grid(struc_a, xtc_a, struc_b, xtc_b, atomgroup, write_grid_as, out_name, prot_prox=True, use_memmap=False, memmap='combined_traj.mymemmap')

Writes out combined atomgroup density for both input simulations.

Parameters:

struc_a (str) – File name for the reference file (PDB or GRO format).
xtc_a (str) – File name for the trajectory (xtc format).
struc_b (str) – File name for the reference file (PDB or GRO format).
xtc_b (str) – File name for the trajectory (xtc format).
atomgroup (str) – Atomgroup selection to calculate the density for (atom name in structure_input).
write_grid_as (str) – The water model to convert the density into. Options are: SPC, TIP3P, TIP4P, water
out_name (str) – Prefix for written filename.
prot_prox (bool, optional) – Select only waters within 3.5 Angstroms of the protein. The default is True.
use_memmap (bool, optional) – Uses numpy memmap to write out a pseudo-trajectory coordinate array. This is used for large trajectories to avoid memory errors with large python arrays. The default is False.
memmap (str, default='combined_traj.mymemmap') – The numpy memmap file for the combined pseudo-trajectory.

generate_grid(u, atomgroup, write_grid_as=None, out_name=None, prot_prox=True)

Obtain the grid for atomgroup density.

Parameters:

u (MDAnalysis universe) – Universe to obtain density grid.
atomgroup (str) – Atomgroup selection to calculate the density for (atom name in structure_input).
write_grid_as (str, optional) – If you choose to write out the grid, you must specify the water model to convert the density into. The default is None.
out_name (str, optional) – Prefix for all written filenames. The default is None.
prot_prox (bool, optional) – Select only waters within 3.5 Angstroms of the protein. The default is True.

Returns:

g – Density grid.

Return type:

grid

local_maxima_3D(data, order=1)

Detects local maxima in a 3D array to obtain coordinates for density maxima.

Parameters:

data (3d ndarray) –
order (int) – How many points on each side to use for the comparison

Returns:

coords (ndarray) – coordinates of the local maxima
values (ndarray) – values of the local maxima

write_atom_to_pdb(pdb_outname, atom_location, atom_ID, atomgroup)

Write a new atom to a reference structure to visualise conserved non-protein atom sites.

Parameters:

pdb_outname (str) – Filename of reference structure.
atom_location (array) – (x, y, z) coordinates of the atom location with respect to the reference structure.
atom_ID (str) – A unique ID for the atom.
atomgroup (str) – MDAnalysis atomgroup to describe the atom.

pensa.preprocessing.download

download_from_gpcrmd(filename, folder)

Downloads a file from GPCRmd.

Parameters:

filename (str) – Name of the file to download. Must be a file that is in GPCRmd.
folder (str) – Target directory. The directory is created if it does not exist.

get_transmem_from_uniprot(uniprot_id)

Retains transmembrane regions from Uniprot (first and last residue each). This function requires internet access.

Parameters:: uniprot_id (str) – The UNIPROT ID of the protein.
Returns:: tm – List of all transmembrane regions, represented as tuples with first and last residue ID.
Return type:: list

pensa.preprocessing.selection

load_selection(sel_file, sel_base='')

Loads a selection from a selection file.

Parameters:

sel_file (str) – Name of the file with selections. Must contain two numbers on each line (first and last residue of this part).
sel_base (str) – The basis string for the selection. Defaults to an empty string.

Returns:

sel_string – A selection string that provides the residue numbers for MDAnalysis.

Return type:

str

range_to_string(a, b)

Provides a string with all integers in between two numbers.

Parameters:

a (int) – First number.
b (int) – Last number.

Returns:

string – String containing all int numbers from a to b.

Return type:

str