Command-line p.b.s. calculator

General description

Command-line version of the p.b.s. calculation provides CCP4-like interface. It will read the pdb files and process them according to the keywords provided. The ultimate goal is to keep the same level of functionality provided by ProteinRanger in a convenient standalone form.

There are two ways to supply keywords and inputs to the program.
1. Inputs/keywords are given in command line and/or script file.
2. Just execute the program and enter keywords manually. Calculation will not start untile you enter "END" keyword.

Keywords comply with 4-letter standard and are not case sensitive (i.e. you can use FITM, fitm, FITMODE or even FitMode - the program will consider it the same).

Installation

There is no istallation per se yet, just
download the tarball and run pbscalc from command line as described below. You need python and scipy, and optionally superpose from CCP4 to run SSM alignement.

Input/output files

XYZIN1, XYZIN2

The input pdb files with structures to compare, in arbitrary order. These should be given on the command line, not in the script.

XYZOUT1, XYZOUT2 (optional)

The output pdb files will be identical to the input files except for the B-factor column which will be reset to values controlled by the keyword BFACTORS.

Keywords

The allowed keywords are
MODE, FITM, CHRNAME, CHAINS, NBINS, NMAX, WCUT, OUTPUT, REFCYCLES, BFACTORS, BAVERAGE

MODE

Defines how the atoms will be matched for the p.b.s. calculation. Allowed values are
ALL (default) Match all the atoms based on atom ID
CALPHA Match all the C-alpha atoms based on residue ID. Residue types don't have to match.
PROTEIN Match all the protein atoms by atom ID.
PROTEIN_BACKBONE Match protein backbone atoms (CA, C, N, O).
BACKBONE This will match all the atoms with names matching those of protein and DNA/RNA backbone. This option will perform no residue name checking, so make sure you know what you are doing.
PROTEIN_SIDECHAINS Match protein side chain atoms (other than CA, C, N, O).
SIDECHAINS This will match all the atoms with names NOT matching those of protein and DNA/RNA backbone. This option will perform no residue name checking, so make sure you know what you are doing.
WATERS Waters will be matched based on theit position, not resids. Specifically, after alignment waters will be analyzed and deemed to be the same water if they are shifted by less than 1A.
CHAINS Atoms from different chains will be matched using the rest of atom IDs. The CHAINS keyword defines how chains are chosen.
PROTEIN_CHAINS Same as MATCH_CHAINS, but making sure that only protein atoms are included.
SEQUENCE_ALIGNMENT Molecules will be pre-aligned and match created using secondary structure matching algorithm as implemented in SUPERPOSE. For this to work, you need to have CCP4 configured so that superpose can be invoked from command line.



FITM

Defines what group of atoms will be used for structural alignment prior to p.b.s. calculation. The same list of options as for MODE, and defaults to it if omitted. This can be useful keyword if structural alignnment is unstable when all the atoms are included. For instance, you may want to calculate the p.b.s. only for side chain atoms (MODE SIDECHAINS) but align molecules using C-alphas (thus FITM CALPHA). It seems though to have surprisingly small effect on the calculated values, while will be crucial if you want to look at waters (impossible to align since resid based match fails unless waters were renumbered).


CHRNAME NO|YES

Forces the strict residue name matching. Defaults to NO, and then the atoms are matched based on their names, ignoring residue type. In most cases this seems like a good idea, otherwise point mutations will be ignired in the analysis. Note that for the SEQUENCE_ALIGNMENT mode this option is crucial, since internally the residue IDs are changed to match the sequences but residue names remain the same. As a result, in this mode turning on the strict residue name checking will yiled strange results, since only the residues that are aligned and are of the same type will be matched.


CHAINS

This keyword is required when different chains are to be matched (MODE CHAINS|PROTEIN_CHAINS). The value should denote chains from two input structures separated by underline symbol and in the order in which chains are to be matched. For instance, "CHAINS ABCD_BADC" means that chains from XYZIN1 and XYZIN2 will be matched in the following order A->B, B->A, C->D, D->C.


NBINS

Defines the number of bins used to calculatte the distance distribution. Bins arre placed unevenly to make sure that each includes roughly the same fraction of atoms. Defaults to 50, you may want to reduce this if the total number of matched atoms is small. This will only affect the fit to Maxwell-Boltzmann distribution and related stuff, but not the p.b.s. and r.m.s.d.


NMAX

Defines the maximum number of the atom groups allowed in multimodal fit. Defaults to 3.


WCUT

Defines the cutoff to stop adding more groups to multimodal fit. Defaults to 95, which means that only as many groups as needed to account for 95% of all atoms will be used (but not more than NMAX).


OUTPUT

Defines what will be included in the output. Defaults to ALL, which will output r.m.s.d., p.b.s., spread obtained at different percentile cutoffs, distance distribution and its fit to single and multiple groups. To obtain only a specific item, the following values should be assigned to OUTPUT: RMSD, PBS, QUANTILE, DISTR, MULT, RESIDUES.


REFCYCLES

Number of alignment cycles prior to p.b.s. calculation. Program aligns models using Kabsch algorithm, but can also iteratively exclude atoms from the list used for this, leaving only those below 60.8% percentile (corresponding to underlying radial r.m.s.d. in th absence of outliers).


BFACTORS

Controls the values placed into B-factor columns in XYZOUT1 and XYZOUT2. The default is PROBABILITY, which will calculate the probability that particular atoms belongs to a group as defined from fitting multimodal distance distribution. For example, if there were two groups total, and the B-factor of a particular atom is 15, it means it's has 50/50 chance to be in either group, and 10/20 would correspond to atoms definitely residing in groups 1/2 respectively. These values get more murky of there are more than two groups and they significanlty overlap. This number is derived from relative strengths of the probaility distribution densities for individual groups at the distance equal to the shift of the particular atom between two structures.
Alternative value is ZSCORE, in which case B-factors are simply calculated as the ratio of atomic shift to the p.b.s.
All the atoms that aere not part of the match have their B-factors reset to 0.


BAVERAGE

If YES, B-factors in XYZOUT1/XYZOUT2 are corrected to represent average per residue. Defaults to NO.


Examples

1. C-alphas are fitted and p.b.s. and related statistics calculated for the protein backbone atoms.
pbscalc xyzin1 apo.pdb xyzin2 holo.pdb << eof
FITM CALPHA
MODE PROTEIN_BACKBONE
END
eof
2. Whole models are used, using up to three groups for multimodal distribution and output pdb-files generated with B-factor columns replaced with probabilty of a residue to belong to a particular group.
pbscalc xyzin1 apo.pdb xyzin2 holo.pdb xyzout1 apo1.pdb xyzout2 holo1.pdb << eof
NMAX 3
BFACTORS PROBABILITY
BAVERAGE YES
END
eof
3. Backbone atoms are used for fit and match, followed by two cycles of alignment correction and only p.b.s. is show in the output. Coordinate files are generated with atomic shifts normalized by p.b.s. as B-factors.
pbscalc xyzin1 apo.pdb xyzin2 holo.pdb xyzout1 apo1.pdb xyzout2 holo1.pdb << eof
REFCYCLES 2
OUTPUT PBS
BFACTORS ZSCORE
END
eof