Command-line p.b.s. calculator
General description
Command-line version of the p.b.s. calculation provides CCP4-like interface. It will read the pdb files and process them according to the keywords provided. The ultimate goal is to keep the same level of functionality provided by ProteinRanger in a convenient standalone form.There are two ways to supply keywords and inputs to the program.
1. Inputs/keywords are given in command line and/or script file.
2. Just execute the program and enter keywords manually. Calculation will not start untile you enter "END" keyword.
Keywords comply with 4-letter standard and are not case sensitive (i.e. you can use FITM, fitm, FITMODE or even FitMode - the program will consider it the same).
Installation
There is no istallation per se yet, just download the tarball and run pbscalc from command line as described below. You need python and scipy, and optionally superpose from CCP4 to run SSM alignement.Input/output files
XYZIN1, XYZIN2The input pdb files with structures to compare, in arbitrary order. These should be given on the command line, not in the script.
XYZOUT1, XYZOUT2 (optional)
The output pdb files will be identical to the input files except for the B-factor column which will be reset to values controlled by the keyword BFACTORS.
Keywords
The allowed keywords are MODE, FITM, CHRNAME, CHAINS, NBINS, NMAX, WCUT, OUTPUT, REFCYCLES, BFACTORS, BAVERAGEMODE
Defines how the atoms will be matched for the p.b.s. calculation. Allowed values are
ALL (default) | Match all the atoms based on atom ID |
CALPHA | Match all the C-alpha atoms based on residue ID. Residue types don't have to match. |
PROTEIN | Match all the protein atoms by atom ID. |
PROTEIN_BACKBONE | Match protein backbone atoms (CA, C, N, O). |
BACKBONE | This will match all the atoms with names matching those of protein and DNA/RNA backbone. This option will perform no residue name checking, so make sure you know what you are doing. |
PROTEIN_SIDECHAINS | Match protein side chain atoms (other than CA, C, N, O). |
SIDECHAINS | This will match all the atoms with names NOT matching those of protein and DNA/RNA backbone. This option will perform no residue name checking, so make sure you know what you are doing. |
WATERS | Waters will be matched based on theit position, not resids. Specifically, after alignment waters will be analyzed and deemed to be the same water if they are shifted by less than 1A. |
CHAINS | Atoms from different chains will be matched using the rest of atom IDs. The CHAINS keyword defines how chains are chosen. |
PROTEIN_CHAINS | Same as MATCH_CHAINS, but making sure that only protein atoms are included. |
SEQUENCE_ALIGNMENT | Molecules will be pre-aligned and match created using secondary structure matching algorithm as implemented in SUPERPOSE. For this to work, you need to have CCP4 configured so that superpose can be invoked from command line. |
FITM
Defines what group of atoms will be used for structural alignment prior to p.b.s. calculation. The same list of options as for MODE, and defaults to it if omitted. This can be useful keyword if structural alignnment is unstable when all the atoms are included. For instance, you may want to calculate the p.b.s. only for side chain atoms (MODE SIDECHAINS) but align molecules using C-alphas (thus FITM CALPHA). It seems though to have surprisingly small effect on the calculated values, while will be crucial if you want to look at waters (impossible to align since resid based match fails unless waters were renumbered).
CHRNAME NO|YES
Forces the strict residue name matching. Defaults to NO, and then the atoms are matched based on their names, ignoring residue type. In most cases this seems like a good idea, otherwise point mutations will be ignired in the analysis. Note that for the SEQUENCE_ALIGNMENT mode this option is crucial, since internally the residue IDs are changed to match the sequences but residue names remain the same. As a result, in this mode turning on the strict residue name checking will yiled strange results, since only the residues that are aligned and are of the same type will be matched.
CHAINS
This keyword is required when different chains are to be matched (MODE CHAINS|PROTEIN_CHAINS). The value should denote chains from two input structures separated by underline symbol and in the order in which chains are to be matched. For instance, "CHAINS ABCD_BADC" means that chains from XYZIN1 and XYZIN2 will be matched in the following order A->B, B->A, C->D, D->C.
NBINS
Defines the number of bins used to calculatte the distance distribution. Bins arre placed unevenly to make sure that each includes roughly the same fraction of atoms. Defaults to 50, you may want to reduce this if the total number of matched atoms is small. This will only affect the fit to Maxwell-Boltzmann distribution and related stuff, but not the p.b.s. and r.m.s.d.
NMAX
Defines the maximum number of the atom groups allowed in multimodal fit. Defaults to 3.
WCUT
Defines the cutoff to stop adding more groups to multimodal fit. Defaults to 95, which means that only as many groups as needed to account for 95% of all atoms will be used (but not more than NMAX).
OUTPUT
Defines what will be included in the output. Defaults to ALL, which will output r.m.s.d., p.b.s., spread obtained at different percentile cutoffs, distance distribution and its fit to single and multiple groups. To obtain only a specific item, the following values should be assigned to OUTPUT: RMSD, PBS, QUANTILE, DISTR, MULT, RESIDUES.
REFCYCLES
Number of alignment cycles prior to p.b.s. calculation. Program aligns models using Kabsch algorithm, but can also iteratively exclude atoms from the list used for this, leaving only those below 60.8% percentile (corresponding to underlying radial r.m.s.d. in th absence of outliers).
BFACTORS
Controls the values placed into B-factor columns in XYZOUT1 and XYZOUT2. The default is PROBABILITY, which will calculate the probability that particular atoms belongs to a group as defined from fitting multimodal distance distribution. For example, if there were two groups total, and the B-factor of a particular atom is 15, it means it's has 50/50 chance to be in either group, and 10/20 would correspond to atoms definitely residing in groups 1/2 respectively. These values get more murky of there are more than two groups and they significanlty overlap. This number is derived from relative strengths of the probaility distribution densities for individual groups at the distance equal to the shift of the particular atom between two structures.
Alternative value is ZSCORE, in which case B-factors are simply calculated as the ratio of atomic shift to the p.b.s.
All the atoms that aere not part of the match have their B-factors reset to 0.
BAVERAGE
If YES, B-factors in XYZOUT1/XYZOUT2 are corrected to represent average per residue. Defaults to NO.
Examples
1. C-alphas are fitted and p.b.s. and related statistics calculated for the protein backbone atoms.
pbscalc xyzin1 apo.pdb xyzin2 holo.pdb << eof
FITM CALPHA
MODE PROTEIN_BACKBONE
END
eof
2. Whole models are used, using up to three groups for multimodal distribution and output pdb-files
generated with B-factor columns replaced with probabilty of a residue to belong to a particular group.
FITM CALPHA
MODE PROTEIN_BACKBONE
END
eof
pbscalc xyzin1 apo.pdb xyzin2 holo.pdb xyzout1 apo1.pdb xyzout2 holo1.pdb << eof
NMAX 3
BFACTORS PROBABILITY
BAVERAGE YES
END
eof
3. Backbone atoms are used for fit and match, followed by two cycles of alignment correction and
only p.b.s. is show in the output. Coordinate files are generated with atomic shifts
normalized by p.b.s. as B-factors.
NMAX 3
BFACTORS PROBABILITY
BAVERAGE YES
END
eof
pbscalc xyzin1 apo.pdb xyzin2 holo.pdb xyzout1 apo1.pdb xyzout2 holo1.pdb << eof
REFCYCLES 2
OUTPUT PBS
BFACTORS ZSCORE
END
eof
REFCYCLES 2
OUTPUT PBS
BFACTORS ZSCORE
END
eof