How to calculate percentile based spread and other ProteinRanger features

Load pdb-files

Select the pdb file using the file chooser and click one of the two buttons with either blue or red arrow. The order in which files are chosen doesn't matter. You can see which two files are currently loaded in the list (Model #1 and Model #2).

Process

To proceed with default parameters, just click this button

The analysis normally takes several seconds and upon conclusion this graph will appear showing the distance distribution.

You can play with buttons of the graph window to pan/zoom, switch to logarithmic scale, hide the multivariate distribution and copy the data to the clipboard for transfer to another application.

Results

Calculation results are listed in the bottom half of the ProteinRanger window.

r.m.s.d.

First you have the traditional r.m.s.d. For the example I tried for this tutorial (the variable domain of an antibody in apo-form and in complex with hapten) it is 0.77A - you can clearly see that this doesn't make much sense when you look at the distance distribution.

p.b.s.

The next listed value is the percentile based spread, or p.b.s. In this example it is 0.20A, which is much more reasonable given the distance distribution. Percentile based spread is roughly the 60th percentile, which for Maxwell-Boltzmann distrbution corresponds to the root-mean-square variation in atomic positions.

Single variance spread

This is obtained from the least squares fit of the distance distribution to Maxwell-Boltzmann. This is usually a little smaller than the p.b.s. (in this example it's 0.17A), perhaps because p.b.s. is somewhat influenced by outliers (but to much smaller degree than the r.m.s.d.).

Single variance completeness

This estimates the fraction of the distance distribution that the single variance distribution accounts for. Just an estimate, so don't take it too seriously (it is derived from fit parameters of the Maxwell-Boltzmann distribution).

Estimated outlier shift

This is even more carelessly defined parameter. The assumptions are that (i) your distribution contains two groups of atoms which have sharply different underlying variation; (ii) the smaller variation amplitude and its fraction are correctly identified by the fit to single Maxwell-Boltzmann distribution. Then the larger variation amplitude calculated from known overall r.m.s.d., relative fractions of the two groups of atoms and smaller variation amplitude. In this example the estimate comes out as 0.83A. As you can see, the overall r.m.s.d. under such circumstances is effectively defined by 20% of the atoms, resulting in the core variation overestimated roughly 5-fold by r.m.s.d. Most atoms in these two structures are shifted by ~0.15A (which is probably simply due to limited precision of the structure determination and a little bit of the "global" structural change induced by the ligand binding). Some, however, shift by 0.8A, which may be considered the real conformational change.

Number of groups

ProteinRanger attempts to fit the distance distribution to multiple Maxwell-Boltzmann's. Of course, this is mathematically an ill-conditioned problem, so results should be evaluated carefully. Under the hood it tries to fit the complete distribution using as many groups as needed (but less that user-defined maximum, 5 by default) to account for at least 95% of all the atoms (this limit can be changed too). This parameter (you guessed right) says how many groups were used. In our example, two groups were enough to meet the target.

Multivariance spread

Lists the individual sigmas for multiple groups. In this example the values are 0.16A and 0.41A (check the screenshot to see that the second group account for the "wing" of the overall distribution). Notice that the first is a little less than the "single variance spread" (which will be biased towards higher values as it tries to account for the outliers), and the second is much smaller than predicted outlier shift, once again emphasizing that r.m.s.d. is heavily influenced by outliers.

Multivariance completeness

In our example, the two fractions are roughly 75% and 25%. Of course, these are approximate. You can probably say that in this case the second group of atoms is rather substantial (i.e. more than just a percent or two), but that is probably all you can say. The total may exceed 100%.

Number of outliers

This is the absolute number of atoms not accounted for by multivariance fit. In this example it is only 6 atoms, indicating that the vast majority of things can be accounted for by assuming just two groups of atoms shifting by different amount. This parameter is not too reliable. It is obtained by subracting the sum of analytically evaluated areas under individual weighted Maxwell-Boltzmann distributions from unity. Because fitting by multiple distributions is an ill-conditioned problem, so is this parameter. Sometimes it becomes negative, which the program will report.

Using only backbone for alignment

This is what the "Fit ..." checkboxes are for. By default the program uses Kabsch alignment with all atoms, but you can ask it to use only the backbone (or only the side chains, which sounds rediculous but was easy to implement so why not). My experience is that in most cases it doesn't make that much difference, but sure sometimes it will. In the example used here the change in r.m.s.d. and p.b.s. is in the thousandths of an angstrom.

Select matching atoms

By default, all atoms are matched. "Match ..." checkboxes provide certain limited control over which atoms are matched. Currently, you can match only backbone, only side chains, only waters (this is distance based so there is no need to renumber waters). You can also specify that only protein atoms are matched (the default). In the example, r.m.s.d. for the backbone is reduced to 0.41A and p.b.s. to 0.17A. Sometimes for really identical structures limiting match to the backbone brings the r.m.s.d. close to p.b.s., suggesting that the outliers are mostly disordered side chains.

Compare identical chains (e.g. NCS)

Use the processing option "match chains" for this. When a pdb file is imported, the list of chains is placed in the text field. Modify the two list to select which chains to match. For example, if the first text box says "AB", and the second "CD", then chain A from the first molecule will be matched to the chain C of the second, "B" from the first to "D" from the second etc. Note that you need to import the pdb file twice (red and blue arrow) to compare chains within the same structure.

Compare non-identical proteins

Use SSM-superposition processing option. This uses CCP4's superpose to run the SSM matching and get the sequence alignment from the log file. This option currently assumes that superpose is available from command line, so make sure that you have configured CCP4.

Shakerr