Wide-Search Molecular Replacement
Portal Application User Guide
When WSMR is suitable
- You've got good data (<4 A)
- You've tried MR with lots of good candidate a priori knowledge
- a priori knowledge
- sequence similarity (PSI-BLAST search)
- protein not sequenced
- no a priori knowledge of expected fold
- You haven't found any good models to use for phasing
- Time to try a brute-force search: WSMR
When MR is not suitable
- Complexes containing significant DNA or RNA
- at least right now these will probably not work
- You haven't tried MR and just want a "quick fix"
- Very large or very small structures
- both are computationally difficult
- Low resolution (> 4.5 A)
- experience so far suggests these aren't going to be helped much
- Reflection data in MTZ file format
- Must have amplitude columns (e.g. FP, SIGFP)
- Doesn't work with intensities only (I, SIGI)
- Managed expectations
- identify good MR candidates about 1 in 4 cases
- We don't produce a fully phased structure only a list of good MR candidates and their best placements as returned by Phaser
- Experience with Phaser to interpret results and re-run candidate models
What you will get
- Your job will take 20-50,000 computing hours
- Produces 300,000 files
- Attempts 100,000 single-domain MR trials using all SCOP domains
- Results are not automatically interpreted, but the best candidates are presented to you in a web-based scatter plot
- All results are available for your analysis
How to get started
WS-MR uses Phaser [McCoy, Grosse-Kunstleve, Adams, Winn, Storoni, Read; J. Appl. Cryst. (2007). 40, 658-674.] to perform molecular replacement on up to 100,000 protein domains from the SCOP protein database (See [Murzin A. G., Brenner S. E., Hubbard T., Chothia C. J. Mol. Biol. (1995) 247, 536-540.] [PDF]) in an attempt to find high quality templates suitable for phasing.
This is a computationally intensive operation (requiring up to 50,000 hours to complete), and one which should only be attempted after other efforts have failed. Results are typically available 2-4 days after initial submission, however all searches are manually verified and submitted based on a suitable justification.
However, there are currently there are many jobs in the Queue, and it may take up to two weeks before new jobs are run.
Citations, Publications, and other Documentation
Please cite one or both of the following in any publications which benefit from this system:
- Stokes-Rees I, Sliz P,Protein structure determination by exhaustive search of Protein Data Bank derived databases, Proc. Nat'l Academy of Sciences PNAS 2010: 1012095107v1-6.
- Stokes-Rees I, Sliz P, Compute and data management strategies for grid deployment of high throughput protein structure studies, IEEE Workshop on Many Task Computing on Grids and Supercomputers 2010 (MTAGS10), Seattle, November 2010
- Boon M, Grid scatters light on protein structures, International Science Grid This Week, December 15, 2010
- Curzan Morton C, The shape of things to come, Vector- Children's Hospital Boston Research and Innovation Blog, November 24, 2010
Your data and results are secure and can only be accessed by individuals with whom you share the access password. Internally we may review your data and results for the purposes of improving the WS-MR system. Any future publications on our part will contact owners of the data which we wish to reference for prior approval.
To use WS-MR you must upload MTZ-format reflection data containing F and SIGF channels (amplitudes, not intensities), and indicate the channel names. You also provide the highest resolution available in the data plus the solvent fraction. The WS-MR server will then verify the data, create a search task, and submit it to run on the Open Science Grid [Pordes R, et al. (2007) J Phys: Conf Ser 78].
Results will be available in 2-4 days, although you can monitor progress using the links you will be sent by email. You can also setup a security password to keep the data private. This can be shared with colleagues who need to access the data.
Interpreting 100,000 molecular replacement results can be daunting. We are happy to help you do this -- please email us at the address above or using the link in your job email. Essentially you are looking for domains which score well (and we consider multiple scoring metrics), are distinct from 99% of the results which are not suitable, and which form a cluster of "similar" structures. The tables, graphs, and tools we provide you should help with this process.
Please describe your need to run WS-MR in the comment text box. What results have you had from conventional MR? Have you had particular difficulties with traditional phasing methods? Has the space group been confirmed, or should other enantiomorphs be tested. Please indicate the expected number of molecules/ASU. Please include data collection stats if available: resolution, Rsymm, % complete. The submitted information will help us to prioritize your jobs.
As the WS-MR process proceeds, live results are accumulated into a file inprogress.dat. Each row in the .dat file represents the scores for a single MR template that produced a result when run with Phaser against your MTZ. When the entire job set is done, this resuls file is augmented with SCOP class and some domain details which can help identify patterns in the top scoring results. A distinct group of top scoring results, or a fringe of high scoring (but perhaps not distinct) results all in the same SCOP class is a strong indicator of a good MR model.
The inprogress.dat files are preliminary and may contain shrapnel from concurrent data aggregation. They are also unsorted. Each time a status operation is executed a snapshot.dat file is generated. This performs some basic cleaning and sorting of the currently available data, but it still should not be considered definitive. Once the process is complete, the finalize process will sequentially collect all results, sort them, augment them with per-domain details, and produce two final results files: final.dat which contains the clean, sorted, final results, and final.augmented.dat which also contains additional "static" information about each domain.
The best way to "browse" your results is through an interactive multi-dimension scatter graph. Mac users are recommended to try DataGraph. Excel, Gnumeric, or other spreadsheet packages may also be used, but may not accept data tables with 100k lines. In these cases, just use the top1000 results. Our experience shows that the best way to discover strong MR candidates from the WS-MR results is to look at the following scatter graphs:
- LLG vs TFZ score - ignore any results with LLG less than 20
- LLG vs domain length - look for a cluster of results with the same domain length but above the "trend" line for the LLG trajectory, and with an LLG above 20.
In addition, adding domain coloring by SCOP class (or class group) can help highlight clusters of similar structures with strong results. This is often an indication of a good template model if the LLG and TFZ scores are also good.
Each time the
is generated you'll get a scatter graph of the LLG vs TFZ results
collected to date.
Exploring specific MR candidates
You may find these two tools useful for examining the candidate models:
View RCSB and SCOP models by 6-character PDB code: http://portal.sbgrid.org/cgi/pdbview.py
Compare RCSB and SCOP models to first model in list: http://portal.sbgrid.org/cgi/tmalign.py
For both of these, a good "hint" is to click on the "copy link" link that appears after you've done a search to share the results with others by a URL. For "tmalign.py", if you have less than 10 results you are comparing, you can i) click on the "copy link" text; then ii) add &jmol=1 to the end of the URL. This will give you a JMol overlay. This is turned off by default since >10 JMol instances will crash many browsers. Yes, the UI for these needs some work, but so far they have been tools primarily for my own internal use...
For some data sets, the promising results may have relatively low TFZ scores, but not "impossibly" low. One of our interesting findings is that TFZ scores as low as 4 or 5 can actually correspond to good MR candidates, and that a high LLG can draw these out. If you don't have any distinct cluster of solutions with jointly high LLG and high TFZ, instead look for high LLG, distinct from the rest of the results, and TFZ scores > 4. Cut and paste the top 20 or 40 results by LLG into:
http://portal.sbgrid.org/cgi/pdbview.py and see if they look approximately like what you expected.
NOTE: If you are using WS-MR it should be in cases where you haven't been able to phase your data by conventional MR, selecting models from a priori knowledge of the unknown protein, or based on sequence homology. Our experience so far (limited to approximately 50 cases) is that only about 1 in 4 WS-MR searches returns promising candidate models. The rest of the time the results are inconclusive. We are working to improve the search mechanisms to attempt alternative structures.
An important thing to note is that small structures produce misleadingly high TFZ scores. We don't currently include the number of residues in the output table, but clearly we should, as this would help filter out uninterestingly small structures.
In our experience any results with LLG < 20 should be ignored, and possibly that cut off should be as high as 30 or 40, depending on the structure under consideration. If you eliminate all data below that cutoff LLG value, and then resort by TFZ, you certainly may get an interesting result.
In summary, promising results can be identified three ways:
i) a distict (and typically small) population of structures with both high LLG and high TFZ (top right of LLG vs TFZ scatter graph);
ii) structures which have the same SCOP SCCS class and show up repeatedly on the "fringe" of the overall population: on the "top" (high LLG) and/or "right" (high TFZ), taking into consideration the comment above that LLG < 20 should be ignored (and possibly higher);
iii) any other structures on the top and/or right "fringe" (as defined above).
We have tried some different "combined" scoring and cut-off mechanisms, but so far don't have a systematically reliable way to extract this data, so we recommend the process described here. If you see a SCOP class showing up repeatedly, you can "browse" it via a URL such as:
If, for instance, you see that f.4.3 is the set of porin transmembrane beta-barrels, this comes up regularly, and it matches your expectation for the structure, then this should give you confidence to pursue those structures first.