DOCK 5.1.0

10/15/2002

Demetri Moustakas

Kuntz Laboratory, UCSF

demer@francisco.compchem.ucsf.edu

Foreward

I would like to thank a lot of people for their help and support with this project. Members of the Kuntz and Kollman labs provided many fruitful discussions and advice during the initial design stages of the code, and helped me with debugging and validation of the DOCK 5 modules as they were being developed. I owe a debt of gratitude to group members past and present whose ideas, advice, and lively debates have shaped this project and helped make DOCK 5 what it is today. Specifically, I would like to acknowledge people who contributed significantly to the project. Fernando Martin designed and wrote the simplex minimizer and the optimization framework classes. Scott Pegg has worked on code and algorithm optimization, leading to significant speedup in the code. Geoff Skillman was instrumental in the early design stages; I owe him particular thanks for introducing me to the OELib, which became the basis for the DOCK 5 architecture. All my thanks to OpenEye Scientific Software: Matt Stahl, Ant Nicholls, Geoff Skillman, Mark McGann, Joe Corkery, and Roger Sayle who answered countless questions, and provided a wealth of chemistry, physics, and programming advice. Xiaoqin Zou provided me with the SDOCK source code, which was incorporated into the GB/SA scoring class. Jim Frazine provided me endless help with many matters related to Linux and SGI systems, and built and rebuilt test clusters to get the MPI code working. I have to thank my wife Katie for putting up with me these last few years through many late nights in front of the computer. And finally I would like to think Tack, for allowing me the opportunity to work on DOCK, and for providing many wonderful experiences these past few years.

Introduction

This is the release of DOCK 5.1.0. This is the lastest full release of Dock that is built on the new C++ codebase. It contains all of the major dock functionality from DOCK 4, as well as a number of new functions. The next minor release will include MPI parallization, and an interface to the ZAP PB/SA library from Openeye Scientific software. All DOCK 5 licensees will be alerted of the incremental releases via email, and notices will be posted on the DOCK web site as well.

This release contains ligand I/O, rigid orienting, anchor first search, energy & contact scoring, GB/SA scoring, simplex minimization. The main additions to this release are:

· Automated matching

· Internal energy (used in flexible docking)

· Scoring function hierarchy

· New minimizer termination criteria

· Bugs addressed in

o Minimzer & optimizer classes

o GB/SA scoring

o Anchor & Grow

o Rigid orienting

o Atom typing

This version of DOCK is written in C++, and each of the major DOCK functions has been implemented as a class. Most classes are designed for maximum ease of debugging and validation, and are continually being optimized for performance. The Dock 5 developers manual will describe the API for each class in detail. The Dock 5 code manual will describe in detail the data structures and algorithms used in each class. These additional manuals will be completed soon, and released with the first minor incremental release.

I would like to ask for feedback in several areas. Please report any bugs to demer@francisco.compchem.ucsf.edu. Additionally, please report any suggestions for new features, or new ways to combine or use the existing features. Thanks, and happy docking!

General Overview

The major features of DOCK 5.1.0 include rigid orienting of ligands to receptor spheres, AMBER energy scoring, GB/SA solvation scoring, contact scoring, Internal non-bonded energy scoring, ligand flexibility, and both rigid and torsional simplex minimization. Each DOCK function is implemented as a C++ class, and molecules are represented by a molecule class (based on the OElib’s OEMol) that are passed from one functional class to another. Much of the theory of the DOCK functions is described in the DOCK 4 manual, in the advanced section. I recommend users wanting to know more about the theory behind the algorithms refer to it.

Ligand File I/O

Currently, only MOL2 file I/O is supported. Ligands are read in from a single MOL2 database file. Atom and bond types are assigned using the DOCK 4 atom/bond typing parameter files (vdw.defn, flex.defn, flex_table.defn). There are several ligand output options, which write molecules to files whose names are formed using the output_file_prefix parameter:

Users can choose to write out orientations. This will create a file called outputprefix_orients.mol2. This will write out the molecules after they have been rigidly oriented and optimized. If anchor & grow is being used, this option will write out only the anchor fragment. All orientations generated will be written out, so be careful that the output doesn’t get too huge.

Users can also write out conformers prior to final optimization. This will create a file called outputprefix_confs.mol2. Again, be aware that the number of molecules in the output file will be equal to the database size * the # of anchors per molecule * the number of orientations per anchor * the number of conformers per cycle. This file can grow quite large, so only use it on small databases.

DOCK will always write out a scored molecules output file, which contains the best scoring pose for each molecule in the database. This will create a file called outputprefix_scored.mol2. In DOCK 5.1.0, users can use molecule ranking, which writes out the top N molecules in the database in a file called outputprefix_ranked.mol2. This option disables the scored molecule output file by default, though users can override this and write out the best pose for each molecule as well.

The ligand class also handles the MPI parallelization of DOCK over SMP and distributed clusters. When DOCK is compiled and run in parallel mode, a master processor distributes molecules to client processors, each of which performs the desired docking. The client node returns the top score list molecules to the master node, to be written out to a file (default filename = output_mpi.mol2). Due to discrepancies in the different MPI implementations, it was not possible to easily use a commandline flag to enable or disable MPI. Therefore, there is a #define statement in the dock.cpp file that enables or disables MPI. Therefore DOCK5 will compile into single processor and MPI versions. The code for this release is set to disable MPI, while a bug related to parallel file access is worked out. This will be fully functional in the next incremental release.

Rigid Orienting

DOCK 5 uses receptor spheres and ligand heavy atom centers to rigidly orient ligands in the receptor. Cliques of receptor spheres & ligand centers are identified using the maximum subgraph clique detection algorithm from DOCK 4. All cliques that satisfy the matching parameters are generated in the matching step, and can be sorted or ordered prior to the loop where the program cycles through the orientations. This leaves open the possibility for the orientational sampling of the site to be directed by a function (e.g. uniform sphere sampling, uniform Cartesian sampling, spatially weighted, etc…). For details on the theory of sphere matching, please see the included DOCK4 manual.

Both automated and manual matching are available in DOCK5. The sphere/center matches are determined by 2 parameters:

1) The distance tolerance is the tolerance in angstroms within which a pair of spheres is considered equivalent to a pair of centers

2) The distance minimum is the shortest distance allowed between 2 spheres (any sphere pair with a shorter distance is disregarded)

Manual matching will create as many matches as possible given the specified parameters, and sort the matches according to the RMS error between the spheres and centers in the match. The matches are provided as orientations until either the max_orients # of orientations are reached, or the end of the match list is reached.

Automated matching will start with the default values for the distance tolerance and distance minimum. A list of matches will be generated, and if the # of matches is less than the # max_orientations, then the distance tolerance is increased and the matching is repeated until there are at least max_orientations in the match list. Then the list is sorted, and orientations are generated.

Ligand Flexibility

Ligand flexibility in DOCK 5 uses an anchor first search introduced in DOCK 4. Rotatable bonds (not contained in rings) are used to partition the molecule into rigid segments, from which all anchors that meet the criteria are selected beginning with the largest anchor segment. If no segments meet the anchor criteria, the largest segment is selected as the only anchor. All anchor orientations (or the starting orientation only, if no orienting is selected) are used as starting configurations onto which the first flexible layer is appended and conformationally expanded. The total population of conformers is then reduced to the number specified in Nc, and the process is repeated until the last layer is reached.

The conformer generator class now integrates score optimization in the anchor & grow algorithm. The anchors can be rigidly optimized, the final conformations can be either rigidly, torsionally, or completely optimized, and the partially grown conformers can be completely optimized. Additionally, a look ahead heuristic designed to optimize the conformation-pruning step has been developed, and is currently being validated. It will be included (pending validation) in an incremental release. The anchor & grow steps use whichever scoring function the user selects as the primary scoring function. The final minimization step uses the secondary scoring function.

Scoring Functions

This release of DOCK5 implements a hierarchical scoring function strategy. A master score class manages all scoring functions that DOCK uses. Any of the DOCK scoring functions can be selected as the primary and/or the secondary scoring function. The primary scoring function is used during the rigid minimization, and anchor & grow steps, which typically make many calls to the scoring function. The secondary scoring function is used in the final minimization, scoring, and ranking of the molecules. If no secondary scoring function is selected, the primary scoring function is used as the secondary.

This release contains intermolecular AMBER energy scoring (vdw + columbic terms only), contact scoring and bump filtering as implemented in DOCK 4. It also contains GB/SA scoring, as implemented in SDOCK, by Dr. Xiaoqin Zou (ZouX@missouri.edu). The scoring functions currently only compute grid based scores; continuum scoring for the AMBER energy score will be implemented in an incremental release. Scoring grids are created using the GRID program distributed with DOCK 4. Scoring grids for GB/SA require that the SDOCK accessory chemgrid be run. This program is included in the utilities/GBSA_Grids/ directory, for both Linux and SGI platforms. There is a README file in this directory with instructions on creating GB/SA grids.

One important note regarding the implementation of the scoring function classes is that each class is implemented as a completely separate class from the other scoring functions. This requires that during parameter input, a path to the grid prefix needs to be supplied to each scoring function.

This release also includes an internal energy scoring function, that is used during the anchor & grow flexible search. This function computes the Lennard-Jones and columbic energy between all ligand atom pairs, excluding all 1-2, 1-3, and 1-4 pairs. This energy is not included in the final reported score.

Score Optimization

Score optimization is implemented using a simplex minimizer based on the DOCK 4 minimizer. Users can choose to minimize the rigid anchors, minimize during flexible growth, and minimize the final conformation. The anchor minimization is always done rigidly; also, if no flexible growth is being done, this step will minimize the entire molecule. The minimization during the flexible growth is a complete (torsions + rigid) minimization. The final minimization can be rigid, torsions only, or complete. There are two termination criteria that the simplex minimizer can use to end minimzation before the maximum number of iterations has been reached. One is a window based termination scheme that evaluates a window of steps in the minimzation, and terminates the minimzation when the largest difference between the energies in the window is within a user-specified tolerance. The other termination criteria is the scaled range termination scheme. This is the termination criterion used in DOCK4, where the difference between the highest and lowest point in the simplex is compared to a tolerance specified by the user. When the simplex “shrinks” enough so that the highest and lowest points are within the tolerance, the minimizer terminates. Unlike the previous version of DOCK5, the minimizer will optimize any scoring function that is used as the primary or secondary score.

User instructions

Installation Instructions

This DOCK 5 beta release has been built and tested on SGI, linux (both AMD and Intel chips), and windows 2000 (Intel chips) platforms. I have not included the windows distribution in this release, however I can provide it to any user who desires it, and it will be provided by default in all future beta releases. Binaries are included for Irix and Linux platforms, and makefiles for each platform are included. The binaries are located in the bin/ subdirectory. If the binaries work on your system, and you have no desire to recompile the program, feel free to skip to the rest of this section. Otherwise I’ll assume you have either a good spirit of adventure, or the need to compile DOCK 5 on a system other than the ones listed above. In the event the latter is the case, please feel free to contact me regarding compilation problems/successes on different platforms.

The dock5 directory contains the following subdirectories:

REQUIRED_LIBRARIES/

bin/

demo/

docs/

mpich/

oelib/

parameters/

src/

utilities/

accessories/

grid/

GBSA_Grids/

DOCK 5 is built upon two libraries. The first is the OELib, provided by OpenEye scientific software (www.eyesopen.com). The version of the OELib used by DOCK 5 is open source, and freeware. Redistribution is restricted to use allowed by the GNU public license, or through arrangement with OpenEye. The second required library is the MPICH library, provided freely by Argonne National Labs (http://www-unix.mcs.anl.gov/mpi/mpich/). The MPI library must be built in order to compile DOCK 5, however it only needs to be installed and running on the system if the MPI features are to be used.

The directory REQUIRED_LIBRARIES/ contains tar.gz archives of both the oelib/ and the mpich/ install directories. The directories oelib/ and mpich/ contain the unpacked install directories for each library. If the libraries are built in these directories, then the provided makefiles should work with no modification. If the library locations are customized, then the makefile include and library paths will require modification. Since the libraries need to be built specifically for one computing platform, if you plan to compile DOCK 5 on multiple platforms, it is advisable to create one copy of the dock_v5.0b1 directory for each platform you wish to compile on. Above all else, make sure that the platform you are compiling DOCK 5 on is the same platform used to build the required libraries.

Building the OELib:(on both SGI & Linux platforms)

From the dock_v5.0b1 directory:

cd oelib

./configure

make

make install

Building MPICH: (on SGI platforms)

From the dock_v5.0b1 directory:

cd mpich/

./configure --with-arch=IRIXN32

make

Building MPICH: (on Linux platforms)

From the dock_v5.0b1 directory:

cd mpich/

./configure

make

Once the required libraries are built, change into the src/ directory. There are two makefiles provided (Makefile.sgi & Makefile.linux), that differ primarily by the use of the CC compiler on SGI platforms, and the g++ compiler on Linux platforms.

Building DOCK 5: (all platforms)

From the dock_v5.0b1 directory:

cd src/

make –f Makefile.(sgi or linux) clean

make –f Makefile.(sgi or linux) dock

make –f Makefile.(sgi or linux) install

the install command will move an executable named dock5.sgi or dock5.linux into the bin/ directory, where it will be ready for use.

To build the utilities, simply change into the utilities/accessories directory, and type:

make all

Then change into the utilities/grid directory, and depending on whether you are using a linux or SGI system, type either:

make –f Makefile.linux grid

or:

make –f Makefile.sgi grid

This will install all of the dock utilities (grid, sphgen, showsphere, etc…) into the bin directory. See the DOCK 4 manual for instructions on how to use these programs.

Running DOCK 5

DOCK 5 reads a parameter file containing field/value pairs similar to the DOCK 4 infile. The program is run as follows:

./dock5 -i parameter.in [-v1] [-v2]

If the parameter file exists, any parameter values found will be read, and any required but not found will be queried to the user via stdin/stdout. An important note regarding MPI use is that the stdin/stdout interfaces are disabled across MPI, therefore the parameter file must be complete in order to work properly. It is advisable to test the parameter file on a single processor job prior to launching an MPI job. If an MPI job is launched with missing parameters, the job will wait indefinitely on user input for the missing parameters. The next beta release will determine whether the program is running as an MPI job, and return an error if missing parameters are present.

DOCK 5 outputs the job parameters to the screen at the start of the job, and prints summary information for each molecule processed. Additional summary information will be included in future releases. The –v1 flag turns on low level verbosity. This will print out a histogram of sphere matching information, as well as other useful output that will be added in the future (minimization statistics, molecule statistics, etc…). The –v2 flag turns on high level verbosity, printing details about the breakdown of the GB/SA terms, and in the future, atom type, bond type, and atom by atom breakdown of energy scores.

DOCK 5 Parameters

The DOCK 5 parameter parser requires that the values entered for a parameter exactly match one of the legal values if any legal values are specified. For example:

param_a [5] ():

param_b [5] (0 5 10):

Param_a can be assigned any value, however param_b can only be assigned 0, 5, or 10. If no value is entered, both will default to a value of 5. Below are listed all DOCK 5 parameters, their default values, legal values, and a brief description of each. The parameters are listed in order of function. Also, for questions requiring a yes/no answer, please use the full word (yes or no) as opposed to y or n. Its inconvenient, but prevents problems with the parser in the long run.

Ligand I/O Parameters

Parameter Name	Default Value	Legal Values	Description
ligand_atom_file	database.mol2		The ligand input filename
ligand_outfile_prefix	output		The prefix that all output files will use
write_orientations	no	yes, no	Flag to write orientations
write_conformations	no	yes, no	Flag to write conformations
calculate_rmsd	no	yes, no	Flag to perform an RMSD calculation between the final molecule pose and its initial structure. This value is reported in the outfile_scored.mol2 file
rank_ligands	no	yes, no	Flag to enable a ligand top-score list. These ligands will be written to outfile_ranked.mol2, and outfile_scored.mol2 will be empty by default
max_ranked_ligands	500		The # of ligands to be stored in the top score list
scored_mol_output_override	no	yes, no	This flag causes all ligands to be written to outfile_scored.mol2, even when rank_ligands is true
max_send_queue_size	10		The maximum number of ligands sent in a workunit to an MPI client
max_recv_queue_size	10000		The maximum number of ligands returned in one message from an MPI client

Orient Ligand Parameters

Parameter Name	Default Value	Legal Values	Description
orient_ligand	no	yes, no	Flag to orient ligand to spheres
automated_matching	no	yes, no	Flag to perform automated matching instead of manual matching
distance_tolerence	0.25		The distance tolerance applied to each edge in a clique
distance_minimum	2.0		The minimum size for an edge in a clique
nodes_minimum	3		The minimum # of nodes in a clique
nodes_maximum	10		The maximum # of nodes in a clique
receptor_site_file	receptor.sph		The file containing the receptor spheres
max_orientations	1000		The maximum # of orientations that will be cycled through

Flexible Ligand Parameters

Parameter Name	Default Value	Legal Values	Description
flexible_ligand	no	yes, no	Flag to perform anchor first search
min_anchor_size	10		The minimum # of heavy atoms for an anchor segment
number_confs_per_cycle	25		The maximum number of conformations carried forward in the anchor & grow search

Scoring Ligand Parameters

Parameter Name	Default Value	Legal Values	Description
bump_filter	no	yes, no	Flag to perform bump filtering
bump_grid_prefix	grid		The prefix to the grid file(s) containing the desired bump grid
max_bumps	0		The maximum allowed # of bumps for a molecule to pass the filter
score_molecules	no	yes, no	Enables scoring of molecules
energy_score_primary	no	yes, no	Flag to perform energy scoring as the primary scoring function
energy_score_secondary	no	yes, no	Flag to perform energy scoring as the secondary scoring function
vdw_scale	1		Scalar multiplier of the vdw energy component
es_scale	1		Scalar multiplier of the electrostatic energy component
nrg_grid_prefix	grid		The prefix to the grid files containing the desired nrg grid
contact_score_primary	no	yes, no	Flag to perform contact scoring as the primary scoring function
contact_score_secondary	no	yes, no	Flag to perform contact scoring as the secondary scoring function
contact_cutoff_distance	4.5		The distance threshold defining a contact
contact_clash_overlap	0.75		Contact definition for use with intramolecular scoring
contact_clash_penalty	50		The penalty for each contact overlap made
cnt_grid_prefix	grid		The prefix to the grid files containing the desired cnt grid
gbsa_score_primary	no	yes, no	Toggles whether or not to use GB/SA scoring as the primary scoring function
gbsa_score_secondary	no	yes, no	Toggles whether or not to use GB/SA scoring as the secondary scoring function
gb_grid_prefix	gb_grid		The path to the pairwise GB grids
sa_grid_prefix	sa_grid		The path to the SA grids
screen_file	screen.in		GB parameter file for electrostatic screening. Its located in the parameters dir by default
solvent_dielectric	78.300003		The value for the solvent dielectric
vdw_grid_prefix	grid		The path to the dock4 nrg grids, used for the vdw portion of the GB/SA calculation

Score Optimization Parameters

Parameter Name	Default Value	Legal Values	Description
minimize_ligand	no	yes, no	Flag to perform score optimization
minimize_rigid_anchor	no	yes, no	Flag to perform rigid optimization of the anchor
minimize_layer_growth	no	yes, no	Flag to perform complete optimization during conformational search
minimize_final_pose	yes	yes, no	Flag to perform minimization of the final ligand pose
minimze_final_pose_rigid	no	yes, no	Flag to perform rigid minimization of the final pose
minimze_final_pose_rigid	no	yes, no	Flag to perform torsional minimization of the final pose
minimize_final_pose_complete	yes	yes, no	Flag to perform complete minimization of the final pose
minimizer_choice	0	0, 1	Chooses whether to use the Simplex (0) minimizer or none (1). This will allow other minimizers to be used in the future
initial_translation	1.0		Initial translation step size
initial_rotation	1.0		Initial rigid rotation step size
initial_torsion	10.0		Initial torsion angle step size
maximum_iterations	100		Maximum # of simplex iterations / cycle
maximum_function_calls	500		Maximum # of function calls / cycle
window_based_termination	no	yes, no	Flag to use the score window termination criteria
window_size	55		The width of the window (the # of iterations)
window_delta	1.0		The threshold energy for the scores in the window- when the highest score – lowest score is less than window_delta, the minimizer will terminate
scaled_range_termination	no	yes, no	Flag to use the scaled range termination criteria (the DOCK4 termination criteria)
scaled_range_fsize	0.0		The maximum score value to be considered in the scaled range calculation
scaled_range_tolerance	1.0		When the fraction (hi – low)/max(hi, fsize) is less than tolerance, the function terminates (where hi and low are the higest and lowest score values in the simplex)
multiple_simplex_cycles	no	yes, no	Flag to use multiple cycles of minimization
maximum_cycles	5		Maximum # of minimization cycles allowed
random_number_generator	0	0, 1	Choice of internal RNG (0) or system RNG (1)
random_number_seed	2002		Seed for RNG

Atom & Bond Typing Parameters

Parameter Name	Default Value	Legal Values	Description
atom_model	all	all, united	Choice of all atom or united atom models
vdw_defn_file	vdw.defn		File containing vdw parameters for atom types
flex_defn_file	flex.defn		File containing bond definition parameters
flex_drive_file	flex_drive.tbl		File containing conformational search parameters
calc_internal_energy	no	yes, no	Flag to calculate the interal energy (only used during the anchor & grow)
internal_energy_att_exp	6		L-J attractive exponent
Internal_energy_rep_exp	12		L-J repulsive exponent
Internal_energy_dielectric	4.0		Dielectric value for coulumbic calculation