|
0-bond paths: | C | O | N |
1-bond paths: | OC | C=C | CN |
2-bond paths: | OC=C | C=CN | |
3-bond paths: | OC=CN |
The list of patterns produced is exhaustive: Every pattern in the molecule, up to the pathlength limit, is generated. For all practical purposes, the number of patterns one might encounter by this exhaustive search is infinite, but the number produced for any particular molecule can be easily handled by a computer.
Because there is no pre-defined set of patterns, and because the number of possible patterns is so huge, it is not possible to assign a particular bit to each pattern as we did with structural keys. Instead, each pattern serves as a seed to a pseudo-random number generator (it is "hashed"), the output of which is a set of bits (typically 4 or 5 bits per pattern); the set of bits thus produced is added (with a logical OR) to the fingerprint.
Note that because each set of bits is produced by a pseudo-random generator, it is likely that sets will overlap. For example, suppose we are in the middle of generating a fingerprint, and it happens that 1/4 of the bits are already set. If the next pattern generates a set containing 5 bits, the probability that all 5 bits will be unique is (3/4)5, or about 24%. Likewise, the probability that all 5 bits will not be unique are (1/4)5, or about 0.1%.
In spite of the difference between the meaning of a fingerprint's bits and a structural key's bits, fingerprints share an important feature with structural keys: If a pattern is a substructure of a molecule, every bit that is set in the pattern's fingerprint will be set in the molecule's fingerprint. This means that, like structural keys, we can use simple boolean operations on fingerprints to screen molecules as we search a database, making a fingerprint comparison an extremely fast screen for substructure searching.
The best way to think of the bits of a fingerprint is as "shared" among an unknown but very large number of patterns. Each pattern generates its particular set of bits; so long as at least one of those bits is unique (not shared with any other pattern present in the molecule), we can tell if the pattern is present or not. A structural key indicates with certainty that a particular pattern is present or absent. Fingerprints are not so definite: if a fingerprint indicates a pattern is missing then it certainly is, but it can only indicate a pattern's presence with some probability. Although a fingerprint doesn't indicate with 100% certainty that a particular pattern is present, it contains far more patterns total than a structural key, the net result being that a fingerprint is a far better screen than a structural key in almost all situations.
Fingerprints have several advantages over structural keys:
The next evolutionary step in screening was the concept of folding a fingerprint to increase information density.
In the discussion above we mentioned the sparseness of a fingerprint, which is directly related to its information density. A fingerprint's information density can be thought of as the ratio how much information it actually holds to how much it could hold. As a practical definition, we measure the bit density, the ratio of "on" bits to the total number of bits (e.g. the bit density of "11000000" is 0.25).
Fingerprints for small molecules and "featureless" molecules (such as CH4 or C40H82) have less information in them than those for large or "rich" molecules. But the fingerprinting mechanisms discussed so far require a fixed fingerprint size for all molecules. If we choose to use small fingerprints, the fingerprint of large or complex molecules will be "black" - nearly all ones - and will not discriminate well (there is more information than the fingerprint can hold). On the other hand, if we use very large fingerprints, most molecules' fingerprints will be "white" - nearly all zeros - and will waste space. In both cases we have low information density; the "black" fingerprint because it is too dense and the "white" one because it is too sparse.
Ideally, we would like to choose a particular discriminatory power (e.g. "For a typical pattern, 98% of the database is screened out by the pattern's fingerprint") and compute the fingerprint density needed to achieve that discriminatory power on a case-by-case basis. Although we can not actually do this, the process of "folding" achieves nearly the same performance and size as the "ideal" case.
The folding process begins with a fixed fingerprint size that is quite large - large enough to accurately represent any molecule we expect to encounter. The fingerprint is then folded: we divide it into two equal halves then combine the two halves using a logical OR. The result is a shorter fingerprint with a higher bit density. We can repeatedly fold the fingerprint until the desired information density (called the minimum density) is reached or exceeded.
As long as two fingerprints are the same size (even if created with different sizes), they are compatible. To see why, consider the fingerprints of a pattern P and a molecule M. If the screen is initially positive (e.g. all bits in the P's fingerprints are also in M's) then the same will be true after folding. On the other hand, a negative screen (at least one bit in P's fingerprint is not in M's) might be converted to a positive screen after folding. But this is ok - converting "correct negative" to a "false positive" doesn't violate the rules of screening: a screen is only required to say P «in» M with 100% reliability. With each fold, we increase the chances of a false positive but save half of the space needed to store the fingerprint.
Fingerprint folding allows us to optimize the information density in a set of fingerprints, thus optimizing screening speed. Rather than choosing one fingerprint size for the entire database, we choose the size of each molecule's fingerprint individually, according to the complexity of the molecule and the desired success rate of the screening process. In most real databases, optimizing the information density greatly reduces the amount of data stored, and increases the screening speed correspondingly.
Although there is no reason a structural key can't reside in memory rather than on disk, they tend to be considerably larger than fingerprints, too large to fit a reasonable number of them into memory. This was especially true when structural keys were first developed - memory was several orders of magnitude more expensive than it is now, and most computers had precious little of it. Thus, structural-key screening has almost always been done as a disk-based screening technique.
Computer memories are roughly 105 times faster than disks. If one can read the bitmaps of a structural key or fingerprint into memory and search it there, the speed of the screen itself increases a corresponding amount. About the time fingerprints were developed, computer memory prices reached a point where the fingerprints from a relatively large database all could be loaded into memory.
With the advent of in-memory fingerprint-based screening, exploratory data analysis takes on a completely new aspect. One can quite literally "explore" a database, trying different searches, refining them, taking side-tracks to see where they would lead, and so on.
The evolution of the above-described screening techniques has now reached the point where the SMILES and fingerprints of all chemicals known in the world (roughly 10-15 million structures) can fit in the memory of a large workstation-class computer; it doesn't even take a "mainframe" computer, much less a supercomputer. An ordinary database of tens or hundreds of thousands of molecules can easily fit into the memory of today's run-of-the-mill workstations. Speed increases are correspondingly dramatic: an ordinary workstation can screen 100,000 to 1,000,000 structures per second using in-memory folded fingerprints.
The number of molecules that can be considered in a single search has also grown dramatically. It is now actually possible to search all known chemicals in the world in a reasonable time (that is, if you can get your hands on such a database). Most users, using a corporate or academic database of less than a million molecules, will be able to search their database in a few seconds - the time it takes to answer the question is now much shorter than the time it takes to think of the question, even on relatively large databases.
There are two distinct types of fingerprints which the Daylight system provides for reaction processing. First is the "normal" structural fingerprint, which is useful as a superstructure search screen. Second is the "difference" fingerprint, which is useful as a metric for the bond changes which occur during a reaction.
The structural reaction fingerprint is nothing more than the combination of the normal structural fingerprints for the reactant and product molecules within the reaction. The combination is the bitwise-OR of the following fingerprints:
Once the structural reaction fingerprint has been generated, all of the normal fingerprint operations (folding, similarity, substructure screen) apply.
The difference fingerprint is a new type of fingerprint specifically tailored for reaction processing. If one has a reaction which is stoichiometric (all atoms on the reactant side appear on the product side), then the difference in the fingerprint of the reactant molecules and the fingerprint of the product molecules will reflect the bond changes which occur during the reaction.
On the surface, one could perform an "exclusive-OR" operation between the reactant and product fingerprint. Unfortunately, since the molecule fingerprints do not keep track of the count of paths many of the relevant bonds get masked out during the "XOR" operation. What is required is to keep track of the count of each path in the reactant and product and then subtract the counts of a given path. If the difference in count is non-zero, then the path is used to set a bit in the difference fingerprint. If the difference in count is zero, then no bit is set for that path in the difference fingerprint.
For example, consider our well-worn Sn2 displacement reaction:
The paths generated for the molecules would be as follows:
Enumerated Fingerprint Paths: | ||
Path Length: | Reactant (count/path): | Product (count/path): |
0 | 1 I, 1 Na, 3 C, 1 Br | 1 I, 1 Na, 3 C, 1 Br |
1 | 1 C=C, 1 C-C, 1 C-Br | 1 C=C, 1 C-C, 1 C-I |
2 | 1 C=C-C, 1 C-C-Br | 1 C=C-C, 1 C-C-I |
3 | 1 C=C-C-Br | 1 C=C-C-I |
Difference in Path Counts: | |
Path Length: | Difference (count/path): |
0 | 0 I, 0 Na, 0 C, 0 Br |
1 | 0 C=C, 0 C-C, 1 C-Br, 1 C-I |
2 | 0 C=C-C, 1 C-C-Br, 1 C-C-I |
3 | 1 C=C-C-Br, 1 C=C-C-I |
After generating the difference in counts, we only use the six paths with non-zero differences to set bits in the difference fingerprint. These are the paths which walk through bonds that change during the reaction. By considering only these paths, we get a fingerprint which reflects the overall bond changes in the reaction.
Once generated, the difference fingerprint is a normal fingerprint and can be folded, used for similarity measurements and clustering. The difference fingerprint may not be used as a substructure screen for any types of searching, since it does not obey the strict subset relationship required for screening. In the above example, note that the "[Na+]" atom does not appear in any of the difference fingerprint paths. Thus, any search query which contained Sodium-containing paths would not pass the screening step, even if a valid query match.
One often wishes to know how similar one molecule is to another, independent of whether either is a substructure of the other. There are many ways one might choose to measure such similarity; for example a chemical approach might rank them according to the number of physical properties and reactions they share, whereas a mathematical approach might rank them according to their similarity in three dimensions.
The Daylight fingerprints described above effectively encode the substructures present in a molecule. It would not seem unreasonable that the proportion of substructures in common between two molecules should be a reasonable measure of similarity of the overall molecules. In mathematical terms this is a comparison of the bits in the fingerprints which are set on.
Note that the similarity measures are independent of the molecular feature descriptors. Fingerprints generated from structural keys or any other method can be used in the Daylight toolkits. The only requirements are that the size is a power of 2 and a multiple of 8. This ensures folding and translation to ascii representation works correctly. The current size limitations are 32 to 230
If we describe our molecules by the presence or absence of features, then the binary association coefficients or similarity measures are based on the four terms a, b, c, d shown in the two way table.
OBJECT B | ||||
---|---|---|---|---|
0 | 1 | Totals | ||
OBJECT A | 0 | d | b | b+ d |
1 | a | c | a + c = A | |
Totals | a + d | b + c = B | n |
Where:
a is the count of bits on in object A but not in object B.
b is the count of bits on in object B but not in object A.
c is the count of the bits on in both object A and object B.
d is the count of the bits off in both object A and object B.
In addition:
n = ( a + b + c + d )
A = ( a + c )
B = ( b + c )
Where:
n is the total number of bits on or off in objects A
or B.
A is the count of the bits on in object A.
B is the count of the bits on in object B.
N.B. This nomenclature differs from that used by others, in particular the Sheffield group. In their system the labels a and c are reversed.
Pre 4.5 releases of the Daylight software provided an example of each of
these classes.
As both numerator and denominator are functions of d, E belongs to the first class of association coefficients dependent on the double zeros.
As we do not take the square root, given mostly we are only interested in ranking ; it is simply a matching coefficient (Sokal, R.R., Michener, C.D., (1958) The University of Kansas Scientific Bulletin 38, 1409-1438). In reality the value returned for this coefficient is the complement of this, i.e. the Hamming distance, or the Total Difference Coefficient (Sneath, P.H.A., (1968) Journal of General Microbiology 54, 1-11)T is independent of d, i.e. T belongs to the second class of association coefficients defined above. It may be regarded as the proportion of the "on-bits" which are shared. See Tanimoto, T.T. (1957) IBM Internal Report 17th Nov see also Jaccard, P. (1901) Bulletin del la Société Vaudoisedes Sciences Naturelles 37, 241-272 .
N.B. These association coefficients are not necessarily true metrics. In particular those where the divisor depends on the particular pairwise comparison i.e. is not equal to n, may violate the triangle inequality. See Anderberg, MR (1973) Cluster Analysis For Applications, Academic Press p 117, for further discussion.
Over the years there has been much discussion as to which type of coefficient to use. In chemistry it has generally been thought that, as most descriptor features are absent in most molecules, i.e. the bit string descriptors such as the Daylight fingerprint contains mainly zeros, coefficients such as the Tanimoto are more appropriate. Further, given that the size of a Daylight fingerprint can be arbitrarily doubled, thereby adding mainly random off bits, any measure using matching off-bits, d, would be inappropriate. However this may not be the case for fixed width key based fingerprints. Daylight therefore offers access to both types of measure. The user must ensure that an appropriate one is chosen.
In version 4.5 Daylight extended the range of coefficients which could be used by introducing the asymmetric Tversky index. This allowed users to make use of directional similarity and harness the power of the concepts of prototypes in similarity searching.
As has been indicated above all of these indices are not monotonic, and as early as 1982 Hubalek (see Hubalek, Z. Biol. Rev. (1982) 57, 669-689) showed that the coefficients could be clustered on the ranking of a given set of objects.
Recently Holliday et al ( Holliday, JD., Hu, C-Y. and Willett, P. (2002) Combinatorial Chemistry and High Throughput Screening 5, 155-166 ) have shown that a whole range of similarity measures can be clustered on the ranking of chemical structures.
With the release of version 4.9 therefore, we have introduced a whole range of named similarity measures and additionally allowed users to construct their own, from the terms a, b, c, d as appropriate.
As of Release 4.51 the Tversky index (Tversky, A. Psychological
Reviews (1977)84 (4) 327-352), provides a more powerful method for
structural similarity searching. Its use and interpretation, however, are
not not as simple as the Tanimoto index. The Tversky comparison is
intrinsically asymmetric. As with Tanimoto the features present in two
objects are compared. In the Tversky approach we have the concept of a "prototype"
to which the objects or "variants" are compared. Note
this differs from the Tanimoto index in which the similarity between
two objects is estimated. This inherent asymmetry means that the Tversky
index is very definitely not a metric. The ratio model in which
the value is bounded (between 0 and 1) is defined as follows:
c/(α* a + β* b + c)
Setting the weighting of prototype features to the same value does not use the power of this index. Indeed, setting α = β= 1, produces the Tanimoto index. If α = β= 0.5 we get the Dice index. See below. The value of the index comes from setting the weighting of prototype and variant features asymmetrically, producing a similarity measure in a more-substructural or more-superstructural sense or reflecting the increased knowledge the user has about the prototype. Quite often one is looking for compounds like a known compound with appropriate properties.
Setting the weighting of prototype features to 100% (α=1) and variant features to 0% (β=0) means that only the prototype features are important, i.e., this produces a "superstucture-likeness" measure. In this case, a Tversky similarity value of 1.0 means that all prototype features are represented in the variant, 0.0 that none are.
Conversely, setting the weights to 0% prototype (α=0) / 100% variant (β=1) produces a "substructure-likeness" measure, where completely embedded structures have a 1.0 value and "near-substructures" have values near 1.0. Note: with no weight at all given to variant features, this measure is pretty sensitive to "noise" in Daylight fingerprints and settings of 90%/10% generally produce a more useful ranking.
Tversky measures where the two weightings add up to 100% (1.0) are of special interest. In XVMerlin the Tversky search query panel provides a "Sum 100%" checkbox which, when selected, forces the two weights to add up to 100%.
Advanced users may wish to experiment with Tversky indices where weightings are not limited to 100%. Weightings greater than 100% causes the distinguishing features a, b to be emphasized more than common features, c, which may be useful in analysis of diversity or dissimilarity.
With the 4.9 release, users are able to use their own similarity measure throughout the software, where before they have been restricted to the hardcoded measures described above. The measure is entered as a string representing a f( a, b, c, d) with all of the usual mathematical expressions available. There is no restriction on the form of the function, but to work as a measure of similarity it should fulfill certain requirements. Expressions such as:
(c + d) /( a + b + c )
2.0*(c +d)/( a + b + 2.0*c )
(c + d)/( 2.0* ( a + b ) + c )
are not useful, even though the user has attempted to alter the weights of the matched/unmatched pairs. It is nonsense to include in the numerator that which has been specifically excluded in the denominator (Anderberg, MR (1973) Cluster Analysis For Applications, Academic Press p. 89).
Users should also be cognizant of the restrictions imposed by the descriptors being compared. Daylight fingerprints can be arbitrarily doubled in size, as described above. The value of d, can be increased enormously without scaled increases in the information rich on-bits. It is not recommended therefore that indexes based on a function which uses the value of d, are used with Daylight type fingerprints. In the case of fingerprints based on structure keys there may be no such restrictions.
When doing comparisons users should also be aware of the range of possible values the index can take. Most have the range 0.0, 1.0 but this is not mandatory.
The most common useful indexes have been collected by Holliday et al (Holliday, JD., Hu, C-Y. and Willett, P. (2002) Combinatorial Chemistry and High Throughput Screening 5, 155-166) These are shown in the table, and can be referred to, by name, in applications and toolkits calls which allow user defined similarity functions.
Measure | Range | Formula |
Cosine | 0.0,1.0 | |
Dice | 0.0,1.0 | |
Euclid | 0.0,1.0 | |
Forbes | 0.0,∞ | |
Hamman | -1.0,1.0 | |
Jaccard | 0.0,1.0 | |
Kulczynski | 0.0,1.0 | |
Manhattan | 1.0,0.0 | |
Matching | 0.0,1.0 | |
Pearson | -1.0,1.0 | |
Rogers-Tanimoto | 0.0,1.0 | |
Russell-Rao | 0.0,1.0 | |
Simpson | 0.0,1.0 | |
Tanimoto | 0.0,1.0 | |
Yule | -1.0,1.0 |
Notes
- The Tanimoto and Jaccard indexes are the same.
- The Forbes index has no upper limit.
- The Manhattan index is a distance = 1.0 - Matching index
- The Kulczynski index is the mean of the individual substructure similarities
- The Simpson index is the best of the individual substructure similarities
- The Dice index is the ratio of the bits in common to the arithmetic mean of the number of on bits in the two items.
- The Cosine index is the ration of the bits in common to the geometric mean of the number of on bits in the two items.