Class ContactHistogramKernel
BaseKernel.BaseKernel --+
|
ContactHistogramKernel
Calculates a similarity score for molecules by comparing histograms of
their contact lengths. Similar to BondHistogramKernel, except that
rather than only tracking lengths between bonded atoms, this considers
the full contact map of distances between every pair of atoms, bonded or
not.
The histogram is a count of all atom pairs with distance fitting in
the next "binWidth." For example, with a binWidth of 0.1,
counts up all pairs with distancein [0.0,0.1) then all those with length
in [0.1,0.2) then [0.2,0.3), etc. until all pairs are accounted for.
This yields a vector / histogram of counts. A dot product can then be
computed across these vectors from two molecules to essentially count up
the number of common pair distances in the two.
Rather than a dot product, the similarity can instead be taken from
the vectors by calculating the similarity as e^(-d/T) where d is the
Euclidean distance (= RMSD from 0.0) between the two vectors to compare
and T is some temperature scaling factor. If T is not specified, then it
defaults to the largest d value found among the data set.
Note that such a feature vector / histogram would be very sparse,
mostly counts of 0, so a dictionary is built of only the non-zero count
values instead.
Only heavy (non-hydrogen) atoms will be considered. Atom pairs will
be considered different by their atom types. For example, a CO pair is
considered different than a CN bond (though the same as an OC pair).
Conceptually, you can think of a separate histogram being generated for
every different possible atom pair type.
An atom cannot pair with itself, otherwise the score will just be
dominated by a bunch of 0 distance pairs. Each pair should only be
counted once, the mirror case does not count. For example, count pair
(a,b) once, but pair (b,a) should not count a second time.
|
__init__(self,
binWidth,
comparisonMode=Const.EUCLIDEAN,
normalize=False) |
|
|
|
setScalingFactor(self,
scalingFactor)
Set the scaling factor used for Euclidean distance measures. |
|
|
|
getScalingFactor(self)
If self.scalingFactor is defined (not null) then just return
it. |
|
|
|
setAtomPairWeight(self,
atomSymbol1,
atomSymbol2,
weight)
When counting up the atom pair distances for the
histograms this method allows for different atom pair
types to contribute different weights in the count. |
|
|
|
parseAtomPairWeightSpecification(self,
specString)
Given a specification string for the atom pair weightings, parse
out the actual values and set them to the object's
atomPairWeightDict. |
|
|
|
similarity(self,
obj1,
obj2)
Primary abstract method where, given two objects, should return an
appropriate, non-negative, similarity score between the two. |
|
|
|
buildFeatureDictionary(self,
mol)
Given an OEMolBase molecule object, calculate the (contact)
distance between every atom pair. |
|
|
|
acceptedAtomPair(self,
atom1,
atom2)
Screen out atom pairs not desired for calculation. |
|
|
|
buildFeatureKey(self,
mol,
atom1,
atom2)
Create a unique key based on the atom types and the distance
between them. |
|
|
|
atomDistance(self,
mol,
atom1,
atom2)
Returns the distance between two atoms. |
|
|
Inherited from BaseKernel.BaseKernel :
dictionaryDotProduct ,
dictionaryEuclideanDistanceSquared ,
ensureListCapacity ,
getFeatureDictionary ,
normalizeFeatureDictionary ,
outputMatrix ,
prepareFeatureDictionaryList
|
|
binWidth = -1.0
Whether to normalize the feature dictionaries
|
|
normalize = False
Comparison mode.
|
|
comparisonMode = <CHEM.DB.rdb.search.NameRxnPatternMatchingMod...
Scaling factor used to modify EUCLIDEAN distance measure
|
|
scalingFactor = <CHEM.DB.rdb.search.NameRxnPatternMatchingMode...
Dictionary whose items are keyed by an object representing a pair
of atom types.
|
|
atomPairWeightDict = <CHEM.DB.rdb.search.NameRxnPatternMatchin...
|
Inherited from BaseKernel.BaseKernel :
featureDictList ,
objIndex1 ,
objIndex2
|
setScalingFactor(self,
scalingFactor)
|
|
Set the scaling factor used for Euclidean distance measures. If no
value is set, will default to a calculation of the largest distance value
in the data set.
|
If self.scalingFactor is defined (not null) then just return it. If
not, then calculate the Euclidean distance for every pair of feature
dictionaries in self.featureDictList and take the largest one. If the
self.featureDictList is None or has < 2 items, then just use a value
of 1.0.
|
setAtomPairWeight(self,
atomSymbol1,
atomSymbol2,
weight)
|
|
When counting up the atom pair distances for the
histograms this method allows for different atom pair
types to contribute different weights in the count.
atomSymbol1 is the string for the atomic symbol of one atom
of the pair (e.g. "C" for carbon, "Cl" for chlorine)
atomSymbol2 is the atomic symbol for the other atom of the pair
weight Weight to multiply the histogram counts for any
atom pairs encountered whose types match the
two symbols specified.
Any atom pairs encountered that are not specified with some
special weight via this method will be assumed
to be weighted by 1.0 (unweighted).
|
parseAtomPairWeightSpecification(self,
specString)
|
|
Given a specification string for the atom pair weightings, parse out
the actual values and set them to the object's atomPairWeightDict.
Specification string is expected to be in the form
'atomSymbol1:atomSymbol2:weight,'. For example, to specify carbon-carbon
pairs as having half the default weight of 1.0 and carbon-oxygen pairs as
having twice the default weight, provide a specification string of
'C:C:0.5,C:O:2.0'
|
similarity(self,
obj1,
obj2)
|
|
Primary abstract method where, given two objects, should return an
appropriate, non-negative, similarity score between the two. Up to the
implementing class to define what this is.
- Overrides:
BaseKernel.BaseKernel.similarity
- (inherited documentation)
|
buildFeatureDictionary(self,
mol)
|
|
Given an OEMolBase molecule object, calculate the (contact) distance
between every atom pair.
Create a dictionary keyed by the atom pair type (just based on the
combination of atoms) and the histogram bin index that the found pairs
should be placed in. The dictionary will have values equal to the number
of pairs with a distance that fits into that bin. The bin index is just
the number of times the binWidth can be wholly divided into the bond
length.
For example, if binWidth = 0.1 then a CO atom pair at a distance of
1.32 will be placed under the CO bin index 13. Alternatively, you could
say that bin index 13 of the CO histogram contains a count for all CO
pairs with distance in [1.3,1.4).
Finally, normalize the feature dictionary as a vector to have a
"length" of 1.0, by dividing all elements by the magnitude of
the vector / dictionary.
- Overrides:
BaseKernel.BaseKernel.buildFeatureDictionary
|
acceptedAtomPair(self,
atom1,
atom2)
|
|
Screen out atom pairs not desired for calculation. Presently just
exclude hydrogen (non-heavy) atoms.
|
buildFeatureKey(self,
mol,
atom1,
atom2)
|
|
Create a unique key based on the atom types and the distance between
them. Importantl that the order that the atoms appear in should not
matter. For example, a CO pair should be the same as an OC pair.
|
atomDistance(self,
mol,
atom1,
atom2)
|
|
Returns the distance between two atoms. Requires a reference to the
parent molecule to access coordinates.
|
normalize
Comparison mode. Choose from options in Const module
- Value:
-
|
comparisonMode
Scaling factor used to modify EUCLIDEAN distance measure
- Value:
-
|
scalingFactor
Dictionary whose items are keyed by an object representing a pair of
atom types. (Tuple containing the atomic numbers of two atoms. For
example, (6,17) to represent the atom pair of Carbon and Chlorine). See
setAtomPairWeight(...) for more information.
- Value:
-
|
atomPairWeightDict
- Value:
-
|