Package CHEM :: Package Kernel :: Module ContactHistogramKernel :: Class ContactHistogramKernel

Class ContactHistogramKernel

BaseKernel.BaseKernel --+
                        |
                       ContactHistogramKernel

Calculates a similarity score for molecules by comparing histograms of their contact lengths. Similar to BondHistogramKernel, except that rather than only tracking lengths between bonded atoms, this considers the full contact map of distances between every pair of atoms, bonded or not.

The histogram is a count of all atom pairs with distance fitting in the next "binWidth." For example, with a binWidth of 0.1, counts up all pairs with distancein [0.0,0.1) then all those with length in [0.1,0.2) then [0.2,0.3), etc. until all pairs are accounted for. This yields a vector / histogram of counts. A dot product can then be computed across these vectors from two molecules to essentially count up the number of common pair distances in the two.

Rather than a dot product, the similarity can instead be taken from the vectors by calculating the similarity as e^(-d/T) where d is the Euclidean distance (= RMSD from 0.0) between the two vectors to compare and T is some temperature scaling factor. If T is not specified, then it defaults to the largest d value found among the data set.

Note that such a feature vector / histogram would be very sparse, mostly counts of 0, so a dictionary is built of only the non-zero count values instead.

Only heavy (non-hydrogen) atoms will be considered. Atom pairs will be considered different by their atom types. For example, a CO pair is considered different than a CN bond (though the same as an OC pair). Conceptually, you can think of a separate histogram being generated for every different possible atom pair type.

An atom cannot pair with itself, otherwise the score will just be dominated by a bunch of 0 distance pairs. Each pair should only be counted once, the mirror case does not count. For example, count pair (a,b) once, but pair (b,a) should not count a second time.

Instance Methods

[hide private]

__init__(self, binWidth, comparisonMode=Const.EUCLIDEAN, normalize=False)

setScalingFactor(self, scalingFactor)
Set the scaling factor used for Euclidean distance measures.

getScalingFactor(self)
If self.scalingFactor is defined (not null) then just return it.

setAtomPairWeight(self, atomSymbol1, atomSymbol2, weight)
When counting up the atom pair distances for the histograms this method allows for different atom pair types to contribute different weights in the count.

parseAtomPairWeightSpecification(self, specString)
Given a specification string for the atom pair weightings, parse out the actual values and set them to the object's atomPairWeightDict.

similarity(self, obj1, obj2)
Primary abstract method where, given two objects, should return an appropriate, non-negative, similarity score between the two.

buildFeatureDictionary(self, mol)
Given an OEMolBase molecule object, calculate the (contact) distance between every atom pair.

acceptedAtomPair(self, atom1, atom2)
Screen out atom pairs not desired for calculation.

buildFeatureKey(self, mol, atom1, atom2)
Create a unique key based on the atom types and the distance between them.

atomDistance(self, mol, atom1, atom2)
Returns the distance between two atoms.

Inherited from BaseKernel.BaseKernel: dictionaryDotProduct, dictionaryEuclideanDistanceSquared, ensureListCapacity, getFeatureDictionary, normalizeFeatureDictionary, outputMatrix, prepareFeatureDictionaryList

Class Variables

[hide private]

binWidth = -1.0
Whether to normalize the feature dictionaries

normalize = False
Comparison mode.

comparisonMode = <CHEM.DB.rdb.search.NameRxnPatternMatchingMod...
Scaling factor used to modify EUCLIDEAN distance measure

scalingFactor = <CHEM.DB.rdb.search.NameRxnPatternMatchingMode...
Dictionary whose items are keyed by an object representing a pair of atom types.

atomPairWeightDict = <CHEM.DB.rdb.search.NameRxnPatternMatchin...

Inherited from BaseKernel.BaseKernel: featureDictList, objIndex1, objIndex2

Method Details

[hide private]

setScalingFactor(self, scalingFactor)

Set the scaling factor used for Euclidean distance measures. If no value is set, will default to a calculation of the largest distance value in the data set.

getScalingFactor(self)

If self.scalingFactor is defined (not null) then just return it. If not, then calculate the Euclidean distance for every pair of feature dictionaries in self.featureDictList and take the largest one. If the self.featureDictList is None or has < 2 items, then just use a value of 1.0.

setAtomPairWeight(self, atomSymbol1, atomSymbol2, weight)

When counting up the atom pair distances for the
histograms this method allows for different atom pair 
types to contribute different weights in the count.

atomSymbol1 is the string for the atomic symbol of one atom 
            of the pair (e.g. "C" for carbon, "Cl" for chlorine)
atomSymbol2 is the atomic symbol for the other atom of the pair
weight      Weight to multiply the histogram counts for any
            atom pairs encountered whose types match the
            two symbols specified.

Any atom pairs encountered that are not specified with some
special weight via this method will be assumed
to be weighted by 1.0 (unweighted).

parseAtomPairWeightSpecification(self, specString)

Given a specification string for the atom pair weightings, parse out the actual values and set them to the object's atomPairWeightDict.

Specification string is expected to be in the form 'atomSymbol1:atomSymbol2:weight,'. For example, to specify carbon-carbon pairs as having half the default weight of 1.0 and carbon-oxygen pairs as having twice the default weight, provide a specification string of 'C:C:0.5,C:O:2.0'

similarity(self, obj1, obj2)

Primary abstract method where, given two objects, should return an appropriate, non-negative, similarity score between the two. Up to the implementing class to define what this is.

Overrides: BaseKernel.BaseKernel.similarity: (inherited documentation)

buildFeatureDictionary(self, mol)

Given an OEMolBase molecule object, calculate the (contact) distance between every atom pair.

Create a dictionary keyed by the atom pair type (just based on the combination of atoms) and the histogram bin index that the found pairs should be placed in. The dictionary will have values equal to the number of pairs with a distance that fits into that bin. The bin index is just the number of times the binWidth can be wholly divided into the bond length.

For example, if binWidth = 0.1 then a CO atom pair at a distance of 1.32 will be placed under the CO bin index 13. Alternatively, you could say that bin index 13 of the CO histogram contains a count for all CO pairs with distance in [1.3,1.4).

Finally, normalize the feature dictionary as a vector to have a "length" of 1.0, by dividing all elements by the magnitude of the vector / dictionary.

Overrides: BaseKernel.BaseKernel.buildFeatureDictionary

acceptedAtomPair(self, atom1, atom2)

Screen out atom pairs not desired for calculation. Presently just exclude hydrogen (non-heavy) atoms.

buildFeatureKey(self, mol, atom1, atom2)

Create a unique key based on the atom types and the distance between them. Importantl that the order that the atoms appear in should not matter. For example, a CO pair should be the same as an OC pair.

atomDistance(self, mol, atom1, atom2)

Returns the distance between two atoms. Requires a reference to the parent molecule to access coordinates.

Class Variable Details

[hide private]

normalize

Comparison mode. Choose from options in Const module

Value:

False

comparisonMode

Scaling factor used to modify EUCLIDEAN distance measure

Value:

None

scalingFactor

Dictionary whose items are keyed by an object representing a pair of atom types. (Tuple containing the atomic numbers of two atoms. For example, (6,17) to represent the atom pair of Carbon and Chlorine). See setAtomPairWeight(...) for more information.

Value:

None

atomPairWeightDict

Value:

None