CHEM :: Kernel :: Util :: FeatureDictWriter :: Class FeatureDictWriter

Class FeatureDictWriter

object --+    
         |    
      dict --+
             |
            FeatureDictWriter

Utility class to encode data feature vectors (represented as string:count feature dictionaries) into a plain text file format, that can then be re-read using the matching "FeatureDictReader" class. The basic strategy is to identify every feature encountered among the data items and assign each a unique index number. This class (which extends the dict class) stores the feature:index mappings and then prints out each data item with the corresponding index:count values.

Example Usage: (Note this may have problems as a doctest since the feature:index mapping order is arbitrary based on the "random" traversal of feature keys through the feature dictionaries.

>>> from cStringIO import StringIO
>>> from CHEM.Kernel.SpectrumKernel import SpectrumKernel
>>> dataList = ["asdfsdfg","asdfasdfASDF","dfghDFGH"]
>>> outfile = StringIO()
>>> kernel = SpectrumKernel(1)
>>> featureEnum = FeatureDictWriter(outfile)
>>> # Determine and output all of the feature:index mappings
>>> for item in dataList:
...     featureDict = kernel.buildFeatureDictionary(item)
...     for feature in featureDict.iterkeys():
...         success = featureEnum.add(feature)
...
>>> # Output the feature dictionaries in index:count text format
>>> for item in dataList:
...     featureDict = kernel.buildFeatureDictionary(item)
...     featureEnum.update( featureDict, item )
>>> print >> outfile, "<BLANKLINE>",            # Hack.  doctest doesn't like blank lines in expected output
>>> print outfile.getvalue().replace("      "," "); # Doesn't like tabs either
# 0 a
# 1 s
# 2 d
# 3 g
# 4 f
# 5 A
# 6 F
# 7 S
# 8 D
# 9 h
# 10 G
# 11 H
asdfsdfg 0:1 1:2 2:2 3:1 4:2 
asdfasdfASDF 0:2 1:2 2:2 4:2 5:1 6:1 7:1 8:1 
dfghDFGH 2:1 3:1 4:1 6:1 8:1 9:1 10:1 11:1 
<BLANKLINE>

Instance Methods

[hide private]

__init__(self, outfile=<CHEM.DB.rdb.search.NameRxnPatternMatchingModel.SearchSentence...)
Constructor.

__getitem__(self, feature)
Override dictionary access method "dict[key]"

add(self, feature)
Should be called for every possible feature in the dataset before actually trying to write the data to the output file with the "update" method.

__setitem__(self, feature, index)
Override dictionary set method "dict[key] = value"

new_key(self, feature, index)
Output the given feature:index mapping.

update(self, featureDict, description)
Output a specific feature dictionary to the text format.

Inherited from dict: __cmp__, __contains__, __delitem__, __eq__, __ge__, __getattribute__, __gt__, __hash__, __iter__, __le__, __len__, __lt__, __ne__, __new__, __repr__, clear, copy, fromkeys, get, has_key, items, iteritems, iterkeys, itervalues, keys, pop, popitem, setdefault, values

Inherited from object: __delattr__, __reduce__, __reduce_ex__, __setattr__, __str__

Properties

[hide private]

Inherited from object: __class__

Method Details

[hide private]

init(self, outfile=<CHEM.DB.rdb.search.NameRxnPatternMatchingModel.SearchSentence`...`)
(Constructor)

Constructor. Just pass it the output file (object, not filename) to write to.

Returns:

new empty dictionary

Overrides: dict.__init__

getitem(self, feature)
(Indexing operator)

Override dictionary access method "dict[key]"

Overrides: dict.__getitem__

add(self, feature)

Should be called for every possible feature in the dataset before actually trying to write the data to the output file with the "update" method. This way the object can first assign index numbers to every feature.

Return value indicates whether the feature is new to the writer or not

setitem(self, feature, index)
(Index assignment operator)

Override dictionary set method "dict[key] = value"

Overrides: dict.__setitem__

new_key(self, feature, index)

Output the given feature:index mapping. Automatically invoked by calls to the set and "add" methods for newly encountered features.

update(self, featureDict, description)

Output a specific feature dictionary to the text format.

It would be nicer to have called the "add" method on every possible feature before calling this method so that the feature:index mappings will all be output at the beginning of the file, instead of intermixed with the data. However, this method will automatically try doing so if it has not. Either way, guaranteed that each feature:index mapping will appear before they are ever referenced by a data row.

The provided object's description will be printed first for each. It is important that this description NOT:

be empty or
contain any whitespace or
equal the FEATURE_PREFIX "#"

Otherwise the "decoding" steps later will be confused.

Preferably this should be some kind of data identifying string, but uniqueness is not enforced.

Returns:

None

Overrides: dict.update

Class FeatureDictWriter

__init__(self, outfile=<CHEM.DB.rdb.search.NameRxnPatternMatchingModel.SearchSentence...) (Constructor)

__getitem__(self, feature) (Indexing operator)

add(self, feature)

__setitem__(self, feature, index) (Index assignment operator)

new_key(self, feature, index)

update(self, featureDict, description)

init(self, outfile=<CHEM.DB.rdb.search.NameRxnPatternMatchingModel.SearchSentence`...`)
(Constructor)

getitem(self, feature)
(Indexing operator)

setitem(self, feature, index)
(Index assignment operator)