Package CHEM :: Package ML :: Module Util :: Class FeatureDictWriter
[hide private]
[frames] | no frames]

Class FeatureDictWriter



object --+    
         |    
      dict --+
             |
            FeatureDictWriter

Utility class to encode data feature vectors (represented as string:value feature dictionaries) into a plain text file format.

These should then be re-read using the matching "FeatureDictReader" class. The basic strategy is to identify every feature encountered among the data items and assign each a unique index number. This class (which extends the dict class) stores the feature:index mappings and then prints out each data item with the corresponding index:value pairs.

Example Usage: (Note this may have problems as a doctest since the feature:index mapping order is arbitrary based on the "random" traversal of feature keys through the feature dictionaries.
>>> from cStringIO import StringIO
>>> from CHEM.ML.features.SpectrumExtractor import SpectrumExtractor
>>> dataList = ["asdfsdfg","asdfasdfASDF","dfghDFGH"]
>>> outfile = StringIO()
>>> kernel = SpectrumExtractor();
>>> kernel.k = 1;
>>> featureEnum = FeatureDictWriter(outfile)
>>> # Determine and output all of the feature:index mappings
>>> for item in dataList:
...     featureDict = kernel(item)
...     for feature in featureDict.iterkeys():
...         success = featureEnum.add(feature)
...
>>> # Output the feature dictionaries in index:value text format
>>> for item in dataList:
...     featureDict = kernel(item)
...     featureEnum.update( featureDict, item )
>>> print >> outfile, "<BLANKLINE>",            # Hack.  doctest doesn't like blank lines in expected output
>>> print outfile.getvalue().replace("\t"," "); # Doesn't like tabs either
# 0 a
# 1 s
# 2 d
# 3 g
# 4 f
# 5 A
# 6 F
# 7 S
# 8 D
# 9 h
# 10 G
# 11 H
asdfsdfg UNKNOWN_ID 0:1 1:2 2:2 3:1 4:2 
asdfasdfASDF UNKNOWN_ID 0:2 1:2 2:2 4:2 5:1 6:1 7:1 8:1 
dfghDFGH UNKNOWN_ID 2:1 3:1 4:1 6:1 8:1 9:1 10:1 11:1 
<BLANKLINE>


Instance Methods [hide private]
 
__init__(self, outfile=None)
Constructor.
 
__getitem__(self, feature)
Override dictionary access method "dict[key]" to automatically add items for newly encountered keys.
 
add(self, feature)
Makes the writer aware of a feature that will have to be written for subsequent feature dictionaries.
 
__setitem__(self, feature, featureIndex)
Override dictionary set method "dict[key] = value" to automatically address newly encountered features.
 
new_key(self, feature, featureIndex)
Output the given feature:index mapping.
 
update(self, featureDict, description, nameID="UNKNOWN_ID")
Output a specific feature dictionary to text format.

Inherited from dict: __cmp__, __contains__, __delitem__, __eq__, __ge__, __getattribute__, __gt__, __hash__, __iter__, __le__, __len__, __lt__, __ne__, __new__, __repr__, clear, copy, fromkeys, get, has_key, items, iteritems, iterkeys, itervalues, keys, pop, popitem, setdefault, values

Inherited from object: __delattr__, __reduce__, __reduce_ex__, __setattr__, __str__

Static Methods [hide private]
 
formatFeature(feature)
Return a string representation of a feature suitable for storage in the file.
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, outfile=None)
(Constructor)

 
Constructor. Just pass it the output file (object, not filename) to write to.
Returns:
new empty dictionary

Overrides: dict.__init__

__getitem__(self, feature)
(Indexing operator)

 
Override dictionary access method "dict[key]" to automatically add items for newly encountered keys.
Overrides: dict.__getitem__

add(self, feature)

 

Makes the writer aware of a feature that will have to be written for subsequent feature dictionaries.

Should be called for every possible feature in the dataset before actually trying to write the data to the output file with the "update" method. This way the object can first assign index numbers to every feature.

Return value indicates whether the feature is new to the writer or not

__setitem__(self, feature, featureIndex)
(Index assignment operator)

 
Override dictionary set method "dict[key] = value" to automatically address newly encountered features.
Overrides: dict.__setitem__

formatFeature(feature)
Static Method

 

Return a string representation of a feature suitable for storage in the file.

Should be structured enough to be parseable back into object form by a respective FeatureDictReader.parseFeature method. By default will just use the "__str__" interface to format it.

For something more sophisticated, you should create your own FeatureDictWriter sub-class that overrides this method. You can then write a respective extension to the FeatureDictReader to parse the string back into an object.

new_key(self, feature, featureIndex)

 

Output the given feature:index mapping.

Automatically invoked by calls to the set and "add" methods for newly encountered features.

update(self, featureDict, description, nameID="UNKNOWN_ID")

 

Output a specific feature dictionary to text format.

It would be nice to call the "add" method on every possible feature before calling this method so that the feature:index mappings will all be output at the beginning of the file, instead of intermixed with the data. However, this method will automatically try doing so if it has not. Either way, guaranteed that each feature:index mapping will appear before they are ever referenced by a data row.

The provided object's description will be printed first for each. It is important that this description NOT:
  • be empty or
  • contain any whitespace or
  • equal the FEATURE_PREFIX "#"

Otherwise the "decoding" steps later will be confused.

Preferably this description should be some kind of data identifying string, but uniqueness is not enforced.
Returns:
None

Overrides: dict.update