Home | Trees | Indices | Help |
---|
|
object --+ | dict --+ | FeatureDictWriter
Utility class to encode data feature vectors (represented as string:value feature dictionaries) into a plain text file format.
These should then be re-read using the matching "FeatureDictReader" class. The basic strategy is to identify every feature encountered among the data items and assign each a unique index number. This class (which extends the dict class) stores the feature:index mappings and then prints out each data item with the corresponding index:value pairs.
Example Usage: (Note this may have problems as a doctest since the feature:index mapping order is arbitrary based on the "random" traversal of feature keys through the feature dictionaries.>>> from cStringIO import StringIO >>> from CHEM.ML.features.SpectrumExtractor import SpectrumExtractor >>> dataList = ["asdfsdfg","asdfasdfASDF","dfghDFGH"] >>> outfile = StringIO() >>> kernel = SpectrumExtractor(); >>> kernel.k = 1; >>> featureEnum = FeatureDictWriter(outfile) >>> # Determine and output all of the feature:index mappings >>> for item in dataList: ... featureDict = kernel(item) ... for feature in featureDict.iterkeys(): ... success = featureEnum.add(feature) ... >>> # Output the feature dictionaries in index:value text format >>> for item in dataList: ... featureDict = kernel(item) ... featureEnum.update( featureDict, item ) >>> print >> outfile, "<BLANKLINE>", # Hack. doctest doesn't like blank lines in expected output >>> print outfile.getvalue().replace("\t"," "); # Doesn't like tabs either # 0 a # 1 s # 2 d # 3 g # 4 f # 5 A # 6 F # 7 S # 8 D # 9 h # 10 G # 11 H asdfsdfg UNKNOWN_ID 0:1 1:2 2:2 3:1 4:2 asdfasdfASDF UNKNOWN_ID 0:2 1:2 2:2 4:2 5:1 6:1 7:1 8:1 dfghDFGH UNKNOWN_ID 2:1 3:1 4:1 6:1 8:1 9:1 10:1 11:1 <BLANKLINE>
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
Inherited from Inherited from |
|
|||
|
|
|||
Inherited from |
|
|
|
Makes the writer aware of a feature that will have to be written for subsequent feature dictionaries. Should be called for every possible feature in the dataset before actually trying to write the data to the output file with the "update" method. This way the object can first assign index numbers to every feature. Return value indicates whether the feature is new to the writer or not |
|
Return a string representation of a feature suitable for storage in the file. Should be structured enough to be parseable back into object form by a respective FeatureDictReader.parseFeature method. By default will just use the "__str__" interface to format it. For something more sophisticated, you should create your own FeatureDictWriter sub-class that overrides this method. You can then write a respective extension to the FeatureDictReader to parse the string back into an object. |
Output the given feature:index mapping. Automatically invoked by calls to the set and "add" methods for newly encountered features. |
Output a specific feature dictionary to text format. It would be nice to call the "add" method on every possible feature before calling this method so that the feature:index mappings will all be output at the beginning of the file, instead of intermixed with the data. However, this method will automatically try doing so if it has not. Either way, guaranteed that each feature:index mapping will appear before they are ever referenced by a data row. The provided object's description will be printed first for each. It is important that this description NOT:
Otherwise the "decoding" steps later will be confused. Preferably this description should be some kind of data identifying string, but uniqueness is not enforced.
|
Home | Trees | Indices | Help |
---|
Generated by Epydoc 3.0beta1 on Thu Nov 8 17:49:32 2007 | http://epydoc.sourceforge.net |