Package CHEM :: Package CombiCDB :: Module CanSmiUniqueSet
[hide private]
[frames] | no frames]

Module CanSmiUniqueSet



Given a collection of molecules, convert them all to canonical SMILES 
format and output the contents to another file, but only those with unique 
canonical SMILES.  Separately reports the number of original and number of 
unique molecules to the application log.

Note:  This is actually a very inefficient program, requiring huge amounts
of memory.  You're much better off just inserting all the candidate
molecules into a database by their canonical SMILES.  Database indexing
is already optimized exactly for finding this kind of information.

Input: 
- sourceFile:  Molecule file
    Can be any format understandable by oemolistream, assuming a properly 
    named extension.  For example, "molecules.smi" for SMILES format
    Can take stdin as source by specifying the filename "-" or ".smi" or 
    something similar.  See documentation of oemolistream for more information.

Output:
- uniqueFile:  Unique (by canonical SMILES) molecule file
    Output is the unique set of source molecules, where uniqueness is
    determined by their canonical SMILES.  Note that the order these are 
    output will NOT correspond to the order from the source file, 
    especially since some molecules may be removed.
    
    If the read molecules include additional information, such as titles / labels,
    coordinates, etc. these will also be output by standard oemolostream
    behavior.  In the case of molecules with redundant canonical SMILES, the
    last one read will have precedence for this output.
    
    Additional catch.  If some source molecules are actually multiple molecules
    (e.g. "CCO.c1ccccc1" represents ethanol and benzene), then each will be
    separated out to find canonical SMILES.  Theoretically, the resultant unique
    list could contain more items than the source list as a result.  Any supplementary
    information on such molecules (coordinates, etc.) cannot be depended
    on to exist after the separations.  Titles will be carried over from the parent
    molecule, appended with ".X" where X is the index in the parent of the submolecule.
    
    Again, redirection to stdout possible by specifying the filename "-" or ".smi"
    etc.  Further note that by specifying a different file suffix, 
    (e.g. ".mol2",".sdf",etc.), the output file does not necessarily have to be
    in canonical SMILES format.  That is only used for the purpose of
    determining uniqueness.  Although, if you do want canonical SMILES formatted
    output, be sure the filename is or ends with the extension ".cansmi"

- redundantFile:    Redunandant (by canonical SMILES) molecule file
    Optional parameter.  If specified, copies of all of the molecules from the
    original sourceFile that did not make it into the uniqueFile (because they
    were redundant with another molecule in the sourceFile) will be output here.
    



Functions [hide private]
 
main(sourceFilename, uniqueFilename, redundantFilename=None, separateComposites=True)
Command-line main method, opens files with respective names and delegates most work to "convertUnique"
 
convertUnique(sourceOEIS, uniqueOEOS, redundantOEOS=None, separateComposites=True)
Primary method, reads the source file to generate the unique molecule output file.
 
addMolecule(mol, canSmiDict, redundantOEOS, separateComposites)
Adds the molecule to the canSmiDict, keyed by its canonical SMILES.
 
usage()
Prints usage directions if module called from command-line incorrectly
Function Details [hide private]

convertUnique(sourceOEIS, uniqueOEOS, redundantOEOS=None, separateComposites=True)

 

Primary method, reads the source file to generate the unique molecule output file. See module documentation for more information.

Note: This method actually takes oemolistream and oemolostream objects, not filenames, to allow the caller to pass "virtual Files" for the purpose of testing and interfacing. Use the "main" method to have the module take care of opening files from filenames.

addMolecule(mol, canSmiDict, redundantOEOS, separateComposites)

 
Adds the molecule to the canSmiDict, keyed by its canonical SMILES. If it replaces a molecule already in canSmiDict, write out the replaced one to redundantOEOS. If separateComposites is True, first split the molecule into several if its canonical SMILES contains any "." indicating multiple molecules.