Module CanSmiUniqueSet
Given a collection of molecules, convert them all to canonical SMILES
format and output the contents to another file, but only those with unique
canonical SMILES. Separately reports the number of original and number of
unique molecules to the application log.
Note: This is actually a very inefficient program, requiring huge amounts
of memory. You're much better off just inserting all the candidate
molecules into a database by their canonical SMILES. Database indexing
is already optimized exactly for finding this kind of information.
Input:
- sourceFile: Molecule file
Can be any format understandable by oemolistream, assuming a properly
named extension. For example, "molecules.smi" for SMILES format
Can take stdin as source by specifying the filename "-" or ".smi" or
something similar. See documentation of oemolistream for more information.
Output:
- uniqueFile: Unique (by canonical SMILES) molecule file
Output is the unique set of source molecules, where uniqueness is
determined by their canonical SMILES. Note that the order these are
output will NOT correspond to the order from the source file,
especially since some molecules may be removed.
If the read molecules include additional information, such as titles / labels,
coordinates, etc. these will also be output by standard oemolostream
behavior. In the case of molecules with redundant canonical SMILES, the
last one read will have precedence for this output.
Additional catch. If some source molecules are actually multiple molecules
(e.g. "CCO.c1ccccc1" represents ethanol and benzene), then each will be
separated out to find canonical SMILES. Theoretically, the resultant unique
list could contain more items than the source list as a result. Any supplementary
information on such molecules (coordinates, etc.) cannot be depended
on to exist after the separations. Titles will be carried over from the parent
molecule, appended with ".X" where X is the index in the parent of the submolecule.
Again, redirection to stdout possible by specifying the filename "-" or ".smi"
etc. Further note that by specifying a different file suffix,
(e.g. ".mol2",".sdf",etc.), the output file does not necessarily have to be
in canonical SMILES format. That is only used for the purpose of
determining uniqueness. Although, if you do want canonical SMILES formatted
output, be sure the filename is or ends with the extension ".cansmi"
- redundantFile: Redunandant (by canonical SMILES) molecule file
Optional parameter. If specified, copies of all of the molecules from the
original sourceFile that did not make it into the uniqueFile (because they
were redundant with another molecule in the sourceFile) will be output here.
|
main(sourceFilename,
uniqueFilename,
redundantFilename=None,
separateComposites=True)
Command-line main method, opens files with respective names and
delegates most work to "convertUnique" |
|
|
|
convertUnique(sourceOEIS,
uniqueOEOS,
redundantOEOS=None,
separateComposites=True)
Primary method, reads the source file to generate the unique
molecule output file. |
|
|
|
addMolecule(mol,
canSmiDict,
redundantOEOS,
separateComposites)
Adds the molecule to the canSmiDict, keyed by its canonical
SMILES. |
|
|
|
usage()
Prints usage directions if module called from command-line
incorrectly |
|
|
convertUnique(sourceOEIS,
uniqueOEOS,
redundantOEOS=None,
separateComposites=True)
|
|
Primary method, reads the source file to generate the unique molecule
output file. See module documentation for more information.
Note: This method actually takes oemolistream and oemolostream
objects, not filenames, to allow the caller to pass "virtual
Files" for the purpose of testing and interfacing. Use the
"main" method to have the module take care of opening files
from filenames.
|
addMolecule(mol,
canSmiDict,
redundantOEOS,
separateComposites)
|
|
Adds the molecule to the canSmiDict, keyed by its canonical SMILES. If
it replaces a molecule already in canSmiDict, write out the replaced one
to redundantOEOS. If separateComposites is True, first split the molecule
into several if its canonical SMILES contains any "."
indicating multiple molecules.
|