27.1.1 Biological Hierarchies

Consider, for example, the issues involved in modeling a hierarchy. In such organizations, records are linked together like a family tree, such that each record has only one owner, e.g. an order is owned by only one customer. Indeed, hierarchical structures were widely used in the first mainframe management systems in the early 1970s.

As a case study, we'll consider the classic hierarchy from biology, that of phylogeny. Most biologists are aware that all living organisms can be placed in a hierarchy of kingdoms, genus and species. If it were truly as simple as this it might be reasonable to organize as a three-level tree. The Biota bioinformatics database, for example, goes further with kingdom, phylum, class, order, family, genus and species as seven different tables. But even this is the tip of the iceberg, as the NCBI taxonomy database, as used by GenBank, records things as

superkingdom, kingdom, subkingdom, superphylum, phylum, subphylum, superclass, class, subclass, infraclass, cohort, subcohort, superorder, order, suborder, infraorder, parvorder, superfamily, family, subfamily, tribe, subtribe, genus, subgenus, species group, species subgroup, species, subspecies, varietas and forma.

Hence if NCBI used a top-down hierarchical data organization, looping over all the organisms in a database only requires 30 nested loops. The problem is exasperated by the fact that the other major bioinformatics databases use a different hierarchy, and that different subtrees and branches have different levels.

Clearly, modeling such a hierarchy explicitly (even without the ambiguity of organisms belonging to multiple leaves of the hierarchy) has serious limitations.

More relevant to OEChem is the related problem of how to represent biomolecules. Once again, a naive structural biologist could be forgiven for assuming it's a simple matter of organizing atoms into residues, and residues into chains for a simple three-level hierarchy. Indeed, this was a fundamental mistake made by the immensely popular RasMol molecular graphics program that had exactly such a three-level structure. In reality, the organization of a PDB file also requires multiple NMR models, crystal related symmetries, secondary and tertiary structural elements, folding domains, (active) sites, connected components, XPLOR segment IDs and distinctions between proteins, nucleic acids, ligands and solvent, and the distinction between backbone vs. side-chain atoms, and categorization by ring system and cycle membership, heavy atoms and hydrogens. Finally, let's not forget the alternate conformation indicators for each atom or residue!

Notice, that this hierarchy also fails to be a "strict" hierarchy. A single syntactic chain may be split into multiple connected components, and multiple PDB chains may be covalently bonded into a single connected component. TER records normally serve as chain terminators but in several PDB files occur with a single residue. Most chains are either ATOM or HETATOM, but peptidic inhibitors and post-translationally modified proteins are mixtures of both. A single strand of a beta-sheet is always formed from a single chain, but a beta-sheet may be formed from stands from multiple chains.