Strategic Directions in Computing Research

Working Group on Storage I/O Issues in Large-Scale Computing

Position statement

Dartmouth College, Department of Computer Science
6211 Sudikoff Laboratory, Hanover, NH 03755-3510, USA
thc@cs.dartmouth.edu, http://www.cs.dartmouth.edu/~thc

Michael T. Goodrich

The Johns Hopkins University, Department of Computer Science
Whiting School of Engineering, Baltimore, MD 21218
http://www.ics.uci.edu/~goodrich/

Abstract: We present the challenge of synthesizing a coherent model that combines the best aspects of the Parallel Disk Model and Bulk Synchronous Parallel models to develop and analyze algorithms that use parallel I/O, computation, and communication.
Categories and Subject Descriptors: B.3.2 [Memory Structures]: Design Styles - Mass storage (e.g., magnetic, optical), Primary memory; B.4.4 [Input/Output and Data Communications]: Performance Analysis and Design Aids - Formal models, Worst-case analysis; D.1.3 [Programming Techniques]: Concurrent Programming - Parallel programming; D.4.2 [Operating Systems]: Storage Management - Secondary Storage; D.4.4 [Operating Systems]: Communications Management - Input/Output; Message sending; Network communication; E.2 [Data Storage Representations]: Contiguous representations; E.5 [Files]: Sorting/searching; F.1.2 [Computation by Abstract Devices]: Modes of Computation - Parallelism and concurrency; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems - Sorting and searching;
General Terms: Algorithms, Design, Languages, Performance, Theory.
Additional Key Words and Phrases: I/O, external memory, secondary memory, communication, disk drive, parallel disks, sorting, Parallel Disk Model, Bulk Synchronous Parallel Model.

A Bridging Model for Parallel Computation, Communication, and I/O

The past decade has seen the introduction of new and useful models for analyzing the computational and communication complexities of parallel algorithms, as well as useful models to measure I/O complexity. Yet no useful model measures all of computational, communication, and I/O complexity simultaneously.

Usefulness of a model implies two characteristics. First, the model should be realistic in the sense that its prediction for any algorithm should correspond to observed behavior of real systems. Second, the model should be simple enough to use and understand that one can design, analyze, and implement algorithms without having had extensive experience with the model.

We maintain that the Bulk Synchronous Parallel, or BSP, model [Valiant 1990] and LogP [Culler et al. 1993] models are useful for computational and communication complexities of parallel algorithms. The Parallel Disk Model, or PDM, [Vitter Shriver 1994] is useful for I/O complexity. The BSP and LogP models, however, ignore I/O, and the PDM does not account for computation or communication. Because we think of I/O as so much slower than computation or communication, the PDM apparently captures the most salient factor in the wall-clock time for algorithms that use parallel I/O.

What is apparent may not be the case, however.

Early experiences with algorithms implemented in the PDM indicate that although wall-clock time for a given algorithm follows the prediction of the model, the algorithms themselves are not I/O bound. Even with synchronous (i.e., blocking) I/O, the time spent waiting for I/O is typically less than 50% of the total wall-clock time. This behavior suggests that each parallel disk access gives rise to a given amount of computation and communication for a particular algorithm.

A sorting algorithm, for example, might repeatedly process "memoryloads" of data by performing a large parallel read, an in-memory sort across all processors, and a large parallel write. The time to perform the in-memory sorts might exceed the combined times of the parallel reads and writes, although it is roughly the same among the memoryloads. Typical algorithms developed for the PDM are similar to this hypothetical sorting algorithm in that they make repeated passes over the data and each pass repeatedly reads in a large amount of data, processes it, and writes out a large amount of data. Processing time (including communication) tends to be about the same each time for a given algorithm. Of course it varies from algorithm to algorithm.

If these early observations continue to hold as we gain more experience in implementing algorithms for the PDM, we will draw the conclusion that the PDM's predictive power is helpful (for analyzing I/O time) but limited (by omitting computation and communication).

Note, however, a striking similarity between the BSP model and typical PDM algorithms: bulk processing. In the BSP model, for example, communcation in a network of processors is considered to be the prime computational bottleneck; hence, a computation is organized as a sequence of rounds, where each round consists of each processor performing computations on its internal memory, followed by the sending and receiving of a limited number of messages. Rounds and communication in BSP algorithms are like memoryload processing and I/O, respectively, in PDM algorithms.

The challenge, therefore, is to synthesize a coherent model that combines the best aspects of the PDM and BSP models to develop and analyze algorithms that use parallel I/O, computation, and communication. Along with such a model, we need programming primitives to enable algorithm implementation under the model. These primitives must be portable and have performance matching the model's requirements.

We view developing such a model as a challenge because we believe that it will be difficult to simultaneously satisfy the two requirements that it be realistic yet easy to use. Our concern is that a realistic model may have so many parameters as to make it unusable. The PDM, without considering processing, has four parameters: problem size, memory size, disk block size, and disk count. The BSP model also has four parameters: problem size, processor count, latency of the network, and the "gap" time between consecutive messages in pipelined computations. Thus, some natural questions to ask include the following:

Can all these parameters be merged in some synthesis?
Is there a block size notion in BSP that might be consistent with the PDM'S block size?
What is the right set of at most four or five important parameters?

In summary, we think that it would be valuable to propose a bridging parallel computational model that incorporates computation, communication and I/O in an accurate and easy to use manner. We hope that discussions at the workshop will lead to such a model.

References

[Alpern et al. 1994]: Alpern, B., Carter, L., Feig, E., and Selker, T., 1994. The Uniform Memory Hierarchy Model of Computation, Algorithmica, 12:2/3, August and September 1994, pages 72-109.
[Cormen Goodrich 1996]: Cormen, T. H., and Goodrich, M. T., 1996. Position Statement, Strategic Directions in Computing Research: Working Group on Storage I/O Issues in Large-Scale Computing, Computing Surveys, 28A(4), December 1996, http://www.cs.jhu.edu/~goodrich/pubs/io.html.
[Cormen Hirschl 1996]: Cormen, T. H., and Hirschl, M., 1996. Early Experiences in Evaluating the Parallel Disk Model with the ViC* Implementation, Parallel Computing, to appear. Also available as Dartmouth College Computer Science Technical Report TR96-293 at ftp://ftp.cs.dartmouth.edu/TR/TR96-293.ps.Z
[Culler et al. 1993]: Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K. E., Santos, E., Subramonian, R., and von Eicken, T. 1993. LogP: Towards a Realistic Model of Parallel Computation, Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, May 1993, pages 1-12.
[Valiant 1990]: Valiant, L. G., 1990. A Bridging Model for Parallel Computation, Communications of the ACM, 33:8, August 1990, pages 103-111.
[Vitter Shriver 1994a]: Vitter, J. S., and Shriver, E. A. M., 1994. Algorithms for Parallel Memory I: Two-level Memories, Algorithmica, 12:2/3, August and September 1994, pages 110-147.
[Vitter Shriver 1994b]: Vitter, J. S., and Shriver, E. A. M., 1994. Algorithms for Parallel Memory II: Hierarchical Multilevel Memories, Algorithmica, 12:2/3, August and September 1994, pages 148-169.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.

Last modified: Mon Oct 21 19:18:04 EDT

Michael T. Goodrich