This is a history of how RAPA came to be, a recap of its main points, a criticism of some mistakes, a note or two on topics that appear to have fallen so far on deaf ears, a description of the development, marketing, and use of ARCHTEST, and some thoughts on where future developments of shared memory architectures may lead.
Early Efforts to Write Routines with Shared Operands
Dijkstra's article [dijk65] on
a critical section routine described the most ingenious bit of
programming I had ever seen. After reading it, I spent six months
working on a critical section routine with one small improvement.
Operands could be only one bit in size. The goal was to eliminate the
need for a machine to exhibit atomicity in write operations in order
for the machine to successfully execute a critical section routine.
Once I had completed the routine, I thought I understood pretty much
everything there was to know about the problems of machines accessing
In the 70's there were efforts to write routines to add an element to a queue or to remove an element from a queue under the very special circumstances found in an operating system. Such routines would have enabled operating system functions to communicate with each other without incurring the expense of making time-consuming calls to other functions in the operating system. The special circumstances under which the queue handling routines were required to run were:
1. All of the changes to storage caused by one execution of either routine had to occur atomically. Otherwise, two executions might livelock, or worse, one might crash on the temporary results produced by the other.
2. The executions could not spin, waiting for a busy condition to clear. If the executions could spin, low priority executions could cause intolerable delays for high priority executions.
3. The executions had to run enabled for interrupts. The routines needed to be called from throughout the operating system, even from code which lacked the authority to disable interrupts.
So far as I know, no such routines were ever found.
A Model of Shared Memory Interactions
There was a lot of discussion about what one could reasonably
assume about the behavior of a uniprocssor when writing routines such
as these. When shared memory multiprocessors (SMMPs) were developed,
the discussion became louder and more confusing. I remember listening
to some people whom I knew to be very bright and knowledgeable and
finding it very hard to follow them. I finally realized that the
problem was not that they were speaking too far above me for me to
follow; the problem was that they were uncertain of what they
themselves understood. I thought to myself that here was a problem
just waiting to be solved, and it was one that I could figure out.
I began by listing the rules that I had called on in my efforts to write the critical section routine and the queuing routines. I skimmed a lot of articles to see what others were thinking.
The first step was to develop a model sufficiently complex to be realistic, but sufficiently simple to be tractable.
Operands could be either private or shared. An operand was shared if it were written into by at least one process and read by at least one other process. Shared operands were represented by A, B, C, .... Private operands were represented by Z, Y, X, ....
Computations, tests, branches, and function calls could all be viewed as operations on private data and so should be considered irrelevant from the standpoint of representing interactions on shared data. Assignment statements were sufficient to represent interactions on shared operands.
Each assignment statement could consist of one or more read operations and one write operation.
An execution represented the act of executing a program, where a program consisted of some number of processes. An execution consisted of statements of the initial and terminal values of the operands and a listing of the assignment statements executed by each process. Here is the execution for a single assignment statement, A = B + C;, occurring in process P1.
Initially, (A,B,C) = (0,1,2). P1 L1: A = B + C; Terminally, (A,B,C) = (3,1,2).
An execution can be decomposed into read events and write events. Each statement in an execution is labelled with its line number within its process in order to make it possible to uniquely identify each event. The execution above has the following events according to the model developed so far.
(P1,L1,R,B,1) (P1,L1,R,C,2) (P1,L1,W,A,3)
The components of an event are:
In this model each process will see an operand change value at the same moment. Something more needs to be added to make it possible to represent processes seeing operands change value at different times.
Define each process Pi to have a separate store Si containing a copy of all of the operands in the system. Store Si represents, not an actual physical memory, but rather Pi's view of all of storage.
The events for the above execution now take the following form. The sixth event component represents the store containing the operand being read or written. Read events for Pi are always from Si, that is, a process always reads only from its own store. For a write operation a process writes separately into each store in the system. In a four process system the above write operation is represented by write events for S1, S2, S3, and S4.
(P1,L1,R,B,1,S1) (P1,L1,R,C,2,S1) (P1,L1,W,A,3,S1) (P1,L1,W,A,3,S2) (P1,L1,W,A,3,S3) (P1,L1,W,A,3,S4)
Within the framework of this model I was able to state all the rules of behavior I had read about or imagined. (This did not happen easily or quickly. Some of the rules were very similar in the words needed to describe them. Trying to massage a fuzzy perception into exactly the right words was, I remember thinking at the time, like trying to pick up an egg yolk with your elbows.)
There is a reason (explained further on) to create an exhaustive list of rules. For all practical purposes, however, there are only four rules: computation (CMP), uniprocessor order (UPO), program order (PO and its subrules), and coherence (CCi, i=1,2,3,4).
The rule of computation is fundamental. It prohibits executions such as the following:
Initially, A = 0. P1 L1: A = 1; Terminally, A = 17.
The rule of computation requires that for every terminal value there be a rational way for the operands to have acquired the terminal value, given the initial values and the assignment statements in the program. The rule of computation is the most obviously necessary of all the rules. Unfortunately, it is also the most difficult to state formally; I am afraid that the chapter in RAPA on the computation rules, coming so early in the book, must have put readers off.
The rule of computation is also the most fundamental rule in the following sense. If an execution is not required to obey the rule of computation, then the events of the execution can be put into an order in which all of the other rules are obeyed. This is not true of any other rule. Given the fundamental importance of the rule of computation, it has been a continuing surprise to me that other writers have not picked up on it and stated it as one of their assumptions when seeking to reason about machine behaviors.
The rules of uniprocessor order were defined by Kuck, et al. [kuck80]. Two successive reads of an operand can occur out of order, but all other pairs of read and write operations for the same operand must occur in the order defined by the program.
The rules of computation and of uniprocessor order apply to all operands, both private and shared. The rest of the rules apply only to shared operands.
The program order rule and its subrules are:
One of two (known, nontrivial) mistakes in RAPA was the failure to define program order completely. I defined PO, RR, and WW; I ignored WR and RW. An execution which failed either WR or WR was said simply to fail PO. A paper by Adve, et al. [adgh96] set me straight.
The rules of cache coherence describe the constraints that a multiprocessor system may be required to obey in making changed operand values visible to other processes in the system:
In RAPA the names for each of these rules were:
Note that the meaning of write atomicity as used here is quite different from what Dijkstra meant in describing critical sections. What Dijkstra meant was that if two processes simultaneously stored two different values into an operand, the result had to be either one of the values or the other; it could not be a combination of bits from each of the two values. In RAPA I called this operand atomicity.
The meaning of write atomicity here is that if one process changes the value of an operand, then all processes in the system will see the change in the value of the operand occur at the same instant. In RAPA I called this operation atomicity.
The list of rules was published in a technical report in 1981 [coll81]. Also in the TR were a large number of examples showing how certain executions could not occur on a machine which obeyed a given set of rules. At the time I did understand fully why these examples were interesting or important.
A colleague dismissed the examples as being "toy programs", mere laboratory curiousities of no significance in the real world. This led me to understand that someone who disputes a fact is easily corrected or ignored, but someone who challenges an intuition or a prejudice can provoke an irritation that endures.
I condensed the TR into a paper which I submitted to the JACM. The one referee's report that I remember said something to the effect of: "What? Umptiump definitions and no theorems? Come on now!" It had seemed to me that just defining the rules was a valuable achievement.
Along about this time John Wait, an acquisitions editor at Prentice-Hall, wrote to me that Prentice-Hall was going to expand its line of books on computer architecture and that he had seen my TR and wondered if it might form the basis for a book. I wrote back that I would like to write a book, but that I did not then know enough about hardware to write a book on computer architecture. Shortly after that I enrolled in a computer science masters program at Syracuse University where I took as many hardware courses as the curriculum allowed.
A Second Try
Meanwhile the criticism of the JACM referee began to have its
effect: I asked myself what theorems could be proved within the model
thus far developed. Well, one thing that would be of some interest to
show would be that if two machines obeyed different rules, then there
was an execution that demonstrated that programmers could detect the
difference between the machines.
More formally, let machine M1 obey rules A and B, and let machine M2 obey rules A, B, and C. Construct, if possible, an execution with the following two characteristics:
1. The events of the execution can be put in a linear order which obeys rules A and B. This shows the execution could occur on machine M1.
2. The rules A, B, and C, applied to the events of the execution, enables one to prove that a circuit exists among the events. This shows that the execution could not occur on machine M2.
Such an execution was said to distinguish the two machines. If no such execution existed, the machines were said to be indistinguishable.
Now I understood the import of the executions in the TR. They were examples of how to distinguish machines with distinct architectures. I set to work to show how to distinguish as many significantly distinct sets of rules as I could. In most cases it was not too hard.
There was one case that was very, very hard. I could not find any execution to distinguish a machine that obeyed CC2 from a machine that obeyed CC1. Well, actually there was one execution, but it involved a rule so weird that no machine would ever obey it. Here are the details.
An execution obeys the rule of read synchronization (RS) if, as in the execution below, whenever P1 fetches A before it fetches B, P2 also fetches A before it fetches B.
Initially, (A,B,X,Y) = (0,0,1,2). P1 P2 L1: X = A + B; L1: Y = A + B; Terminally, (A,B,X,Y) = (3,3,1,2).
In the presence of RS it is possible to distinguish two machines which are identical except that one obeys CC1 and the other obeys CC2. What about in the absence of RS? I spent six months on this problem, alternately trying to find a distinguishing execution and trying to find a proof that the two machines would always be indistinguishable.
(At one point my wife woke in the night with severe abdominal pains, the result of adhesions from surgery six months previously. The surgeon said to bring her to the hospital immediately. I grabbed a couple of magazines and my notes for what promised to be a long night. During the surgery I found that I could not sleep and I could not read, but I could, to my amazed relief, relax into the obsession of the moment and try to prove that CC1 was indistinguishable from CC2, oblivious to the crisis around me.)
The attempt to find a counter-example to the proposed theorem led to cases, and then subcases, and then sub-subcases, ..., down to five levels deep in some situations. Each case consisted of one possible path between two events. Looking at one of the deepest subcases (= longest paths), I saw a pattern emerge. To explain it, some notation is required. Let W1 and W2 be the sets of write events for two distinct statements, respectively. Let CC2(W1,W2) state that for the two write events in W1 and W2 for the same store Si, there is an arc labelled <cc2 from the event in W1 to the event in W2, for i = 1, 2, .... The rule of CC2 says that either CC2(W1,W2) is true or CC2(W2,W1) is true for the sets of write events, W1 and W2, in every pair of statements.
Let w1 be a write event in a path in a CC2 execution and let w2 be a second write event further along in the same path. The pattern that I saw showed that no matter what rules related adjacent events in the path from w1 to w2, it was always true that CC2(the write events in the statement containing w1, the write events in the statement containing w2). From this it was easy to show that a CC2 execution was indistinguishable from a CC1 execution (in the absence of RS).
A Program to Help with the Proof
One very tedious part of the proof of the theorem was to show
that the pattern was true for all paths between any two write events.
I wrote a Pascal program to search all paths until one of a half dozen
conditions was met. From a write event there could be 32 rules
connecting the event to a second event. In 26 of these cases one of
the conditions was met. In the other six cases the path had to be
extended to a third event. This involved 6 times 32 cases. In all of
these 192 subcases one of the conditions was met. Thus there were a
total of 224 cases. The program ended after searching a distance of
only two nodes from the root of the search tree.
Probably the right language for the program was Prolog, but I was only a beginning Prolog programmer and a more practiced Pascal programmer, so Pascal won out.
There were two other proofs of the same nature. For each there was a program to perform the tree search. I am in the process of rewriting the programs in C. If anyone is interested in seeing the results, please contact me.
Proving Real Machines to be Correct
Several times when I have talked about the work in RAPA, I have
been asked if I knew how to prove that a machine obeyed a given
architecture. It seems to me that the programs provide a very
tentative first step. A more realistic model would have to define
hundreds, maybe thousands, of components in the events and for the
rules between events.
I once asked Pat Gannon, a formidably competent designer, whether or not the current mainframe (I think it was the 3090) was CC2. He checked the design, and said that it wasn't. I said that then the machine did not achieve its architecture which required it to be CC1. He asked for my reasoning and I showed him the execution which detects a failure to be CC2 (Test T6). He pointed out that the window in which operands could arrive in a non-CC2 fashion was only two cycles long, far too short for the required sequence of events to take place. Therefore the machine would appear to be CC1, even though it was not strictly CC2.
This suggests an additional line of research: take the current model and extend it by adding durations to each event; identify the maximum size of possible windows in which events occur late. Prove that any distinguishing path would necessarily overflow the largest possible window.
[ngmg98] describes an approach to the problem of verifying models of machines, using some of the tests in ARCHTEST.
A second TR in 1984 presented what I then understood [coll84]. The toy programs were presented as proofs of the distinguishability of architectures. The theorem of the essential indistinguishability of a CC2 architecture from a CC1 architecture was stated, and a proof was given. Actually, at that time I had only the key insight needed to prove the theorem. It was not until some time later that I understood that I was missing an essential ingredient, specifically, the structure of nondeterminism.
After the TR came out, I sent a copy to Leslie Lamport and told him that I had gotten the idea that a CC2 system would appear to be CC1 from his article "Time, Clocks, and the Ordering of Events in a Distributed System" [lamp78]. He was kind enough to reply; after thanking me for the TR, he mentioned he was repeatedly astonished at the number of ideas people said they had seen in that particular paper, ideas he had never consciously held and so could certainly never have intended to impart.
The Reification of Nondeterminism
In 1986, having completed the computer science masters program, I
was privileged to spend a semester on campus at Syracuse University.
A seminar on parallelism gave me a chance to explain the ideas in the
two TR's. In the seminar I was confronted with Doug Troeger, Lockwood
Morris, and Roy Heimbach, men whose good will and intelligence I could
not question (as I had sometimes been wont to do with uncomprehending
listeners). When they did not get it, I was forced to acknowledge
that the fault for the failure to communicate was mine.
Out of the resulting frustration I developed what I think is the most important thing in RAPA that is not in the predecessor TR's, namely, the reification of nondeterminism.
Here is the conventional view of nondeterminism.
Let E be an execution of a program. E is composed of some finite number of read and write events, say n. If there is no a priori ordering of the events, then there are fact(n) possible orderings of the events in time. Let R1 be a rule that the events are required to obey. Then R1 induces a partial ordering P1 that the events are required to obey. R1 reduces the number of possible orderings of the events from fact(n) to only those orders which obey P1. If the events are further required to obey rule R2 which induces partial order P2, then R2 reduces even further the number of possible orderings.
The view of nondeterminism developed in RAPA is quite distinct.
As before, let E be an execution composed of n events, and let R1 be a rule that the events are required to obey.
Suppose E is the following execution.
Initially, A = 0. P1 P2 P3 L1: A = 1; L1: A = 2; L1: A = 3; Terminally, A = 3
Obviously, the statement in P3 executed last. However, it cannot be known in what order P1 and P2 were executed. To represent this ambiguity, the rule of computation induces two partial orderings on the events of the execution. In the first partial ordering, P1 executed before P2; in the second, P2 before P1.
The point is that a rule can induce, not just one, but a (possibly quite large) number of partial orderings on the events of an execution.
At this point it will be convenient to switch to the terminology of graph theory.
Suppose then that E is any execution.
It takes a set of graphs to represent the fact that E obeys rule R1. (And each arc in the graphs is labelled "r1".) Call this set of graphs GS1 (for "graph set 1").
It takes another set of graphs to represent the fact that E obeys rule R2 (with each arc in this set of graphs labelled "r2"). Call this set of graphs GS2 (for "graph set 2").
What does it take to represent E as obeying both rules R1 and R2?
Every possibility under R1 must be paired with every possibility under R2. Proceed as follows. Take the first graph G1 from GS1. For each graph G2 in GS2 create a new graph containing the labelled arcs from both G1 and G2. Do this with all the graphs in GS1 and all the graphs in GS2. The result is called the product GS1*GS2 of GS1 and GS2. GS1*GS2 represents the execution E under rules R1 and R2.
Thus, the requirement that an execution obey an additional rule does not simplify the representation, it complexifies it.
With this representation of nondeterminism I was able to construct a formal proof that, in the absence of RS, if an execution is CC2, then it is CC1.
This is, I think, an elegant result. The complexity of the proof has, so far as I know, kept anyone from verifying its correctness. (However, Ken Nicholas, now at Compaq, thought the proof in the 1984 TR was correct, as far as it went.) Basically, there have been only two reactions. Some folks say, "Yup, seems reasonable." More flattering in its way was the reaction of Ron Smith, a very, very knowledgeable architect at IBM who always listened closely over the years when I talked to him about shared memory architectures. In one of our last conversations, he told me that he understood what I was saying, but that he could not believe that the theorem was true. I took his opinion as evidence that, in the eyes of an expert, the result is not trivial.
Here is the reason, alluded to above, that it is important to develop an exhaustive list of rules. It is not possible to say that a CC2 machine is always a CC1 machine; indeed, RS shows that this is not the case. The exhaustive list makes it possible to say that (in the absence of RS) two machines, one CC1 and the other CC2, obeying any combination of known rules, will be indistinguishable. The more rules that are known, the stronger the proof.
In the fall of 1987 I realized that I would never learn enough to
feel qualified to write a book, but that nonetheless I knew some
things about computer architecture that others could find useful.
Therefore I wrote to John Wait and said I was ready to proceed, if his
offer was still valid. Wait had moved on, but his successor agreed
to begin the process.
I titled the book Reasoning About Parallel Architectures (RAPA) [coll92]. Writing RAPA was the highpoint of my life. One of the fascinating aspects of the work was to see how I could lie to myself, how I could say that I had finally figured out how to understand a certain point and then to return to that point some time later and discover that it still had not been properly dealt with.
I still had the phrase "toy programs" ringing in my ears. The problem with the toy programs was that for them to be effective, they would have to synchronize exactly, to force two or more processes to start running at exactly the same moment, so that possible pathological behavior of the machine could be revealed.
I had thought for years that there had to be long-running programs which would reveal architectural pathology, but I could not conceive of what form they would take. One day, tired of thinking about the problem, I picked up pencil and paper and wrote down one of the toy programs, and then began embellishing it with additional statements of the same form, but with distinct operand values. Voila! I couldn't believe it was so simple. The two processes did not need to be synchronized in their execution. So long as the execution of both processes overlapped at least partly, it was possible to detect the failure of a machine to obey one or another of the rules.
I talked about these programs in Chapter 1 of RAPA. In July of 1991 I had a few weeks of repose with nothing to write or to proofread. I asked myself what other tests could be developed. The result was Appendix D, tacked onto the book at the last minute.
Here is the second mistake in the book. Most of the tests dealt with only a handful of operands. I wanted some tests that overflowed the ability of the machine to deal with easily, so I proposed that the tests be extended to operate on arrays of operands, not just a few, individual operands. This turned out to be foolish. Each element of the arrays would be accessed only once; therefore, simultaneous interactions on shared data would never occur. See Errata for details.
RAPA appeared in 1992. I received a single, advance copy in March, with more promised in a couple of weeks. Edsgar Dijkstra spoke at Marist College in Poughkeepsie a few days later. After his talk there was a reception. We spoke, and I offered him a copy of my book. He graciously accepted, without knowing the circumstances of the offer. I cannot imagine anyone I would rather have given it to.
In my last year with IBM I wrote all the Appendix D tests in assembly language and ran them on three IBM mainframes, an Amdahl machine, and an Hitachi machine. All were expected to obey the IBM architecture, which required write atomicity, but which allowed reads to come before logically preceding writes. All passed the tests except the Hitachi machine which was CC3, as revealed by Test T7.
The discovery won me an audience with the Poughkeepsie Lab Director. I pitched the tests and the results to him and a group of staff people. There were only a couple of questions, and so I congratulated myself on having successfully explained the results, but when I left the meeting, I saw most of the staff gathered around Ron Smith in the hallway, asking him the questions they had not been willing to ask in front of the boss.
Formation of Multiprocessor Diagnostics
After I retired from IBM in 1993, I set up a company,
Multiprocessor Diagnostics, to recreate the tests in C and to market
the program, which I called ARCHTEST. I have learned a lot since
then and have done things I would never have gotten to do had I
Early assistance came from Dennis Eaton, Director of the InfoMall Section (Mid-Hudson) of the Northeast Parallel Architecture Center at Syracuse University, who was seeking to develop high-tech companies in the Mid-Hudson Valley. He was generous in supplying hardware, software, direction, and support to me and to a number of other start-ups.
One thing I learned was how to make cold calls on companies by phone. Once I reached the right group, I found engineers eager to try ARCHTEST. (In those days I offered a free evaluation license.) The scenario was always the same. I would ask the engineers at the start what architecture their machine obeyed. They would say it was cache coherent. I would ask what form. They would reply that they were not sure. Once the tests had been run, then, no matter what the tests showed, they always claimed that the results showed that the machine behaved exactly as they had intended all along.
There was something else that also always happened, somethng very distressing. The engineers were friendly and eager before the tests were run, and then they were cold and distant afterward. I finally realized what the problem was. When ARCHTEST found that a machine failed a particular test, it printed out a statement that the machine exhibited an "error". I revised the program to speak only of "strong" behavior and "relaxed" behavior (borrowing the terms from a paper by Adve, et al. [adgh96], and I described ARCHTEST as a tool that could be used to find areas in machine design that could be made either stronger or more relaxed.
Since then I have sold licenses to several companies and have licensed four universities to use ARCHTEST in their research programs licensees.
Release 5.4 of ARCHTEST now allows a user to bring shared operands in a test into a cache in either read-only or exclusive state, an idea suggested by [mntz96].
Plans are afoot to rewrite the proof programs in C, to investigate whether or not there are any critical section routines that would run on a relaxed archtitecture machine, and to create some new tests.
Last updated January 4, 2006.