Comments on an Intel Memory Model White Paper

A customer for ARCHTEST asked me to comment on the Intel(R) 64 Architecture Memory Ordering White Paper [INTEL07]. I found the document so interesting that I thought these observations might be of interest to a wider audience. Hence I am posting them here.

For an alternative analysis of the examples in the white paper see:

       Example      RAPA        ARCHTEST
                Theorem Page
        2.1       A7     154    http://www.mpdiag.com/analysis.html#t02
        2.2       A4     152    http://www.mpdiag.com/analysis.html#t04

        2.3.a     A5     152    http://www.mpdiag.com/analysis.html#t04
        2.3.b     A1     150    http://www.mpdiag.com/analysis.html#t08
 
        2.4                     http://www.mpdiag.com/analysis.html#t07
        2.5       A29    173    http://www.mpdiag.com/analysis.html#t05

        2.6       A36    180    http://www.mpdiag.com/analysis.html#t09
        2.7       A18    163    http://www.mpdiag.com/analysis.html#t06

Page 5.

This page is completely clear.

Page 6.

1. Instructions and memory accesses

In RAPA I identified two distinct meanings of the phrase "write atomic".

WA1. If two processors write to the same location in memory at the same time, the value of the result in memory will be the result written by one of the processors or by the other of the two processors. WA1 applies to a multiprocessing system without caches, where two processors can seek to modify the same operand in main storage at the same time.

Example. One processor writes '3'; the other writes '5'. The final result in memory will be either '3' or '5'; no other result is allowed.

WA2. If a processor writes to an operand, all copies of the operand will appear to change value at the same instant. Therefore, if two processors write to the same operand at the same time, all copies of the operand will appear to change to one of the values at the same instant, and then subsequently all copies will appear to change to the other value at the same (subsequent) instant. WA2 applies to a system with caches, where more than one copy of an operand can exist at a time.

The four statements at the top of the page simply state the conditions under which WA1 can be counted on to hold.

The second paragraph is about WA2. The statement starting with, "All locked instructions . . ." can be parsed a couple of ways. What the authors intend to convey is that a locked instruction consists of a sequence of this form:

load1, load2, ..., store1, store2, ...

Loads occur first; stores occur last. The entire sequence appears to occur instantaneously. No other such sequence or instruction appears to occur at the same time. (The document doesn't exactly say this, but that is what is intended. See Item 7 on page 7.)

Page 7.

2. Memory ordering for write-back (WB) memory

Line 1. "coherent write-back (WB) memory" Is there noncoherent write-back memory? I don't think so. Therefore, saying "coherent write-back (WB) memory" communicates no information; it only raises questions.

Line 2. "WB memory is . . . ." This is an interesting effort to define "WB memory" in terms of what a higher-level language programmer would understand. But all the subsequent examples are in terms of assembly language programs. Better, I think, to stay all the way with discussions in terms of assembly language. Maybe just call WB memory main storage.

Lines 8-9. Older and younger instructions. Nice terminology.

Lines 12-13. "if all processors agree on their order of execution"

Intuitively this is clear. But only until someone asks what it means for processors to AGREE on the order of execution of instructions. Processors are not sentient.

In the paragraph beginning "Intel 64 memory ordering . . . ":

Item 5. causality. This does not define a principle; it only gives it a name. Actually, two names: causality and transitive visibility. On Page 11 neither term is defined. Only an example is given of what they are not. This is very unsatisfying.

Item 6. This is the current, generally accepted definition of cache coherence. (See CC3 in http://www.mpdiag. com/analysis.html#rcc.)

Item 7. The effect of this is to say that only one locked instruction can occur at a time. In the terminology of RAPA locked instructions are statement atomic.

Page 10.

In Table 2.3.b the code for Processor 1 is irrelevant. The point is made simply with Processor 0. In RAPA the rule obeyed here was called uniprocessor order (UPO).

To complete the requirement that the machine obey UPO, the following two behaviors are required:

  Processor 0
  mov [_x],1
  mov [-x],2
  Must have x == 2.
  
  Processor 0
  Initially x == 0.
  mov r1,[_x]
  mov [-x],2
  Must have r1 == 0 and x == 2.
Example 2.4. The explanation refers to wanting a processor to be "self consistent". This principle is not defined; it is not included in the list of principles on page 7.

In my view self consistency is part of uniprocessor order. It requires that the values seen in M2 and M5 are equal to 1, and it has nothing to do with the values seen in M3 and M6.

This example is important to note since the analysis in RAPA shows that it explicitly allows a machine to visibly fail to be CC1.

Page 11.

Example 2.5. The phrase "causally precede" identifies another principle missing from the list of principles on page 7. It corresponds to a part of what in RAPA was called the compute rule. That is, roughly, that for every terminal value of an operand there must be a pattern of read and write events which allows that terminal value to be computed. The compute rule is implicitly involved in every example in this white paper, but it is only here that it is invoked.

This is one, single, explicit example. It is not a definition of a principle.

Further, it is an example of what transitive visibility (TV) is NOT. It does not tell a programmer how to reason about TV in any other context, except in so far as that context contains exactly the example given.

There may be reasons in the hardware that Example 2.5 cannot occur. All the runs of Test T5 in ARCHTEST have produced no evidence of Example 2.5 failing. So the engineers may have said that since we are certain it cannot happen, we will tell the programmers that they can count on it's not happening.

The overall result is that programmers will know that the machine is CC1 in Example 2.5 but only CC3 in Example 2.4. This leaves open the question of when, in what other cases, the machine is either CC3 or CC1. Saying the machine is TV does not answer this question.

Example 2.6. The white paper says that accesses are coherent. This is another principle which is not included in the list of principles on page 7. It is not defined anywhere in the document so it cannot imply anything. I think it is not necessary. Saying all stores to a given operand are visible in the same order to all processes achieves everything the authors want to achieve, namely, CC3. See http://www.mpdiag. com/analysis.html#rcc.

Page 12.

Example 2.7.

An interesting question is what behavior is allowed if the two locked stores are replaced by plain old store operations.

Case 1. These values are not allowed: r3 == 1, r4 == 0, r5 == 1, r6 == 0. Then the Example would require CC1 behavior.

Case 2. These values are allowed: r3 == 1, r4 == 0, r5 == 1, r6 == 0. Then the Example would allow CC3 behavior.

Example 2.4 allowed the machine to be CC3. Example 2.5 required it to be CC1. It is not possible for a programmer to deduce, based on the contents of the white paper, what the behavior of the machine will be in this case.

Summary

In the terminology of RAPA the white paper requires that a machine:

References

Site Map

Last updated June 22, 2008.