- å èšäº
- ARM ã POWER ã® Relaxed Memory Model
- Weak Ordering and Data-Race-Free Sequential Consistency ãžãšç¶ã
å èšäº
ARM ã POWER ã® Relaxed Memory Model
Now let's look at an even more relaxed memory model, the one found on ARM and POWER processors. At an implementation level, these two systems are different in many ways, but the guaranteed memory consistency model turns out to be roughly similar, and quite a bit weaker than x86-TSO or even x86-TLO+CC.
The conceptual model for ARM and POWER systems is that each processor reads from and writes to its own complete copy of memory, and each write propagates to the other processors independently, with reordering allowed as the writes propagate.
- ARM ã PORWER ã®ãããªããã«ãããã¡ã¢ãªã¢ãã«ãèŠãŠã¿ãã
- å®è£ ã¬ãã«ã§ã¯ããããã®2ã€ã®ã·ã¹ãã ã¯å€ãã®æ¹éãéãããã¡ã¢ãªäžè²«æ§ã¢ãã«ã®ä¿èšŒã¯ã ããã䌌éã£ãŠããŠããã㯠x86-TSO ã x86-TLO+CC ã§ãããããéåžžã«åŒ±ã
- ARM ãš POWER ã·ã¹ãã ã®ã³ã³ã»ããã¯ãããããã®ããã»ããµã¯èªèº«ã®ã¡ã¢ãªã®ã³ããŒã«èªã¿èŸŒã¿ãæžã蟌ã¿ãããããããã®æžã蟌ã¿ãä»ã®ããã»ããµã«ç¬ç«ããŠäŒæããæžã蟌ã¿ãäŒæããã«ã€ãäžŠã³æ¿ããèš±å¯ããã
Here, there is no total store order. Not depicted, each processor is also allowed to postpone a read until it needs the result: a read can be delayed until after a later write. In this relaxed model, the answer to every litmus test weâve seen so far is âyes, that really can happen.â
- TSO ã¯ãªããããããã®ããã»ããµãŒã¯çµæãå¿ èŠã«ãªããŸã§èªã¿èŸŒã¿ãé ãããããšãèš±ãããèªã¿èŸŒã¿ã¯åŸã®æžã蟌ã¿åŸãŸã§é ãããããšãã§ãã
- ãã® Relaxed ã¢ãã«ã§ã¯ãç§éããããŸã§ã¿ãŠãããã¹ãŠã®ãªããã¹ãã¹ãã®çã㯠âyesâ ãšãªã
For the original message passing litmus test, the reordering of writes by a single processor means that Thread 1's writes may not be observed by other threads in the same order:
- ãªããã¹ãã¹ããéãå ã®ã¡ãã»ãŒãžã«ã€ããŠãåäžã®ããã»ããµã«ããæžã蟌ã¿ã®äžŠã¹æ¿ãã¯ãã¹ã¬ãã 1 ã®æžã蟌ã¿ãä»ã®ã¹ã¬ããã«ãã£ãŠåãé åºã§èгå¯ãããªãå¯èœæ§ãããããšãæå³ãã
In the ARM/POWER model, we can think of thread 1 and thread 2 each having their own separate copy of memory, with writes propagating between the memories in any order whatsoever. If thread 1's memory sends the update ofÂy
 to thread 2 before sending the update ofÂx
, and if thread 2 executes between those two updates, it will indeed see the resultÂr1
Â=
Â1
,Âr2
Â=
Â0
.
- ARM/POWER ã¢ãã«ã«ãããŠãããããã¡ã¯ãããããåããããã¡ã¢ãªã³ããŒãæã¡ãã©ããªé çªã§ãã¡ã¢ãªéã§æžã蟌ã¿ãäŒæããã¹ã¬ãã 1 ãã¹ã¬ãã 2 ãèããããšãã§ãã
- ããã¹ã¬ãã 1 ã®ã¡ã¢ãªã
y
ã®ã¢ããããŒããx
ã®æŽæ°ãéãåã« ã¹ã¬ãã 2 ãžéä¿¡ããã¹ã¬ãã 2 ããããã® 2 ã€ã®æŽæ°ãå®è¡ããå Žåãr1
Â=
Â1
,Âr2
Â=
Â0
ã®çµæã¯ç¢ºå®ã«çºçããã
This result shows that the ARM/POWER memory model is weaker than TSO: it makes fewer requirements on the hardware. The ARM/POWER model still admits the kinds of reorderings that TSO does:
- ãã®çµæã¯ãARM/POWER ã¡ã¢ãªã¢ãã«ã TSO ãã匱ãããšã衚ããããã¯ããŒããŠã§ã¢ã®èŠæ±ã匱ããã
- ARM/POWER ã¢ãã«ã¯ãŸã ãTSO ãè¡ãäžŠã¹æ¿ãã®çš®é¡ãåŒãç¶ãèªãã
On ARM/POWER, the writes toÂx
 andÂy
 might be made to the local memories but not yet have propagated when the reads occur on the opposite threads.
Hereâs the litmus test that showed what it meant for x86 to have a total store order:
- ARM/POWER ã«ãããŠ
x
ãšy
ãžã®æžã蟌ã¿ã¯ããŒã«ã«ã¡ã¢ãªã«å¯ŸããŠè¡ãããããšãããããå察ã®ã¹ã¬ããã«èªã¿èŸŒã¿ãçºçããæã«ã¯ãŸã äŒæããŠããªã - ãã㯠TSO ãæã€ x86 ã«ãšã£ãŠäœãæå³ãããã衚ããªããã¹ãã¹ãã§ãã
On ARM/POWER, different threads may learn about different writes in different orders. They are not guaranteed to agree about a total order of writes reaching main memory, so Thread 3 can seeÂx
 change beforeÂy
 while Thread 4 seesÂy
 change beforeÂx
.
- ARM/POWER ã«ãããŠãéãã¹ã¬ããã¯ãéãé çªã«éãæžã蟌ã¿ãåŠç¿ãããããããªã
- ãããã¯ãã¡ã€ã³ã¡ã¢ãªã«å°éããæžã蟌ã¿ã®å šé åºã«ã€ããŠåæããããšãä¿èšŒãããªã
- ã€ãŸããã¹ã¬ãã 3 ã¯
y
ã®åã«x
ã®å€æŽã確èªã§ããã¹ã¬ãã 4 ã¯x
ã®åã«y
ã®å€æŽã確èªã§ãã
As another example, ARM/POWER systems have visible buffering or reordering of memory reads (loads), as demonstrated by this litmus test:
- ããäžã€ã®äŸãšããŠãARM/POWER ã·ã¹ãã ã¯ããããã¡ãªã³ã°ãã¡ã¢ãªèªã¿èŸŒã¿ã®äžŠã³æ¿ããæã€ããšããã®ãªããã¹ãã¹ãã«ãã£ãŠå®èšŒããã
Any sequentially consistent interleaving must start with either thread 1'sÂr1
Â=
Âx
 or thread 2'sÂr2
Â=
Ây
. That read must see a zero, making the outcomeÂr1
Â=
Â1
,Âr2
Â=
Â1
 impossible. In the ARM/POWER memory model, however, processors are allowed to delay reads until after writes later in the instruction stream, so thatÂy
Â=
Â1
 andÂx
Â=
Â1
 execute before the two reads.
- 鿬¡äžè²«æ§ã®ããã€ã³ã¿ãŒãªãŒãã¯ãã¹ã¬ãã 1 ã®
r1 = x
ãŸãã¯ã¹ã¬ãã 2 ã®r2 = y
ã§éå§ããå¿ èŠããã - ãã®èªã¿åãã¯å¿
ããŒãã衚瀺ãããçµæ
r1 = 1
ãr2 = 1
ã¯äžå¯èœã«ãªã - ãã ããARM/POWER ã¡ã¢ãªã¢ãã«ã®ããã»ããµã¯ãåœä»€ã®åŸåã®æžã蟌ã¿åŸãŸã§ã«èªã¿èŸŒã¿ãé
ãããããšãèš±ãããŠããã®ã§ã
y = 1
ãšx = 1
㯠2 ã€ã®èªã¿èŸŒã¿ã®åã«å®è¡ããã
Although both the ARM and POWER memory models allow this result, Maranget et al. reported (in 2012) being able to reproduce it empirically only on ARM systems, never on POWER. Here the divergence between model and reality comes into play just as it did when we examined Intel x86: hardware implementing a stronger model than technically guaranteed encourages dependence on the stronger behavior and means that future, weaker hardware will break programs, validly or not.
- ARM ã POWER ã¡ã¢ãªã¢ãã«ã¯ãã®çµæã蚱容ããããMaranget ãã® 2012 幎ã®è«æã¯ãARM ã·ã¹ãã ã§ã®ã¿çµéšçã«åçŸããPOWER ã§ã¯åçŸããªããšå ±åããã
- ããã§ãã€ã³ãã« x86 ã調ã¹ããšããšåãããã«ãã¢ãã«ãšçŸå®ã®éã®çžéãçãã
- æè¡çã«ä¿èšŒãããŠããããã匷åãªã¢ãã«ãå®è£ ããããŒããŠã§ã¢ã¯ããã匷åãªåäœãžã®äŸåãä¿é²ããå°æ¥ããã匱ãããŒããŠã§ã¢ãæå¹ãã©ããã«é¢ä¿ãªãããã°ã©ã ãç Žå£ããããšãæå³ãã
Like on TSO systems, ARM and POWER have barriers that we can insert into the examples above to force sequentially consistent behaviors. But the obvious question is whether ARM/POWER without barriers excludes any behavior at all. Can the answer to any litmus test ever be âno, that canât happen?â It can, when we focus on a single memory location.
Hereâs a litmus test for something that canât happen even on ARM and POWER:
- TSO ã·ã¹ãã ã®ããã«ãARM ãš POWER ã¯åŒ·å¶çã«é次äžè²«æ§ã®æ¯ãèããäžèšã®äŸã«å·®ã蟌ãããšãã§ããããªã¢ãæã€
- ãããæãããªåé¡ã¯ãããªã¢ã®ãªã ARM/POWER ãæ¯ãèããæé€ããã
- ãªããã¹ãã¹ãã«å¯ŸããçããšããŠãânoããããªããšèµ·ããåŸãªãâ ãšããããšããããïŒåäžã®ã¡ã¢ãªäœçœ®ã«çŠç¹ãåœãŠããšããããå¯èœãšãªã
- ãã㯠ARM / POWER ã§ãèµ·ããåŸãªããªããã¹ãã¹ãã§ãã
This litmus test is like the previous one, but now both threads are writing to a single variableÂx
 instead of two distinct variablesÂx
 andÂy
. Threads 1 and 2 write conflicting values 1 and 2 toÂx
, while Thread 3 and Thread 4 both readÂx
 twice. If Thread 3 seesÂx
Â=
Â1
 overwritten byÂx
Â=
Â2
, can Thread 4 see the opposite?
- ãã®ãªããã¹ãã¹ãã¯åã®ãšäŒŒãŠããããäž¡æ¹ã®ã¹ã¬ããã 2 ã€ã®ç°ãªã倿°
x
ãšy
ã§ã¯ãªããåäžã®å€æ°x
ã«æžã蟌ã¿ãè¡ã£ãŠãã - ã¹ã¬ãã 1 ãš 2 ã¯ç«¶åããå€ 1 ãš 2 ã
x
ã«æžã蟌ãããã¹ã¬ãã 3 ãšã¹ã¬ãã 4 ã¯äž¡æ¹ãšãx
ã 2 åèªã¿åã - ã¹ã¬ãã 3 ã
x = 1
ãx = 2
ã«ãã£ãŠäžæžããããããšã確èªããå Žåãã¹ã¬ãã 4 ã¯å察ã®ããšã確èªã§ãããïŒ
The answer is no, even on ARM/POWER: threads in the system must agree about a total order for the writes to a single memory location. That is, threads must agree which writes overwrite other writes. This property is called called coherence. Without the coherence property, processors either disagree about the final result of memory or else report a memory location flip-flopping from one value to another and back to the first. It would be very difficult to program such a system.
- ãã®çãã¯ARM/POWER ã§ã no ã§ãã
- ã·ã¹ãã ã®ã¹ã¬ããã¯ãåäžã®ã¡ã¢ãªãã±ãŒã·ã§ã³ã«æžã蟌ã¿ã®ããã«å šé åºã«ã€ããŠåæããå¿ èŠããã
- ãã®ããããã£ã¯ã³ããŒã¬ã³ã¹ãšåŒã°ãã
- ã³ããŒã¬ã³ã¹ããããã£ãªãã«ãããã»ããµã¯ãã¡ã¢ãªã®æçµçµæã«ã€ããŠæèŠãäžèŽããªãããããå€ããå¥ã®å€ã«ããªããããããããŠæåã®å€ã«æ»ãã¡ã¢ãªäœçœ®ãå ±åãã
- ãã®ãããªã·ã¹ãã ãããã°ã©ã ããããšã¯éåžžã«é£ããããšã§ãã
I'm purposely leaving out a lot of subtleties in the ARM and POWER weak memory models. For more detail, see any of Peter Sewell's papers on the topic. Also, ARMv8 strengthened the memory model by making it âmulticopy atomic,â but I won't take the space here to explain exactly what that means.
- ARM ããã³ POWER ã®åŒ±ãã¡ã¢ãªã¢ãã«ã§ã¯ãå€ãã®åŸ®åŠãªç¹ãæå³çã«çç¥ããŠãã
- ãã詳现ãã¿ãããã«ã¯ã Peter Sewell's papers on the topic ãªã©ãã¿ãŠãã ãã
- ARMv8 㯠 strengthened the memory model ã«ãããã¡ã¢ãªã¢ãã«ãããã«ãã³ã㌠ã¢ãããã¯ãã«ããããšã§åŒ·åãããããããäœãæå³ããã®ããæ£ç¢ºã«èª¬æããããã«ããã§ã¹ããŒã¹ãåããªã
There are two important points to take away. First, there is an incredible amount of subtlety here, the subject of well over a decade of academic research by very persistent, very smart people. I don't claim to understand anywhere near all of it myself. This is not something we should hope to explain to ordinary programmers, not something that we can hope to keep straight while debugging ordinary programs. Second, the gap between what is allowed and what is observed makes for unfortunate future surprises. If current hardware does not exhibit the full range of allowed behaviorsâespecially when it is difficult to reason about what is allowed in the first place!âthen inevitably programs will be written that accidentally depend on the more restricted behaviors of the actual hardware. If a new chip is less restricted in its behaviors, the fact that the new behavior breaking your program is technically allowed by the hardware memory modelâthat is, the bug is technically your faultâis of little consolation. This is no way to write programs.
- æŒãããŠããããéèŠãªãã€ã³ã㯠2 ã€
- ãŸããããã«ã¯ä¿¡ããããªãã»ã©ã®åŸ®åŠãªç¹ããããéåžžã«ç²ã匷ããéåžžã«é ã®è¯ã人ã ã«ãã 10 幎ãã¯ããã«è¶ ããåŠè¡ç ç©¶ã®ããŒãããã
- ç§ã¯ããã®ã»ãšãã©ãã¹ãŠãèªåã§çè§£ããŠãããšã¯äž»åŒµããªã
- ããã¯ãæ®éã®ããã°ã©ããŒã«èª¬æãããããšã§ã¯ãªããæ®éã®ããã°ã©ã ããããã°ããŠããéãæ£ãããä¿ã€ããšãæåŸ ã§ãããã®ã§ããªã
- 2 ã€ç®ã«ãèš±å¯ãããŠããããšãšèгå¯ãããŠããããšã®éã®ã®ã£ããã¯ãäžå¹žãªå°æ¥ã®é©ãããããã
- çŸåšã®ããŒããŠã§ã¢ãèš±å¯ãããŠãããã¹ãŠã®åäœã瀺ããŠããªãå ŽåïŒç¹ã«ãããããäœãèš±å¯ãããŠãããã倿ããã®ãé£ããå ŽåïŒ
- å¿ ç¶çã«ãå®éã®ããŒããŠã§ã¢ã®ããå¶éãããåäœã«èª€ã£ãŠäŸåããããã°ã©ã ãäœæããã
- æ°ãããããã®åäœã®å¶éãç·©åãããŠããå Žåãããã°ã©ã ãå£ãæ°ããåäœãããŒããŠã§ã¢ã¡ã¢ãªã¢ãã«ã«ãã£ãŠæè¡çã«èš±å¯ãããŠãããšããäºå®ïŒã€ãŸãããã°ã¯æè¡çã«ã¯ãŠãŒã¶ãŒã®è²¬ä»»ã§ããïŒã¯ãã»ãšãã©æ °ãã«ã¯ãªããªã
- ããã¯ããã°ã©ã ãæžãæ¹æ³ã§ã¯ãªã