SSE instructions: which CPUs can do atomic 16B memory operations?_问答_开发者

Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory location is aligned to 16 bytes.

The document "Intel® 64 Architecture Memory Ordering White Paper" states that for "Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary" the memory operation appears to execute as a single memory access regardless of memory type.

The question: Do there exist Intel/AMD/etc x86 CPUs which guarantee that reading or writing 16 bytes (128 bits) aligned to a 16 byte boundary executes as a single memory access? Is so, which particular type of CPU is it (Core2/Atom/K8/Phenom/...)? If you provide an answer (yes/no) to this question, please also specify the method that was used to determine the answer - PDF document lookup, brute force testing, math proof, or whatever other method you used to determine the answer.

This question relates to problems such as http://research.swtch.com/2010/02/off-to-races.html

Update:

I created a simple test program in C that you can run on your computers. Please compile and run it on your Phenom, Athlon, Bobcat, Core2, Atom, Sandy Bridge or whatever SSE2-capable CPU you happen to have. Thanks.

// Compile with:
//   gcc -o a a.c -pthread -msse2 -std=c99 -Wall -O2
//
// Make sure you have at least two physical CPU cores or hyper-threading.

#include <pthread.h>
#include <emmintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>

typedef int v4si __attribute__ ((vector_size (16)));
volatile v4si x;

unsigned n1[16] __attribute__((aligned(64)));
unsigned n2[16] __attribute__((aligned(64)));

void* thread1(void *arg) {
        for (int i=0; i<100*1000*1000; i++) {
                int mask = _mm_movemask_ps((__m128)x);
                n1[mask]++;

                x = (v4si){0,0,0,0};
        }
        return NULL;
}

void* thread2(void *arg) {
        for (int i=0; i<100*1000*1000; i++) {
                int mask = _mm_movemask_ps((__m128)x);
            开发者_StackOverflow社区    n2[mask]++;

                x = (v4si){-1,-1,-1,-1};
        }
        return NULL;
}

int main() {
        // Check memory alignment
        if ( (((uintptr_t)&x) & 0x0f) != 0 )
                abort();

        memset(n1, 0, sizeof(n1));
        memset(n2, 0, sizeof(n2));

        pthread_t t1, t2;
        pthread_create(&t1, NULL, thread1, NULL);
        pthread_create(&t2, NULL, thread2, NULL);
        pthread_join(t1, NULL);
        pthread_join(t2, NULL);

        for (unsigned i=0; i<16; i++) {
                for (int j=3; j>=0; j--)
                        printf("%d", (i>>j)&1);

                printf("  %10u %10u", n1[i], n2[i]);
                if(i>0 && i<0x0f) {
                        if(n1[i] || n2[i])
                                printf("  Not a single memory access!");
                }

                printf("\n");
        }

        return 0;
}

The CPU I have in my notebook is Core Duo (not Core2). This particular CPU fails the test, it implements 16-byte memory read/writes with a granularity of 8 bytes. The output is:

0000    96905702      10512
0001           0          0
0010           0          0
0011          22      12924  Not a single memory access!
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100     3092557       1175  Not a single memory access!
1101           0          0
1110           0          0
1111        1719   99975389

In the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A, which nowadays contains the specifications of the memory ordering white paper you mention, it is said in section 8.1.1 that:

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

Reading or writing a byte.

Reading or writing a word aligned on a 16-bit boundary.

Reading or writing a doubleword aligned on a 32-bit boundary. The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

Reading or writing a quadword aligned on a 64-bit boundary.

16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:

Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.

Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:

MOVAPD, MOVAPS, and MOVDQA.

VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.

VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).

(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)

Each of the writes x = (v4si){0,0,0,0} and x = (v4si){-1,-1,-1,-1} are probably compiled into a single 16-byte MOVAPS. The address of x is 16-byte aligned. On an Intel processor that supports AVX, these writes are atomic. Otherwise, they are not atomic.

On AMD processors, AMD64 Architecture Programmer's Manual, Section 3.9.1.3 states that

Single load or store operations (from instructions that do just a single load or store) are naturally atomic on any AMD64 processor as long as they do not cross an aligned 8-byte boundary. Accesses up to eight bytes in size which do cross such a boundary may be performed atomically using certain instructions with a lock prefix, such as XCHG, CMPXCHG or CMPXCHG8B, as long as all such accesses are done using the same technique. (Note that misaligned locked accesses may be subject to heavy performance penalties.) CMPXCHG16B can be used to perform 16-byte atomic accesses in 64- bit mode (with certain alignment restrictions).

AMD processors thus do not guarantee that AVX instructions provide 16-byte atomicity.

On Intel processors that don't support AVX and on AMD processor, the CMPXCHG16B instruction with the LOCK prefix can be used. You can use the CPUID instruction to figure out if your processor supports CMPXCHG16B (the "CX16" feature bit).

EDIT: Test program results

(Test program modified to increase #iterations by a factor of 10)

On a Xeon X3450 (x86-64):

0000   999998139       1572
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111        1861  999998428

On a Xeon 5150 (32-bit):

0000   999243100     283087
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111      756900  999716913

On an Opteron 2435 (x86-64):

0000   999995893       1901
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111        4107  999998099

Note that the Intel Xeon X3450 and Xeon 5150 don't support AVX. The Opteron 2435 is an AMD processor.

Does this mean that Intel and/or AMD guarantee that 16 byte memory accesses are atomic on these machines? IMHO, it does not. It's not in the documentation as guaranteed architectural behavior, and thus one cannot know if on these particular processors 16 byte memory accesses really are atomic or whether the test program merely fails to trigger them for one reason or another. And thus relying on it is dangerous.

EDIT 2: How to make the test program fail

Ha! I managed to make the test program fail. On the same Opteron 2435 as above, with the same binary, but now running it via the "numactl" tool specifying that each thread runs on a separate socket, I got:

0000   999998634       5990
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          1  Not a single memory access!
1101           0          0
1110           0          0
1111        1366  999994009

So what does this imply? Well, the Opteron 2435 may, or may not, guarantee that 16-byte memory accesses are atomic for intra-socket accesses, but at least the cache coherency protocol running on the HyperTransport interconnect between the two sockets does not provide such a guarantee.

EDIT 3: ASM for the thread functions, on request of "GJ."

Here's the generated asm for the thread functions for the GCC 4.4 x86-64 version used on the Opteron 2435 system:


.globl thread2
        .type   thread2, @function
thread2:
.LFB537:
        .cfi_startproc
        movdqa  .LC3(%rip), %xmm1
        xorl    %eax, %eax
        .p2align 5,,24
        .p2align 3
.L11:
        movaps  x(%rip), %xmm0
        incl    %eax
        movaps  %xmm1, x(%rip)
        movmskps        %xmm0, %edx
        movslq  %edx, %rdx
        incl    n2(,%rdx,4)
        cmpl    $1000000000, %eax
        jne     .L11
        xorl    %eax, %eax
        ret
        .cfi_endproc
.LFE537:
        .size   thread2, .-thread2
        .p2align 5,,31
.globl thread1
        .type   thread1, @function
thread1:
.LFB536:
        .cfi_startproc
        pxor    %xmm1, %xmm1
        xorl    %eax, %eax
        .p2align 5,,24
        .p2align 3
.L15:
        movaps  x(%rip), %xmm0
        incl    %eax
        movaps  %xmm1, x(%rip)
        movmskps        %xmm0, %edx
        movslq  %edx, %rdx
        incl    n1(,%rdx,4)
        cmpl    $1000000000, %eax
        jne     .L15
        xorl    %eax, %eax
        ret
        .cfi_endproc

and for completeness, .LC3 which is the static data containing the (-1, -1, -1, -1) vector used by thread2:


.LC3:
        .long   -1
        .long   -1
        .long   -1
        .long   -1
        .ident  "GCC: (GNU) 4.4.4 20100726 (Red Hat 4.4.4-13)"
        .section        .note.GNU-stack,"",@progbits

Also note that this is AT&T ASM syntax, not the Intel syntax Windows programmers might be more familiar with. Finally, this is with march=native which makes GCC prefer MOVAPS; but it doesn't matter, if I use march=core2 it will use MOVDQA for storing to x, and I can still reproduce the failures.

The "AMD Architecture Programmer's Manual Volume 1: Application Programming" says in section 3.9.1: "CMPXCHG16B can be used to perform 16-byte atomic accesses in 64-bit mode (with certain alignment restrictions)."

However, there is no such comment about SSE instructions. In fact, there is a comment in 4.8.3 that the LOCK prefix "causes an invalid-opcode exception when used with 128-bit media instructions". It therefore seems pretty conclusive to me that the AMD processors do NOT guarantee atomic 128-bit accesses for SSE instructions, and the only way to do an atomic 128-bit access is to use CMPXCHG16B.

The "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1" says in 8.1.1 "An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses." This is pretty conclusive that 128-bit SSE instructions are not guaranteed atomic by the ISA. Volume 2A of the Intel docs says of CMPXCHG16B: "This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically."

Further, CPU manufacturers haven't published written guarantees of atomic 128b SSE operations for specific CPU models where that is the case.

Erik Rigtorp has done some experimental testing on recent Intel and AMD CPUs to look for tearing. Results at https://rigtorp.se/isatomic/. Keep in mind there's no documentation or guarantee about this behaviour, and IDK if it's possible for a custom many-socket machine using such CPUs to have less atomicity than the machines he tested on. But on current x86 CPUs (not K10), SIMD atomicity for aligned loads/stores simply scales with data-path width between cache and L1d cache.

The x86 ISA only guarantees atomicity for things up to 8B, so that implementations are free to implement SSE / AVX support the way Pentium III / Pentium M / Core Duo does: internally data is handled in 64bit halves. A 128bit store is done as two 64bit stores. The data path to/from cache is only 64b wide in the Yonah microarchitecture (Core Duo). (source:Agner Fog's microarch doc).

More recent implementations do have wider data paths internally, and handle 128b instructions as a single op. Core 2 Duo (conroe/merom) was the first Intel P6-descended microarch with 128b data paths. (IDK about P4, but fortunately it's old enough to be totally irrelevant.)

This is why the OP finds that 128b ops are not atomic on Intel Core Duo (Yonah), but other posters find that they are atomic on later Intel designs, starting with Core 2 (Merom).

The diagrams on this Realworldtech writeup about Merom vs. Yonah show the 128bit path between ALU and L1 data-cache in Merom (and P4), while the low-power Yonah has a 64bit data path. The data path between L1 and L2 cache is 256b in all 3 designs.

The next jump in data path width came with Intel's Haswell, featuring 256b (32B) AVX/AVX2 loads/stores, and a 64Byte path between L1 and L2 cache. I expect that 256b loads/stores are atomic in Haswell, Broadwell, and Skylake, but I don't have one to test. I forget if Skylake again widened the paths in preparation for AVX512 in Skylake-EP (the server version), or if perhaps the initial implementation of AVX512 will be like SnB/IvB's AVX, and have 512b loads/stores occupy a load/store port for 2 cycles.

As janneb points out in his excellent experimental answer, the cache-coherency protocol between sockets in a multi-core system might be narrower than what you get within a shared-last-level-cache CPU. There is no architectural requirement on atomicity for wide loads/stores, so designers are free to make them atomic within a socket but non-atomic across sockets if that's convenient. IDK how wide the inter-socket logical data path is for AMD's Bulldozer-family, or for Intel. (I say "logical", because even if the data is transferred in smaller chunks, it might not modify a cache line until it's fully received.)

Finding similar articles about AMD CPUs should allow drawing reasonable conclusions about whether 128b ops are atomic or not. Just checking instruction tables is some help:

K8 decodes movaps reg, [mem] to 2 m-ops, while K10 and bulldozer-family decode it to 1 m-op. AMD's low-power bobcat decodes it to 2 ops, while jaguar decodes 128b movaps to 1 m-op. (It supports AVX1 similar to bulldozer-family CPUs: 256b insns (even ALU ops) are split into two 128b ops. Intel SnB only splits 256b loads/stores, while having full-width ALUs.)

janneb's Opteron 2435 is a 6-core Istanbul CPU, which is part of the K10 family, so this single-m-op -> atomic conclusion appears accurate within a single socket.

Intel Silvermont does 128b loads/stores with a single uop, and a throughput of one per clock. This is the same as for integer loads/stores, so it's quite probably atomic.

There is actually a warning in the Intel Architecture Manual Vol 3A. Section 8.1.1 (May 2011), under the section of guaranteed atomic operations:

An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses. If such an instruction stores to memory, some of the accesses may complete (writing to memory) while another causes the operation to fault for architectural reasons (e.g. due an page-table entry that is marked “not present”). In this case, the effects of the completed accesses may be visible to software even though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section 4.10.4.4), such page faults may occur even if all accesses are to the same page.

thus SSE instructions are not guaranteed to be atomic, even if the underlying architecture does use a single memory access (this is one reason why the memory fencing was introduced).

Combine that with this statement from the Intel Optimization Manual, Section 13.3 (April 2011)

AVX and FMA instructions do not introduce any new guaranteed atomic memory operations.

and that fact that none of the load or store operation for SIMD guarantee atomicity, we can come to the conclusion that Intel doesn't not support any form of atomic SIMD (yet).

As an extra bit, if the memory is split along cache lines or page boundaries (when using things like movdqu which permit unaligned access), the following processors will not perform atomic accesses, regardless of alignment, but later processors will (again from the Intel Architecture Manual):

Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors

It looks like AMD will also specify in the next revision of their manual that aligned 16b loads and stores are atomic on their x86 processors which supports AVX. (Source)

Apologies for late response!

We would update the AMD APM manuals in the next revision.

For all AMD architectures,

Processors that support AVX extend the atomicity for cacheable, naturally-aligned single loads or stores from a quadword to a double quadword.

which means all 128b instructions, even the *MOVDQU instructions, are atomic if they end up being naturally aligned.

Can we extend this patch to AMD processors as well. If not, I will plan to submit the patch for stage-1!

With this, the patch making libatomic use vmovdqa in their implementation of __atomic_load_16 and __atomic_store_16 not only on Intel processors with AVX but also on AMD processors with AVX has landed on the master branch.

Lot of answers have been posted so far and hence lot of information is already available (as a side effect lot of confusion too). I would like to site facts from Intel manual regarding hardware guaranteed atomic operations ...

In Intel's latest processors of nehalem and sandy bridge family, reading or writing to a quadword aligned to 64 bit boundary is guaranteed.

Even unaligned 2, 4 or 8 byte reads or writes are guaranteed to be atomic provided they are cached memory and fit in a cache line.

Having said that the test posted in this question passes on sandy bridge based intel i5 processor.

EDIT: In the last two days I have made several tests on my three PCs and I didn't reproduce any memory error, so I can't say anything more precisely. Maybe is this memory error also dependent from OS.

EDIT: I'm programing in Delphi and not in C but I should understand C. So I have translated the code, here are you have the threads procedures where the main part is made in assembler:

procedure TThread1.Execute;
var
  n             :cardinal;
const
  ConstAll0     :array[0..3] of integer =(0,0,0,0);
begin
  for n := 0 to 100000000 do
    asm
      movdqa    xmm0, dqword [x]
      movmskps  eax, xmm0
      inc       dword ptr[n1 + eax *4]
      movdqu    xmm0, dqword [ConstAll0]
      movdqa    dqword [x], xmm0
    end;
end;

{ TThread2 }

procedure TThread2.Execute;
var
  n             :cardinal;
const
  ConstAll1     :array[0..3] of integer =(-1,-1,-1,-1);
begin
  for n := 0 to 100000000 do
    asm
      movdqa    xmm0, dqword [x]
      movmskps  eax, xmm0
      inc       dword ptr[n2 + eax *4]
      movdqu    xmm0, dqword [ConstAll1]
      movdqa    dqword [x], xmm0
    end;
end;

Result: no mistake on my quad core PC and no mistake on my dual core PC as expected!