I have this function which uses SSE2 to add some values together it's supposed to add lhs and rhs together and store the result back into lhs:
template<typename T>
void simdAdd(T *lhs,T *rhs)
{
    asm volatile("movups %0,%%xmm0"::"m"(lhs));
    asm volatile("movups %0,%%xmm1"::"m"(rhs));
    switch(sizeof(T))
    {
        case sizeof(uint8_t):
        asm volatile("paddb %%xmm0,%%xmm1":);
        break;
        case sizeof(uint16_t):
        asm volatile("paddw %%xmm0,%%xmm1":);
        break;
        case sizeof(float):
        asm volatile("addps %%xmm0,%%xmm1":);
        break;
        case sizeof(double):
        asm volatile("addpd %%xmm0,%%xmm1":);
        break;
        default:
        std::cout<<"error"<<std::endl;
        break;
    }
    asm volatile("movups %%xmm0,%0":"=m"(lhs));
}
and my code uses the function like this:
float *values=new float[4];
float *values2=new float[4];
values[0]=1.0f;
values[1]=2.0f;
values[2]=3.0f;
values[3]=4.0f;
values2[0]=1.0f;
values2[1]=2.0f;
values2[2]=3.0f;
values2[3]=4.0f;
simdAdd(values,values2);
for(uint32_t count=0;count<4;count++) std::cout<<values[count]<<开发者_开发知识库;std::endl;
However this isn't working because when the code runs it outputs 1,2,3,4 instead of 2,4,6,8
I've found that inline assembly support isn't reliable in most modern compilers (as in, the implementations are just plain buggy). You are generally better off using compiler intrinsics which are declarations that look like C functions, but actually compile to a specific opcode.
Intrinsics let you specify an exact sequence of opcodes, but leave the register coloring to the compiler. It's much more reliable than trying to move data between C variables and asm registers, which is where inline assemblers have always fallen down for me. It also lets the compiler schedule your instructions, which can provide better performance if it works around pipeline hazards. Ie, in this case you could do
void simdAdd(float *lhs,float *rhs)
{
   _mm_storeu_ps( lhs, _mm_add_ps(_mm_loadu_ps( lhs ), _mm_loadu_ps( rhs )) );
}
In your case, anyway, you've two problems:
- The terrible GCC inline assembly syntax which makes great confusion of the difference between pointers and values. Use *lhsand*rhsinstead of just lhs and rhs; apparently the "=m" syntax means "implicitly use a pointer to this thing that I'm passing you instead of the thing itself."
- GCC has a source,destination syntax -- The addps stores its result in the second parameter, so you you need to output xmm1, notxmm0.
I've put a fixed example on codepad (to avoid cluttering up this answer, and to demonstrate that it works).
Couple things I see wrong here. Firstly, your statements that load up the XMM registers and store values back to your variable are wrong.
asm volatile("movups %0,%%xmm0"::"m"(lhs));
asm volatile("movups %0,%%xmm1"::"m"(rhs));
...
asm volatile("movups %%xmm0,%0":"=m"(lhs));
Should read
asm volatile("movups %0,%%xmm0"::"m"(*lhs));
asm volatile("movups %0,%%xmm1"::"m"(*rhs));
...
asm volatile("movups %%xmm0,%0":"=m"(*lhs));
Note the *'s. You were loading up and adding the pointer values, and then storing them back in a temporary which was used to pass the pointer argument (which consequently is forgotten without writing to memory when the function call returns).
Even with these fixes, in general, this is not a good technique. I had written my own example with asm statements, but it was flawed because I forgot to account for the unaligned nature of the parameters being passed in. It becomes very cumbersome to do with asm statements and far easier and more readable using intrinsic functions. Just use caution to use the correct data types:
template<typename T>
void simdAdd(T *lhs,T *rhs)
{
    switch(sizeof(T))
    {
        case sizeof(uint8_t):
        {
          __m128i lh128;
          lh128 = _mm_add_epi8( _mm_loadu_si128( (__m128i *)lhs ),
                                _mm_loadu_si128( (__m128i *)rhs ) );
          _mm_storeu_si128( (__m128i *)lhs, lh128 );
        }
        break;
        case sizeof(uint16_t):
        {
          __m128i lh128;
          lh128 = _mm_add_epi16( _mm_loadu_si128( (__m128i *)lhs ),
                                 _mm_loadu_si128( (__m128i *)rhs ) );
          _mm_storeu_si128( (__m128i *)lhs, lh128 );
        }
        break;
        case sizeof(float):
        {
          __m128 lh128;
          lh128 = _mm_add_ps( _mm_loadu_ps( (float *)lhs ),
                              _mm_loadu_ps( (float *)rhs ) );
          _mm_storeu_ps( (float *)lhs, lh128 );
        }
        break;
        case sizeof(double):
        {
          __m128d lh128;
          lh128 = _mm_add_pd( _mm_loadu_pd( (double *)lhs ),
                              _mm_loadu_pd( (double *)rhs ) );
          _mm_storeu_pd( (double *)lhs, lh128 );
        }
        break;
        default:
        std::cout<<"error"<<std::endl;
        break;
    }
}
Something to be aware of is the size of your data types is not sufficient to know which data type you were passed. Just because a template type shares the same size as the basic types you are checking, doesn't mean it is the same type. So I force the casting to cover this case in my example. This might generally be an unsafe practice unless you are certain that this function will only ever be used with the types you have specified. For example, using a float-sized integer will result in an unexpectedly wrong answer, and the compiler won't be able to warn you about it.
 
         
                                         
                                         
                                         
                                        ![Interactive visualization of a graph in python [closed]](https://www.devze.com/res/2023/04-10/09/92d32fe8c0d22fb96bd6f6e8b7d1f457.gif) 
                                         
                                         
                                         
                                         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论