Problems with Qualcomm Scorpion dual-core ARM NEON code?_问答_开发者

Problems with Qualcomm Scorpion dual-core ARM NEON code?

开发者 https://www.devze.com 2023-04-09 02:06 出处：网络

I am developing a native library for Android where I use ARM assembly optimizations and multithreading in order to get maximum performance on the dual-core ARM chipset MSM8660. While doing some measur

The single-threaded library with NEON optimizations is faster than the single-threaded library with ARMv6 optimizations (as expected).
The multi-threaded library with ARMv6 optimizations is faster than the single-threaded library with ARMv6 optimizations (as expected).
The multi-threaded library with NEON optimizations is slower than the single-threaded library with 开发者_C百科NEON optimizations (definitely not expected!).

I have tried searching all over the net for an explanation for why this is but have so far not found any. It almost seems like all the cores share the same NEON pipeline or something like that, but all schematics seem to indicate that each core should have its own NEON unit. Does anyone know why this is happening?

First of all, what library are you using ?

You're correct, each core has it's own NEON unit, It is however their own proprietary 'VeNum' unit and not much information is provided about it, It was designed for the Cortex-A8 based Scorpion in 8x50 and was quite better than ARM's own implementation of NEON SIMD, However a good relief is that they (qcom) design their hardware in a way that it's compatible with the base refrence design so most code for a cortex-A8 will work just fine with Scorpion albeit with some performance hit due to possible different instruction timing.

If you're using "softfp" to compile your program, you will have an overhead of approx 20 cycles for every function you call which uses floating point arguments and or uses the NEON unit as transferring register data from the ARM core to Neon unit and vice versa is quite slow and can sometimes stall the core for many cycles waiting for the pipeline to flush.

Also for a threaded program using floating point unit, the kernel has to save the FP registers during a context switch so that incurs additional penalty for threads since we already know moving registers from neon to arm is slow and is known to stall the pipeline.

Additionally many other factors can lead to this such as a bad optimization from compiler, cache miss, not using the double issue feature of scorpion, bad instruction scheduling and switching of your thread from one core to another repeatedly.

It's probably because of cache misses. It's hard to tell without more information.

My guess would be that it is because of the extra cycle penalty involved in flushing the NEON pipeline. The NEON pipeline is behind the rest of the core, and so you see an extra cycle penalty for missed branches and so on.

If the threads have to synchronize quite often, or if you have a lot of locks, I think you are going to see big penalties with NEON.

The only way you are going to leverage NEON for an overall gain in performance with multi-threaded code is if the code is embarrassingly parallel and there is very little and infrequent communication between the threads.