Notes on ARM SIMD
Introduction
Collection of some ARM SIMD related information which could come in handy. This is mostly going to be a live-document. As and when I continue to find new information, I'll be updating this page (mostly for my perusal in the future).
-
NEON
- SIMD instructions
- No predicates allowed
- Vector lengths fixed in the Arch
- Introduced in Armv7-A
- 32x 128b regs (V0-31)
- int, fp32, fp64 ops are supported
- All NEON intrinsics can be found here
-
SVE
- Generalized extensions for SIMD instructions
- VLA (Vector Length Agnostic)
- HW implementors are free to choose the width (128b to 2048b)
- Supports predicated registers.
- Optional extension in Armv8-A
svcntb()
(andsvcntw()
?) will tell you the vector width at runtime- 16x predicte regs (P0-15)
- P0-7 - preds for load/store/math
- P8-15 - preds for loop control
- 32x SVE regs which use the same lower 128b from NEON regs (Z0-31)
- FFR (First Fault Reg) for SW speculation
- Armv9-A introduces SVE2 as an extension to SVE instructions
- All SVE intrinsics can be found here
- features
- gather-load, scatter-store
- per-lane predication
- predicate-driven loop control
- vector partitioning for SW-managed speculation (FFR)
- extended FP and bitwise horizontal reductions
-
SVE2
- superset of SVE and NEON
- from Armv9-A
- regs are same from SVE + Scalable vector system control regs (ZCR_Elx, x = 1, 2, 3)
- ZCR_Elx.LEN defines the vector length for the current and lower exception levels
- features
- all of SVE
- replicates existing NEON instructions
- complex arithmetic
- crypto
- genomics
- etc
-
Prefer to use libaries that already use these instructions (eg: ARMPL), then, prefer to use intrinsics. Finally, if perf is very important, look at using inline assembly.
-
arm_neon.h and arm_sve.h (use the
__ARM_FEATURE_SVE
macro!) for using NEON and SVE instructions respectively -
use
restrict
keyword in order to guarantee compiler of non-aliasing of pointers. This can lead to, for eg, compiler usingldp
instructions. It'll in general emit lesser instructions. It can also aid compiler in autovectorization. -
Nice table on various memory latencies.
-
__builtin_prefetch()
helps with SW prefetching -
type-casts are almost always costlier than a simple copy (6 cycles on Neoverse N1 and 3 on N2)
-
not autovectorizable loops (usually!)
- non countable loops (Eg:
while
withbreak
) - no function calls in the loop
- no branches or if/else/switch statements in the loop (not true for iteration invariant conditionals)
- only inner loops are autovectorized
- data interdependent iterations
- non countable loops (Eg:
-
for simd codes prefer SoA vs AoS and keep the data as packed as possible (these all sound similar for GPU optimization as well!)
References
- https://developer.arm.com/-/media/Arm%20Developer%20Community/PDF/Learn%20the%20Architecture/102131_0100_01_SVE_and_Neon_coding_compared.pdf?revision=feaaf72e-a941-461c-bd92-0d960d0f8615
- https://learn.arm.com/learning-paths/servers-and-cloud-computing/sve/sve_basics/
- https://www.arm.com/developer-hub/servers-and-cloud-computing/arm-simd
- https://developer.arm.com/documentation/102340/0100
- https://www.stonybrook.edu/commcms/ookami/support/_docs/ARM_SVE_tutorial.pdf
- https://www.stonybrook.edu/commcms/ookami/support/_docs/5%20-%20Advanced%20SVE.pdf