Introduction

Collection of some ARM SIMD related information which could come in handy. This is mostly going to be a live-document. As and when I continue to find new information, I'll be updating this page (mostly for my perusal in the future).

NEON
- SIMD instructions
- No predicates allowed
- Vector lengths fixed in the Arch
- Introduced in Armv7-A
- 32x 128b regs (V0-31)
- int, fp32, fp64 ops are supported
- All NEON intrinsics can be found here
SVE
- Generalized extensions for SIMD instructions
- VLA (Vector Length Agnostic)
- HW implementors are free to choose the width (128b to 2048b)
- Supports predicated registers.
- Optional extension in Armv8-A
- svcntb() (and svcntw()?) will tell you the vector width at runtime
- 16x predicte regs (P0-15)
  - P0-7 - preds for load/store/math
  - P8-15 - preds for loop control
- 32x SVE regs which use the same lower 128b from NEON regs (Z0-31)
- FFR (First Fault Reg) for SW speculation
- Armv9-A introduces SVE2 as an extension to SVE instructions
- All SVE intrinsics can be found here
- features
  - gather-load, scatter-store
  - per-lane predication
  - predicate-driven loop control
  - vector partitioning for SW-managed speculation (FFR)
  - extended FP and bitwise horizontal reductions
SVE2
- superset of SVE and NEON
- from Armv9-A
- regs are same from SVE + Scalable vector system control regs (ZCR_Elx, x = 1, 2, 3)
- ZCR_Elx.LEN defines the vector length for the current and lower exception levels
- features
  - all of SVE
  - replicates existing NEON instructions
  - complex arithmetic
  - crypto
  - genomics
  - etc
Prefer to use libaries that already use these instructions (eg: ARMPL), then, prefer to use intrinsics. Finally, if perf is very important, look at using inline assembly.
arm_neon.h and arm_sve.h (use the __ARM_FEATURE_SVE macro!) for using NEON and SVE instructions respectively
use restrict keyword in order to guarantee compiler of non-aliasing of pointers. This can lead to, for eg, compiler using ldp instructions. It'll in general emit lesser instructions. It can also aid compiler in autovectorization.
Nice table on various memory latencies.
__builtin_prefetch() helps with SW prefetching
type-casts are almost always costlier than a simple copy (6 cycles on Neoverse N1 and 3 on N2)
not autovectorizable loops (usually!)
- non countable loops (Eg: while with break)
- no function calls in the loop
- no branches or if/else/switch statements in the loop (not true for iteration invariant conditionals)
- only inner loops are autovectorized
- data interdependent iterations
for simd codes prefer SoA vs AoS and keep the data as packed as possible (these all sound similar for GPU optimization as well!)

References

https://developer.arm.com/-/media/Arm%20Developer%20Community/PDF/Learn%20the%20Architecture/102131_0100_01_SVE_and_Neon_coding_compared.pdf?revision=feaaf72e-a941-461c-bd92-0d960d0f8615
https://learn.arm.com/learning-paths/servers-and-cloud-computing/sve/sve_basics/
https://www.arm.com/developer-hub/servers-and-cloud-computing/arm-simd
https://developer.arm.com/documentation/102340/0100
https://www.stonybrook.edu/commcms/ookami/support/_docs/ARM_SVE_tutorial.pdf
https://www.stonybrook.edu/commcms/ookami/support/_docs/5%20-%20Advanced%20SVE.pdf