The SWAR (SIMD-within-a-register) numbers are strictly better than the SIMD versions as well as the standard library baseline. Why is that? SIMD should be strictly faster if the machine supports it, since the SWAR max bitwidth is 64, while SIMD starts at 128 bits.
The Java SIMD API used here must not result in using actual SIMD machine code.
bluuewhale 22 hours ago [-]
Author here!
Thanks for the great point.
This is actually the main topic I'm working on for the next post.
It's understandable to expect SIMD to win purely because it's wider, but in practice the end-to-end cost matters more than raw VL.
With the Java Vector API, the equality compare can indeed be compiled down to real SIMD instructions, yet the overall path may still lose if turning a VectorMask into a scalar bitmask is expensive. The "best case" is a vector compare followed by a single instruction that packs the result into a bitmask; if the JIT doesn't hit that lowering, it can fall back to extra work such as materializing the mask and repacking it in scalar code. From what I can tell, they have been working on intrinsic for VectorMask.toLong (https://bugs.openjdk.org/browse/JDK-8273949).
Also, SWAR avoids that entire transition by staying in GPR and producing the bitmask directly with a small, predictable sequence of bit operations. For small fixed-size probes, that simplicity often outweighs SIMD's theoretical advantage, and on some CPUs heavier vector usage can even come with frequency effects that further narrow the gap. So, I guess the more likely explanation isn't that the Vector API never uses SIMD.
I'll take a closer look at how it compiles down to machine code and share what I find.
P.S.
Benchmark results can vary a lot depending on the environment (OS, CPU and JDK/JIT version and flags), so it’s also possible the benchmark picture changes on a different setup.
yardstick 2 days ago [-]
The article wasn’t great at laying out the concepts at the start. As I understand it, the big idea is essentially a bloom filter as the first phase of a retrieval.
bluuewhale 22 hours ago [-]
Thanks for the feedback.
You've nailed the core idea. I'll tweak the structure a bit to make the concepts clearer up front.
jbellis 2 days ago [-]
Now I'm curious about what fastutil's implementation is doing.
2 days ago [-]
twoodfin 2 days ago [-]
[flagged]
Rendered at 10:04:13 GMT+0000 (Coordinated Universal Time) with Vercel.
The Java SIMD API used here must not result in using actual SIMD machine code.
Thanks for the great point. This is actually the main topic I'm working on for the next post.
It's understandable to expect SIMD to win purely because it's wider, but in practice the end-to-end cost matters more than raw VL.
With the Java Vector API, the equality compare can indeed be compiled down to real SIMD instructions, yet the overall path may still lose if turning a VectorMask into a scalar bitmask is expensive. The "best case" is a vector compare followed by a single instruction that packs the result into a bitmask; if the JIT doesn't hit that lowering, it can fall back to extra work such as materializing the mask and repacking it in scalar code. From what I can tell, they have been working on intrinsic for VectorMask.toLong (https://bugs.openjdk.org/browse/JDK-8273949).
Also, SWAR avoids that entire transition by staying in GPR and producing the bitmask directly with a small, predictable sequence of bit operations. For small fixed-size probes, that simplicity often outweighs SIMD's theoretical advantage, and on some CPUs heavier vector usage can even come with frequency effects that further narrow the gap. So, I guess the more likely explanation isn't that the Vector API never uses SIMD.
I'll take a closer look at how it compiles down to machine code and share what I find.
P.S. Benchmark results can vary a lot depending on the environment (OS, CPU and JDK/JIT version and flags), so it’s also possible the benchmark picture changes on a different setup.
You've nailed the core idea. I'll tweak the structure a bit to make the concepts clearer up front.