Skip to content

Create a simple tool that tells me whether SIMD is working in my environment #421

@mikemccand

Description

@mikemccand

Accessing SIMD instructions is devilishly complex from the clouds of javaland:

  • Panama Vector API: One can use Panama's vector API, but that is quite abstract and might not do the right thing on your particular CPU and java version. And it remained experimental until Java 22 (maybe?), but Lucene devs got impatient and began integrating it while it was still experimental, and the original "Plan B" issue ~2 years ago.

  • Hotspot compiler's auto-vectorization: One can write the java sources "just so" and hope Hotspot's autovectorization kicks in (Lucene's postings decode of blocks of 128 integers does this).

  • Native extensions: Finally, one can also take matters into their own hands and create a native extension, using JNI or foreign functions support (also in Panama) to call C code from java and explicitly embed the right SIMD instructions in your C code (or rely on gcc to properly compile your native extension) -- this PR takes that approach, but it's draft/experimental.

These optimizations are highly CPU arch dependent -- x86-64 (Intel, AMD) are usually the focus (SSE/AVX) since developers mostly build and run on these CPUs, and Arm (e.g. Amazon's Graviton CPUs, Apples MX CPUs) doesn't often get the attention it should (though it does get some -- thank you @rmuir!). It has a different SIMD instruction set (Neon).

There are problems and frustrations with all of the above paths, and Lucene to varying degrees in various releases and on certain JDKs, tries to do the right thing. But even the right thing is brittle -- is Hotspot still auto-vectorizing correctly after my Java upgrade? Is Panama vector extensions really using SIMD instructions, or emulating them in slower java code? Have new SIMD instructions arrived (e.g. AVX-512) which could do our dot products faster?

Anyways, with all of that complicated backdrop... on to this actual issue:

Let's create a simple tool that one could run on their production environment, that reports whether or not your world (Java version, Lucene version, CPU arch, OS, virtualization layers, ...) is in fact using the right SIMD instructions when running Lucene's KNN search. It's OK if this tool is kinda slow to run -- I see it as something like an integration test. Ideas:

  • Make a Python wrapper that runs knnPerfTest, but turns on -XX:+PrintAssembly, and then watches the output: when Hotspot says "I compiled dotProduct and here is the ASM code", grep that output for specific instructions. You'd first have to enumerate which instructions you would expect to see in this case (float or quantized vectors? which CPU arch?)

  • Or ... are there performance counters that we could ask the CPU to report on (perf or so?). Then run knnPerfTest, and gather those counters, and see if "enough" SIMD instructions were executed?

  • Or ... maybe from a single JVM invocation, there is some way to introspect (java reflection APIs maybe...) and see the ASM for functions and just look "yourself" from within the JVM that is running KnnGraphTester?

  • Other ideas....?

We badly need this tooling because the whole situation is so fraught. Ideally this tool would be available from Lucene, and even more ideally Lucene itself could introspect and print a warning that it is not optimized properly for the current Java/CPU/OS. But we can start with a rough first cut here and maybe iterate it to something more compact.

These optimizations are crazy powerful, but they are also insanely brittle, so we need to give Lucene users stronger testing that the opto is still working...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions