[lucene] Support vector index using Lucene #6773

jerry-024 · 2025-12-08T06:41:22Z

Purpose

Introduce vector index support for Paimon using Apache Lucene 9.x to enable:

Approximate nearest neighbor (ANN) search for vector embeddings

Implementation

Uses Lucene's HNSW algorithm for efficient ANN search
Supports float32 and byte vectors
Supports EUCLIDEAN, COSINE, DOT_PRODUCT, and MAXIMUM_INNER_PRODUCT similarity metrics

Tests

VectorGlobalIndexTest

API and Format

Documentation

* upstream/master: [test] Fix test SparkWriteITCase.testTruncatePartitionValueNull [orc] Limiting Memory Usage of OrcBulkWriter When Writing VectorizedRowBatch (apache#6590) [arrow] Improve customization capabilities for data type conversion. (apache#6695) [spark] Fix NPE in spark truncate null partitions [core] Exclude .class files from sources.jar (apache#6707) [core] DataEvolutionFileStoreScan should not filter files by read type when it contains no physical columns. (apache#6714) [spark] Update scalafmt version to 3.10.2 (apache#6709) [variant] Extract only required columns when reading shredded variants (apache#6720) [python] Fix read large volume of blob data (apache#6701) [flink] Support cdc source (apache#6606) [hive] fix splitting for bucket tables (apache#6594) [spark] Update spark build topology for global index (2) (apache#6703) [test][spark] Fix the flaky test setDefaultDatabase (apache#6696) [spark] Update global index build topology (apache#6700) [spark] Introduce global file index builder on spark (apache#6684) [python] Fix with_shard feature for blob data (apache#6691) [test][spark] Add alter with incompatible col type test case (apache#6689) [variant] Introduce withVariantAccess in ReadBuilder (apache#6685) [python] Fix file name prefix in postpone mode. (apache#6668)

* upstream/master: [core] format table: throw exception when get partiiton from file system (apache#6730) [core] Enrich static creation methods for GlobalIndexResult [spark] Fix duplicate column error when merging on _ROW_ID (apache#6727) [core] Remove Builder of SplitReadProvider.Context [core] Intruduce bitmap64 for GlobalIndexResult and support with ranges filter push down (apache#6725) [spark] Push down partition filter when compactUnAwareBucketTable (apache#6663) [core] Support the new split description for reading data (apache#6711) [python] support paimon as ray datasource for distributed processing (apache#6686) [docs] Add missing PaimonSparkSessionExtensions to Spark configs (apache#6729) Revert "[flink] Flink batch job support specify partition with max_pt() and max_two_pt()" without review [flink] Flink batch job support specify partition with max_pt() and max_two_pt() [spark] Update the display of size and timing metrics (apache#6722) [core] Append commit should check bucket number if latest commit user is different (apache#6723) [core] ShardScanner should not keep partition parameter (apache#6724)

* upstream/master: [core] Upgrade LZ4 dependency to 1.8.1 (apache#6737) [core] Update the result of show global system tables (apache#6751) [python] Introduce lance file format support to pypaimon (apache#6746) [Python] Support DLF OSS endpoint override in RESTTokenFileIO (apache#6749) [python] fix pypaimon manifest entry incomplete identifier issue (apache#6748) [spark] Merge into supports _ROW_ID shortcut (apache#6745) [core] Fix BlockIterator may not move forward after calling `next()` (apache#6738)

JingsongLi · 2025-12-08T14:02:45Z

paimon-vector-index/pom.xml

+        <version>1.4-SNAPSHOT</version>
+    </parent>
+
+    <artifactId>paimon-vector-index</artifactId>


Maybe paimon-lucene is better.

kaori-seasons · 2025-12-09T06:25:54Z

paimon-lucene/src/main/java/org/apache/paimon/lucene/index/VectorGlobalIndexReader.java

+                IndexReader reader = null;
+                boolean success = false;
+                try {
+                    directory = deserializeDirectory(in);


Current situation: serializeDirectory / deserializeDirectory only write the number of files, the length of each filename, the filename itself, and the file length, then directly write the bytes. There is no magic number, version number, or verification/checksum mechanism.

Impact: If the serialization structure is modified in the future, older versions will not be able to recognize it; it's difficult to pinpoint which part is corrupted if the stream is damaged; security/robustness is poor (prone to silent corruption).

Recommendation: Add a magic number, version number, and CRC32 for each file (or an overall checksum), and use standard DataOutputStream/DataInputStream to write int/long, avoiding manual ByteBuffer wrapping. This will be more robust and easier to maintain backward compatibility (using a version field).

examples:

private static final int MAGIC = 0x50414D4F; // 'PAMO' or any magic private static final short VERSION = 1; private void serializeDirectory(Directory directory, OutputStream out) throws IOException { DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(out)); dos.writeInt(MAGIC); dos.writeShort(VERSION); String[] files = directory.listAll(); dos.writeInt(files.length); for (String fileName : files) { byte[] nameBytes = fileName.getBytes(StandardCharsets.UTF_8); dos.writeInt(nameBytes.length); dos.write(nameBytes); long fileLength = directory.fileLength(fileName); dos.writeLong(fileLength); try (IndexInput input = directory.openInput(fileName, IOContext.DEFAULT)) { byte[] buffer = new byte[32 * 1024]; long remaining = fileLength; CRC32 crc = new CRC32(); while (remaining > 0) { int toRead = (int) Math.min(buffer.length, remaining); input.readBytes(buffer, 0, toRead); dos.write(buffer, 0, toRead); crc.update(buffer, 0, toRead); remaining -= toRead; } dos.writeLong(crc.getValue()); // per-file checksum } } dos.flush(); }

And the corresponding deserializeDirectory function (which reads the magic number/version and verifies the CRC).

private IndexMMapDirectory deserializeDirectory(SeekableInputStream in) throws IOException { DataInputStream dis = new DataInputStream(new BufferedInputStream(in)); int magic = dis.readInt(); if (magic != MAGIC) { throw new IOException("Invalid vector index file magic: " + Integer.toHexString(magic)); } short version = dis.readShort(); if (version != VERSION) { throw new IOException("Unsupported vector index version: " + version); } IndexMMapDirectory indexMMapDirectory = new IndexMMapDirectory(); try { int numFiles = dis.readInt(); byte[] buffer = new byte[BUFFER_SIZE]; for (int i = 0; i < numFiles; i++) { int nameLength = dis.readInt(); byte[] nameBytes = new byte[nameLength]; dis.readFully(nameBytes); String fileName = new String(nameBytes, StandardCharsets.UTF_8); long fileLength = dis.readLong(); long expectedCrc = -1L; // We need to stream file data to index output, and compute CRC while reading try (IndexOutput output = indexMMapDirectory.directory().createOutput(fileName, IOContext.DEFAULT)) { long remaining = fileLength; CRC32 crc = new CRC32(); while (remaining > 0) { int toRead = (int) Math.min(buffer.length, remaining); dis.readFully(buffer, 0, toRead); output.writeBytes(buffer, 0, toRead); crc.update(buffer, 0, toRead); remaining -= toRead; } expectedCrc = dis.readLong(); if (crc.getValue() != expectedCrc) { throw new IOException("CRC mismatch for file " + fileName); } } } return indexMMapDirectory; } catch (Exception e) { try { indexMMapDirectory.close(); } catch (Exception ee) { // ignore or add suppressed e.addSuppressed(ee); } if (e instanceof IOException) { throw (IOException) e; } else { throw new IOException("Failed to deserialize directory", e); } } }

@kaori-seasons Please do not copy all AI answer to the review comment, it will cost author a lot to understand a simple point of view. You should summarize them into one or two words, like "I think add a MAGIC number is better".

+1 to @leaves12138 This is really a waste of everyone's time.

* upstream/master: [Python] Support read deletion vector pk table (apache#6766) [core] Fix retry in Consumer and SnapshotManager (apache#6780) [core] Introduce Like LeafFunction for Predicate (apache#6776) [core] Remote lookup file should overwrite old orphan file (apache#6769) [core] Make bucketed append table write initial lighter (apache#6741) [python] update requirements.txt for pyarrow and pylance (apache#6774) [docs] Fix error in blob.md docs (apache#6771) [core] Should prevent users from specifying partition columns as blob field (apache#6753) [test] add e2e test case for lance read write with python and java (apache#6765) [spark] keep the rowId unchanged when updating the row tracking and deletion vectors table. (apache#6756) [core] Remove useless classes and clean memory in FileBasedBloomFilter [core] Extract an SST File Format from LookupStore. (apache#6755) [python] fix pypaimin timestamp non free time zone issue (apache#6750) [doc] Add blob document (apache#6757) [lance] update lance version to 0.39.0 (apache#6758) [python] update data type of _VALUE_KIND to int8 (apache#6759)

JingsongLi · 2025-12-11T02:35:23Z

.github/workflows/utitcase-paimon-lucene.yml

+  push:
+  pull_request:
+    paths:
+      - 'paimon-lucene/**'


Why not cover this in utitcase-jdk11?

leaves12138

Thanks for @jerry-024
Left some comments below

leaves12138 · 2025-12-11T02:53:22Z

paimon-lucene/src/main/java/org/apache/paimon/lucene/index/IndexMMapDirectory.java

+    private final MMapDirectory mmapDirectory;
+
+    public IndexMMapDirectory() throws IOException {
+        this.path = java.nio.file.Files.createTempDirectory("paimon-lucene-" + UUID.randomUUID());


Please do not use java.nio.file.Files in class, import them in head.

leaves12138 · 2025-12-11T02:53:50Z

paimon-lucene/src/main/java/org/apache/paimon/lucene/index/IndexMMapDirectory.java

+    public void close() throws Exception {
+        mmapDirectory.close();
+        if (java.nio.file.Files.exists(path)) {
+            java.nio.file.Files.walk(path)


leaves12138 · 2025-12-11T02:54:00Z

paimon-lucene/src/main/java/org/apache/paimon/lucene/index/IndexMMapDirectory.java

+        mmapDirectory.close();
+        if (java.nio.file.Files.exists(path)) {
+            java.nio.file.Files.walk(path)
+                    .sorted(java.util.Comparator.reverseOrder())


We don't need to sort

leaves12138 · 2025-12-11T02:54:08Z

paimon-lucene/src/main/java/org/apache/paimon/lucene/index/IndexMMapDirectory.java

+                    .forEach(
+                            p -> {
+                                try {
+                                    java.nio.file.Files.delete(p);


leaves12138 · 2025-12-11T03:06:09Z

paimon-lucene/src/main/java/org/apache/paimon/lucene/index/VectorGlobalIndexerFactory.java

+/** Factory for creating vector global indexers. */
+public class VectorGlobalIndexerFactory implements GlobalIndexerFactory {
+
+    public static final String IDENTIFIER = "vector";


lucene-hnw? "vector" is too wide.

how about lucene-hnsw? (ps: HNSW (Hierarchical Navigable Small World))

leaves12138 · 2025-12-11T03:10:19Z

paimon-lucene/src/main/java/org/apache/paimon/lucene/index/VectorGlobalIndexReader.java

+    // Implementation of FunctionVisitor methods
+    @Override
+    public GlobalIndexResult visitIsNotNull(FieldRef fieldRef) {
+        throw new UnsupportedOperationException(


Do not throw Exception. Please return GlobalIndexResult.fromRange(<the range pass to you in GlobalIndexIOMeta>)

JingsongLi

All classes should start with Lucene.
Identifier should be lucene-vector-knn?

JingsongLi · 2025-12-11T13:08:48Z

+1

jerry-024 added 15 commits November 28, 2025 17:27

init

f1b82e9

add demo

04f1c6e

fix

4401379

fix

7ace2ab

change to lucene

785754a

fix

62f7db5

fix

e9e6ed5

fix

08477e1

add test script, split, notice

fec2404

delete no need case

0219ab8

update test workflow and delete no need case

72c0389

rename paimon-vector to paimon-vector-index

7a62a33

jerry-024 marked this pull request as draft December 8, 2025 06:41

add flush

016c520

jerry-024 marked this pull request as ready for review December 8, 2025 10:04

JingsongLi reviewed Dec 8, 2025

View reviewed changes

jerry-024 force-pushed the vector_index branch from 7522f32 to 27f41fb Compare December 9, 2025 01:22

jerry-024 changed the title ~~Vector index~~ Support vector index using lucene Dec 9, 2025

rename from paimon-vector-index to paimon-lucene

f1b6011

jerry-024 force-pushed the vector_index branch from 27f41fb to f1b6011 Compare December 9, 2025 02:48

kaori-seasons reviewed Dec 9, 2025

View reviewed changes

add vector index create by data type

0a90b51

jerry-024 force-pushed the vector_index branch from 566c8fd to 0a90b51 Compare December 9, 2025 08:07

jerry-024 added 2 commits December 9, 2025 16:09

fix compile error

758f005

apache deleted a comment from kaori-seasons Dec 9, 2025

jerry-024 added 3 commits December 9, 2025 17:43

move serialize and deserialize to IndexMMapDirectory

660f672

add version for IndexMMapDirectory

1f51e64

delete no need code

a34ee5d

JingsongLi changed the title ~~Support vector index using lucene~~ [lucene] Support vector index using lucene Dec 11, 2025

JingsongLi reviewed Dec 11, 2025

View reviewed changes

leaves12138 approved these changes Dec 11, 2025

View reviewed changes

move paimon-lucene test to jdk11 workflow

51ddaae

jerry-024 force-pushed the vector_index branch from ec2feb1 to 51ddaae Compare December 11, 2025 02:47

leaves12138 reviewed Dec 11, 2025

View reviewed changes

jerry-024 force-pushed the vector_index branch 2 times, most recently from 0c5ab2a to 30efab8 Compare December 11, 2025 03:39

delete no need code

8110c15

jerry-024 force-pushed the vector_index branch 4 times, most recently from 4ab8246 to a33cf08 Compare December 11, 2025 08:07

fix

7bacb22

jerry-024 force-pushed the vector_index branch from a33cf08 to 7bacb22 Compare December 11, 2025 08:32

JingsongLi reviewed Dec 11, 2025

View reviewed changes

jerry-024 changed the title ~~[lucene] Support vector index using lucene~~ [lucene] Support vector index using Lucene Dec 11, 2025

update name

d68e048

jerry-024 force-pushed the vector_index branch from 7cc814f to d68e048 Compare December 11, 2025 10:13

JingsongLi merged commit 5fb3d45 into apache:master Dec 11, 2025
23 checks passed

jerry-024 deleted the vector_index branch December 12, 2025 01:19

[lucene] Support vector index using Lucene #6773

[lucene] Support vector index using Lucene #6773

Uh oh!

Conversation

jerry-024 commented Dec 8, 2025

Purpose

Implementation

Tests

API and Format

Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leaves12138 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

leaves12138 left a comment •

edited

Loading