-
-
Notifications
You must be signed in to change notification settings - Fork 66
PISA v1 Proposal
The purpose of this document is to propose a high-level architecture and format of PISA index and related structures.
This section describes the desired outcomes of the proposed design, in no particular order.
We aim to preserve the high-performance nature of the project. Although some concessions might be made to facilitate new functionality, the performance when not using those should generally not decline. However, in certain cases, we may still decide to sacrifice some performance if it improves maintainability or user experience, though these would have to be consider on a case-by-case basis, and should be avoided if possible.
We want to preserve the existing functionality unless we deem some of it unnecessary. For example, we want to maintain the support for a range of different retrieval algorithms, index compression techniques, etc. It is acceptable that some new functionality excludes some of these or otherwise compromises performance for them, e.g., if we introduce index updates, it may be less efficient to do it for quantized indexes, but the current functionality of the quantized indexes should be maintained.
We want to make it easier to use by default, but still powerful and flexible if needed. For example, we want to introduce abstractions to manage an index without having to manually keep track of all individual files (index, lexicon, metadata, etc.), but we still want to maintain low-level access to those if needed for a power user.
Our code should facilitate building custom systems. A single component, e.g., inverted index or lexicon, should be decoupled enough to use on its own.
The project should be easy to extend with new functionality, which is especially important for academic purposes.
We currently suffer from many segfaults whenever someone makes a mistake, e.g., passes the wrong encoding value when running queries. This should be avoided by encoding these parameters in the index itself, so that checks can be made and we can gracefully shut down if two components are not compatible.
The v1 API should be stable and backwards-compatible.
Thus, we should also clearly delineate which parts of the code are part
of the public API, e.g., by moving any "private" template code to a
separate "private" namespace (such as pisa::details
). This should be
clearly documented.
We should clearly document all the components and public-facing APIs and commands.
This is a list of new functionality we would like to support in the future. Note that this does not mean these have to be implemented for v1, but rather we should have them in mind when designing.
We want a facility to update an existing index. This will likely mean we want to implement two functionality: index merging and querying multiple indexes at a time. Additionally, we may want to implement an in-memory dynamic index for new documents before they are written to the disk in the usual format. Similarly, we can think of document deletion, which most likely involves marking documents as deleted in the existing index, and "committing" the deletes by rebuilding the index structure without the marked documents. Furthermore, document modifications can be implemented by adding a new document and marking the old one for deletion.
This may be at least partially covered by index updates, or at least is related to it. A user that is not interested in index details should be able to feed documents to PISA and get all necessary structures instead of manually going through multiple stages.
We should be able to index documents straight to the compressed format, and use the compressed format in place of the uncompressed one when doing things like computing max-scores. We do not want to take away the ability to first invert and then use uncompressed index for operations like reordering documents, but rather make it an optional step.
We want the ability to run PISA as a server and expose APIs for index operations. We can aim at compatibility between REST API and CLI, but the server would allow us to do some additional in-memory caching, and delay writing some things to disk when doing many updates.
Searching across multiple indexed "fields". This a down-the-road nice-to-have, but we should at least make our architecture facilitate it for the future or 3rd party implementations.
Although we want to cover many compression and retrieval algorithms, this comes at a high cost during compilation. For many (most?) use cases, however, a user wants a single encoding. Thus, it would be very beneficial if we could disable some of them at compile time, reducing compilation time and binary size.
Multi-stage ranking is something we want to address in the future. Therefore, we should make sure that we facilitate easy and efficient feature calculation.
Dense retrieval will require separate type of index and therefore will have little impact on the design of the index itself, but we should consider it when designing high-level functionality.
This is a work-in-progress list of definitions, subject to change
Collection represents any data created from a given corpus. This includes any fields (at first, only one), shards, lexicons, and any other related data structures.
Stores document data, including non-indexed payload.
❓ Should forward index be sharded together with the inverted index? This makes little difference when all resides on the same machine, but impacts distributed systems where shards are on different machines. Note that some of the document-indexed data must be available at query time for scoring, so does it make sense to make forward index part of the overall "Index" structure?
Index stores searchable text data. A single index is parameterized by its encoding format and other properties like whethe it is quantized or not.
A single static inverted index structure, along with auxiliary data, such as document lengths, max scores (or frequencies), etc.
❓ Should we follow the existing naming convention and call it "index segment" as opposed to "shard"? This may be a good idea as "shard" could be regarded as a set of multiple index segments for purposes of distributing over multiple machines.
A static inverted index containing all of the shard's postings.
Maps document IDs to lengths.
Maps terms to internal docIDs.
Maps terms to global statistics used in scoring, e.g., no. term occurrences or no. term postings.
An inverted index in context of this document is a static inverted index structure over a single text field. It does not support adding or modifying documents -- this is something that would be done on a higher level. However, we may need a way of marking documents for deletion on a lower level because we can't simply filter them out at the end (we would end up with fewer documents than requested). That said, it is probably best to encode that in the that in a separate structure and handle that when aggregating top-k documents as opposed to having it encoded directly in the index. This way, we may optimize retrieval in cases when no deletions are detected.
An inverted index is parameterized by the type of posting list encoding. Posting lists contain pairs of document IDs and an integer payload (frequency or quantized score).
The index interface must provide the following functionality:
- return size (number of terms),
- return number of documents,
- access a posting list at a given position (equivalent to the term ID).
Some additional functionality may be provided, such as iterating over all posting lists.
Note that we do not mention any merging capability here, as merging must be done at the segment level (which includes lexicons and document lengths).
We can express the index interface in terms of C++ concept:
template <typename T, typename Cursor>
concept InvertedIndex = PostingCursor<Cursor, std::uint32_t>
&& requires(T const i, std::uint32_t termid) {
{ i.operator[](termid) } -> std::same_as<Cursor>;
{ i.num_docs() } -> std::convertible_to<std::size_t>;
{ i.size() } -> std::convertible_to<std::size_t>;
};
When accessing a posting list, we get a posting cursor, expressed here
by the PostingCursor
concept, which will be explained below.
Notice that the cursor is parameterized by another type. This is because
this concept is broader than just reading from an index structure. For
example, we can apply a transformation for posting scoring, which
produces a cursor of type T
that satisfies PostingCursor<T, float>
.
As mentioned above, a cursor is parameterized by its payload, which is always an integer for an inverted index representation, but can be different if we apply a transformation.
Any cursor implements the following functionality:
- return the number of postings in the list,
- return the document ID at the current position,
- return the payload value the current position,
- move to the next position.
Here is a C++ concept representation:
template <typename C, typename Payload>
concept PostingCursor = requires(C const &cursor, Payload payload)
{
{ cursor.docid() } -> std::convertible_to<std::uint32_t>;
{ cursor.size() } noexcept -> std::convertible_to<std::size_t>;
} && requires(C cursor) {
{ cursor.value() } -> std::convertible_to<Payload>;
cursor.next();
};
We can build upon this definition to define some specialized cursors.
For example, for a posting list in which the postings are sorted in the
increasing order of document IDs, we define an additional function
next_geq(d)
that moves the cursor to the next position at which the
document ID is at least docid
(this operation doesn't make sense if
the ordering is different).
Thus, we can define SortedPostingCursor
as follows:
template <typename C, typename Payload>
concept SortedPostingCursor = PostingCursor<C, Payload>
&& requires(C cursor, std::uint32_t docid) {
cursor.next_geq(docid);
};
Any document-at-a-time (DAAT) retrieval algorithm will require this concept to be satisfied by the cursors, while term-at-a-time (TAAT) algorithms will have a relaxed requirement.
Similarly, we can define concepts for other cursors, e.g., for posting lists stored in score-at-a-time (SAAT) order.
Two major operations on posting cursors are transformations and aggregations.
A transformation takes a cursor, or multiple cursors, and produces another cursor. An example of a transformation is a scoring cursor. Given a cursor and a scoring function (with some additional data used for scoring the particular term/list), it produces another cursor with a floating point payload. A different kind of transformation are union or intersection. These can be used for more complex retrieval algorithms (for example: https://dl.acm.org/doi/abs/10.1145/3488560.3498489).
To retrieve the final set of documents, we employ aggregations. They
take one or more cursors, and return a set of retrieved documents. For
each algorithm, we define a collector. For example, if we want to
retrieve all of the matching documents, we can define a collector that
takes a list of cursors, transforms them into a union (or intersection),
and collects them to a vector. This collector takes any type of cursor.
Other cursor may constrain the type of cursors they accept. For example,
the MaxScore
or BlockMaxWand
collectors will require a special
max-score and blockmax-score (sorted) cursors, respectively.
❓ Does it make sense to use "aggregating cursors" that would
allow for composing multiple segments? For example, if we run a query on
n
segments, we could create n
cursors that additionally share an
aggregate state (top-k heap), and then we define a collector over those
n
cursors. Can we do this? Is it preferable over running multiple
queries over segments and then merging the results?
A single segment contains an inverted index as described above together with some additional data:
- document lexicon,
- document lengths,
- local term lexicon,
- term max payloads,
- block-max payload skip-lists (optional).
Note that payloads could be frequencies or scores. For dynamic indexes, it may be better to store frequencies because at least for some scoring functions the global term statistics will change over time. Scores should be stored when scores are quantized and precomputed.
This is a straight-forward structure that maps document IDs to the
document title (or unique string ID that identifies a document across
the entire collection). This could be a simple string array in memory,
but may be beneficial to store in a format that can be used via memory
mapping, such as our current PayloadVector
.
template <typename T, typename Payload>
concept Mapping = requires(T const map, std::uint32_t pos) {
{ map.operator[](pos) } -> std::convertible_to<Payload>;
{ map.size() } noexcept-> std::convertible_to<std::size_t>;
};
Local term lexicon stores mapping from terms to their IDs. It can be thought of as an extension of the document lexicon.
template <typename T, typename Payload>
concept BidirectionalMapping = Mapping<T, Payload> && requires(T const map, Payload payload) {
/** Get the position of the given payload. */
{ map.find(payload) } -> std::convertible_to<std::uint32_t>;
};
From the interface point of view, a document lengths mapping T
needs
to satisfy Mapping<T, std::uint32_t>
as defined above. However, it is
important that its representation is efficient because of the frequent
use of it during query processing.
This is very similar to document lengths, only we map from term IDs to a payload value. However, this does not have to be as efficient because it's only accessed once per query term.
These are essentially posting lists containing IDs of the first documents in the blocks and the maximum payloads in these blocks.
A lookup table is a bidirectional mapping from an index, representing an
internal ID, to a binary payload, such as string. E.g., an N
-element
lookup table maps values 0...N-1
to their payloads. These tables are
used for things like mapping terms to term IDs and document IDs to
titles or URLs.
The format of a lookup table is designed to operate without having to parse the entire structure. Once the header is parsed, it is possible to operate directly on the binary format to access the data. In fact, a lookup table will typically be memory mapped. Therefore, it is possible to perform a lookup (or reverse lookup) without loading the entire structure into memory.
The header begins as follows:
+--------+--------+-------- -+
| 0x87 | Ver. | ... |
+--------+--------+-------- -+
The first byte is a constant identifier. When reading, we can verify whether this byte is correct to make sure we are using the correct type of data structure.
The second byte is equal to the version of the format.
The remaining of the format is defined for each version. The version is introduced in order to be able to update the format in the future but still be able to read old formats for backwards compatibility.
+--------+--------+--------+--------+--------+--------+--------+--------+
| 0x87 | 0x01 | Flags | 0x00 |
+--------+--------+--------+--------+--------+--------+--------+--------+
| Length |
+--------+--------+--------+--------+--------+--------+--------+--------+
| |
| Offsets |
| |
+-----------------------------------------------------------------------+
| |
| Payloads |
| |
+-----------------------------------------------------------------------+
Immediately after the version bit, we have flags byte.
MSB LSB
+---+---+---+---+---+---+---+---+
| 0 | 0 | 0 | 0 | 0 | W | E | S |
+---+---+---+---+---+---+---+---+
The first bit (S
) indicates whether the payloads are sorted (1) or not
(0). The second bit (E
) indicates the endianness: little-endian (0) or
big-endian (1). Note that currently we only use little-endian byte
order, and thus E=0. We introduce the flag in case we want to allow
big-endian encoding in the future. The next bit (W
) defines the width
of offsets (see below): 32-bit (0) or 64-bit (1). In most use cases, the
cumulative size of the payloads will be small enough to address it by
32-bit offsets. For example, if we store words that are 16-bytes long on
average, we can address over 200 million of them. For this many
elements, reducing the width of the offsets would save us over 700 MB.
Still, we want to support 64-bit addressing because some payloads may be
much longer (e.g., URLs).
The rest of the bits in the flags byte are currently not used, but should be set to 0 to make sure that if more flags are introduced, we know what values to expect in the older iterations, and thus we can make sure to keep it backwards-compatible.
The following 5 bytes are padding with values of 0. This is to help with byte alignment. When loaded to memory, it should be loaded with 8-byte alignment. When memory mapped, it should be already correctly aligned by the operating system (at least on Linux).
Following the padding, there is a 64-bit unsigned integer encoding the
number of elements in the lexicon (N
).
Given N
and W
, we can now calculate the byte range of all offsets,
and thus the address offset for the start of the payloads. The offsets
are N+1
E-endian unsigned integers of size determined by W
(either 4
or 8 bytes). The offsets are associated with consecutive IDs from 0 to
N-1
; the last offsets points at the first byte after the last payload.
The offsets are relative to the beginning of the first payload,
therefore the first offset will always be 0.
Payloads are arbitrary bytes, and must be interpreted by the software. Although the typical use case are strings, this can be any binary payload.
The k-th payload boundaries are calculated by reading k-th and (k+1)-th offsets.
If the payloads are sorted (S), we can find an ID of a certain payload with a binary search.