Skip to content

PISA v1 Proposal

Michal Siedlaczek edited this page Jan 22, 2024 · 14 revisions

Introduction

The purpose of this document is to propose a high-level architecture and format of PISA index and related structures.

Goals

This section describes the desired outcomes of the proposed design, in no particular order.

General

Performance

We aim to preserve the high-performance nature of the project. Although some concessions might be made to facilitate new functionality, the performance when not using those should generally not decline. However, in certain cases, we may still decide to sacrifice some performance if it improves maintainability or user experience, though these would have to be consider on a case-by-case basis, and should be avoided if possible.

Functionality

We want to preserve the existing functionality unless we deem some of it unnecessary. For example, we want to maintain the support for a range of different retrieval algorithms, index compression techniques, etc. It is acceptable that some new functionality excludes some of these or otherwise compromises performance for them, e.g., if we introduce index updates, it may be less efficient to do it for quantized indexes, but the current functionality of the quantized indexes should be maintained.

Easy by default, powerful if needed

We want to make it easier to use by default, but still powerful and flexible if needed. For example, we want to introduce abstractions to manage an index without having to manually keep track of all individual files (index, lexicon, metadata, etc.), but we still want to maintain low-level access to those if needed for a power user.

Composability

Our code should facilitate building custom systems. A single component, e.g., inverted index or lexicon, should be decoupled enough to use on its own.

Extensibility

The project should be easy to extend with new functionality, which is especially important for academic purposes.

Safety

We currently suffer from many segfaults whenever someone makes a mistake, e.g., passes the wrong encoding value when running queries. This should be avoided by encoding these parameters in the index itself, so that checks can be made and we can gracefully shut down if two components are not compatible.

Stability

The v1 API should be stable and backwards-compatible. Thus, we should also clearly delineate which parts of the code are part of the public API, e.g., by moving any "private" template code to a separate "private" namespace (such as pisa::details). This should be clearly documented.

Documentation

We should clearly document all the components and public-facing APIs and commands.

New Functionality

This is a list of new functionality we would like to support in the future. Note that this does not mean these have to be implemented for v1, but rather we should have them in mind when designing.

Index updates

We want a facility to update an existing index. This will likely mean we want to implement two functionality: index merging and querying multiple indexes at a time. Additionally, we may want to implement an in-memory dynamic index for new documents before they are written to the disk in the usual format. Similarly, we can think of document deletion, which most likely involves marking documents as deleted in the existing index, and "committing" the deletes by rebuilding the index structure without the marked documents. Furthermore, document modifications can be implemented by adding a new document and marking the old one for deletion.

Building index from collection

This may be at least partially covered by index updates, or at least is related to it. A user that is not interested in index details should be able to feed documents to PISA and get all necessary structures instead of manually going through multiple stages.

Unification of compressed and uncompressed index formats

We should be able to index documents straight to the compressed format, and use the compressed format in place of the uncompressed one when doing things like computing max-scores. We do not want to take away the ability to first invert and then use uncompressed index for operations like reordering documents, but rather make it an optional step.

Server mode

We want the ability to run PISA as a server and expose APIs for index operations. We can aim at compatibility between REST API and CLI, but the server would allow us to do some additional in-memory caching, and delay writing some things to disk when doing many updates.

Faceted search

Searching across multiple indexed "fields". This a down-the-road nice-to-have, but we should at least make our architecture facilitate it for the future or 3rd party implementations.

Disabling features at compile time

Although we want to cover many compression and retrieval algorithms, this comes at a high cost during compilation. For many (most?) use cases, however, a user wants a single encoding. Thus, it would be very beneficial if we could disable some of them at compile time, reducing compilation time and binary size.

Reranking

Multi-stage ranking is something we want to address in the future. Therefore, we should make sure that we facilitate easy and efficient feature calculation.

Dense retrieval

Dense retrieval will require separate type of index and therefore will have little impact on the design of the index itself, but we should consider it when designing high-level functionality.

Definitions

This is a work-in-progress list of definitions, subject to change

Collection

Collection represents any data created from a given corpus. This includes any fields (at first, only one), shards, lexicons, and any other related data structures.

Forward Index

Stores document data, including non-indexed payload.

❓ Should forward index be sharded together with the inverted index? This makes little difference when all resides on the same machine, but impacts distributed systems where shards are on different machines. Note that some of the document-indexed data must be available at query time for scoring, so does it make sense to make forward index part of the overall "Index" structure?

Index

Index stores searchable text data. A single index is parameterized by its encoding format and other properties like whethe it is quantized or not.

Shard (Index segment)

A single static inverted index structure, along with auxiliary data, such as document lengths, max scores (or frequencies), etc.

❓ Should we follow the existing naming convention and call it "index segment" as opposed to "shard"? This may be a good idea as "shard" could be regarded as a set of multiple index segments for purposes of distributing over multiple machines.

Posting Lists

A static inverted index containing all of the shard's postings.

Document lengths

Maps document IDs to lengths.

Local Term Lexicon

Maps terms to internal docIDs.

Global Term Lexicon

Maps terms to global statistics used in scoring, e.g., no. term occurrences or no. term postings.

Inverted Index

An inverted index in context of this document is a static inverted index structure over a single text field. It does not support adding or modifying documents -- this is something that would be done on a higher level. However, we may need a way of marking documents for deletion on a lower level because we can't simply filter them out at the end (we would end up with fewer documents than requested). That said, it is probably best to encode that in the that in a separate structure and handle that when aggregating top-k documents as opposed to having it encoded directly in the index. This way, we may optimize retrieval in cases when no deletions are detected.

An inverted index is parameterized by the type of posting list encoding. Posting lists contain pairs of document IDs and an integer payload (frequency or quantized score).

The index interface must provide the following functionality:

  • return size (number of terms),
  • return number of documents,
  • access a posting list at a given position (equivalent to the term ID).

Some additional functionality may be provided, such as iterating over all posting lists.

Note that we do not mention any merging capability here, as merging must be done at the segment level (which includes lexicons and document lengths).

We can express the index interface in terms of C++ concept:

template <typename T, typename Cursor>
concept InvertedIndex = PostingCursor<Cursor, std::uint32_t>
&& requires(T const i, std::uint32_t termid) {
    { i.operator[](termid) } -> std::same_as<Cursor>;
    { i.num_docs() } -> std::convertible_to<std::size_t>;
    { i.size() } -> std::convertible_to<std::size_t>;
};

When accessing a posting list, we get a posting cursor, expressed here by the PostingCursor concept, which will be explained below.

Notice that the cursor is parameterized by another type. This is because this concept is broader than just reading from an index structure. For example, we can apply a transformation for posting scoring, which produces a cursor of type T that satisfies PostingCursor<T, float>.

Posting Cursor

As mentioned above, a cursor is parameterized by its payload, which is always an integer for an inverted index representation, but can be different if we apply a transformation.

Any cursor implements the following functionality:

  • return the number of postings in the list,
  • return the document ID at the current position,
  • return the payload value the current position,
  • move to the next position.

Here is a C++ concept representation:

template <typename C, typename Payload>
concept PostingCursor = requires(C const &cursor, Payload payload)
{
    { cursor.docid() } -> std::convertible_to<std::uint32_t>;
    { cursor.size() } noexcept -> std::convertible_to<std::size_t>;
} && requires(C cursor) {
    { cursor.value() } -> std::convertible_to<Payload>;
    cursor.next();
};

We can build upon this definition to define some specialized cursors. For example, for a posting list in which the postings are sorted in the increasing order of document IDs, we define an additional function next_geq(d) that moves the cursor to the next position at which the document ID is at least docid (this operation doesn't make sense if the ordering is different).

Thus, we can define SortedPostingCursor as follows:

template <typename C, typename Payload>
concept SortedPostingCursor = PostingCursor<C, Payload>
&& requires(C cursor, std::uint32_t docid) {
    cursor.next_geq(docid);
};

Any document-at-a-time (DAAT) retrieval algorithm will require this concept to be satisfied by the cursors, while term-at-a-time (TAAT) algorithms will have a relaxed requirement.

Similarly, we can define concepts for other cursors, e.g., for posting lists stored in score-at-a-time (SAAT) order.

Transformations & Aggregations

Two major operations on posting cursors are transformations and aggregations.

A transformation takes a cursor, or multiple cursors, and produces another cursor. An example of a transformation is a scoring cursor. Given a cursor and a scoring function (with some additional data used for scoring the particular term/list), it produces another cursor with a floating point payload. A different kind of transformation are union or intersection. These can be used for more complex retrieval algorithms (for example: https://dl.acm.org/doi/abs/10.1145/3488560.3498489).

To retrieve the final set of documents, we employ aggregations. They take one or more cursors, and return a set of retrieved documents. For each algorithm, we define a collector. For example, if we want to retrieve all of the matching documents, we can define a collector that takes a list of cursors, transforms them into a union (or intersection), and collects them to a vector. This collector takes any type of cursor. Other cursor may constrain the type of cursors they accept. For example, the MaxScore or BlockMaxWand collectors will require a special max-score and blockmax-score (sorted) cursors, respectively.

❓ Does it make sense to use "aggregating cursors" that would allow for composing multiple segments? For example, if we run a query on n segments, we could create n cursors that additionally share an aggregate state (top-k heap), and then we define a collector over those n cursors. Can we do this? Is it preferable over running multiple queries over segments and then merging the results?

Index Segment

A single segment contains an inverted index as described above together with some additional data:

  • document lexicon,
  • document lengths,
  • local term lexicon,
  • term max payloads,
  • block-max payload skip-lists (optional).

Note that payloads could be frequencies or scores. For dynamic indexes, it may be better to store frequencies because at least for some scoring functions the global term statistics will change over time. Scores should be stored when scores are quantized and precomputed.

Document lexicon

This is a straight-forward structure that maps document IDs to the document title (or unique string ID that identifies a document across the entire collection). This could be a simple string array in memory, but may be beneficial to store in a format that can be used via memory mapping, such as our current PayloadVector.

template <typename T, typename Payload>
concept Mapping = requires(T const map, std::uint32_t pos) {
    { map.operator[](pos) } -> std::convertible_to<Payload>;
    { map.size() }  noexcept-> std::convertible_to<std::size_t>;
};

Local term lexicon

Local term lexicon stores mapping from terms to their IDs. It can be thought of as an extension of the document lexicon.

template <typename T, typename Payload>
concept BidirectionalMapping = Mapping<T, Payload> && requires(T const map, Payload payload) {
    /** Get the position of the given payload. */
    { map.find(payload) } -> std::convertible_to<std::uint32_t>;
};

Document lengths

From the interface point of view, a document lengths mapping T needs to satisfy Mapping<T, std::uint32_t> as defined above. However, it is important that its representation is efficient because of the frequent use of it during query processing.

Term max payloads

This is very similar to document lengths, only we map from term IDs to a payload value. However, this does not have to be as efficient because it's only accessed once per query term.

Block-max payload skip-lists

These are essentially posting lists containing IDs of the first documents in the blocks and the maximum payloads in these blocks.

Format Specs

Lookup table

A lookup table is a bidirectional mapping from an index, representing an internal ID, to a binary payload, such as string. E.g., an N-element lookup table maps values 0...N-1 to their payloads. These tables are used for things like mapping terms to term IDs and document IDs to titles or URLs.

The format of a lookup table is designed to operate without having to parse the entire structure. Once the header is parsed, it is possible to operate directly on the binary format to access the data. In fact, a lookup table will typically be memory mapped. Therefore, it is possible to perform a lookup (or reverse lookup) without loading the entire structure into memory.

The header begins as follows:

+--------+--------+--------   -+
|  0x87  |  Ver.  |        ... |
+--------+--------+--------   -+

The first byte is a constant identifier. When reading, we can verify whether this byte is correct to make sure we are using the correct type of data structure.

The second byte is equal to the version of the format.

The remaining of the format is defined for each version. The version is introduced in order to be able to update the format in the future but still be able to read old formats for backwards compatibility.

v1

+--------+--------+--------+--------+--------+--------+--------+--------+
|  0x87  |  0x01  | Flags  |                    0x00                    |
+--------+--------+--------+--------+--------+--------+--------+--------+
|                                 Length                                |
+--------+--------+--------+--------+--------+--------+--------+--------+
|                                                                       |
|                                Offsets                                |
|                                                                       |
+-----------------------------------------------------------------------+
|                                                                       |
|                                Payloads                               |
|                                                                       |
+-----------------------------------------------------------------------+

Immediately after the version bit, we have flags byte.

 MSB                         LSB
+---+---+---+---+---+---+---+---+
| 0 | 0 | 0 | 0 | 0 | W | E | S |
+---+---+---+---+---+---+---+---+

The first bit (S) indicates whether the payloads are sorted (1) or not (0). The second bit (E) indicates the endianness: little-endian (0) or big-endian (1). Note that currently we only use little-endian byte order, and thus E=0. We introduce the flag in case we want to allow big-endian encoding in the future. The next bit (W) defines the width of offsets (see below): 32-bit (0) or 64-bit (1). In most use cases, the cumulative size of the payloads will be small enough to address it by 32-bit offsets. For example, if we store words that are 16-bytes long on average, we can address over 200 million of them. For this many elements, reducing the width of the offsets would save us over 700 MB. Still, we want to support 64-bit addressing because some payloads may be much longer (e.g., URLs).

The rest of the bits in the flags byte are currently not used, but should be set to 0 to make sure that if more flags are introduced, we know what values to expect in the older iterations, and thus we can make sure to keep it backwards-compatible.

The following 5 bytes are padding with values of 0. This is to help with byte alignment. When loaded to memory, it should be loaded with 8-byte alignment. When memory mapped, it should be already correctly aligned by the operating system (at least on Linux).

Following the padding, there is a 64-bit unsigned integer encoding the number of elements in the lexicon (N).

Given N and W, we can now calculate the byte range of all offsets, and thus the address offset for the start of the payloads. The offsets are N+1 E-endian unsigned integers of size determined by W (either 4 or 8 bytes). The offsets are associated with consecutive IDs from 0 to N-1; the last offsets points at the first byte after the last payload. The offsets are relative to the beginning of the first payload, therefore the first offset will always be 0.

Payloads are arbitrary bytes, and must be interpreted by the software. Although the typical use case are strings, this can be any binary payload.

The k-th payload boundaries are calculated by reading k-th and (k+1)-th offsets.

If the payloads are sorted (S), we can find an ID of a certain payload with a binary search.