Skip to content

Search optimization and indexing based on datetime #405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/workflows/cicd.yml
Original file line number Diff line number Diff line change
@@ -28,6 +28,7 @@ jobs:
xpack.security.enabled: false
xpack.security.transport.ssl.enabled: false
ES_JAVA_OPTS: -Xms512m -Xmx1g
action.destructive_requires_name: false
ports:
- 9200:9200

@@ -44,6 +45,7 @@ jobs:
xpack.security.enabled: false
xpack.security.transport.ssl.enabled: false
ES_JAVA_OPTS: -Xms512m -Xmx1g
action.destructive_requires_name: false
ports:
- 9400:9400

@@ -60,6 +62,7 @@ jobs:
plugins.security.disabled: true
plugins.security.ssl.http.enabled: true
OPENSEARCH_JAVA_OPTS: -Xms512m -Xmx512m
action.destructive_requires_name: false
ports:
- 9202:9202

@@ -120,5 +123,6 @@ jobs:
ES_PORT: ${{ matrix.backend == 'elasticsearch7' && '9400' || matrix.backend == 'elasticsearch8' && '9200' || '9202' }}
ES_HOST: 172.17.0.1
ES_USE_SSL: false
DATABASE_REFRESH: true
ES_VERIFY_CERTS: false
BACKEND: ${{ matrix.backend == 'elasticsearch7' && 'elasticsearch' || matrix.backend == 'elasticsearch8' && 'elasticsearch' || 'opensearch' }}
26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -8,6 +8,32 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.

## [Unreleased]

### Added

- Added comprehensive index management system with dynamic selection and insertion strategies for improved performance and scalability [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405)
- Added `ENABLE_DATETIME_INDEX_FILTERING` environment variable to enable datetime-based index selection using collection IDs. When enabled, the system creates indexes with UUID-based names and manages them through time-based aliases. Default is `false`. [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405)
- Added `DATETIME_INDEX_MAX_SIZE_GB` environment variable to set maximum size limit in GB for datetime-based indexes. When an index exceeds this size, a new time-partitioned index will be created. Note: add +20% to target size due to ES/OS compression. Default is `25` GB. Only applies when `ENABLE_DATETIME_INDEX_FILTERING` is enabled. [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405)
- Added index operations system with unified interface for both Elasticsearch and OpenSearch [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405):
- `IndexOperations` class with common index creation and management methods
- UUID-based physical index naming: `{prefix}_{collection-id}_{uuid4}`
- Alias management: main collection alias, temporal aliases, and closed index aliases
- Automatic alias updates when indexes reach size limits
- Added datetime-based index selection strategies with caching support [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405):
- `DatetimeBasedIndexSelector` for temporal filtering with intelligent caching
- `IndexCacheManager` with configurable TTL-based cache expiration (default 1 hour)
- `IndexAliasLoader` for alias management and cache refresh
- `UnfilteredIndexSelector` as fallback for returning all available indexes
- Added index insertion strategies with automatic partitioning [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405):
- Simple insertion strategy (`SimpleIndexInserter`) for traditional single-index-per-collection approach
- Datetime-based insertion strategy (`DatetimeIndexInserter`) with time-based partitioning
- Automatic index size monitoring and splitting when limits exceeded
- Handling of chronologically early data and bulk operations
- Added index management utilities [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405):
- `IndexSizeManager` for size monitoring and overflow handling with compression awareness
- `DatetimeIndexManager` for datetime-based index operations and validation
- Factory patterns (`IndexInsertionFactory`, `IndexSelectorFactory`) for strategy creation based on configuration


## [v6.1.0] - 2025-07-24

### Added
15 changes: 10 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
@@ -27,7 +27,7 @@ run_os = docker compose \
.PHONY: image-deploy-es
image-deploy-es:
docker build -f dockerfiles/Dockerfile.dev.es -t stac-fastapi-elasticsearch:latest .

.PHONY: image-deploy-os
image-deploy-os:
docker build -f dockerfiles/Dockerfile.dev.os -t stac-fastapi-opensearch:latest .
@@ -71,14 +71,19 @@ test-opensearch:
-$(run_os) /bin/bash -c 'export && ./scripts/wait-for-it-es.sh opensearch:9202 && cd stac_fastapi/tests/ && pytest'
docker compose down

.PHONY: test
test:
-$(run_es) /bin/bash -c 'export && ./scripts/wait-for-it-es.sh elasticsearch:9200 && cd stac_fastapi/tests/ && pytest --cov=stac_fastapi --cov-report=term-missing'
.PHONY: test-datetime-filtering-es
test-datetime-filtering-es:
-$(run_es) /bin/bash -c 'export ENABLE_DATETIME_INDEX_FILTERING=true && ./scripts/wait-for-it-es.sh elasticsearch:9200 && cd stac_fastapi/tests/ && pytest -s --cov=stac_fastapi --cov-report=term-missing -m datetime_filtering'
docker compose down

-$(run_os) /bin/bash -c 'export && ./scripts/wait-for-it-es.sh opensearch:9202 && cd stac_fastapi/tests/ && pytest --cov=stac_fastapi --cov-report=term-missing'
.PHONY: test-datetime-filtering-os
test-datetime-filtering-os:
-$(run_os) /bin/bash -c 'export ENABLE_DATETIME_INDEX_FILTERING=true && ./scripts/wait-for-it-es.sh opensearch:9202 && cd stac_fastapi/tests/ && pytest -s --cov=stac_fastapi --cov-report=term-missing -m datetime_filtering'
docker compose down

.PHONY: test
test: test-elasticsearch test-datetime-filtering-es test-opensearch test-datetime-filtering-os

.PHONY: run-database-es
run-database-es:
docker compose run --rm elasticsearch
76 changes: 75 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -230,6 +230,81 @@ You can customize additional settings in your `.env` file:
> [!NOTE]
> The variables `ES_HOST`, `ES_PORT`, `ES_USE_SSL`, `ES_VERIFY_CERTS` and `ES_TIMEOUT` apply to both Elasticsearch and OpenSearch backends, so there is no need to rename the key names to `OS_` even if you're using OpenSearch.

# Datetime-Based Index Management

## Overview

SFEOS supports two indexing strategies for managing STAC items:

1. **Simple Indexing** (default) - One index per collection
2. **Datetime-Based Indexing** - Time-partitioned indexes with automatic management

The datetime-based indexing strategy is particularly useful for large temporal datasets. When a user provides a datetime parameter in a query, the system knows exactly which index to search, providing **multiple times faster searches** and significantly **reducing database load**.

## When to Use

**Recommended for:**
- Systems with large collections containing millions of items
- Systems requiring high-performance temporal searching

**Pros:**
- Multiple times faster queries with datetime filter
- Reduced database load - only relevant indexes are searched

**Cons:**
- Slightly longer item indexing time (automatic index management)
- Greater management complexity

## Configuration

### Enabling Datetime-Based Indexing

Enable datetime-based indexing by setting the following environment variable:

```bash
ENABLE_DATETIME_INDEX_FILTERING=true
```

### Related Configuration Variables

| Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `ENABLE_DATETIME_INDEX_FILTERING` | Enables time-based index partitioning | `false` | `true` |
| `DATETIME_INDEX_MAX_SIZE_GB` | Maximum size limit for datetime indexes (GB) - note: add +20% to target size due to ES/OS compression | `25` | `50` |
| `STAC_ITEMS_INDEX_PREFIX` | Prefix for item indexes | `items_` | `stac_items_` |

## How Datetime-Based Indexing Works

### Index and Alias Naming Convention

The system uses a precise naming convention:

**Physical indexes:**
```
{ITEMS_INDEX_PREFIX}{collection-id}_{uuid4}
```

**Aliases:**
```
{ITEMS_INDEX_PREFIX}{collection-id} # Main collection alias
{ITEMS_INDEX_PREFIX}{collection-id}_{start-datetime} # Temporal alias
{ITEMS_INDEX_PREFIX}{collection-id}_{start-datetime}_{end-datetime} # Closed index alias
```

**Example:**

*Physical indexes:*
- `items_sentinel-2-l2a_a1b2c3d4-e5f6-7890-abcd-ef1234567890`

*Aliases:*
- `items_sentinel-2-l2a` - main collection alias
- `items_sentinel-2-l2a_2024-01-01` - active alias from January 1, 2024
- `items_sentinel-2-l2a_2024-01-01_2024-03-15` - closed index alias (reached size limit)

### Index Size Management

**Important - Data Compression:** Elasticsearch and OpenSearch automatically compress data. The configured `DATETIME_INDEX_MAX_SIZE_GB` limit refers to the compressed size on disk. It is recommended to add +20% to the target size to account for compression overhead and metadata.

## Interacting with the API

- **Creating a Collection**:
@@ -538,4 +613,3 @@ You can customize additional settings in your `.env` file:
- Ensures fair resource allocation among all clients

- **Examples**: Implementation examples are available in the [examples/rate_limit](examples/rate_limit) directory.

3 changes: 3 additions & 0 deletions compose.yml
Original file line number Diff line number Diff line change
@@ -21,6 +21,7 @@ services:
- ES_USE_SSL=false
- ES_VERIFY_CERTS=false
- BACKEND=elasticsearch
- DATABASE_REFRESH=true
ports:
- "8080:8080"
volumes:
@@ -72,6 +73,7 @@ services:
hostname: elasticsearch
environment:
ES_JAVA_OPTS: -Xms512m -Xmx1g
action.destructive_requires_name: false
volumes:
- ./elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
- ./elasticsearch/snapshots:/usr/share/elasticsearch/snapshots
@@ -86,6 +88,7 @@ services:
- discovery.type=single-node
- plugins.security.disabled=true
- OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m
- action.destructive_requires_name=false
volumes:
- ./opensearch/config/opensearch.yml:/usr/share/opensearch/config/opensearch.yml
- ./opensearch/snapshots:/usr/share/opensearch/snapshots
23 changes: 19 additions & 4 deletions stac_fastapi/core/stac_fastapi/core/core.py
Original file line number Diff line number Diff line change
@@ -37,6 +37,7 @@
BulkTransactionMethod,
Items,
)
from stac_fastapi.sfeos_helpers.database import return_date
from stac_fastapi.types import stac as stac_types
from stac_fastapi.types.conformance import BASE_CONFORMANCE_CLASSES
from stac_fastapi.types.core import AsyncBaseCoreClient
@@ -324,10 +325,16 @@ async def item_collection(
search=search, collection_ids=[collection_id]
)

if datetime:
try:
datetime_search = return_date(datetime)
search = self.database.apply_datetime_filter(
search=search, interval=datetime
search=search, datetime_search=datetime_search
)
except (ValueError, TypeError) as e:
# Handle invalid interval formats if return_date fails
msg = f"Invalid interval format: {datetime}, error: {e}"
logger.error(msg)
raise HTTPException(status_code=400, detail=msg)

if bbox:
bbox = [float(x) for x in bbox]
@@ -342,6 +349,7 @@ async def item_collection(
sort=None,
token=token,
collection_ids=[collection_id],
datetime_search=datetime_search,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? We apply the datetime_search to the search variable on line 331. If this is optional, could we omit it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed in this function so that you can find which index this product is in.

)

items = [
@@ -500,10 +508,16 @@ async def post_search(
search=search, collection_ids=search_request.collections
)

if search_request.datetime:
try:
datetime_search = return_date(search_request.datetime)
search = self.database.apply_datetime_filter(
search=search, interval=search_request.datetime
search=search, datetime_search=datetime_search
)
except (ValueError, TypeError) as e:
# Handle invalid interval formats if return_date fails
msg = f"Invalid interval format: {search_request.datetime}, error: {e}"
logger.error(msg)
raise HTTPException(status_code=400, detail=msg)

if search_request.bbox:
bbox = search_request.bbox
@@ -560,6 +574,7 @@ async def post_search(
token=search_request.token,
sort=sort,
collection_ids=search_request.collections,
datetime_search=datetime_search,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here -- Is this needed? We apply the datetime_search to the search variable on line 513. If this is optional, could we omit it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

)

fields = (
1 change: 1 addition & 0 deletions stac_fastapi/core/stac_fastapi/core/datetime_utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""Utility functions to handle datetime parsing."""

from datetime import datetime, timezone

from stac_fastapi.types.rfc3339 import rfc3339_str_to_datetime
1 change: 1 addition & 0 deletions stac_fastapi/core/stac_fastapi/core/serializers.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""Serializers."""

import abc
from copy import deepcopy
from typing import Any, List, Optional
1 change: 1 addition & 0 deletions stac_fastapi/core/stac_fastapi/core/session.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""database session management."""

import logging

import attr
Loading