Skip to content

Commit 0b2300d

Browse files
varunbharadwajkolchfa-awsnatebower
authored
[Pull-based Ingestion] Add experimental pull-based ingestion page (#9659)
* Add experimental pull-based ingestion page Signed-off-by: Varun Bharadwaj <[email protected]> * update the page to address comments Signed-off-by: Varun Bharadwaj <[email protected]> * Doc review Signed-off-by: Fanit Kolchina <[email protected]> * Add experimental pull-based ingestion page Signed-off-by: Varun Bharadwaj <[email protected]> * update the page to address comments Signed-off-by: Varun Bharadwaj <[email protected]> * Doc review Signed-off-by: Fanit Kolchina <[email protected]> * Add additional info and address comments Signed-off-by: Varun Bharadwaj <[email protected]> * Minor rewording and template update Signed-off-by: Fanit Kolchina <[email protected]> * Typo fix Signed-off-by: Fanit Kolchina <[email protected]> * Add info about the param parameter Signed-off-by: Fanit Kolchina <[email protected]> * Update _api-reference/document-apis/pull-based-ingestion.md Signed-off-by: kolchfa-aws <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> --------- Signed-off-by: Varun Bharadwaj <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: Fanit Kolchina <[email protected]> Co-authored-by: kolchfa-aws <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
1 parent 8ecb2be commit 0b2300d

File tree

6 files changed

+348
-21
lines changed

6 files changed

+348
-21
lines changed

.github/vale/styles/Vocab/OpenSearch/Products/accept.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,12 @@ Amazon
55
Amazon OpenSearch Serverless
66
Amazon OpenSearch Service
77
Amazon Bedrock
8+
Amazon Kinesis
89
Amazon SageMaker
910
AWS Secrets Manager
1011
Ansible
1112
Anthropic Claude
13+
Apache Kafka
1214
Auditbeat
1315
AWS Cloud
1416
Cohere Command
@@ -50,6 +52,7 @@ JSON Web Token
5052
Keycloak
5153
Kerberos
5254
Kibana
55+
Kinesis
5356
Kubernetes
5457
Lambda
5558
Langflow
Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
---
2+
layout: default
3+
title: Pull-based ingestion management
4+
parent: Pull-based ingestion
5+
grand_parent: Document APIs
6+
has_children: true
7+
nav_order: 10
8+
---
9+
10+
# Pull-based ingestion management
11+
**Introduced 3.0**
12+
{: .label .label-purple }
13+
14+
This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, join the discussion on the [OpenSearch forum](https://forum.opensearch.org/).
15+
{: .warning}
16+
17+
OpenSearch provides the following APIs to manage pull-based ingestion.
18+
19+
## Pause ingestion
20+
21+
Pauses ingestion for one or more indexes. When paused, OpenSearch stops consuming data from the streaming source for all shards in the specified indexes.
22+
23+
### Endpoint
24+
25+
```json
26+
POST /<index>/ingestion/_pause
27+
```
28+
29+
### Path parameters
30+
31+
The following table lists the available path parameters.
32+
33+
| Parameter | Data type | Required/Optional | Description |
34+
| :--- | :--- | :--- | :--- |
35+
| `index` | String | Required | The index to pause. Can be a comma-separated list of multiple index names. |
36+
37+
### Query parameters
38+
39+
The following table lists the available query parameters. All query parameters are optional.
40+
41+
| Parameter | Data type | Description |
42+
| :--- | :--- | :--- |
43+
| `cluster_manager_timeout` | Time units | The amount of time to wait for a connection to the cluster manager node. Default is `30s`. |
44+
| `timeout` | Time units | The amount of time to wait for a response from the cluster. Default is `30s`. |
45+
46+
### Example request
47+
48+
```json
49+
POST /my-index/ingestion/_pause
50+
```
51+
{% include copy-curl.html %}
52+
53+
## Resume ingestion
54+
55+
Resumes ingestion for one or more indexes. When resumed, OpenSearch continues consuming data from the streaming source for all shards in the specified indexes.
56+
57+
### Endpoint
58+
59+
```json
60+
POST /<index>/ingestion/_resume
61+
```
62+
63+
### Path parameters
64+
65+
The following table lists the available path parameters.
66+
67+
| Parameter | Data type | Required/Optional | Description |
68+
| :--- | :--- | :--- | :--- |
69+
| `index` | String | Required | The index to resume ingestion for. Can be a comma-separated list of multiple index names. |
70+
71+
### Query parameters
72+
73+
The following table lists the available query parameters. All query parameters are optional.
74+
75+
| Parameter | Data type | Description |
76+
| :--- | :--- | :--- | :--- |
77+
| `cluster_manager_timeout` | Time units | The amount of time to wait for a connection to the cluster manager node. Default is `30s`. |
78+
| `timeout` | Time units | The amount of time to wait for a response from the cluster. Default is `30s`. |
79+
80+
### Example request
81+
82+
```json
83+
POST /my-index/ingestion/_resume
84+
```
85+
{% include copy-curl.html %}
86+
87+
## Get ingestion state
88+
89+
Returns the current ingestion state for one or more indexes. This API supports pagination.
90+
91+
### Endpoint
92+
93+
```json
94+
GET /<index>/ingestion/_state
95+
```
96+
97+
### Path parameters
98+
99+
The following table lists the available path parameters.
100+
101+
| Parameter | Data type | Required/Optional | Description |
102+
| :--- | :--- | :--- | :--- |
103+
| `index` | String | Required | The index for which to return the ingestion state. Can be a comma-separated list of multiple index names. |
104+
105+
### Query parameters
106+
107+
The following table lists the available query parameters. All query parameters are optional.
108+
109+
| Parameter | Data type | Description |
110+
| :--- | :--- | :--- |
111+
| `timeout` | Time units | The amount of time to wait for a response from the cluster. Default is `30s`. |
112+
113+
### Example request
114+
115+
The following is a request with the default settings:
116+
117+
```json
118+
GET /my-index/ingestion/_state
119+
```
120+
{% include copy-curl.html %}
121+
122+
The following example shows a request with a page size of 20:
123+
124+
```json
125+
GET /my-index/ingestion/_state?size=20
126+
```
127+
{% include copy-curl.html %}
128+
129+
The following example shows a request with a next page token:
130+
131+
```json
132+
GET /my-index/ingestion/_state?size=20&next_token=<next_page_token>
133+
```
134+
{% include copy-curl.html %}
135+
136+
### Example response
137+
138+
```json
139+
{
140+
"_shards": {
141+
"total": 1,
142+
"successful": 1,
143+
"failed": 0,
144+
"failures": [
145+
{
146+
"shard": 0,
147+
"index": "my-index",
148+
"status": "INTERNAL_SERVER_ERROR",
149+
"reason": {
150+
"type": "timeout_exception",
151+
"reason": "error message"
152+
}
153+
}
154+
]
155+
},
156+
"next_page_token" : "page token if not on last page",
157+
"ingestion_state": {
158+
"indexName": [
159+
{
160+
"shard": 0,
161+
"poller_state": "POLLING",
162+
"error_policy": "DROP",
163+
"poller_paused": false
164+
}
165+
]
166+
}
167+
}
168+
```
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
---
2+
layout: default
3+
title: Pull-based ingestion
4+
parent: Document APIs
5+
has_children: true
6+
nav_order: 60
7+
---
8+
9+
# Pull-based ingestion
10+
**Introduced 3.0**
11+
{: .label .label-purple }
12+
13+
This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, join the discussion on the [OpenSearch forum](https://forum.opensearch.org/).
14+
{: .warning}
15+
16+
Pull-based ingestion enables OpenSearch to ingest data from streaming sources such as Apache Kafka or Amazon Kinesis. Unlike traditional ingestion methods where clients actively push data to OpenSearch through REST APIs, pull-based ingestion allows OpenSearch to control the data flow by retrieving data directly from streaming sources. This approach provides exactly-once ingestion semantics and native backpressure handling, helping prevent server overload during traffic spikes.
17+
18+
## Prerequisites
19+
20+
Before using pull-based ingestion, ensure that the following prerequisites are met:
21+
22+
* Install an ingestion plugin for your streaming source using the command `bin/opensearch-plugin install <plugin-name>`. For more information, see [Additional plugins]({{site.url}}{{site.baseurl}}/install-and-configure/additional-plugins/index/). OpenSearch supports the following ingestion plugins:
23+
- `ingestion-kafka`
24+
- `ingestion-kinesis`
25+
* Enable [segment replication]({{site.url}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/segment-replication/index/) with [remote-backed storage]({{site.url}}{{site.baseurl}}/tuning-your-cluster/availability-and-recovery/remote-store/index/). Pull-based ingestion is not compatible with document replication.
26+
* Configure pull-based ingestion during [index creation](#creating-an-index-for-pull-based-ingestion). You cannot convert an existing push-based index to a pull-based one.
27+
28+
## Creating an index for pull-based ingestion
29+
30+
To ingest data from a streaming source, first create an index with pull-based ingestion settings. The following request creates an index that pulls data from a Kafka topic:
31+
32+
```json
33+
PUT /my-index
34+
{
35+
"settings": {
36+
"ingestion_source": {
37+
"type": "kafka",
38+
"pointer.init.reset": "earliest",
39+
"param": {
40+
"topic": "test",
41+
"bootstrap_servers": "localhost:49353"
42+
}
43+
},
44+
"index.number_of_shards": 1,
45+
"index.number_of_replicas": 1,
46+
"index": {
47+
"replication.type": "SEGMENT"
48+
}
49+
},
50+
"mappings": {
51+
"properties": {
52+
"name": {
53+
"type": "text"
54+
},
55+
"age": {
56+
"type": "integer"
57+
}
58+
}
59+
}
60+
}
61+
```
62+
{% include copy-curl.html %}
63+
64+
### Ingestion source parameters
65+
66+
The `ingestion_source` parameters control how OpenSearch pulls data from the streaming source. A _poll_ is an operation in which OpenSearch actively requests a batch of data from the streaming source. The following table lists all parameters that `ingestion_source` supports.
67+
68+
| Parameter | Description |
69+
| :--- | :--- |
70+
| `type` | The streaming source type. Required. Valid values are `kafka` or `kinesis`. |
71+
| `pointer.init.reset` | Determines where to start reading from the stream. Optional. Valid values are `earliest`, `latest`, `rewind_by_offset`, `rewind_by_timestamp`, or `none`. See [Stream position](#stream-position). |
72+
| `pointer.init.reset.value` | Required only for `rewind_by_offset` or `rewind_by_timestamp`. Specifies the offset value or timestamp in milliseconds. See [Stream position](#stream-position). |
73+
| `error_strategy` | How to handle failed messages. Optional. Valid values are `DROP` (failed messages are skipped and ingestion continues) and `BLOCK` (when a message fails, ingestion stops). Default is `DROP`. We recommend using `DROP` for the current experimental release. |
74+
| `max_batch_size` | The maximum number of records to retrieve in each poll operation. Optional. |
75+
| `poll.timeout` | The maximum time to wait for data in each poll operation. Optional. |
76+
| `num_processor_threads` | The number of threads for processing ingested data. Optional. Default is 1. |
77+
| `param` | Source-specific configuration parameters. Required. <br>&ensp;&#x2022; The `ingest-kafka` plugin requires:<br>&ensp;&ensp;- `topic`: The Kafka topic to consume from<br>&ensp;&ensp;- `bootstrap_servers`: The Kafka server addresses<br>&ensp;&ensp;Optionally, you can provide additional standard Kafka consumer parameters (such as `fetch.min.bytes`). These parameters are passed directly to the Kafka consumer. <br>&ensp;&#x2022; The `ingest-kinesis` plugin requires:<br>&ensp;&ensp;- `stream`: The Kinesis stream name<br>&ensp;&ensp;- `region`: The AWS Region<br>&ensp;&ensp;- `access_key`: The AWS access key<br>&ensp;&ensp;- `secret_key`: The AWS secret key<br>&ensp;&ensp;Optionally, you can provide an `endpoint_override`. |
78+
79+
### Stream position
80+
81+
When creating an index, you can specify where OpenSearch should start reading from the stream by configuring the `pointer.init.reset` and `pointer.init.reset.value` settings in the `ingestion_source` parameter. OpenSearch will resume reading from the last commited position for existing indexes.
82+
83+
The following table provides the valid `pointer.init.reset` values and their corresponding `pointer.init.reset.value` values.
84+
85+
| `pointer.init.reset` | Starting ingestion point | `pointer.init.reset.value` |
86+
| :--- | :--- | :--- |
87+
| `earliest` | Beginning of stream | None |
88+
| `latest` | Current end of stream | None |
89+
| `rewind_by_offset` | Specific offset in the stream | A positive integer offset. Required. |
90+
| `rewind_by_timestamp` | Specific point in time | A Unix timestamp in milliseconds. Required. <br> For Kafka streams, defaults to Kafka's `auto.offset.reset` policy if no messages are found for the given timestamp. |
91+
| `none` | Last committed position for existing indexes | None |
92+
93+
### Stream partitioning
94+
95+
When using partitioned streams (such as Kafka topics or Kinesis shards), note the following relationships between stream partitions and OpenSearch shards:
96+
97+
- OpenSearch shards map one-to-one to stream partitions.
98+
- The number of index shards must be greater than or equal to the number of stream partitions.
99+
- Extra shards beyond the number of partitions remain empty.
100+
- Documents must be sent to the same partition for successful updates.
101+
102+
When using pull-based ingestion, traditional REST API--based ingestion is disabled for the index.
103+
{: .note}
104+
105+
### Updating the error policy
106+
107+
You can use the [Update Settings API]({{site.url}}{{site.baseurl}}/api-reference/index-apis/update-settings/) to dynamically update the error policy by setting `index.ingestion_source.error_strategy` to either `DROP` or `BLOCK`.
108+
109+
The following example demonstrates how to update the error policy:
110+
111+
```json
112+
PUT /my-index/_settings
113+
{
114+
"index.ingestion_source.error_strategy": "DROP"
115+
}
116+
```
117+
{% include copy-curl.html %}
118+
119+
## Message format
120+
121+
To be correctly processed by OpenSearch, messages in the streaming source must have the following format:
122+
123+
```json
124+
{"_id":"1", "_version":"1", "_source":{"name": "alice", "age": 30}, "_op_type": "index"}
125+
{"_id":"2", "_version":"2", "_source":{"name": "alice", "age": 30}, "_op_type": "delete"}
126+
```
127+
128+
Each data unit in the streaming source (Kafka message or Kinesis record) must include the following fields that specify how to create or modify an OpenSearch document.
129+
130+
| Field | Data type | Required | Description |
131+
| :--- | :--- | :--- | :--- |
132+
| `_id` | String | No | A unique identifier for a document. If not provided, OpenSearch auto-generates an ID. Required for document updates or deletions. |
133+
| `_version` | Long | No | A document version number, which must be maintained externally. If provided, OpenSearch drops messages with versions earlier than the current document version. If not provided, no version checking occurs. |
134+
| `_op_type` | String | No | The operation to perform. Valid values are:<br>- `index`: Creates a new document or updates an existing one<br>- `delete`: Soft deletes a document |
135+
| `_source` | Object | Yes | The message payload containing the document data. |
136+
137+
## Pull-based ingestion metrics
138+
139+
Pull-based ingestion provides metrics that can be used to monitor the ingestion process. The `polling_ingest_stats` metric is currently supported and is available at the shard level.
140+
141+
The following table lists the available `polling_ingest_stats` metrics.
142+
143+
| Metric | Description |
144+
| :--- | :--- |
145+
| `message_processor_stats.total_processed_count` | The total number of messages processed by the message processor. |
146+
| `consumer_stats.total_polled_count` | The total number of messages polled from the stream consumer. |
147+
148+
To retrieve shard-level pull-based ingestion metrics, use the [Nodes Stats API]({{site.url}}{{site.baseurl}}/api-reference/index-apis/update-settings/):
149+
150+
```json
151+
GET /_nodes/stats/indices?level=shards&pretty
152+
```
153+
{% include copy-curl.html %}
154+
```

_api-reference/document-apis/reindex.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
layout: default
33
title: Reindex document
44
parent: Document APIs
5-
nav_order: 60
5+
nav_order: 17
66
redirect_from:
77
- /opensearch/reindex-data/
88
- /opensearch/rest-api/document-apis/reindex/

0 commit comments

Comments
 (0)