-
Notifications
You must be signed in to change notification settings - Fork 103
Add documentation for failure stores. #1368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
TBD on recipes. Most links are not complete and need updating from "???".
Co-authored-by: Lee Hinman <[email protected]>
Co-authored-by: Lee Hinman <[email protected]>
Co-authored-by: Lee Hinman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbaiera this is such a superb doc! I've left a bunch of super small suggestions but overall it looks great to me! 🚀
Co-authored-by: David Kilfoyle <[email protected]>
Adding @slobodanadamovic as reviewer for the new roles in the reference docs. Is there anywhere else that we should expand with more failure store info for security beyond that reference? |
:::{warning} | ||
Documents redirected to the failure store in the event of a failed ingest pipeline will be stored in their original, unprocessed form. If an ingest pipeline normally redacts sensitive information from a document, then failed documents in their original, unprocessed form may contain sensitive information. | ||
|
||
Furthermore, failed documents are likely to be structured differently than normal data in a data stream, and thus are not supported by [document level security](../../../deploy-manage/users-roles/cluster-or-deployment-auth/controlling-access-at-document-field-level.md#document-level-security) or [field level security](../../../deploy-manage/users-roles/cluster-or-deployment-auth/controlling-access-at-document-field-level.md#field-level-security). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually not true. We decided to support FLS and DLS against failure store. The main reason is because it would not be possible to prevent using FLS/DLS when users define implicit read
access to backing .fs*
indices. It felt wrong to prevent using FLS/DLS when users defined explicit read_failure_store
access to data streams with FLS/DLS restrictions. Our biggest concern was that users would be expecting the DLS/FLS to stop certain docs/fields from being visible and it wouldn't.
Right now, if users include FLS/DLS when granting access to failure store then we'll try to honour it. We should just make sure to highlight it (which you already did) that these documents are structured differently, and because they may contain sensitive information, the users should take extra care when defining access to them.
@@ -381,6 +384,8 @@ To learn how to assign privileges to a role, refer to [](/deploy-manage/users-ro | |||
|
|||
This privilege is not available in {{serverless-full}}. | |||
|
|||
`read_failure_store` | |||
: Read-only access to actions performed on a data stream's failure store. Required for access to failure store data (count, explain, get, mget, get indexed scripts, more like this, multi percolate/search/termvector, percolate, scroll, clear_scroll, search, suggest, tv). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The manage_failure_store
and read_failure_store
privileges are special in terms that they only grant access to failure store when accessed through data stream names using ::failures
selector. We should try and highlight somehow that these privileges cannot be used to grant direct read/manage access to failure store backing indices (.fs*
) or any other regular indices. Hence, they should only be granted to data streams that have failure store enabled.
No, that's the only place we should document them. I left some comments. I think we should try to point out that the new privileges are only granting access to failure store when accessed using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I left some comments, but nothing major. This is a monumental amount of work and I think it's great, thanks Jimmy.
}, | ||
"_seq_no": 2, | ||
"_primary_term": 1, | ||
"failure_store": "used" // The document was sent to the failure store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this document, can you use the <1>
annotation to put the captions at the bottom? I think it's easier and we have (I think) a special CSS syntax so that it doesn't show up in copy-and-pasting.
"@timestamp": "2025-05-09T06:24:48.381Z", | ||
"document": { | ||
"index": "my-datastream-ingest", | ||
"source": { // When an ingest pipeline fails, the document stored is what was originally sent to the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here (and below) about using <1>
, <2>
, etc.
|
||
**Separate your failures beforehand.** As described in the previous [failure document source](./failure-store.md#use-failure-store-document-source) section, failure documents are structured differently depending on when the document failed during ingestion. We recommend to separate documents by ingest pipeline failures and indexing failures at minimum. Ingest pipeline failures often need to have the original pipeline re-run, while index failures should skip any pipelines. Further separating failures by index or specific failure type may also be beneficial. | ||
|
||
**Perform a failure store rollover.** Consider rolling over the failure store before attempting to remediate failures. This will create a new failure index that will collect any new failures during the remediation process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link to the section you wrote about rolling over the failure store here?
|
||
**Use an ingest pipeline to convert failure documents back into their original document.** Failure documents store failure information along with the document that failed ingestion. The first step for remediating documents should be to use an ingest pipeline to extract the original source from the failure document and then discard any other information about the failure. | ||
|
||
**Simulate first to avoid repeat failures.** If you must run a pipeline as part of your remediation process, it is best to simulate the pipeline against the failure first. This will catch any unforeseen issues that may fail the document a second time. Remember, ingest pipeline failures will capture the document before an ingest pipeline is applied to it, which can further complicate remediation when a failure document becomes nested inside a new failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here about linking to the simulate ingest API.
1. The failures have been remediated. | ||
|
||
:::{tip} | ||
Since the failure store is enabled on this data stream, it would be wise to check it for any further failures from the reindexing process. Failures that happen at this point in the process may end up as nested failures in the failure store. Remediating nested failures can quickly become a hassle as the original document gets nested multiple levels deep in the failure document. For this reason, it is suggested to remediate data during a quiet period when no other failures are likely to arise. Furthermore, rolling over the failure store before executing the remediation would allow easier discarding of any new nested failures and only operate on the original failure documents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would also be good to mention that reindexing the failed documents into the data stream does not remove the failures, so that needs to be done after the reindexing.
--- | ||
applies_to: | ||
stack: ga 8.19.0 | ||
serverless: ga 9.1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Serverless have versions for the docs?
|
||
A failure store is a secondary set of indices inside a data stream, dedicated to storing failed documents. A failed document is any document that, without the failure store enabled, would cause an ingest pipeline exception or that has a structure that conflicts with a data stream's mappings. In the absence of the failure store, a failed document would cause the indexing operation to fail, with an error message returned in the operation response. | ||
|
||
When a data stream's failure store is enabled, these failures are instead captured in a separate index and persisted to be analysed later. Clients receive a successful response with a flag indicating the failure was redirected. Failure stores do not capture failures caused by backpressure or document version conflicts. These failures are always returned as-is since they warrant specific action by the client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be worth separating out the Failure stores do not capture failures caused by backpressure or document version conflicts. These failures are always returned as-is since they warrant specific action by the client.
section into a "note" or "important" block, so it doesn't get missed.
You can specify in a data stream's [index template](../templates.md) if it should enable the failure store when it is first created. | ||
|
||
:::{note} | ||
Unlike the `settings` and `mappings` fields on an [index template](../templates.md) which are repeatedly applied to new data stream write indices on rollover, the `data_stream_options` section of a template is applied to a data stream only once when the data stream is first created. To configure existing data streams, use the put [data stream options API](indices-put-data-stream-options). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the indices-put-data-stream-options
link work? Or do we need to fully-qualify it with the URL?
``` | ||
```console |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two lines need to be removed to make this one continuous console block, since we use <2>
below
|
||
### Searching failures [use-failure-store-searching] | ||
|
||
Once you have accumulated some failures, they can be searched much like a regular index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you have accumulated some failures, they can be searched much like a regular index. | |
Once you have accumulated some failures, the failure store can be searched much like a regular data stream. |
@dakrone Any chance we might get this before FF for 8.19? We've made some work in the kibana UI and would be great if we can add a link to the docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @jbaiera! just adding a couple of comments to address the metadata of these pages / improve wayfinding between the pages. also approving to make sure that you're unblocked. let me know if you need a hand with anything.
--- | ||
applies_to: | ||
stack: ga 8.19.0 | ||
serverless: ga 9.1.0 | ||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to add to @dakrone's comments, we need to tweak this a little
- 8.19 docs for this feature should be written in the old docs system so we should be just marking this as 9.1+ in this system
- when we have the 8.19 docs, we can use a
mapped_pages
tag to create a link from the old system to the new system - serverless is unversioned, so no need to specify a version #
- we added a products tag since this PR was opened which will be used to filter search results
--- | |
applies_to: | |
stack: ga 8.19.0 | |
serverless: ga 9.1.0 | |
--- | |
--- | |
mapped_pages: | |
- (8.19 docs) | |
applies_to: | |
stack: ga 9.1 | |
serverless: ga | |
products: | |
- id: elasticsearch | |
- id: elastic-stack | |
- id: cloud-serverless | |
--- |
@@ -0,0 +1,1154 @@ | |||
# Failure store recipes and use cases [failure-store-recipes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This page also needs front matter. the front matter helps people who land on the page from google to understand if the content applies to them. this page could also have a mapped_pages
if you're creating an equivalent page in the 8.19 docs - that can always be added in a subsequent PR as well
# Failure store recipes and use cases [failure-store-recipes] | |
--- | |
mapped_pages: | |
- (8.19 docs) | |
applies_to: | |
stack: ga 9.1 | |
serverless: ga | |
products: | |
- id: elasticsearch | |
- id: elastic-stack | |
- id: cloud-serverless | |
--- | |
# Failure store recipes and use cases [failure-store-recipes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:2c: I don't love the word "recipes" here - it's not great for an international audience. Perhaps "Using failure stores to address ingestion issues" or something.
|
||
A failure store is a secondary set of indices inside a data stream, dedicated to storing failed documents. A failed document is any document that, without the failure store enabled, would cause an ingest pipeline exception or that has a structure that conflicts with a data stream's mappings. In the absence of the failure store, a failed document would cause the indexing operation to fail, with an error message returned in the operation response. | ||
|
||
When a data stream's failure store is enabled, these failures are instead captured in a separate index and persisted to be analysed later. Clients receive a successful response with a flag indicating the failure was redirected. Failure stores do not capture failures caused by backpressure or document version conflicts. These failures are always returned as-is since they warrant specific action by the client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try to link from this overview page to your recipes page - otherwise, people might not know to look for it. we should try not to rely on the sidebar.
I suggest introducing it in the overview here - we can explain the scope of this page, and then explain that there is another page and what they can find on it.
When a data stream's failure store is enabled, these failures are instead captured in a separate index and persisted to be analysed later. Clients receive a successful response with a flag indicating the failure was redirected. Failure stores do not capture failures caused by backpressure or document version conflicts. These failures are always returned as-is since they warrant specific action by the client. | |
When a data stream's failure store is enabled, these failures are instead captured in a separate index and persisted to be analysed later. Clients receive a successful response with a flag indicating the failure was redirected. Failure stores do not capture failures caused by backpressure or document version conflicts. These failures are always returned as-is since they warrant specific action by the client. | |
On this page, you'll learn how to set up, use, and manage a failure store, as well as the structure of failure store documents. | |
For examples of how to use failure stores to identify and fix errors in ingest pipelines and your data, refer to [](/manage-data/data-store/data-streams/failure-store-recipes.md). |
|
||
## Using a failure store [use-failure-store] | ||
|
||
The failure store is meant to ease the burden of detecting and handling failures when ingesting data to {{es}}. Clients are less likely to encounter unrecoverable failures when writing documents, and developers are more easily able to troubleshoot faulty pipelines and mappings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure store is meant to ease the burden of detecting and handling failures when ingesting data to {{es}}. Clients are less likely to encounter unrecoverable failures when writing documents, and developers are more easily able to troubleshoot faulty pipelines and mappings. | |
The failure store is meant to ease the burden of detecting and handling failures when ingesting data to {{es}}. Clients are less likely to encounter unrecoverable failures when writing documents, and developers are more easily able to troubleshoot faulty pipelines and mappings. | |
For examples of how to use failure stores to identify and fix errors in ingest pipelines and your data, refer to [](/manage-data/data-store/data-streams/failure-store-recipes.md). |
🔍 Preview links for changed docs:
🔔 The preview site may take up to 3 minutes to finish building. These links will become live once it completes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to fix the failing test
You can specify in a data stream's [index template](../templates.md) if it should enable the failure store when it is first created. | ||
|
||
:::{note} | ||
Unlike the `settings` and `mappings` fields on an [index template](../templates.md) which are repeatedly applied to new data stream write indices on rollover, the `data_stream_options` section of a template is applied to a data stream only once when the data stream is first created. To configure existing data streams, use the put [data stream options API](indices-put-data-stream-options). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike the `settings` and `mappings` fields on an [index template](../templates.md) which are repeatedly applied to new data stream write indices on rollover, the `data_stream_options` section of a template is applied to a data stream only once when the data stream is first created. To configure existing data streams, use the put [data stream options API](indices-put-data-stream-options). | |
Unlike the `settings` and `mappings` fields on an [index template](../templates.md) which are repeatedly applied to new data stream write indices on rollover, the `data_stream_options` section of a template is applied to a data stream only once when the data stream is first created. To configure existing data streams, use the put [data stream options API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-indices-put-data-stream-options). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder if we need links to the serverless versions of these APIs too
https://www.elastic.co/docs/api/doc/elasticsearch-serverless/operation/operation-indices-put-data-stream-options
Adds a new section to the documentation to explain new failure store functionality.
Preview:
https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/1368/manage-data/data-store/data-streams/failure-store
This PR relies on links to updates in: