Skip to content

Naively increase the meta field char limit 50->500 #131478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

seanstory
Copy link
Member

@seanstory seanstory commented Jul 17, 2025

WIP. Wanting to validate that naively bumping this limit doesn't cause any significant issues.

At the same time, will be evaluating the potential impact this could have on LLM understanding of mappings, if we can add much longer field-level "descriptions".

Copy link
Contributor

github-actions bot commented Jul 17, 2025

🔍 Preview links for changed docs

@seanstory
Copy link
Member Author

buildkite test this

Samiul-TheSoccerFan and others added 26 commits July 17, 2025 16:57
…ces (elastic#131251)

* Refactoring inference services to accept context

* fix linting issues

* adding mock cluster service to fix IT test

* refactoring to remove duplication in constructors

* remove unnecessary blank line

* refactor to have uniform constructor call

* refactor to have uniform constructor call for sagemaker

* fix linting issues

* fix failed unit tests

---------

Co-authored-by: Elastic Machine <[email protected]>
This PR adds the missing ignore_unavailable, allow_no_indices and
expand_wildcards query parameters.
We miss checking whether a field exists when populating the dimension 
attributes. This issue occurs when a field exists in some, but not all
target indices.
The corresponding issue elastic#116781 has already been fixed.
… work (elastic#131505)

Disable entitlements for DirectIOIT, the suite requires delegation to
work.

On main the suite is skipped (direct IO is disabled by default), but
this blocks backports.
…t {p0=downsample-with-security/10_basic/Downsample index} elastic#131513
Removes YAML tests for the `/_cluster/allocation/explain` API. The tests
 passed in alternate values for the APIs. An example is passing "true"
 for fields expecting a boolean value. While this is explicitly
 supported by the API, this is not the correct place to be testing this
 behaviour, and resulted in the API specification failing
 validation.

Relates elastic#127028
…ate states (elastic#129633)

Continuation of elastic#127148

When datanodes send the STATS intermediate states to the coordinator, it aggregates them.
Now, however, the TopN groups sent by a datanode may not be acceptable in the coordinator (Because it has better values already), so it will discard such values.

However, the engine wasn't handling intermediate groups with nulls (TopNBlockHash uses nulls to discard unused groups).
See https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/GroupingAggregator.java#L47

_This code isn't connected with the query yet, so there's no bug in production_
Add verification that the optimizers do not modify the number of attributes and the attribute datatype.
We add special handling for Lookup Join, by checking EsQueryExec esQueryExec && esQueryExec.indexMode() == LOOKUP and another special handling for ProjectAwayColumns.ALL_FIELDS_PROJECTED

Closes elastic#125576
This adds support for splitting `Page`s of large values when loading
from single segment, non-descending hits. This is hottest code path as
it's how we load data for aggregation. So! We had to make very very very
sure this doesn't slow down the fast path of loading doc values.

Caveat - this only defends against loading large values via the
row-by-row load mechanism that we use for stored fields and _source.
That covers the most common kinds of large values - mostly `text` and
geo fields. If we need to split further on docs values, we'll have to
invent something for them specifically. For now, just row-by-row.

This works by flipping the order in which we load row-by-row and
column-at-a-time values. Previously we loaded all column-at-a-time
values first because that was simpler. Then we loaded all of the
row-by-row values. Now we save the column-at-a-time values and instead
load row-by-row until the `Page`'s estimated size is larger than a "jumbo"
size which defaults to a megabyte.

Once we load enough rows that we estimate the page is "jumbo", we then
stop loading rows. The Page will look like this:

```
| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        | <-- after loading this row
|      |     |      |      |        |     we crossed to "jumbo" size
|      |     |      |      |        |
|      |     |      |      |        |
|      |     |      |      |        | <-- these rows are entirely empty
|      |     |      |      |        |
|      |     |      |      |        |
```

Then we chop the page to the last row:
```
| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
| XXXX |     | XXXX |      |        |
```

Then fill in the column-at-a-time columns:
```
| txt1 | int | txt2 | long | double |
|------|-----|------|------|--------|
| XXXX |   1 | XXXX |   11 |    1.0 |
| XXXX |   2 | XXXX |   22 |   -2.0 |
| XXXX |   3 | XXXX |   33 |    1e9 |
| XXXX |   4 | XXXX |   44 |    913 |
| XXXX |   5 | XXXX |   55 | 0.1234 |
| XXXX |   6 | XXXX |   66 | 3.1415 |
```

And then we return *that* `Page`. On the next `Driver` iteration we
start from where we left off.
Fix elastic#129372

Due to how remote ENRICH is
[planned](https://github.com/elastic/elasticsearch/blob/32e50d0d94e27ee559d24bf9d5463ba6e64d1788/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/mapper/Mapper.java#L93),
it interacts in special ways with pipeline breakers, in particular LIMIT
and TopN; when these are encountered upstream from a remote ENRICH,
these nodes are copied and executed a second time after the remote
ENRICH.

We'd like to allow remote ENRICH after LOOKUP JOIN, but that forces the
lookup to be remote as well; this has its own interactions with pipeline
breakers: in particular, LIMITs and TopNs cannot just be duplicated
after LOOKUP JOIN, as LOOKUP JOIN may add new rows.

For now, let's just forbid any usage of remote ENRICH after LOOKUP
JOINs; remote ENRICH is mostly relevant for CCS, and LOOKUP JOIN doesn't
support that in 9.1/8.19, anyway.

There is separate work that enables remote LOOKUP JOINs on remote
clusters and adds the correct validations; we can later build support
for remote ENRICH + LOOKUP JOIN on top of that. (C.f. my comment
[here](elastic#129372 (comment))
and my draft elastic#131286 for
enabling this.)
…cellationViaTimeoutWithAllowPartialResultsSetToFalse elastic#131248
* Fix msearch rest-api-spec

* Add YAML tests for added parameters
RLIKE LIST did not manage to make it into 9.1.
In this PR, we modify the documentation to make it clear that it will be available in 9.2, not 9.1
elasticsearchmachine and others added 28 commits July 18, 2025 23:44
…#130495)

This change removes RemoteClusterService.getRemoteClusterNames() since
getRegisteredRemoteClusterNames() provides the same functionality.
The comment in getRegisteredRemoteClusterNames() was removed since
it is no longer accurate after the change in PR elastic#47891.
This PR changes the test to simply wait expected master on every node
instead of selectively waiting on one non-master and one master node.
The later is problematic since it uses API that is not suitable when the
cluster is changing master.

Relates: elastic#127213
…t {p0=search/40_indices_boost/Indices boost with alias} elastic#131598
An exception here should be impossible, but we don't assert that, nor do
we emit a log message to prove it didn't happen in a production
environment. This commit adds the missing log and assert.
Clarifies in its documentation that `BlobContainer#getRegister` offers
only read-after-write semantics rather than full linearizability, and
adds comments to its callers justifying why this is still safe.
In order to better understand the infrequent failures in elastic#129445 and
in the hope to reproduce the issue better, this PR adds logging
around which shards and nodes documents end up in
FieldSortIT#testSortMixedFieldTypes and increases the log level for
this test suite in the o.e.search packages to TRACE and in some
action.search packages to DEBUG to better understand where exceptions
are thrown and to better trace how resources are released after that.

Relates to elastic#129445
Spell out that total memory needs to account for multiple nodes, and
other processes, and that the OOM killer might react if you ignore this
guidance.
…lastic#131419)

* update `kibana_system` to grant it access to `.chat-*` system index

* fix unit test
`SampleOperator.Status` wasn't declared as a NamedWritable by the plugin, leading to serialization errors when `SAMPLE` is used with `profile: true`.

It leads to an `IllegalArgumentException: Unknown NamedWriteable [org.elasticsearch.compute.operator.Operator$Status][sample]`

Profiles will be tested in this PR: elastic#131474, that's currently failing because of this bug
…thub.com:seanstory/elasticsearch into seanstory/increase-mapping-field-meta-char-limit
Copy link

❌ Author of the following commits did not sign a Contributor Agreement:
e0c1a9b, a4f345b

Please, read and sign the above mentioned agreement if you want to contribute to this project

@seanstory seanstory closed this Jul 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.