Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Standalone SIGABRT: tantivy OpenReadError (FileDoesNotExist("meta.json")) #39585

Closed
1 task done
ThreadDao opened this issue Jan 24, 2025 · 11 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-ci
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2): go sdk
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

pr's go-sdk ci test run failed:

[2025-01-24T10:45:39Z INFO  tantivy::directory::file_watcher] Meta file "/var/lib/milvus/data/indexnode/index_files/455531513553853686/1/meta.json" was modified |  
-- | --
  |   | [2025-01-24T10:45:39Z INFO  tantivy::directory::file_watcher] Meta file "/var/lib/milvus/data/indexnode/index_files/455531513553853646/1/meta.json" was modified |  
  |   | [2025-01-24T10:45:39Z ERROR tantivy::reader] Error while loading searcher after commit was detected. OpenReadError(FileDoesNotExist("meta.json")) |  
  |   | terminate called after throwing an instance of 'milvus::SegcoreError' |  
  |   | what():   => remove local directory:/var/lib/milvus/data/indexnode/index_files/455531513553853646/1/ failed, error: Directory not empty, files: /var/lib/milvus/data/indexnode/index_files/455531513553853646/1/.tantivy-meta.lock at /workspace/source/internal/core/src/storage/LocalChunkManager.cpp:228 |  
  |   |   |  
  |   | SIGABRT: abort |  
  |   | PC=0x7f2df9e419fc m=662 sigcode=18446744073709551610 |  
  |   | signal arrived during cgo execution |  
  |   |   |  
  |   | goroutine 245337 gp=0xc001dbbdc0 m=662 mp=0xc0209df808 [syscall, locked to thread]: |  
  |   | non-Go function |  
  |   | pc=0x7f2df9e419fc |  
  |   | non-Go function |  
  |   | pc=0x7f2df9ded475 |  
  |   | non-Go function |  
  |   | pc=0x7f2df9dd37f2 |  
  |   | non-Go function |  
  |   | pc=0x7f2df9c21b9d |  
  |   | non-Go function |  
  |   | pc=0x7f2df9c2d20b

  • server pods:
ms-39579-3-go-pr-etcd-0                                     1/1     Running            0               15m     10.104.29.87    4am-node35   <none>           <none>

2025-01-24T10:47:45Z {container="step-check-status"} ms-39579-3-go-pr-milvus-standalone-5ccf8b647b-dnpqk         1/1     Running            3 (2m ago)      15m     10.104.31.192   4am-node34   <none>           <none>

2025-01-24T10:47:45Z {container="step-check-status"} ms-39579-3-go-pr-minio-85454bb9f-znjpp                      1/1     Running            0               15m     10.104.29.85    4am-node35   <none>           <none>

Loki logs

Expected Behavior

No response

Steps To Reproduce

Milvus Log

No response

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2025
@ThreadDao ThreadDao added this to the 2.6.0 milestone Jan 24, 2025
@xiaofan-luan
Copy link
Collaborator

it seems to be a file not found?

@xiaofan-luan
Copy link
Collaborator

/assign @SpadeA-Tang
please help on it

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 26, 2025
@SpadeA-Tang
Copy link
Contributor

It seems there's a race condition between LocalChunkManager::RemoveDir and Tantivy InnerIndexReader::reload.
Tantivy has a watcher on "meta.json" and when the file changed, tantivy will reload the index reader.
So, my conjecture is that:
t1: Tantivy: meta.json was changed -> file watcher detected it, so a index reader reload will be triggered
t2: Segcore: LocalChunkManager::RemoveDir is called concurrently
t3: Tantivy: for index reader reload, it acquires "tantivy-meta.lock" (creating file named "tantivy-meta.lock"). And at this time, "meta.json" is deleted by LocalChunkManager, so it reports logs:
[2025-01-24T10:45:39Z ERROR tantivy::reader] Error while loading searcher after commit was detected. OpenReadError(FileDoesNotExist("meta.json"))
t4: Segcore: return the error "Directory not empty" due to concurrently file creation ("tantivy-meta.lock") I guess, which triggers the panic.

If my conjuecture is right, the problem goes to why LocalChunkManager chooses to remove the directory while the index reader is still alive.

@xiaofan-luan
Copy link
Collaborator

@SpadeA-Tang
maybe related to #39471

@SpadeA-Tang
Copy link
Contributor

what's the commit hash of the panic cluster? @ThreadDao

@SpadeA-Tang
Copy link
Contributor

I just noticed that the panic info is after the fix #39471. I think the root cause should be similar with that.

@ThreadDao
Copy link
Contributor Author

@SpadeA-Tang I will try again after this PR is merged #39253

@ThreadDao
Copy link
Contributor Author

It seems that it has not reappeared

@jaime0815
Copy link
Contributor

jaime0815 commented Feb 24, 2025

I saw the same error on the branch https://github.com/milvus-io/milvus/commits/hotfix-2.5.4


2025-02-24 10:00:46.900 | GitCommit: 234f5b5e8
2025-02-24 10:13:25.194 | [2025-02-24T10:13:25Z ERROR tantivy::reader] Error while loading searcher after commit was detected. LockFailure(IoError(Os { code: 2, kind: NotFound, message: "No such file or directory" }), None)

@jaime0815 jaime0815 reopened this Feb 24, 2025
@jaime0815
Copy link
Contributor

@sunby We also need to update the tantivy version on 2.5 branch ?

@SpadeA-Tang
Copy link
Contributor

@sunby We also need to update the tantivy version on 2.5 branch ?

milvus-io/tantivy@cd8486a this commit should fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants