-
Notifications
You must be signed in to change notification settings - Fork 304
Closed
Labels
check_for_bugNeeds to be reproducedNeeds to be reproducedwait for feedbackWaiting for the user feedbackWaiting for the user feedback
Description
Describe the bug
I have fscrawler on continuously. What I find is that if I turn it off and then restart, it proceeds to delete and reindex the documents which are already indexed. Specifically, the number of indexed documents doesn't change but it appears to be deleting and then adding them again even though there are no new ones. To be clear, i just stopped the docker container/elasticsearch and restarted it.
Job Settings
---
name: "job_name"
fs:
#url: "/mnt/cloud/cases"
url: "/tmp/es"
update_rate: "15m"
excludes:
- "*/~*"
json_support: false
filename_as_id: true
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: true
ocr:
#path: "/usr/bin/"
#data_path: "/usr/share/tesseract-ocr/5/tessdata/"
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
pipeline: "fscrawler-copy"
nodes:
- url: "https://192.168.1.199:9200"
# - url: "https://192.168.1.196:9200"
# - url: "https://192.168.1.198:9200"
# - url: "https://192.168.1.200:9200"
# - url: "https://192.168.1.201:9200"
username: "elastic"
password: "Dynaco123$"
bulk_size: 100
ssl_verification: false
Logs
14:16:30,482 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [364.4mb/5.8gb=6.07%], RAM [7.2gb/23.4gb=30.92%], Swap [22.3gb/22.3gb=100.0%].
14:16:30,816 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
14:16:30,817 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
14:16:30,942 WARN [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
14:16:31,702 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.13.2
14:16:31,711 WARN [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
14:16:31,827 INFO [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.13.2
14:16:31,855 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [job_name] for [/tmp/es] every [15m]
14:16:32,038 INFO [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.
Expected behavior
I wouldn't expect any need to reindex as no new documents were added to the folder
Versions:
- OS: [Debian 12]
Fscrawler 2-10 snapshot docker
Attachment
If the bug is related to a given file, please share this file, so we can reuse it in tests
to reproduce the problem and may be use it in our integration tests.
Metadata
Metadata
Assignees
Labels
check_for_bugNeeds to be reproducedNeeds to be reproducedwait for feedbackWaiting for the user feedbackWaiting for the user feedback