Skip to content

Upon restart, Fscrawler deletes and reindexes even though no new files are added. #1941

@ScottCov

Description

@ScottCov

Describe the bug

I have fscrawler on continuously. What I find is that if I turn it off and then restart, it proceeds to delete and reindex the documents which are already indexed. Specifically, the number of indexed documents doesn't change but it appears to be deleting and then adding them again even though there are no new ones. To be clear, i just stopped the docker container/elasticsearch and restarted it.

Job Settings

---
name: "job_name"
fs:
  #url: "/mnt/cloud/cases"
  url: "/tmp/es"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: true
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false 
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  ocr:
    #path: "/usr/bin/"
    #data_path: "/usr/share/tesseract-ocr/5/tessdata/"
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  pipeline: "fscrawler-copy"
  nodes:
  - url: "https://192.168.1.199:9200"
 # - url: "https://192.168.1.196:9200"
 # - url: "https://192.168.1.198:9200"
 # - url: "https://192.168.1.200:9200"
 # - url: "https://192.168.1.201:9200"
  username: "elastic"
  password: "Dynaco123$"
  bulk_size: 100
 ssl_verification: false


Logs

14:16:30,482 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [364.4mb/5.8gb=6.07%], RAM [7.2gb/23.4gb=30.92%], Swap [22.3gb/22.3gb=100.0%].
14:16:30,816 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
14:16:30,817 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
14:16:30,942 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
14:16:31,702 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.13.2
14:16:31,711 WARN  [f.p.e.c.f.c.ElasticsearchClient] We are not doing SSL verification. It's not recommended for production.
14:16:31,827 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.13.2
14:16:31,855 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [job_name] for [/tmp/es] every [15m]
14:16:32,038 INFO  [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.

Expected behavior

I wouldn't expect any need to reindex as no new documents were added to the folder

Versions:

  • OS: [Debian 12]
    Fscrawler 2-10 snapshot docker

Attachment

If the bug is related to a given file, please share this file, so we can reuse it in tests
to reproduce the problem and may be use it in our integration tests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions