Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Images from markdown file are not removed #17397

Open
TomskDiver opened this issue Dec 31, 2024 · 1 comment · May be fixed by #17429
Open

[Bug]: Images from markdown file are not removed #17397

TomskDiver opened this issue Dec 31, 2024 · 1 comment · May be fixed by #17429
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@TomskDiver
Copy link

Bug Description

Images from markdown file are not removed because error in regex pattern pattern = r"!{1}\[\[(.*)\]\]": https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/readers/llama-index-readers-file/llama_index/readers/file/markdown/base.py#L79

But markdown syntax for images it ![alt_text](path_to_image_file) (see https://www.markdownguide.org/basic-syntax/#images)

May be correct regex is !\[.*\]\(.*\)

FYI @hursh-desai, @jerry

Version

0.12.8

Steps to Reproduce

Test file: test.md

from llama_index.readers.file import MarkdownReader
document = MarkdownReader().load_data("test.md")
print(document[0])

Output:

Doc ID: 5f87ffec-c634-40ce-a615-ce7cd5b453a3
Text: !alt_text

Must be like this:

Doc ID: c4e7cb46-8989-4fd3-83db-2b20d43a9300
Text: **Test header**

Relevant Logs/Tracbacks

No response

@TomskDiver TomskDiver added bug Something isn't working triage Issue needs to be triaged/prioritized labels Dec 31, 2024
Copy link

dosubot bot commented Dec 31, 2024

@TomskDiver Hello there! 👋 I'm here to help you with bug fixes, questions, and contributions.

The regex pattern in the remove_images function of the MarkdownReader class should indeed be corrected to properly remove images from Markdown files. The current pattern r"!{1}\[\[(.*)\]\]" is incorrect. The correct pattern to match the standard Markdown image syntax ![alt_text](path_to_image_file) is:

pattern = r"!\[.*?\]\(.*?\)"

This pattern will accurately capture and remove images from the Markdown content, ensuring that the output is as expected [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Riddhimaan-Senapati added a commit to Riddhimaan-Senapati/llama_index that referenced this issue Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant