Skip to content

Conversation

@priyankaforu
Copy link
Contributor

@priyankaforu priyankaforu commented Oct 10, 2025

Contributor checklist


Description

Fix:

The enwiki's ".ndjson" was always parsing without any data, and this is because the parser is unable to open the file elements to parse like for example " <page> <title> Heading For The Article </title> </page> is not being read and called back to parse the array of articles/words that we need. So I have added the missing callback triggers for reading the texts inside those tags.

added filtering to omit the redirect links to various sources and name space pages

I have also verified it by downloading 2 enwiki_dump files and trying to parse them into words 👇

@axif0 @andrewtavis @DeleMike Please have a look and lemme know if there are any corrections to be made or if my approach in understanding the issue is wrong

image

#641

@github-actions
Copy link

Thank you for the pull request! ❤️

The Scribe-Data team will do our best to address your contribution as soon as we can. If you're not already a member of our public Matrix community, please consider joining! We'd suggest that you use the Element client as well as Element X for a mobile app, and definitely join the General and Data rooms once you're in. Also consider attending our bi-weekly Saturday dev syncs. It'd be great to meet you 😊

@github-actions
Copy link

Maintainer Checklist

The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@andrewtavis andrewtavis added the hacktoberfest-accepted Accepted as a part of Hacktoberfest label Oct 11, 2025
@andrewtavis
Copy link
Member

Let's review this in the call we have later, all :)

@DeleMike
Copy link
Collaborator

I will review today. Well done! ✨

Copy link
Collaborator

@DeleMike DeleMike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks okay @priyankaforu!

I would love to test it locally. I did not see the steps to reproduce the issue you had in #641, so it is hard for me to have the complete experience like you describe in the issue.

Could you add this, please?

@DeleMike DeleMike linked an issue Oct 15, 2025 that may be closed by this pull request
2 tasks
@priyankaforu
Copy link
Contributor Author

priyankaforu commented Oct 15, 2025

All looks okay @priyankaforu!

I would love to test it locally. I did not see the steps to reproduce the issue you had in #641, so it is hard for me to have the complete experience like you describe in the issue.

Could you add this, please?

Hey @DeleMike , did you try downloading autosuggestions from wiki data ? Because for autosuggestions the issue is , parser is not parsing the characters inside the tags, so to fix that we need to let Sax parser to read in between characters, and store them in the array. I just added the missing methods there :) If you still feel this is unclear, I can explain you better connecting with you, if possible whenever you have time

@DeleMike
Copy link
Collaborator

Okay, thanks. I will connect w/ you in Matrix.

Copy link
Collaborator

@DeleMike DeleMike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @priyankaforu on this! And thanks for the sync. I totally see it all in action!

One suggestion:

  • We need a little bit more logs when parsing starts.
  • Hopefully, #646 solves this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hacktoberfest-accepted Accepted as a part of Hacktoberfest

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing Sax Parsing Logic For Xml Files in extract_wiki.py file

3 participants