Skip to content
This repository was archived by the owner on Mar 12, 2025. It is now read-only.

Top words for early 2022 are all Vietnamese? #78

Open
pgulley opened this issue Jun 24, 2024 · 3 comments
Open

Top words for early 2022 are all Vietnamese? #78

pgulley opened this issue Jun 24, 2024 · 3 comments
Assignees
Labels
bug Something isn't working question Further information is requested
Milestone

Comments

@pgulley
Copy link
Member

pgulley commented Jun 24, 2024

image

@philbudne Noted, in investigating the status of re-indexing data from 2022, that the top-terms for a query from 2022-01-01 to 2022-12-31 seems to be entirely populated with Vietnamese words- despite vietnamese not being in the top 10 languages represented!

@pgulley pgulley added bug Something isn't working question Further information is requested labels Jun 24, 2024
@pgulley pgulley self-assigned this Jun 26, 2024
@pgulley
Copy link
Member Author

pgulley commented Jul 3, 2024

No Vietnamese stopwords might be part of the issue, but probably doesn't cover this

@pgulley pgulley added this to the July milestone Jul 3, 2024
@pgulley pgulley moved this from Todo to Investigating in Ingest + Index Infrastructure Jul 3, 2024
@pgulley pgulley modified the milestones: 2 - July, 3 - August Jul 31, 2024
@pgulley pgulley modified the milestones: 3 - August, 4 - September Aug 28, 2024
@philbudne
Copy link
Contributor

It may just be because I run queries against all stories when looking at progress running historical backfills, and we have some REALLY spammy .vn sources!!

@philbudne
Copy link
Contributor

philbudne commented Nov 3, 2024

Was traipsing thru mc-providers and noticed that there is no vi_stop_words.txt file in https://github.com/mediacloud/mc-providers/tree/main/mc_providers/language/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working question Further information is requested
Projects
Status: Investigating
Development

No branches or pull requests

2 participants