-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Articles from the future??? #103
Comments
NOTE: there could also be articles (now in the "past") that were from the future when indexed. |
Picked a few at random and it appears to be a date format / language issue
https://aktien-portal.at/shownews.html?id=75405
https://belonging.berkeley.edu/e-newsletter-archive
https://www.tss-tv.co.jp/apply/regular/20230112.html
|
I wrote a small script to pass through all the shared links and attempt to extract the
|
Perhaps a list like this of test cases or real failures should be posted as an issue on https://github.com/adbar/htmldate? |
So, I guess the question is why these stories are being /indexed/ with a future date, if the indexer is meant to be stopping it. Are these from some reindexing batch perhaps? |
A while back I wrote an ES query for the Media Cloud news index that tallied articles by publication year, and saw some oddness:
mcmetadata is supposed to reject anything a month in the future!
So today I wrote a query to retrieve anything with a publication date 2025-07-01 thru 9999-12-31, and here is what it found:
The text was updated successfully, but these errors were encountered: