GoogleDriveReader support file extensions #17620
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Previously GoogleDrive folders that had file extensions besides Google Sheets/Docs/Presentation would have their extensions stripped because the filename was the Google Drive file ID which doesn't have an extension. However this meant that reading other types of file formats (like PDFs) wouldn't work and the "file_extractor" capability wouldn't be used, instead everything would just be reading bytes. That works ok for .txt or .md docs, but for a PDF it's generates a bunch of garbage.
This change adds the file extension if it can be returned, otherwise it'll just add empty string and shouldn't create any issues
It's possible in GoogleDrive to also remove the extension of a file or upload a file without an extension. In that case this should make no change.
Note, depending on how you use this reader it could be a breaking change if some files were previously being parsed as bytes/text but would now be getting parsed by the file_extractor, but it seems like this may be the expected behavior already.
Fixes # (issue)
Version Bump?
Did I bump the version in the
pyproject.toml
file of the package I am updating? (Except for thellama-index-core
package)Type of Change
Depending on usage, I suppose it could be any of these... 😄
How Has This Been Tested?
Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.
I also ran a script against my Google Drive where I uploaded some some PDFs:
Suggested Checklist:
make format; make lint
to appease the lint gods