-
Notifications
You must be signed in to change notification settings - Fork 114
Improve file path validation #1110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Improve and enforce stricter validation and normalization of user-provided file paths, along with comprehensive tests for edge cases.
- Enhance
validate_path
to normalize paths, strip redundant segments, and enforce that paths are relative without..
- Add parameterized unit tests covering valid, empty, and invalid path inputs
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
tests/unit/lib/test_file.py | Add test_path_validation with parameterized cases for path rules |
src/datachain/lib/file.py | Update validate_path to use PurePath , os.path.normpath , and explicit checks |
Comments suppressed due to low confidence (2)
tests/unit/lib/test_file.py:372
- [nitpick] The parameter name
raises
is ambiguous; it holds an expected error message rather than an exception type. Consider renaming it toerror_match
orexpected_error
for clarity.
@pytest.mark.parametrize(
"path,expected,raises",
src/datachain/lib/file.py:244
- The function uses
os.path.normpath
butos
is not imported, which will cause aNameError
. Addimport os
at the top of the file.
normalized_path = os.path.normpath(normalized_path)
src/datachain/lib/file.py
Outdated
normalized_path = PurePath(path).as_posix() | ||
normalized_path = os.path.normpath(normalized_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Using PurePath
followed by os.path.normpath
may lead to OS-dependent normalization (e.g., backslashes on Windows). Consider using PurePosixPath
exclusively or clearly documenting that normalization follows the current OS conventions.
normalized_path = PurePath(path).as_posix() | |
normalized_path = os.path.normpath(normalized_path) | |
normalized_path = PurePosixPath(path) |
Copilot uses AI. Check for mistakes.
@@ -236,8 +236,23 @@ def validate_location(cls, v): | |||
|
|||
@field_validator("path", mode="before") | |||
@classmethod | |||
def validate_path(cls, path): | |||
return Path(path).as_posix() if path else "" | |||
def validate_path(cls, path: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] This validator would benefit from a docstring summarizing its rules (empty → ''
, strip ./
, enforce relativity, disallow ..
) to aid future maintainers.
def validate_path(cls, path: str) -> str: | |
def validate_path(cls, path: str) -> str: | |
""" | |
Validate the 'path' field to ensure it adheres to specific rules: | |
- Convert empty paths to an empty string (''). | |
- Normalize paths by stripping './'. | |
- Ensure paths are relative (disallow absolute paths starting with '/'). | |
- Disallow paths containing '..' to prevent directory traversal. | |
""" |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it be expensive? are we sure it won't be doing some syscalls underneath all these normalize, as_posix - etc
I'm worried it can slow down some bulk massive operations - like listing creation for some very rarely needed use case
In this PR I am also replacing pathlib.Path with pathlib.PurePath, which main difference is it does not resolve files and does not make any real filesystem operations, pure strings manipulations only. https://docs.python.org/3/library/pathlib.html
So it should not make any syscalls.
Very good point IMO. |
anything we can do to do a very basic quick test? e.g. starts with . or .. or /? etc ... something really quick that would tell if we need to do a more complicated test? |
|
I am more worrying about modifying file path (see tests in this PR). Is this OK to modify path (e.g. |
tbh, i think that's fine. why not? File is meant to be an actual precise object more or less / vs some random paths |
Also it looks like we should allow absolute paths, it is how it works now for I am going to remove the check for absolute path, but ideally we should also check source and allow absolute path only if Other option is to move root directory (even |
My concern is only about user having normalized path in the dataset, even if they are creating |
a completely alternative option - allow any path at all, don't validate validate in prefetch instead (our code that deals with these files) let users also deal with any custom logic that would like to put there wdyt? |
One more additional sanity check we can do is to check if |
Yes, this is what my concern was about when I was talking about "user experience".
I think this looks like a good option. Should I add something like |
yes, kinda use your validation logic but also we need to put safeguards in UDFs, right? For local paths. E.g. do we even need prefetch if file is local? |
Deploying datachain-documentation with
|
Latest commit: |
f3bc04a
|
Status: | ✅ Deploy successful! |
Preview URL: | https://11ab24ab.datachain-documentation.pages.dev |
Branch Preview URL: | https://better-file-path-validation.datachain-documentation.pages.dev |
We don't need prefetch or cache (? different disks/file systems?) if file is local, but I think this is a subject for separate PR? |
yes, unless it is just easier to do that vs other types of safeguards (e.g. deny any local operations for prefetch / cache for now). in reality it might be needed though in local mode - e.g. slow NAS mounted on some volume |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1110 +/- ##
==========================================
+ Coverage 88.65% 88.67% +0.01%
==========================================
Files 148 148
Lines 12853 12870 +17
Branches 1810 1814 +4
==========================================
+ Hits 11395 11412 +17
Misses 1033 1033
Partials 425 425
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Yes, exactly. This is still local file, but cache/prefetch might be useful in this case. User can control this with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will users see when they run UDF with some bad files after this change?
They will see an exception:
There is another issue, not directly related: #1125 |
Validate
File.path
on usage (caching, download). More validation cases in test.