Skip to content

Improve file path validation #1110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

dreadatour
Copy link
Contributor

@dreadatour dreadatour commented May 20, 2025

Validate File.path on usage (caching, download). More validation cases in test.

@dreadatour dreadatour self-assigned this May 20, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Improve and enforce stricter validation and normalization of user-provided file paths, along with comprehensive tests for edge cases.

  • Enhance validate_path to normalize paths, strip redundant segments, and enforce that paths are relative without ..
  • Add parameterized unit tests covering valid, empty, and invalid path inputs

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tests/unit/lib/test_file.py Add test_path_validation with parameterized cases for path rules
src/datachain/lib/file.py Update validate_path to use PurePath, os.path.normpath, and explicit checks
Comments suppressed due to low confidence (2)

tests/unit/lib/test_file.py:372

  • [nitpick] The parameter name raises is ambiguous; it holds an expected error message rather than an exception type. Consider renaming it to error_match or expected_error for clarity.
@pytest.mark.parametrize(
    "path,expected,raises",

src/datachain/lib/file.py:244

  • The function uses os.path.normpath but os is not imported, which will cause a NameError. Add import os at the top of the file.
normalized_path = os.path.normpath(normalized_path)

Comment on lines 243 to 244
normalized_path = PurePath(path).as_posix()
normalized_path = os.path.normpath(normalized_path)
Copy link
Preview

Copilot AI May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using PurePath followed by os.path.normpath may lead to OS-dependent normalization (e.g., backslashes on Windows). Consider using PurePosixPath exclusively or clearly documenting that normalization follows the current OS conventions.

Suggested change
normalized_path = PurePath(path).as_posix()
normalized_path = os.path.normpath(normalized_path)
normalized_path = PurePosixPath(path)

Copilot uses AI. Check for mistakes.

@@ -236,8 +236,23 @@ def validate_location(cls, v):

@field_validator("path", mode="before")
@classmethod
def validate_path(cls, path):
return Path(path).as_posix() if path else ""
def validate_path(cls, path: str) -> str:
Copy link
Preview

Copilot AI May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This validator would benefit from a docstring summarizing its rules (empty → '', strip ./, enforce relativity, disallow ..) to aid future maintainers.

Suggested change
def validate_path(cls, path: str) -> str:
def validate_path(cls, path: str) -> str:
"""
Validate the 'path' field to ensure it adheres to specific rules:
- Convert empty paths to an empty string ('').
- Normalize paths by stripping './'.
- Ensure paths are relative (disallow absolute paths starting with '/').
- Disallow paths containing '..' to prevent directory traversal.
"""

Copilot uses AI. Check for mistakes.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be expensive? are we sure it won't be doing some syscalls underneath all these normalize, as_posix - etc

I'm worried it can slow down some bulk massive operations - like listing creation for some very rarely needed use case

@dreadatour
Copy link
Contributor Author

can it be expensive? are we sure it won't be doing some syscalls underneath all these normalize, as_posix - etc

In this PR I am also replacing pathlib.Path with pathlib.PurePath, which main difference is it does not resolve files and does not make any real filesystem operations, pure strings manipulations only.

https://docs.python.org/3/library/pathlib.html

Path classes are divided between pure paths, which provide purely computational operations without I/O, and concrete paths, which inherit from pure paths but also provide I/O operations.

So it should not make any syscalls.

I'm worried it can slow down some bulk massive operations - like listing creation for some very rarely needed use case

Very good point IMO.

@shcheklein
Copy link
Member

Very good point IMO.

anything we can do to do a very basic quick test? e.g. starts with . or .. or /? etc ... something really quick that would tell if we need to do a more complicated test?

@dreadatour
Copy link
Contributor Author

anything we can do to do a very basic quick test? e.g. starts with . or .. or /? etc ... something really quick that would tell if we need to do a more complicated test?

$ python -m timeit -s "from datachain import File" "File.validate_path('./foo/../bar/../file.ext')"
100000 loops, best of 5: 3.14 usec per loop

@dreadatour
Copy link
Contributor Author

I am more worrying about modifying file path (see tests in this PR).

Is this OK to modify path (e.g. dir/../file.ext -> file.ext or should we raise an exception if path is not normalized?
I'd prefer second option, to be honest, but this can breaks user experience. Same time modifying path is also not good for user experience.

@shcheklein
Copy link
Member

Is this OK to modify path

tbh, i think that's fine. why not? File is meant to be an actual precise object more or less / vs some random paths

@dreadatour
Copy link
Contributor Author

Also it looks like we should allow absolute paths, it is how it works now for file:// source.

I am going to remove the check for absolute path, but ideally we should also check source and allow absolute path only if source.startswith("file://") and disallow it in other cases.

Other option is to move root directory (even /) in source field and for all sources always use relative paths in path field, but this is not backward-compatible and will require migrations.

@dreadatour
Copy link
Contributor Author

Is this OK to modify path

tbh, i think that's fine. why not? File is meant to be an actual precise object more or less / vs some random paths

My concern is only about user having normalized path in the dataset, even if they are creating File signal with some random weird paths. On the other hand you're right, this is "an actual precise object" and normalizing path looks like an only option.

@shcheklein
Copy link
Member

a completely alternative option - allow any path at all, don't validate

validate in prefetch instead (our code that deals with these files)

let users also deal with any custom logic that would like to put there

wdyt?

@dreadatour
Copy link
Contributor Author

One more additional sanity check we can do is to check if path is not ends with trailing slash (/), which represents directories and File model is for file objects.

@dreadatour
Copy link
Contributor Author

a completely alternative option - allow any path at all, don't validate

validate in prefetch instead (our code that deals with these files)

let users also deal with any custom logic that would like to put there

Yes, this is what my concern was about when I was talking about "user experience".

wdyt?

I think this looks like a good option.

Should I add something like normpath method/property to the File model and check if we are using this new normalized path everywhere we are working with physical files in our codebase?

@shcheklein
Copy link
Member

Should I add something like normpath method/property to the File model and check if we are using this new normalized path everywhere we are working with physical files in our codebase?

yes, kinda use your validation logic

but also we need to put safeguards in UDFs, right? For local paths. E.g. do we even need prefetch if file is local?

Copy link

cloudflare-workers-and-pages bot commented May 20, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: f3bc04a
Status: ✅  Deploy successful!
Preview URL: https://11ab24ab.datachain-documentation.pages.dev
Branch Preview URL: https://better-file-path-validation.datachain-documentation.pages.dev

View logs

@dreadatour dreadatour requested a review from shcheklein May 20, 2025 16:41
@dreadatour
Copy link
Contributor Author

E.g. do we even need prefetch if file is local?

We don't need prefetch or cache (? different disks/file systems?) if file is local, but I think this is a subject for separate PR?

@shcheklein
Copy link
Member

if file is local, but I think this is a subject for separate PR?

yes, unless it is just easier to do that vs other types of safeguards (e.g. deny any local operations for prefetch / cache for now).

in reality it might be needed though in local mode - e.g. slow NAS mounted on some volume

Copy link

codecov bot commented May 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.67%. Comparing base (648cf44) to head (f3bc04a).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1110      +/-   ##
==========================================
+ Coverage   88.65%   88.67%   +0.01%     
==========================================
  Files         148      148              
  Lines       12853    12870      +17     
  Branches     1810     1814       +4     
==========================================
+ Hits        11395    11412      +17     
  Misses       1033     1033              
  Partials      425      425              
Flag Coverage Δ
datachain 88.60% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/cache.py 88.75% <100.00%> (ø)
src/datachain/client/fsspec.py 92.66% <100.00%> (+0.02%) ⬆️
src/datachain/client/local.py 97.33% <100.00%> (ø)
src/datachain/lib/arrow.py 98.77% <100.00%> (ø)
src/datachain/lib/file.py 92.03% <100.00%> (+0.33%) ⬆️
src/datachain/lib/tar.py 100.00% <ø> (ø)
src/datachain/lib/webdataset.py 94.44% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dreadatour
Copy link
Contributor Author

in reality it might be needed though in local mode - e.g. slow NAS mounted on some volume

Yes, exactly. This is still local file, but cache/prefetch might be useful in this case. User can control this with settings, so may be we should leave it as is for now.

Copy link
Member

@0x2b3bfa0 0x2b3bfa0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. 😅

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will users see when they run UDF with some bad files after this change?

@dreadatour
Copy link
Contributor Author

What will users see when they run UDF with some bad files after this change?

They will see an exception:

datachain.lib.file.FileError: Error in file gs://datachain-test-vlad/.: path must not be a directory

There is another issue, not directly related: #1125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants