Improve file path validation #1110

dreadatour · 2025-05-20T02:44:57Z

Validate File.path on usage (caching, download). More validation cases in test.

Copilot

Pull Request Overview

Improve and enforce stricter validation and normalization of user-provided file paths, along with comprehensive tests for edge cases.

Enhance validate_path to normalize paths, strip redundant segments, and enforce that paths are relative without ..
Add parameterized unit tests covering valid, empty, and invalid path inputs

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
tests/unit/lib/test_file.py	Add `test_path_validation` with parameterized cases for path rules
src/datachain/lib/file.py	Update `validate_path` to use `PurePath`, `os.path.normpath`, and explicit checks

Comments suppressed due to low confidence (2)

tests/unit/lib/test_file.py:372

[nitpick] The parameter name raises is ambiguous; it holds an expected error message rather than an exception type. Consider renaming it to error_match or expected_error for clarity.

@pytest.mark.parametrize(
    "path,expected,raises",

src/datachain/lib/file.py:244

The function uses os.path.normpath but os is not imported, which will cause a NameError. Add import os at the top of the file.

normalized_path = os.path.normpath(normalized_path)

Copilot · 2025-05-20T02:46:17Z

src/datachain/lib/file.py

+        normalized_path = PurePath(path).as_posix()
+        normalized_path = os.path.normpath(normalized_path)


[nitpick] Using PurePath followed by os.path.normpath may lead to OS-dependent normalization (e.g., backslashes on Windows). Consider using PurePosixPath exclusively or clearly documenting that normalization follows the current OS conventions.

Suggested change

normalized_path = PurePath(path).as_posix()

normalized_path = os.path.normpath(normalized_path)

normalized_path = PurePosixPath(path)

Copilot · 2025-05-20T02:46:17Z

src/datachain/lib/file.py

@@ -236,8 +236,23 @@ def validate_location(cls, v):

    @field_validator("path", mode="before")
    @classmethod
-    def validate_path(cls, path):
-        return Path(path).as_posix() if path else ""
+    def validate_path(cls, path: str) -> str:


[nitpick] This validator would benefit from a docstring summarizing its rules (empty → '', strip ./, enforce relativity, disallow ..) to aid future maintainers.

Suggested change

def validate_path(cls, path: str) -> str:

def validate_path(cls, path: str) -> str:

"""

Validate the 'path' field to ensure it adheres to specific rules:

- Convert empty paths to an empty string ('').

- Normalize paths by stripping './'.

- Ensure paths are relative (disallow absolute paths starting with '/').

- Disallow paths containing '..' to prevent directory traversal.

"""

shcheklein

can it be expensive? are we sure it won't be doing some syscalls underneath all these normalize, as_posix - etc

I'm worried it can slow down some bulk massive operations - like listing creation for some very rarely needed use case

dreadatour · 2025-05-20T02:53:50Z

can it be expensive? are we sure it won't be doing some syscalls underneath all these normalize, as_posix - etc

In this PR I am also replacing pathlib.Path with pathlib.PurePath, which main difference is it does not resolve files and does not make any real filesystem operations, pure strings manipulations only.

https://docs.python.org/3/library/pathlib.html

Path classes are divided between pure paths, which provide purely computational operations without I/O, and concrete paths, which inherit from pure paths but also provide I/O operations.

So it should not make any syscalls.

I'm worried it can slow down some bulk massive operations - like listing creation for some very rarely needed use case

Very good point IMO.

shcheklein · 2025-05-20T03:12:16Z

Very good point IMO.

anything we can do to do a very basic quick test? e.g. starts with . or .. or /? etc ... something really quick that would tell if we need to do a more complicated test?

dreadatour · 2025-05-20T03:24:01Z

anything we can do to do a very basic quick test? e.g. starts with . or .. or /? etc ... something really quick that would tell if we need to do a more complicated test?

$ python -m timeit -s "from datachain import File" "File.validate_path('./foo/../bar/../file.ext')"
100000 loops, best of 5: 3.14 usec per loop

dreadatour · 2025-05-20T03:28:38Z

I am more worrying about modifying file path (see tests in this PR).

Is this OK to modify path (e.g. dir/../file.ext -> file.ext or should we raise an exception if path is not normalized?
I'd prefer second option, to be honest, but this can breaks user experience. Same time modifying path is also not good for user experience.

shcheklein · 2025-05-20T03:31:07Z

Is this OK to modify path

tbh, i think that's fine. why not? File is meant to be an actual precise object more or less / vs some random paths

dreadatour · 2025-05-20T03:34:25Z

Also it looks like we should allow absolute paths, it is how it works now for file:// source.

I am going to remove the check for absolute path, but ideally we should also check source and allow absolute path only if source.startswith("file://") and disallow it in other cases.

Other option is to move root directory (even /) in source field and for all sources always use relative paths in path field, but this is not backward-compatible and will require migrations.

dreadatour · 2025-05-20T03:37:08Z

Is this OK to modify path

tbh, i think that's fine. why not? File is meant to be an actual precise object more or less / vs some random paths

My concern is only about user having normalized path in the dataset, even if they are creating File signal with some random weird paths. On the other hand you're right, this is "an actual precise object" and normalizing path looks like an only option.

shcheklein · 2025-05-20T03:39:28Z

a completely alternative option - allow any path at all, don't validate

validate in prefetch instead (our code that deals with these files)

let users also deal with any custom logic that would like to put there

wdyt?

dreadatour · 2025-05-20T03:39:56Z

One more additional sanity check we can do is to check if path is not ends with trailing slash (/), which represents directories and File model is for file objects.

dreadatour · 2025-05-20T03:42:31Z

a completely alternative option - allow any path at all, don't validate

validate in prefetch instead (our code that deals with these files)

let users also deal with any custom logic that would like to put there

Yes, this is what my concern was about when I was talking about "user experience".

wdyt?

I think this looks like a good option.

Should I add something like normpath method/property to the File model and check if we are using this new normalized path everywhere we are working with physical files in our codebase?

shcheklein · 2025-05-20T03:50:13Z

Should I add something like normpath method/property to the File model and check if we are using this new normalized path everywhere we are working with physical files in our codebase?

yes, kinda use your validation logic

but also we need to put safeguards in UDFs, right? For local paths. E.g. do we even need prefetch if file is local?

cloudflare-workers-and-pages · 2025-05-20T16:40:43Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`f3bc04a`
Status:	✅ Deploy successful!
Preview URL:	https://11ab24ab.datachain-documentation.pages.dev
Branch Preview URL:	https://better-file-path-validation.datachain-documentation.pages.dev

View logs

dreadatour · 2025-05-20T16:43:09Z

E.g. do we even need prefetch if file is local?

We don't need prefetch or cache (? different disks/file systems?) if file is local, but I think this is a subject for separate PR?

shcheklein · 2025-05-20T16:45:03Z

if file is local, but I think this is a subject for separate PR?

yes, unless it is just easier to do that vs other types of safeguards (e.g. deny any local operations for prefetch / cache for now).

in reality it might be needed though in local mode - e.g. slow NAS mounted on some volume

codecov · 2025-05-21T04:47:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.67%. Comparing base (648cf44) to head (f3bc04a).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1110      +/-   ##
==========================================
+ Coverage   88.65%   88.67%   +0.01%     
==========================================
  Files         148      148              
  Lines       12853    12870      +17     
  Branches     1810     1814       +4     
==========================================
+ Hits        11395    11412      +17     
  Misses       1033     1033              
  Partials      425      425

Flag	Coverage Δ
datachain	`88.60% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/cache.py	`88.75% <100.00%> (ø)`
src/datachain/client/fsspec.py	`92.66% <100.00%> (+0.02%)`	⬆️
src/datachain/client/local.py	`97.33% <100.00%> (ø)`
src/datachain/lib/arrow.py	`98.77% <100.00%> (ø)`
src/datachain/lib/file.py	`92.03% <100.00%> (+0.33%)`	⬆️
src/datachain/lib/tar.py	`100.00% <ø> (ø)`
src/datachain/lib/webdataset.py	`94.44% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dreadatour · 2025-05-21T05:59:31Z

in reality it might be needed though in local mode - e.g. slow NAS mounted on some volume

Yes, exactly. This is still local file, but cache/prefetch might be useful in this case. User can control this with settings, so may be we should leave it as is for now.

0x2b3bfa0

Looks good to me. 😅

shcheklein

What will users see when they run UDF with some bad files after this change?

dreadatour · 2025-05-29T15:23:40Z

What will users see when they run UDF with some bad files after this change?

They will see an exception:

datachain.lib.file.FileError: Error in file gs://datachain-test-vlad/.: path must not be a directory

There is another issue, not directly related: #1125

Improve file path validation

22a3c66

dreadatour requested review from shcheklein, 0x2b3bfa0 and Copilot May 20, 2025 02:44

dreadatour self-assigned this May 20, 2025

Copilot AI reviewed May 20, 2025

View reviewed changes

shcheklein reviewed May 20, 2025

View reviewed changes

Validate File.path on usage + add tests

038766d

dreadatour requested a review from shcheklein May 20, 2025 16:41

dreadatour added 2 commits May 21, 2025 11:42

Fix tests

f418a2e

Merge branch 'main' into better-file-path-validation

c922000

0x2b3bfa0 approved these changes May 21, 2025

View reviewed changes

shcheklein reviewed May 21, 2025

View reviewed changes

Merge branch 'main' into better-file-path-validation

f3bc04a

		normalized_path = PurePath(path).as_posix()
		normalized_path = os.path.normpath(normalized_path)

	normalized_path = PurePath(path).as_posix()
	normalized_path = os.path.normpath(normalized_path)
	normalized_path = PurePosixPath(path)

-    def validate_path(cls, path: str) -> str:
+    def validate_path(cls, path: str) -> str:
+        """
+        Validate the 'path' field to ensure it adheres to specific rules:
+        - Convert empty paths to an empty string ('').
+        - Normalize paths by stripping './'.
+        - Ensure paths are relative (disallow absolute paths starting with '/').
+        - Disallow paths containing '..' to prevent directory traversal.
+        """

Improve file path validation #1110

Are you sure you want to change the base?

Improve file path validation #1110

Uh oh!

Conversation

dreadatour commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 20, 2025

Choose a reason for hiding this comment

Uh oh!

shcheklein left a comment

Choose a reason for hiding this comment

Uh oh!

dreadatour commented May 20, 2025

Uh oh!

shcheklein commented May 20, 2025

Uh oh!

dreadatour commented May 20, 2025

Uh oh!

dreadatour commented May 20, 2025

Uh oh!

shcheklein commented May 20, 2025

Uh oh!

dreadatour commented May 20, 2025

Uh oh!

dreadatour commented May 20, 2025

Uh oh!

shcheklein commented May 20, 2025

Uh oh!

dreadatour commented May 20, 2025

Uh oh!

dreadatour commented May 20, 2025

Uh oh!

shcheklein commented May 20, 2025

Uh oh!

cloudflare-workers-and-pages bot commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

dreadatour commented May 20, 2025

Uh oh!

shcheklein commented May 20, 2025

Uh oh!

codecov bot commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dreadatour commented May 21, 2025

Uh oh!

0x2b3bfa0 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shcheklein left a comment

Choose a reason for hiding this comment

Uh oh!

dreadatour commented May 29, 2025

Uh oh!

Uh oh!

dreadatour commented May 20, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented May 20, 2025 •

edited

Loading

codecov bot commented May 21, 2025 •

edited

Loading

0x2b3bfa0 left a comment •

edited

Loading