Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast native file system path abstraction #1

Closed
10 of 12 tasks
ForNeVeR opened this issue Apr 20, 2022 · 10 comments
Closed
10 of 12 tasks

Fast native file system path abstraction #1

ForNeVeR opened this issue Apr 20, 2022 · 10 comments

Comments

@ForNeVeR
Copy link
Owner

ForNeVeR commented Apr 20, 2022

So, we need an abstraction over a file system path which:

  • is able to parse a file system path on the local file system
  • guarantees that the path is absolute (or relative, or either)
  • is either a struct or a value object (immutable)
  • is normalized (i.e. no double or trailing separators, automatic stripping of UNC prefix on Windows)
  • provides a set of operators to append either a string or a relative path
  • should be fast (i.e. the virtual methods are unwanted, if it's possible to live without them)
  • provides a set of basic operations:
    • canonicalize,
    • concatenate,
    • check if a file or a directory,
    • convert to a string.

Open questions:

  • interning? (may be useful for normalized paths)
    • decided for now out of scope
  • definition of canonicalization? (to me, it's obvious we should convert paths to the right case on case-insensitive file systems, but file systems may behave their own way)
    • this is called normalization and now documented
@ForNeVeR ForNeVeR mentioned this issue Apr 20, 2022
9 tasks
@PaGrom
Copy link

PaGrom commented Apr 20, 2022

Do we need to parse non-local file system paths? Example: parse C:\Users on unix.
What will be abstract root? On unix system root is always the same, but not on Windows. Probably we need several abstract roots.
Do we need to compare relative paths from different file systems? How will we compare case-sensitive/insensitive paths?

I guess we have to come up with some examples.

@ForNeVeR
Copy link
Owner Author

In my opinion, parsing non-local file system paths is a thing for #2.

But you've raised a good question, actually. I guess we'll need different "codecs" (or something else) for different file systems anyway, and if we're having it – then why not make them user-visible?

@PaGrom
Copy link

PaGrom commented Apr 25, 2022

Ok, we can forbid to compare path with different codecs for now.
So about parsing. How to determine is path absolute or relative if it starts with slash on unix system? I think user have to pass some enum value manually. But we do not need it for win-paths I guess. So, our api can't be the same for all platforms.

Therefore I suggest to develop support for single platform (Windows?) at start.

@ForNeVeR
Copy link
Owner Author

How to determine is path absolute or relative if it starts with slash on unix system?

I would say that such a path is always an absolute one. Do you think there are cases when it's not?

Therefore I suggest to develop support for single platform (Windows?) at start.

Yep, I think we could start from that.

@PaGrom
Copy link

PaGrom commented Apr 29, 2022

I would say that such a path is always an absolute one. Do you think there are cases when it's not?

Simple mistake like:

var path = "/dev/random";
var rootPath = "/dev";
var relativePath = path.Replace(rootPath, ""); // getting '/random'. Relative path but starts from slash

So, will we support wild-card paths?
C:\Users\*\**\*.* and so on.

Will we support \..\ statements? We can normalize it easy, but need to have information about it in our abstraction.

What will we do with restricted characters? Like from Path.GetInvalidFileNameChars. I guess throwing exception is enough.

There are DOS path specifiers (\\.\ and \\?\). Do we need parse it too?

How to store parsed data in memory? I think the best way is collection of ReadOnlySpan<char> to reduce memory allocations.

How to iterate through path's segments? There is no full tree or graph, so simple 'flat' collection (Array or LinkedList) is enough.

@ForNeVeR
Copy link
Owner Author

ForNeVeR commented May 1, 2022

Simple mistake like:

var path = "/dev/random";
var rootPath = "/dev";
var relativePath = path.Replace(rootPath, ""); // getting '/random'. Relative path but starts from slash

Please note that the default behavior of path-combining functions, such as Path.Combine is to short-circuit on absolute paths. Such as Path.Combine(@"C:\Windows", @"C:\Users") will return @"C:\Users" (and the same on Unix, which is more questionable, I agree).

I would like to preserve this behavior by default.

But (and that's a big but!) the main point of the library is to make paths more strongly typed, to avoid such issues. A prototype API I imagine looks like that:

// this is always absolute; will assert in the constructor
struct AbsolutePath
{
  // autodetect path kind, behave as Path.Combine if passed an absolute path
  public LocalPath operator / (string relativeOrAbsolute);
  
  // note no necessity of any kind of "autodetection"
  public LocalPath operator / (RelativePath relative);
}

// in this struct, we may allow to pass paths such as `/random` and convert them into `random`, perhaps not by default
// but via a ctor overload which will allow to pass various flags such as "TreatUnixRootedPathAsRelative"
struct RelativePath {}

This is all debatable, of course.

So, will we support wild-card paths? C:\Users\*\**\*.* and so on.

As of now, this is not one of the main points of the library, but I have nothing against implementing that in the near future.

Will we support \..\ statements? We can normalize it easy, but need to have information about it in our abstraction.

I believe that we should normalize paths by default, but we may discuss.

What will we do with restricted characters? Like from Path.GetInvalidFileNameChars. I guess throwing exception is enough.

I think yes, just throw an exception from, the path constructor.

There are DOS path specifiers (\\.\ and \\?\). Do we need parse it too?

We should do something about them (and network paths, too), but it may be another "codec".

\\?\ I have plans to utilize automatically in certain cases (when converting paths to strings for WinAPI).

How to store parsed data in memory? I think the best way is collection of ReadOnlySpan<char> to reduce memory allocations.

That's a good question. Not sure about ReadOnlySpan, it is a stack-only type, right? I don't think we should aim to that. Maybe ReadOnlyMemory is enough?

In ReSharper, there's an interning system for their FileSystemPath which reduces allocations a lot. We could think about such a system, too.

How to iterate through path's segments?

I guess, in this case, we can have an API like AbsolutePath.ForEach(Action<ReadOnlySpan<char>>).

@PaGrom
Copy link

PaGrom commented May 5, 2022

I would like to preserve this behavior by default.

To preserve we can use such functions internally, can't we?

public LocalPath operator / (string relativeOrAbsolute);

Wow, I like it! Use / operator is great idea!

I believe that we should normalize paths by default, but we may discuss.

Yep, I tried to imagine use-case for store info about '..' but actually couldn't. So ok, let's normalize.

We should do something about them (and network paths, too), but it may be another "codec".

How do you suggest to choose specific codec? Im/explicitly?

Maybe ReadOnlyMemory is enough?

Yes, agree with you.

In ReSharper, there's an interning system for their FileSystemPath which reduces allocations a lot. We could think about such a system, too.

How it works? Could you show docs or something please?

I guess, in this case, we can have an API like AbsolutePath.ForEach(Action<ReadOnlySpan>).

Yep, or implement IEnumerable.

How to parse raw parse? Should we use regex? I think simple 'Split' is not enough (cause of \\, \\?\, \\.\). Or we can implement our own state-machine (how regex works) or something like that.

@ForNeVeR
Copy link
Owner Author

ForNeVeR commented May 7, 2022

I would like to preserve this behavior by default.

To preserve we can use such functions internally, can't we?

We can use them of course (though this won't probably be in line with the "zero-alloc" approach), or we can reimplement them on our own. We were only discussing the behavior and not the implementation here, in my opinion.

We should do something about them (and network paths, too), but it may be another "codec".

How do you suggest to choose specific codec? Im/explicitly?

This is something open for discussion. I have in mind the implementation of so-called "interaction contexts" from ReSharper (where each path gets its own "interaction context" and will parse the paths accordingly), but this isn't set in stone.

In ReSharper, there's an interning system for their FileSystemPath which reduces allocations a lot. We could think about such a system, too.

How it works? Could you show docs or something please?

This isn't documented (and is thus subject to change), but basically it works like this: there's a static interning cache, where the key is the path passed to a method FileSystemPath.Parse("foo", InternStrategy.{Intern/DoNotIntern/TRY_GET_INTERNED_BUT_DO_NOT_INTERN}).

So, this is only optimized for cases when you pass the same path to FileSystemPath.Parse a lot.

Note that the keys in the cache are before canonicalization.

I'm not saying we should do something like this, but this is a possibility.

How to parse raw parse? Should we use regex? I think simple 'Split' is not enough (cause of \\, \\?\, \\.\). Or we can implement our own state-machine (how regex works) or something like that.

In any case, this is a very simple routine with linear complexity (until we start considering various weird path parameters like partial case-sensitivity). So, any simple implementation would work, provided it does no unnecessary allocations.

I think, by default our path should store a canonicalized path string inside of itself, and send parts of it when requested by APIs that enumerate its components. Whether we should add anything to work with Memory<char> I'm not sure yet. Maybe default to Memory and add string-based overloads?

@PaGrom
Copy link

PaGrom commented May 10, 2022

This isn't documented (and is thus subject to change), but basically it works like this: there's a static interning cache, where the key is the path passed to a method FileSystemPath.Parse("foo", InternStrategy.{Intern/DoNotIntern/TRY_GET_INTERNED_BUT_DO_NOT_INTERN}).

So, this is only optimized for cases when you pass the same path to FileSystemPath.Parse a lot.

Note that the keys in the cache are before canonicalization.

I'm not saying we should do something like this, but this is a possibility.

Great idea, but I don't understand how it should be implemented. I think it is not the task with first priority. We should create issue and discuss later.

Maybe default to Memory and add string-based overloads?

Yep, string-based overloads with Memory<char>.ToString calls I guess.

@ForNeVeR
Copy link
Owner Author

I am closing this issue as mostly implemented, and extracting the remaining parts to a set of separate, more focused issues.

ForNeVeR pushed a commit that referenced this issue May 3, 2024
(#42) IPath: introduce a generic interface
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants