Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: rename to_parquet to write_parquet #6909

Open
1 task done
bingbong-sempai opened this issue Aug 20, 2023 · 13 comments
Open
1 task done

feat: rename to_parquet to write_parquet #6909

bingbong-sempai opened this issue Aug 20, 2023 · 13 comments
Assignees
Labels
feature Features or general enhancements needs love Issues that need concerted focus and effort to address holistically

Comments

@bingbong-sempai
Copy link

Is your feature request related to a problem?

No response

Describe the solution you'd like

I know pairing read_ and to_ functions follows Pandas' style but I think read_ and write_ makes more sense (it's also what polars uses).
For example, read_parquet and write_parquet for reading/writing to disk, from_pandas and to_pandas for reading/writing to memory.

What version of ibis are you running?

6.1.0

What backend(s) are you using, if any?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@bingbong-sempai bingbong-sempai added the feature Features or general enhancements label Aug 20, 2023
@lostmygithubaccount
Copy link
Member

my personal opinion is not strictly opposed. though I don't like having a lot of aliases for methods, perhaps this is a good case (and thus wouldn't need to be a breaking change)

@bingbong-sempai
Copy link
Author

yeah i agree that too many aliases is a bad thing. it's just that pairing read_ and write_ (same with from_ and to_) feels like the "correct" thing to do.

@gforsyth
Copy link
Member

I agree that there's a nice consistency in using read_ and write_ for file-based things and from_ and to_ for memory-based things.
It's also going to break a lot of users. I'm not sure that's worth it. It also potentially leaves users thinking that we don't support writing to various filetypes because they're expecting it under to_ and not tab-completing out to write_.

There's also the issue that depending on the backend, we're sometimes uploading a file to, say, Snowflake, which isn't a cheap operation, vs. creating a view based only on parquet metadata, with DuckDB, which IS cheap.

Polars also has a scan_parquet method to distinguish this kind of behavior, but if we follow that convention we'd end up with a different API for a given backend, because one would only support read_ and the other would only support scan_.

Overall, I'm inclined to leave things as they are.

I am a -1 on adding an alias.

@bingbong-sempai
Copy link
Author

that's unfortunate, i was hoping ibis would be able to make API changes that pandas couldn't

@bingbong-sempai
Copy link
Author

bingbong-sempai commented Aug 22, 2023

an option can be to alias read_ to from_ instead?
from_ can cover both read and scan operations.
that way io could pair from_ and to_.

@gforsyth
Copy link
Member

Hey @bingbong-sempai -- I think we're going to stick with our current convention. Most users are probably coming from pandas and there are other libraries that follow the convention. There's also a benefit in seeing all of the targets for commands like read_ in one block when using tab-completion.
Hopefully you find other parts of Ibis to enjoy and this API choice isn't a blocker for you.

Going to close this out -- thanks!

@lostmygithubaccount lostmygithubaccount closed this as not planned Won't fix, can't repro, duplicate, stale Aug 23, 2023
@lostmygithubaccount
Copy link
Member

just to add on a bit -- write makes sense for files (CSV, parquet) but less sense for other things Ibis can output (PyArrow tables, PyArrow batches, pandas, torch, etc.)

@bingbong-sempai
Copy link
Author

Just to add, both Polars and DuckDB use write_parquet and not to_parquet.

@lostmygithubaccount
Copy link
Member

lostmygithubaccount commented Apr 25, 2024

not sure how we feel about it in general as the Ibis project, but I'm personally not opposed (similar to my earlier comment) to adding aliases for common spellings of existing APIs -- e.g. a write_parquet method that perhaps we don't even put on the website (or do and note it's just an alias of to_parquet)

I'll reopen this so it shows in our issue triage and we can discuss

@kszucs
Copy link
Member

kszucs commented Apr 26, 2024

My preference is to use write_<parquet|csv|...> which triggers computation. I would like to have lazy counterparts eventually which could use to_<format> prefix but no strong opinions there. I think methods triggering actions should be verbs.

@jcrist
Copy link
Member

jcrist commented Apr 26, 2024

I would like to have lazy counterparts eventually which could use to_<format> prefix but no strong opinions there.

Note that we already have to_pandas/to_pyarrow/to_polars which run eagerly and return an in memory result. I think those names are good, but mixing to_* methods that don't run eagerly in with them would be confusing IMO.

I'm not opposed to a convention of:

  • write_*: eager methods that compute and write the output somewhere
  • to_*: eager methods that compute and return some in-memory object

That said, I don't find the current convention of putting both of these under to_* bad. A user can always do expr.to_<TAB> to see all the things they can convert an expression into.

The only thing I'm against is having any aliases where both expr.to_parquet and expr.write_parquet exist and do the same thing. As a user I would find that confusing and would spend a bunch of time trying to figure out if/how the expr.to_* and expr.write_* methods differed. (to be clear, short lived aliases for deprecation/migration are fine).

@gforsyth gforsyth added the needs love Issues that need concerted focus and effort to address holistically label Apr 26, 2024
@kszucs kszucs self-assigned this Apr 26, 2024
@NickCrews
Copy link
Contributor

I think the fraction of data formats is really small for which in-memory and on-disk format really makes sense. Eg I could see both a write_json() and a to_json(), but I don't see how an in-memory representation of a CSV makes sense, you always are going to write this to disk. I think this is the case for most of our formats (parquet, delta, csv). Therefore, since for most formats we would either have a to_format() or a write_format() (and not both), I think it is not going to be obvious enough what the distinction is between the two, and mostly it's just going to appear inconsistent to users.

I think (no evidence) that most users when they see to_csv() they understand that something is getting written to disk and so some computation is happening. Being consistent with duckdb and Polars would be nice, but I'd rather be internally consistent and simple.

I think my preferred solution would be to_format(..., path: str | Path | None = NOT_SET, ...), where if the user explicitly passes a path of None, then the result gets returned in-memory (if possible for that format). The default should be a NOT_SET sentinel value so that if they forget to pass something explicitly they get an error.

@NickCrews
Copy link
Contributor

NickCrews commented Jan 19, 2025

I vote no adding alias write_ methods. Keep our code simple.

I vote on closing this out as not planned for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements needs love Issues that need concerted focus and effort to address holistically
Projects
Status: backlog
Development

No branches or pull requests

6 participants