-
Notifications
You must be signed in to change notification settings - Fork 116
duckplyr 1.0.0 #724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
duckplyr 1.0.0 #724
Changes from all commits
Commits
Show all changes
101 commits
Select commit
Hold shift + click to select a range
96861b7
duckplyr 1.0.0
maelle 5b2b472
restructure, better to explain some things first?!
maelle 97739a8
oops
maelle 1d80f56
evaluate...
maelle 11e21c1
format
maelle eb09f2b
link
maelle 44b7c66
tweak
maelle 20b6a6e
fix mistake
maelle 4ccfa70
link issue
maelle 692b390
thanks @krlmlr
maelle 0df0d53
explicitly mention data size
maelle 4251569
fully compatible
maelle e30f959
add ref for DuckDB
maelle 1cec2cb
compare to other backends
maelle b276db4
rm prudence @krlmlr
maelle 7c92ddb
fallbacks
maelle d8aca67
link to docs
maelle 480551b
more details here
maelle 6adf074
links
maelle c955826
isn't this a goal too @krlmlr
maelle 1c5153f
link
maelle 78ee84b
more promises :sweat_smile:
maelle 1b55a30
add thumbnail-wd
maelle ae87276
add thumbnail-sq
maelle ae8c8e6
check thumbnail things
maelle 7661ad3
lol
maelle 48eba60
phrasing
maelle 8b6d0bd
Space at EOL
krlmlr e25fb07
Sentence
krlmlr a0b9b39
FIXME
krlmlr b90e8af
Shorten
krlmlr 5ac6f6e
Verbose link
krlmlr 14ec2f6
Not dying on this particular hill here
krlmlr b9a277a
Tweak query, let's see
krlmlr 5762c0a
Prune
krlmlr 6aaf953
This works
krlmlr f847736
Tweak narrative
krlmlr c78073f
Choose pivoting as an important op not yet supported
krlmlr 1f898c0
Link style
krlmlr 5a1f22c
aeolus
krlmlr 2b4b421
Help
krlmlr d97b031
Exclude maintainers
krlmlr 4be5ea9
Thanks
krlmlr f344b9f
Link
krlmlr a13315a
Restore narrative
krlmlr ad9825f
Add vignette link
krlmlr a734638
FIXME
krlmlr fc8122d
Date
krlmlr 3211710
Why bother
krlmlr eea955a
Level
krlmlr 4a20ca3
Move
krlmlr f5e4a38
Detail
krlmlr 20dff03
TBC
krlmlr f231c68
Merge pull request #1 from krlmlr/duckplyr-post-krlmlr
maelle 49b4f8b
kill your babies
maelle 6b84b25
.
maelle 452d5f2
typo
maelle 21c2b74
use suggestion without repeating backend that's in the sentence right…
maelle a21daa5
fix
maelle 9022d7f
specific
maelle f0563c0
start tweaking
maelle 88846a3
weave benchmark in?
maelle 17cdfc3
just rm
maelle 9e0e496
rm ellipsis + comment on benchmark
maelle a5ac341
port majority of Kirill's edits
maelle 36a93cb
fix phrasing
maelle 4f88bbd
hide it for real
maelle ad8866a
Apply suggestions from code review
maelle beee540
un-hide
maelle 5fe00ac
add this edit of @krlmlr's
maelle a186621
one fixme
maelle a994038
new section
maelle ac86b3a
rephrase
maelle 4d6fa0d
typo
maelle 0582a15
make it work :sweat_smile:
maelle 735e9d9
Recreate environment and re-render
krlmlr 217144a
Bold face
krlmlr dc07389
"small results processed seamlessly with dplyr" is the main goal of p…
krlmlr 663eda2
Declutter
krlmlr dd9d20d
Space
krlmlr 0008ac8
Move
krlmlr ea3fda0
methods_restore() needed only later
krlmlr 9128e16
Wrap
krlmlr eaee543
Wording
krlmlr 931fd45
lineitem_tbl
krlmlr 5bb1a84
Keep it simple
krlmlr 78729ac
Caveat
krlmlr 0ef4ace
Stress
krlmlr 0f23a74
Use function from the beginning
krlmlr 236d793
Prune
krlmlr 5bc98a6
Section
krlmlr cf61e47
Explicit verbosity
krlmlr bfac020
Render
krlmlr d0ed8f9
Merge branch 'main' into duckplyr-post
krlmlr 51931cb
Final edits
krlmlr 1ddb0e1
Paragraph and comments
krlmlr 85b98cd
Merge branch 'main' into duckplyr-post
krlmlr 58997b1
Tweak and render
krlmlr 29f8d0c
Move
krlmlr af72705
Polish
hadley 9fbb466
typo fix
maelle File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,233 @@ | ||
--- | ||
output: hugodown::hugo_document | ||
|
||
slug: duckplyr-1-1-0 | ||
title: duckplyr fully joins the tidyverse! | ||
date: 2025-06-19 | ||
author: Kirill Müller and Maëlle Salmon | ||
description: > | ||
duckplyr 1.1.0 is on CRAN! | ||
A drop-in replacement for dplyr, powered by DuckDB for speed. | ||
It is the most dplyr-like of dplyr backends. | ||
|
||
photo: | ||
url: https://www.pexels.com/photo/a-mallard-duck-on-water-6918877/ | ||
author: Kiril Gruev | ||
|
||
# one of: "deep-dive", "learn", "package", "programming", "roundup", or "other" | ||
categories: [package] | ||
tags: | ||
- duckplyr | ||
- dplyr | ||
- tidyverse | ||
--- | ||
|
||
```{r include = FALSE} | ||
options( | ||
pillar.min_title_chars = 20, | ||
pillar.max_footer_lines = 7, | ||
pillar.bold = TRUE | ||
) | ||
options(conflicts.policy = list(warn = FALSE)) | ||
library(conflicted) | ||
conflict_prefer("filter", "dplyr", quiet = TRUE) | ||
``` | ||
|
||
|
||
We're well chuffed to announce the release of [duckplyr](https://duckplyr.tidyverse.org) 1.1.0. | ||
This is a dplyr backend powered by [DuckDB](https://duckdb.org/), a fast in-memory analytical database system[^duckdb]. | ||
duckplyr uses the power of DuckDB for impressive performance where it can, and seemlessly falls back to R where it can't. | ||
You can install it from CRAN with: | ||
|
||
[^duckdb]: If you haven't heard of it yet, watch [Hannes Mühleisen's keynote at posit::conf(2024)](https://www.youtube.com/watch?v=GELhdezYmP0&feature=youtu.be). | ||
|
||
```{r, eval = FALSE} | ||
install.packages("duckplyr") | ||
``` | ||
|
||
This article shows how duckplyr can be used instead of dplyr, explain how you can help improve the package, and share a selection of further resources. | ||
|
||
## A drop-in replacement for dplyr | ||
|
||
Imagine you have to wrangle a huge dataset, like this one from the [TPC-H benchmark](https://duckdb.org/2024/04/02/duckplyr.html#benchmark-tpc-h-q1), a famous database benchmarking dataset. | ||
|
||
```{r} | ||
lineitem_tbl <- duckdb:::sql( | ||
"INSTALL tpch; LOAD tpch; CALL dbgen(sf=1); FROM lineitem;" | ||
) | ||
lineitem_tbl <- tibble::as_tibble(lineitem_tbl) | ||
dplyr::glimpse(lineitem_tbl) | ||
``` | ||
|
||
To work with this in duckplyr instead of dplyr, all you need to do is load duckplyr: | ||
|
||
```{r} | ||
library(duckplyr) | ||
``` | ||
|
||
Now we can express the well-known (at least in the database community!) "TPC-H benchmark query 1" in dplyr syntax and execute it in DuckDB via duckplyr. | ||
|
||
```{r} | ||
tpch_dplyr <- function(lineitem) { | ||
lineitem |> | ||
filter(l_shipdate <= !!as.Date("1998-09-02")) |> | ||
summarise( | ||
sum_qty = sum(l_quantity), | ||
sum_base_price = sum(l_extendedprice), | ||
sum_disc_price = sum(l_extendedprice * (1 - l_discount)), | ||
sum_charge = sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)), | ||
avg_qty = mean(l_quantity), | ||
avg_price = mean(l_extendedprice), | ||
avg_disc = mean(l_discount), | ||
count_order = n(), | ||
.by = c(l_returnflag, l_linestatus) | ||
) |> | ||
arrange(l_returnflag, l_linestatus) | ||
} | ||
|
||
tpch_dplyr(lineitem_tbl) | ||
``` | ||
|
||
Like other dplyr backends such as dtplyr and dbplyr, duckplyr gives you higher performance without learning a different syntax. | ||
Unlike other dplyr backends, duckplyr does not require you to change existing code or learn specific idiosyncrasies. | ||
Not only is the syntax the same, the semantics are too! | ||
If an operation cannot be carried out with DuckDB, it is automatically outsourced to dplyr. | ||
Over time, we expect fewer and fewer fallbacks to dplyr to be needed. | ||
|
||
## How to use duckplyr | ||
|
||
There are two ways to use duckplyr: | ||
|
||
- As above, you can `library(duckplyr)`, and replace all existing dplyr methods. This is safe because duckplyr is guaranteed to give the exactly same the results as dplyr, unlike other backends. | ||
|
||
- Create individual "duck frames" using _conversion functions_ like `duckdplyr::duckdb_tibble()` or `duckdplyr::as_duckdb_tibble()`, or _ingestion functions_ like `duckdplyr::read_csv_duckdb()`. | ||
|
||
Here's an example of the second form: | ||
|
||
```{r} | ||
out <- lineitem_tbl |> | ||
duckplyr::as_duckdb_tibble() |> | ||
tpch_dplyr() | ||
|
||
out | ||
``` | ||
|
||
Note that the resulting object is indistinguishable from a regular tibble, except for the additional class. | ||
|
||
```{r} | ||
typeof(out) | ||
class(out) | ||
out$count_order | ||
``` | ||
|
||
Operations not yet supported by duckplyr are automatically outsourced to dplyr. | ||
For instance, filtering on grouped data is not supported, but it still works thanks to the fallback mechanism. | ||
By default, the fallback is silent, but you can make it visible by setting an environment variable. | ||
This is useful if you want to better understanding what's making your code slow. | ||
|
||
```{r} | ||
Sys.setenv(DUCKPLYR_FALLBACK_INFO = TRUE) | ||
|
||
lineitem_tbl |> | ||
duckplyr::as_duckdb_tibble() |> | ||
filter(l_quantity == max(l_quantity), .by = c(l_returnflag, l_linestatus)) | ||
``` | ||
|
||
You can also directly use DuckDB functions with the `dd$` qualifier. | ||
Functions with this prefix will not be translated at all and passed through directly to DuckDB. | ||
For example, the following code uses DuckDB's internal implementation of [Levenstein distance](https://duckdb.org/docs/stable/sql/functions/text.html#editdist3s1-s2): | ||
|
||
```{r} | ||
tibble(a = "dbplyr", b = "duckplyr") %>% | ||
mutate(c = dd$levenshtein(a, b)) | ||
``` | ||
|
||
See `vignette("duckdb")` for more information on these features. | ||
|
||
If you're working with dbplyr too, you can use `as_tbl()` you to convert a duckplyr tibble to a dbplyr lazy table. | ||
This allows you to seamlessly interact with existing code that might use inline SQL or other dbplyr functionality. | ||
With `as_duckdb_tibble()`, you can convert a dbplyr lazy table to a duckplyr tibble. | ||
Both operations work without intermediate materialization. | ||
|
||
## Benchmark | ||
|
||
duckplyr is often much faster than dplyr. | ||
The comparison below is done in a fresh R session where dplyr is attached but duckplyr is not. | ||
|
||
```{r include = FALSE} | ||
# Undo the effect of library(duckplyr) | ||
methods_restore() | ||
``` | ||
|
||
We use `tpch_dplyr()` as defined above to run the query with dplyr. | ||
The function that runs it with duckplyr only wraps the input data in a duck frame and forwards it to the dplyr function. | ||
The `collect()` at the end is required only for this benchmark to ensure fairness.[^collect] | ||
|
||
[^collect]: If omitted, the results would be unchanged but the measurements would be wrong. The computation would then be triggered by the check. See `vignette("prudence")` for details. | ||
|
||
```{r} | ||
tpch_duckplyr <- function(lineitem) { | ||
lineitem |> | ||
duckplyr::as_duckdb_tibble() |> | ||
tpch_dplyr() |> | ||
collect() | ||
} | ||
``` | ||
|
||
And now we compare the two: | ||
|
||
```{r} | ||
bench::mark( | ||
tpch_dplyr(lineitem_tbl), | ||
tpch_duckplyr(lineitem_tbl), | ||
check = ~ all.equal(.x, .y, tolerance = 1e-10) | ||
) | ||
``` | ||
|
||
In this example, duckplyr is a lot faster than dplyr. | ||
It also appears to use much less memory, but this is misleading: DuckDB manages the memory, not R, so the memory usage is not visible to `bench::mark()`. | ||
|
||
## Out-of-memory data | ||
|
||
As well as improved speed with in-memory datasets, duckplyr makes it easy to work with datasets that are too big to fit in memory. | ||
In this case, you want: | ||
|
||
1. To work with data stored in modern formats designed for large data (e.g. Parquet). | ||
1. To be able to store large intermediate results on disk, keeping them out of memory. | ||
1. Fast computation! | ||
|
||
duckdplyr provides each of these features: | ||
|
||
1. You can read data from disk with functions like `read_parquet_duckdb()`. | ||
1. You can save intermediate results to disk with `compute_parquet()` and `compute_csv()`. | ||
1. duckdplyr takes advantage of DuckDB's query planner which considers your entire pipeline holistically to figure out the most efficient way to get the data you need. | ||
|
||
See `vignette("large")` for a walkthrough and more details. | ||
|
||
## Help us improve duckplyr! | ||
|
||
Our goals for future development of duckplyr include: | ||
|
||
- Enabling users to provide [custom translations](https://github.com/tidyverse/duckplyr/issues/158) of dplyr functionality; | ||
- Making it easier to contribute code to duckplyr; | ||
- Supporting more dplyr and tidyr functionality natively in DuckDB. | ||
|
||
You can help! | ||
|
||
- Please report any issues, especially regarding unknown incompabilities. See `vignette("limits")`. | ||
- Contribute to the codebase after reading duckplyr's [contributing guide](https://duckplyr.tidyverse.org/CONTRIBUTING.html). | ||
- Turn on telemetry to help us hear about the most frequent fallbacks so we can prioritize working on the corresponding missing dplyr translation. See `vignette("telemetry")` and `duckplyr::fallback_sitrep()`. | ||
|
||
## Additional resources | ||
|
||
Eager to learn more about duckplyr -- beside by trying it out yourself? | ||
The duckplyr website features several [articles](https://duckplyr.tidyverse.org/articles/). | ||
Furthermore, the blog post ["duckplyr: dplyr Powered by DuckDB"](https://duckdb.org/2024/04/02/duckplyr.html) by Hannes Mühleisen provides some context on duckplyr including its inner workings, as also seen in a [section](https://blog.r-hub.io/2025/02/13/lazy-meanings/#duckplyr-lazy-evaluation-and-prudence) of the R-hub blog post ["Lazy introduction to laziness in R"](https://blog.r-hub.io/2025/02/13/lazy-meanings/) by Maëlle Salmon, Athanasia Mo Mowinckel and Hannah Frick. | ||
|
||
## Acknowledgements | ||
|
||
A big thanks to all folks who filed issues, created PRs and generally helped to improve duckplyr and its workhorse [duckdb](https://r.duckdb.org/)! | ||
|
||
[@adamschwing](https://github.com/adamschwing), [@alejandrohagan](https://github.com/alejandrohagan), [@andreranza](https://github.com/andreranza), [@apalacio9502](https://github.com/apalacio9502), [@apsteinmetz](https://github.com/apsteinmetz), [@barracuda156](https://github.com/barracuda156), [@beniaminogreen](https://github.com/beniaminogreen), [@bob-rietveld](https://github.com/bob-rietveld), [@brichards920](https://github.com/brichards920), [@cboettig](https://github.com/cboettig), [@davidjayjackson](https://github.com/davidjayjackson), [@DavisVaughan](https://github.com/DavisVaughan), [@Ed2uiz](https://github.com/Ed2uiz), [@eitsupi](https://github.com/eitsupi), [@era127](https://github.com/era127), [@etiennebacher](https://github.com/etiennebacher), [@eutwt](https://github.com/eutwt), [@fmichonneau](https://github.com/fmichonneau), [@hadley](https://github.com/hadley), [@hannes](https://github.com/hannes), [@hawkfish](https://github.com/hawkfish), [@IndrajeetPatil](https://github.com/IndrajeetPatil), [@JanSulavik](https://github.com/JanSulavik), [@JavOrraca](https://github.com/JavOrraca), [@jeroen](https://github.com/jeroen), [@jhk0530](https://github.com/jhk0530), [@joakimlinde](https://github.com/joakimlinde), [@JosiahParry](https://github.com/JosiahParry), [@kevbaer](https://github.com/kevbaer), [@larry77](https://github.com/larry77), [@lnkuiper](https://github.com/lnkuiper), [@lorenzwalthert](https://github.com/lorenzwalthert), [@lschneiderbauer](https://github.com/lschneiderbauer), [@luisDVA](https://github.com/luisDVA), [@math-mcshane](https://github.com/math-mcshane), [@meersel](https://github.com/meersel), [@multimeric](https://github.com/multimeric), [@mytarmail](https://github.com/mytarmail), [@nicki-dese](https://github.com/nicki-dese), [@PMassicotte](https://github.com/PMassicotte), [@prasundutta87](https://github.com/prasundutta87), [@rafapereirabr](https://github.com/rafapereirabr), [@Robinlovelace](https://github.com/Robinlovelace), [@romainfrancois](https://github.com/romainfrancois), [@sparrow925](https://github.com/sparrow925), [@stefanlinner](https://github.com/stefanlinner), [@szarnyasg](https://github.com/szarnyasg), [@thomasp85](https://github.com/thomasp85), [@TimTaylor](https://github.com/TimTaylor), [@Tmonster](https://github.com/Tmonster), [@toppyy](https://github.com/toppyy), [@wibeasley](https://github.com/wibeasley), [@yjunechoe](https://github.com/yjunechoe), [@ywhcuhk](https://github.com/ywhcuhk), [@zhjx19](https://github.com/zhjx19), [@ablack3](https://github.com/ablack3), [@actuarial-lonewolf](https://github.com/actuarial-lonewolf), [@ajdamico](https://github.com/ajdamico), [@amirmazmi](https://github.com/amirmazmi), [@anderson461123](https://github.com/anderson461123), [@andrewGhazi](https://github.com/andrewGhazi), [@Antonov548](https://github.com/Antonov548), [@appiehappie999](https://github.com/appiehappie999), [@ArthurAndrews](https://github.com/ArthurAndrews), [@arthurgailes](https://github.com/arthurgailes), [@babaknaimi](https://github.com/babaknaimi), [@bcaradima](https://github.com/bcaradima), [@bdforbes](https://github.com/bdforbes), [@bergest](https://github.com/bergest), [@bill-ash](https://github.com/bill-ash), [@BorgeJorge](https://github.com/BorgeJorge), [@brianmsm](https://github.com/brianmsm), [@chainsawriot](https://github.com/chainsawriot), [@ckarnes](https://github.com/ckarnes), [@clementlefevre](https://github.com/clementlefevre), [@cregouby](https://github.com/cregouby), [@cy-james-lee](https://github.com/cy-james-lee), [@daranzolin](https://github.com/daranzolin), [@david-cortes](https://github.com/david-cortes), [@DavZim](https://github.com/DavZim), [@denis-or](https://github.com/denis-or), [@developertest1234](https://github.com/developertest1234), [@dicorynia](https://github.com/dicorynia), [@dsolito](https://github.com/dsolito), [@e-kotov](https://github.com/e-kotov), [@EAVWing](https://github.com/EAVWing), [@eddelbuettel](https://github.com/eddelbuettel), [@edward-burn](https://github.com/edward-burn), [@elefeint](https://github.com/elefeint), [@eli-daniels](https://github.com/eli-daniels), [@elysabethpc](https://github.com/elysabethpc), [@erikvona](https://github.com/erikvona), [@florisvdh](https://github.com/florisvdh), [@gaborcsardi](https://github.com/gaborcsardi), [@ggrothendieck](https://github.com/ggrothendieck), [@hdmm3](https://github.com/hdmm3), [@hope-data-science](https://github.com/hope-data-science), [@IoannaNika](https://github.com/IoannaNika), [@jabrown-aepenergy](https://github.com/jabrown-aepenergy), [@JamesLMacAulay](https://github.com/JamesLMacAulay), [@jangorecki](https://github.com/jangorecki), [@javierlenzi](https://github.com/javierlenzi), [@Joe-Heffer-Shef](https://github.com/Joe-Heffer-Shef), [@kalibera](https://github.com/kalibera), [@lboller-pwbm](https://github.com/lboller-pwbm), [@lgaborini](https://github.com/lgaborini), [@m-muecke](https://github.com/m-muecke), [@meztez](https://github.com/meztez), [@mgirlich](https://github.com/mgirlich), [@mtmorgan](https://github.com/mtmorgan), [@nassuphis](https://github.com/nassuphis), [@nbc](https://github.com/nbc), [@olivroy](https://github.com/olivroy), [@pdet](https://github.com/pdet), [@phdjsep](https://github.com/phdjsep), [@pierre-lamarche](https://github.com/pierre-lamarche), [@r2evans](https://github.com/r2evans), [@ran-codes](https://github.com/ran-codes), [@rplsmn](https://github.com/rplsmn), [@Saarialho](https://github.com/Saarialho), [@SimonCoulombe](https://github.com/SimonCoulombe), [@tau31](https://github.com/tau31), [@thohan88](https://github.com/thohan88), [@ThomasSoeiro](https://github.com/ThomasSoeiro), [@timothygmitchell](https://github.com/timothygmitchell), [@vincentarelbundock](https://github.com/vincentarelbundock), [@VincentGuyader](https://github.com/VincentGuyader), [@wlangera](https://github.com/wlangera), [@xbasics](https://github.com/xbasics), [@xiaodaigh](https://github.com/xiaodaigh), [@xtimbeau](https://github.com/xtimbeau), [@yng-me](https://github.com/yng-me), [@Yousuf28](https://github.com/Yousuf28), [@yutannihilation](https://github.com/yutannihilation), and [@zcatav](https://github.com/zcatav) | ||
|
||
Special thanks to Joe Thorley ([@joethorley](https://github.com/joethorley)) for help with choosing the right words. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious why is the date injected? If possible it'd be better not to use
!!
in a large audience blog post as it tends to confuse users, and in this case it's especially confusing because it doesn't seem necessary? Correct me if I'm wrong, I'm not familiar with duckplyr :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. In this particular case, it might be necessary indeed, I'm not sure we have a translation for
as.Date()
.I prefer
!!
over.env
in my code, I think I see the downsides and the potential for confusion. I don't have a strong opinion here, we can use whatever works for the blog post.