Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion man/data.table.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac
\item or of the form \code{startcol:endcol}: e.g., \code{DT[, sum(a), by=x:z]}
}

\emph{Advanced:} When \code{i} is a \code{list} (or \code{data.frame} or \code{data.table}), \code{DT[i, j, by=.EACHI]} evaluates \code{j} for the groups in \code{DT} that each row in \code{i} joins to. That is, you can join (in \code{i}) and aggregate (in \code{j}) simultaneously. We call this \emph{grouping by each i}. See \href{https://stackoverflow.com/a/27004566/559784}{this StackOverflow answer} for a more detailed explanation until we \href{https://github.com/Rdatatable/data.table/issues/944}{roll out vignettes}.
\emph{Advanced:} When \code{i} is a \code{list} (or \code{data.frame} or \code{data.table}), \code{DT[i, j, by=.EACHI]} evaluates \code{j} for the groups in \code{DT} that each row in \code{i} joins to. That is, you can join (in \code{i}) and aggregate (in \code{j}) simultaneously. We call this \emph{grouping by each i}. Note that for rows in \code{i} with no match, the group of matching rows in \code{x} is empty. Special symbols that operate on rows (e.g., \code{.I} or \code{.N}) will therefore evaluate to \code{0} for such groups. This differs from selecting a column from \code{x} (e.g., \code{x$col}), which results in \code{NA} as governed by the \code{nomatch} argument. See \href{https://stackoverflow.com/a/27004566/559784}{this StackOverflow answer} for a more detailed explanation until we \href{https://github.com/Rdatatable/data.table/issues/944}{roll out vignettes}.

\emph{Advanced:} In the \code{X[Y, j]} form of grouping, the \code{j} expression sees variables in \code{X} first, then \code{Y}. We call this \emph{join inherited scope}. If the variable is not in \code{X} or \code{Y} then the calling frame is searched, its calling frame, and so on in the usual way up to and including the global environment.}

Expand Down Expand Up @@ -320,6 +320,13 @@ DT[!"a", sum(v), by=.EACHI, on="x"] # same, but using subsets-as-joins
DT[c("b","c"), sum(v), by=.EACHI, on="x"] # same
DT[c("b","c"), sum(v), by=.EACHI, on=.(x)] # same, using on=.()

#' # Why .I is 0 for non-matching rows with by=.EACHI:
#' d1 = data.table(v = c("A", "B", "C", "A", "C"), val = 1:5)
#' d2 = data.table(v = c("D", "A", "G", "C"))
#' # Selecting a column 'val' returns NA for non-matches, per `nomatch=NA`
#' d1[d2, on = .(v), .(val), by = .EACHI]
#' d1[d2, on = .(v), .I, by = .EACHI]

# joins as subsets
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
X
Expand Down
36 changes: 36 additions & 0 deletions vignettes/datatable-joins.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -259,6 +259,42 @@ dt2 = ProductReceived[
identical(dt1, dt2)
```

##### Understanding `j` Evaluation with `by=.EACHI` for Non-Matches

A common point of confusion arises when using special symbols like `.I` in `j` with `by=.EACHI`. The behavior for non-matching rows differs from what you might expect when selecting a regular column.

Let's illustrate with a simple example:
```{r by-eachi-special-symbols}
d1 = data.table(v = c("A", "B", "C", "A", "C"), i_col = 1:5)
d2 = data.table(v = c("D", "A", "G", "C"))
```

*Case 1: Selecting a regular column*

When we select a column from `x (d1)`, non-matching rows from `i (d2)` result in `NA`. This is the standard behavior governed by `nomatch = NA`.
```{r}
d1[d2, on = .(v), .(i_col), by = .EACHI]
```

For the rows `D` and `G` in `d2`, there is no matching row in `d1`, so the value for `i_col` is missing `(NA)`.

*Case 2: Evaluating the special symbol `.I`*

However, when we use the special symbol `.I`, non-matching rows evaluate to `0`.
```{r}
d1[d2, on = .(v), .I, by = .EACHI]
```

The reason for this difference is crucial:
- In Case 1, we are performing a value lookup. A failed lookup results in a missing value (`NA`).
- In Case 2, we are performing an evaluation. The symbol `.I` is defined as "the row indices in `x` for the current group". For non-matching rows like `D`, the group of matching rows in d1 is empty. The set of indices for an empty group is integer(0). data.table represents this zero-length result as a single `0` in the output.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this info come from?


This logic is consistent with other special symbols like `.N` (the number of rows in a group), which also correctly evaluates to `0` for non-matching groups.

```{r}
d1[d2, on = .(v), .N, by = .EACHI]
```

#### 3.1.4. Joining based on several columns

So far we have just joined `data.table`s based on 1 column, but it's important to know that the package can join tables matching several columns.
Expand Down
Loading