Rdatatable · venom1204 · Aug 21, 2025 · ben-schwen · Aug 22, 2025
@@ -111,7 +111,7 @@ data.table(\dots, keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFac
         \item or of the form \code{startcol:endcol}: e.g., \code{DT[, sum(a), by=x:z]}
     }
 
-    \emph{Advanced:} When \code{i} is a \code{list} (or \code{data.frame} or \code{data.table}), \code{DT[i, j, by=.EACHI]} evaluates \code{j} for the groups in \code{DT} that each row in \code{i} joins to. That is, you can join (in \code{i}) and aggregate (in \code{j}) simultaneously. We call this \emph{grouping by each i}. See \href{https://stackoverflow.com/a/27004566/559784}{this StackOverflow answer} for a more detailed explanation until we \href{https://github.com/Rdatatable/data.table/issues/944}{roll out vignettes}.
+    \emph{Advanced:} When \code{i} is a \code{list} (or \code{data.frame} or \code{data.table}), \code{DT[i, j, by=.EACHI]} evaluates \code{j} for the groups in \code{DT} that each row in \code{i} joins to. That is, you can join (in \code{i}) and aggregate (in \code{j}) simultaneously. We call this \emph{grouping by each i}. Note that for rows in \code{i} with no match, the group of matching rows in \code{x} is empty. Special symbols that operate on rows (e.g., \code{.I} or \code{.N}) will therefore evaluate to \code{0} for such groups. This differs from selecting a column from \code{x} (e.g., \code{x$col}), which results in \code{NA} as governed by the \code{nomatch} argument. See \href{https://stackoverflow.com/a/27004566/559784}{this StackOverflow answer} for a more detailed explanation until we \href{https://github.com/Rdatatable/data.table/issues/944}{roll out vignettes}.
 
     \emph{Advanced:} In the \code{X[Y, j]} form of grouping, the \code{j} expression sees variables in \code{X} first, then \code{Y}. We call this \emph{join inherited scope}. If the variable is not in \code{X} or \code{Y} then the calling frame is searched, its calling frame, and so on in the usual way up to and including the global environment.}
 
@@ -320,6 +320,13 @@ DT[!"a", sum(v), by=.EACHI, on="x"]         # same, but using subsets-as-joins
 DT[c("b","c"), sum(v), by=.EACHI, on="x"]   # same
 DT[c("b","c"), sum(v), by=.EACHI, on=.(x)]  # same, using on=.()
 
+#' # Why .I is 0 for non-matching rows with by=.EACHI:
+#' d1 = data.table(v = c("A", "B", "C", "A", "C"), val = 1:5)
+#' d2 = data.table(v = c("D", "A", "G", "C"))
+#' # Selecting a column 'val' returns NA for non-matches, per `nomatch=NA`
+#' d1[d2, on = .(v), .(val), by = .EACHI]
+#' d1[d2, on = .(v), .I, by = .EACHI]
+
 # joins as subsets
 X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
 X

@@ -259,6 +259,42 @@ dt2 = ProductReceived[
 identical(dt1, dt2)
 ```
 
+##### Understanding `j` Evaluation with `by=.EACHI` for Non-Matches
+
+A common point of confusion arises when using special symbols like `.I` in `j` with `by=.EACHI`. The behavior for non-matching rows differs from what you might expect when selecting a regular column.
+
+Let's illustrate with a simple example:
+```{r by-eachi-special-symbols}
+d1 = data.table(v = c("A", "B", "C", "A", "C"), i_col = 1:5)
+d2 = data.table(v = c("D", "A", "G", "C"))
+```
+
+*Case 1: Selecting a regular column*
+
+When we select a column from `x (d1)`, non-matching rows from `i (d2)` result in `NA`. This is the standard behavior governed by `nomatch = NA`.
+```{r}
+d1[d2, on = .(v), .(i_col), by = .EACHI]
+```
+
+For the rows `D` and `G` in `d2`, there is no matching row in `d1`, so the value for `i_col` is missing `(NA)`.
+
+*Case 2: Evaluating the special symbol `.I`*
+
+However, when we use the special symbol `.I`, non-matching rows evaluate to `0`.
+```{r}
+d1[d2, on = .(v), .I, by = .EACHI]
+```
+
+The reason for this difference is crucial:
+- In Case 1, we are performing a value lookup. A failed lookup results in a missing value (`NA`).
+- In Case 2, we are performing an evaluation. The symbol `.I` is defined as "the row indices in `x` for the current group". For non-matching rows like `D`, the group of matching rows in d1 is empty. The set of indices for an empty group is integer(0). data.table represents this zero-length result as a single `0` in the output.
+
+This logic is consistent with other special symbols like `.N` (the number of rows in a group), which also correctly evaluates to `0` for non-matching groups.
+
+```{r}
+d1[d2, on = .(v), .N, by = .EACHI]
+```
+
 #### 3.1.4. Joining based on several columns
 
 So far we have just joined `data.table`s based on 1 column, but it's important to know that the package can join tables matching several columns.