Fix index printing by adding index info to header (#6806) #6816

Mukulyadav2004 · 2025-02-14T17:08:08Z

Problem:
Currently, when options(datatable.show.indices = TRUE), print.data.table() tries to add index info to toprint. However, toprint may have a different number of columns, leading to the error:

Error in rbind(abbs, toprint) : number of columns of result is not a multiple of vector length (arg 1)

Fix:
Instead of modifying toprint, this PR adds the index information directly to header, ensuring a cleaner and safer display.

Changes:
Extracts index names from the "index" attribute.
Formats them (removes __ and replaces _ with , ).
Creates a "Indices: ..." header string.
Appends this to header.

File changed: print.data.table R

aitap

The guide that you and your colleagues are following has some advice that is counter-productive. Where does it come from? Perhaps it should be changed?

Putting the issue number that you would like to fix into the pull request title doesn't help: GitHub ignores it. Instead you need to put the number into the text (i.e. the comment) of the pull request. Then GitHub will link the two together.

Let's try to stick to the code formatting style used by this project: when closing an if block and following it with an else block, put both braces on the same line as the word else:

if (whatever) {
  # some code
} else { # <-- like here
  # some more code
}

It's usually best to minimise the changes you're introducing to a code base. This is both easier to review and has less chance of introducing bugs. Since R is mostly ambivalent about spaces, let's not introduce extra space at the end of some but not all of the lines. On the Files tab of this pull request, you can see "lint-r" complaining about extra spaces. Could you please remove them?

But the main problem is the suggested solution. print(x) shouldn't have to change variables above it in the function call stack. The main issue here is that abbs is constructed from classes1(x) without taking into account that cbind(x[...,], index_dt) may be printed instead of just a subset of x. There must be a simpler way of either (1) extracting the classes from both data.tables or (2) padding abbs with a string for each index.

aitap · 2025-02-14T20:03:37Z

R/print.data.table.R

-      setnames(index_dt, print_names)
+    indices <- names(attr(x, "index", exact = TRUE))
+    if (length(indices)) {
+      cleaned_indices <- gsub("^__|_", ", ", indices) 


gsub with _ as a complete subexpression will match _ anywhere, including in the middle of the column name. Why does the code match it here?

aitap · 2025-02-14T20:04:52Z

R/print.data.table.R

+    indices <- names(attr(x, "index", exact = TRUE))
+    if (length(indices)) {
+      cleaned_indices <- gsub("^__|_", ", ", indices) 
+      cleaned_indices <- sub(", $", "", cleaned_indices) 


There's probably a way to fix the code above so that it doesn't create a separator at the end of the string in the first place.

aitap · 2025-02-14T20:07:53Z

R/print.data.table.R

+      if (exists("header", envir = parent.frame(), inherits = FALSE)) {
+        # Match data.table's existing header handling
+        assign("header", c(get("header", envir = parent.frame()), header), 
+               envir = parent.frame())


What is this code intended to do? Where is the header variable used? Use of parent.frame() in this context is almost certainly a mistake because parent.frame() is the environment of the function that calls print(x). Why is the exists() check needed? Why assign()? In R, if doesn't create a new lexical scope (unlike a function call), so your changes to the variables will be retained outside the block.

Thank you, @aitap, for your valuable feedback.
I now understand the issues you pointed out and will try to make the necessary changes accordingly. Initially, I declared header because an error occurred when an index existed, and since the index is metadata, I created header to store it. My intent was to update header dynamically when printing a data.table, ensuring index information is included.
However after you tell that if statements do not create a new lexical scope , I came to know about it. This led to header being modified in the parent environment instead of being managed locally within print.data.table().

Is there a place in the code where the header variable is used after being created by this block?

Mukulyadav2004 · 2025-02-18T05:36:02Z

Hi @aitap
As you mentioned, in the modified data.table, where we add columns from both x and index_dt via cbind, abbs was initially constructed only from x and not index_dt. I have made the necessary changes to address this.

However, after implementing these modifications, three tests from tests.Rraw are failing because the expected output differs from the observed output. Specifically, in these tests, the expected output does not include an extra column for index_dt. The failing tests are as follows:

DT2 <- data.table(a = 1:3, b = 4:6)
setindexv(DT2, c("b","a"))
test(1775.2, print(DT2, print.keys = TRUE),
     output=c("Index: <b__a>", "   a b", "1: 1 4", "2: 2 5", "3: 3 6"))

setindexv(DT2, "b")
test(1775.3, print(DT2, print.keys = TRUE),
     output=c("Indices: <b__a>, <b>", "   a b", "1: 1 4", "2: 2 5", "3: 3 6"))

setkey(DT2, a)
setindexv(DT2, c("b","a"))
test(1775.4, print(DT2, print.keys = TRUE),
     output=c("Key: <a>", "Indices: <b__a>, <b>", "   a b", "1: 1 4", "2: 2 5", "3: 3 6"))

Could you please clarify this.

aitap · 2025-02-18T07:06:59Z

Hi @Mukulyadav2004 Could you please push your changes to this pull request? It's very hard to diagnose code only knowing that it produces unwanted output but without seeing the changes to the output code. The expected output in tests 1775.2, 1775.3, 1775.4 does not contain the index columns because show.indices= is not set, only print.keys= is.

aitap

Thank you for making the improvements! You are approaching a good solution.

aitap · 2025-02-19T11:45:02Z

R/print.data.table.R

+      if (exists("header", envir = parent.frame(), inherits = FALSE)) {
+        # Match data.table's existing header handling
+        assign("header", c(get("header", envir = parent.frame()), header), 
+               envir = parent.frame())


Is there a place in the code where the header variable is used after being created by this block?

aitap · 2025-02-19T11:47:40Z

R/print.data.table.R

@@ -101,8 +110,23 @@ print.data.table = function(x, topn=getOption("datatable.print.topn"),
      IDate = "<IDat>", integer64 = "<i64>", raw = "<raw>",
      expression = "<expr>", ordered = "<ord>")
    classes = classes1(x)
+    col_names <- colnames(toprint)
+    classes <- sapply(col_names, function(col_name) {
+      if (grepl("^index:", col_name)) {


This check is not very reliable. A user can create a column whose name starts with index:: data.table(`index:foo`='bar'). Moreover, grepl('^static-string') can usually be replaced by startsWith(), which avoids the cost of having to compile a regular expression.

Thank you for your feedback. I have reviewed the code, and currently, the header variable is being assigned within this block. However, I do not see any explicit usage of header after its creation. I believe the header might be printed due to the following code snippet:

if (print.keys) { if (!is.null(ky <- key(x))) catf("Key: <%s>\n", toString(ky)) if (!is.null(ixs <- indices(x))) cat(sprintf( ngettext(length(ixs), "Index: %s\n", "Indices: %s\n"), paste0("<", ixs, ">", collapse = ", ") )) }

However, I am not entirely certain, and I would appreciate any clarification if I have misunderstood

Let me rephrase that. When you wrote the code that introduced the header variable, where did you intend it to be used? I'm asking this question because there is no other use of this variable in R/print.data.table.R, and no uses of variables named header in other *.R files are applicable.

If you are using a machine learning model to write the code or the comments for you, please stop doing that, because it's doing us both a disservice. Relying on machine output deprives you from learning opportunities and prevents you from accumulating skill. (If machine learning models start writing good code by themselves, what use will there be for human programmers like you and I?) Having to review code that superficially looks like it would work but ultimately proves useless wastes maintainer time and morale.

aitap · 2025-02-19T11:48:43Z

R/print.data.table.R

    abbs = unname(class_abb[classes])
+    abbs[classes == "index"] <- "<index>"


What if an R package provides an S3 class called "index"? There's more than 20000 CRAN packages, plus some more Bioconductor packages, plus a lot of packages only published on GitHub or even not published at all.

To avoid such conflicts, I will ensure that only actual index columns are modified by explicitly tracking them using indices(x). This way, we only modify columns that are confirmed to be indices, rather than relying on a generic "index" class check.

aitap · 2025-02-19T11:49:50Z

R/print.data.table.R

@@ -101,8 +110,23 @@ print.data.table = function(x, topn=getOption("datatable.print.topn"),
      IDate = "<IDat>", integer64 = "<i64>", raw = "<raw>",
      expression = "<expr>", ordered = "<ord>")
    classes = classes1(x)
+    col_names <- colnames(toprint)
+    classes <- sapply(col_names, function(col_name) {


Instead of overwriting classes by re-creating it anew, can you append the index markers to the variable classes created two lines above?

aitap · 2025-02-19T11:51:44Z

R/print.data.table.R

    if ( length(idx <- which(is.na(abbs))) ) abbs[idx] = paste0("<", classes[idx], ">")
+    stopifnot(length(abbs) == ncol(toprint))


Defensive programming is good, but it's better to add a test to inst/tests/tests.Rraw to verify that printing a data.table with indices does not raise any warnings due to invalid rbind().

Mukulyadav2004 · 2025-02-20T07:13:23Z

Hi @aitap,
I have incorporated all the changes you suggested, but I am still encountering 6 test failures out of 11,760 tests. The failing tests are: 2264.1, 2264.2, 2264.3, 2264.4, 2264.7, and 2264.8, all from inst/tests/tests.Rraw.
The issue seems to be that the observed output does not match the expected output.

Expected Output:

  grp1 grp2 grp3 index1:grp1__grp3 index2:grp3__grp1
1:   77   61   53                 3                 5
2:   80   66   37                 8                 4
3:   27   42    8                 5                 3
4:   66   37    7                 4                 7
5:   38   69    5                 6                 2
6:   72   89   69                 1                10
7:   86   52   16                 2                 1
8:   28   35   62                10                 8
9:   95   82   80                 7                 6
10:   83   64   41                 9                 9

Observed Output:

    grp1 grp2 grp3
 1:   77   61   53
 2:   80   66   37
 3:   27   42    8
 4:   66   37    7
 5:   38   69    5
 6:   72   89   69
 7:   86   52   16
 8:   28   35   62
 9:   95   82   80
10:   83   64   41

All failures appear to be of the same nature—the index columns are missing in the observed output. Could you please provide guidance on what might be causing
this and how best to resolve it?

aitap · 2025-02-20T10:20:46Z

inst/tests/tests.Rraw

@@ -18784,6 +18784,11 @@ ans = c(
 "10:   83   64   41                 9                 9")
 # test where topn isn't necessary
 test(2264.8, print(DT, show.indices=TRUE), output=ans)
+# printing does not fail when indices are present
+test(2264.9, {
+  suppressWarnings( print(DT, show.indices=TRUE) )


Instead of suppressing warnings, test that they don't occur. Try capturing the output from a print(DT, show.indices=TRUE) with the warning fixed and call test(2264.9, print(DT, show.indices=TRUE), output = c(...example output goes here...)).

aitap · 2025-02-20T10:30:18Z

The current patch both removes the code that creates index_dt and then seems to set show.indices to FALSE, which prevents an error due to the index_dt being undefined. Try providing the variables that are needed by the show.indices == TRUE branches in print.data.table and then preserving the show.indices argument being true.

codecov · 2025-02-20T23:38:51Z

Codecov Report

Attention: Patch coverage is 77.77778% with 4 lines in your changes missing coverage. Please review.

Project coverage is 98.62%. Comparing base (4b3a081) to head (3acfeb7).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
R/print.data.table.R	77.77%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6816      +/-   ##
==========================================
- Coverage   98.64%   98.62%   -0.03%     
==========================================
  Files          79       79              
  Lines       14642    14654      +12     
==========================================
+ Hits        14444    14452       +8     
- Misses        198      202       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Mukulyadav2004 · 2025-02-21T00:01:43Z

Hi @aitap,

I have incorporated all the changes as per your suggestions, and this time, all tests from inst/tests/tests.Rraw have passed successfully. However, the following checks have failed:
atime performance tests / comment (pull_request)
code-quality / lint-r (pull_request)

Could you please advise on how to proceed with these.

MichaelChirico · 2025-02-21T00:05:21Z

for lintr, scroll through the Files tab, you will see line annotations telling you what's wrong
same for codecov. you need to add new test(s) which will execute the highlighted line(s).
for atime, that's a known issue, you can ignore

Mukulyadav2004 · 2025-02-21T15:35:30Z

I have added tests for the lines in 'inst/tests/tests/tests.Rraw' that were identified by Codecov, but it seems they are still not being recognized, as I am encountering the same error as before. Could you please advise on how to resolve this issue.

MichaelChirico · 2025-02-21T17:08:55Z

R/print.data.table.R

    # The issue is distinguishing "> DT" (after a previous := in a function) from "> DT[,foo:=1]". To print.data.table(), there
    # is no difference. Now from R 3.2.0 a side effect of the very welcome and requested change to avoid silent deep copy is that
    # there is now no longer a difference between > DT and > print(DT). So decided that DT[] is now needed to guarantee print; simpler.
    # This applies just at the prompt. Inside functions, print(DT) will of course print.
    # Other options investigated (could revisit): Cstack_info(), .Last.value gets set first before autoprint, history(), sys.status(),
    #   topenv(), inspecting next statement in caller, using clock() at C level to timeout suppression after some number of cycles
    SYS = sys.calls()
-    if (identical(SYS[[1L]][[1L]], print) || # this is what auto-print looks like, i.e. '> DT' and '> DT[, a:=b]' in the terminal; see #3029.


why remove these comments?

I noticed these changes were made while re-checking the code. I will revert them to maintain consistency.

MichaelChirico · 2025-02-21T17:09:02Z

R/print.data.table.R

@@ -22,23 +22,23 @@ print.data.table = function(x, topn=getOption("datatable.print.topn"),
  if (col.names == "none" && class)
    warningf("Column classes will be suppressed when col.names is 'none'")
  if (!shouldPrint(x)) {
-    #  := in [.data.table sets .global$print=address(x) to suppress the next print i.e., like <- does. See FAQ 2.22 and README item in v1.9.5
+#  := in [.data.table sets .global$print=address(x) to suppress the next print i.e., like <- does. See FAQ 2.22 and README item in v1.9.5


why this change?

MichaelChirico · 2025-02-21T17:09:11Z

R/print.data.table.R

      return(invisible(x))
    }
  }
  if (!is.numeric(nrows)) nrows = 100L
  if (!is.infinite(nrows)) nrows = as.integer(nrows)
-  if (nrows <= 0L) return(invisible(x))   # ability to turn off printing
+  if (nrows <= 0L) return(invisible(x))  # ability to turn off printing


why this change?

MichaelChirico · 2025-02-21T17:09:21Z

R/print.data.table.R

@@ -57,8 +57,8 @@ print.data.table = function(x, topn=getOption("datatable.print.topn"),
      catf("Null data.%s (0 rows and 0 cols)\n", class)  # See FAQ 2.5 and NEWS item in v1.8.9
    } else {
      catf("Empty data.%s (%d rows and %d cols)", class, NROW(x), NCOL(x))
-      if (length(x)>0L) cat(": ",paste(head(names(x),6L),collapse=","),if(length(x)>6L)"...",sep="") # notranslate
-      cat("\n") # notranslate
+      if (length(x)>0L) cat(": ",paste(head(names(x),6L),collapse=","),if(length(x)>6L)"...",sep="")  # notranslate


why this change?

MichaelChirico · 2025-02-21T17:10:27Z

inst/tests/tests.Rraw

+# Test for covering classes[col_name] <- "unknown"
+DT = data.table(A = 1:3, B = 4:6, C = 7:9)
+if ("D" %in% colnames(DT)) DT[, D := NULL]
+test(2306.4, DT, data.table(A = 1:3, B = 4:6, C = 7:9))


missing final newline

MichaelChirico · 2025-02-21T17:11:55Z

I have added tests for the lines in 'inst/tests/tests/tests.Rraw' that were identified by Codecov, but it seems they are still not being recognized, as I am encountering the same error as before. Could you please advise on how to resolve this issue.

You are writing code about data.table's print() method, but none of your tests involve printing (just input/output equality).

You want to use test(output=) to ensure printing is invoked and you're actually testing the behavior you're changing.

Mukulyadav2004 · 2025-02-22T16:39:05Z

Hi @MichaelChirico
Apologies for the interruption. Could you kindly review the test I’ve written. Despite making changes to include printing, it still fails at the same checks. I would greatly appreciate your guidance on how to correct it.

aitap

Better, but the changes to unrelated code need to be reverted, and it's better to reuse the classes1 logic instead of reimplementing it.

aitap · 2025-02-23T13:52:46Z

R/print.data.table.R

@@ -101,7 +101,22 @@ print.data.table = function(x, topn=getOption("datatable.print.topn"),
      IDate = "<IDat>", integer64 = "<i64>", raw = "<raw>",
      expression = "<expr>", ordered = "<ord>")
    classes = classes1(x)


Here's an idea. The code uses classes1(x) to obtain the class names of the column. It's already mostly the right answer, and it handles things like non-length-1 class vectors. The problem here is that toprint, which is what will get printed, may be produced from not just x, but cbind(x, indices(x)), and so may have a different number of columns.

Instead of manually reimplementing classes1 below, it should be enough to either append <index> to classes (length(indices(x)) times) if show.indices is true, or capture classes1(toprint) before it is formatted.

aitap · 2025-02-23T13:57:43Z

inst/tests/tests.Rraw

+# Test for covering classes[col_name] <- "index"
+DT <- data.table(A = 1:3, B = 4:6)
+setindex(DT, A)
+test(2306.1, {print(DT); capture.output(print(DT))}, 


Instead of using capture.output(...) in tests, put the expected output into the output= argument of the test() function. This will automatically skip the output test when running with translation enabled instead of failing it. You also don't have to manually call print() when using the output= argument; the test() function calls print() for you when you ask it to test the output. There are other useful arguments described in help(test, data.table).

aitap · 2025-02-23T14:04:12Z

inst/tests/tests.Rraw

+     c("   A B C",
+       "1: 1 4 7",
+       "2: 2 5 8",
+       "3: 3 6 9"))


In order to exercise the index-printing code, make sure to create an index using setindex(...) and use print(..., datatable.show.indices=TRUE) or test(..., output = ..., options = c(datatable.show.indices=TRUE)). In order to reproduce the warning, make sure to create more than one column, otherwise rbind() silently recycles the length-1 vector abbs.

aitap · 2025-02-23T14:06:45Z

R/print.data.table.R

+        cls <- class(x[[col_name]])
+        if (is.list(cls)) cls <- unlist(cls)
+        if (length(cls) == 0) cls <- "unknown"


I don't think R will allow making the class vector a list. Also, class() will return the name of the primitive type of the value if the class attribute is set empty or removed altogether, so both tests here are impossible.

aitap · 2025-02-23T14:08:18Z

R/print.data.table.R

+        if (length(cls) == 0) cls <- "unknown"
+        classes[col_name] <- cls[1]
+      } else {
+        classes[col_name] <- "unknown"


This is likely to be impossible due to the way toprint is constructed from x.

Mukulyadav2004 added 5 commits February 14, 2025 14:31

add header to match keystyle

a68392f

add header to match keystyle

f8527a3

maybe final

d8c091e

new one

f463b76

add header string

558d67a

Mukulyadav2004 requested a review from MichaelChirico as a code owner February 14, 2025 17:08

aitap requested changes Feb 14, 2025

View reviewed changes

aitap marked this pull request as draft February 14, 2025 20:30

Mukulyadav2004 requested a review from aitap February 15, 2025 11:22

Mukulyadav2004 and others added 2 commits February 18, 2025 19:38

Merge branch 'master' into new_branch

76ac2de

proper row alignment

ff4746e

aitap requested changes Feb 19, 2025

View reviewed changes

Mukulyadav2004 and others added 2 commits February 20, 2025 09:40

modified changes

aca5688

Merge branch 'master' into new_branch

87ffd7f

Mukulyadav2004 requested a review from aitap February 20, 2025 08:18

aitap reviewed Feb 20, 2025

View reviewed changes

providing needed variables

cca0282

Mukulyadav2004 and others added 6 commits February 21, 2025 13:24

add changes

b5cee16

added test

4ac709c

add tests

3aaa7a8

adding tests

a67e0ff

remove warning meassage

7096877

Merge branch 'master' into new_branch

552f75d

MichaelChirico reviewed Feb 21, 2025

View reviewed changes

Mukulyadav2004 and others added 2 commits February 22, 2025 10:40

tests

3b610d3

Merge branch 'master' into new_branch

3acfeb7

Mukulyadav2004 requested review from aitap and MichaelChirico February 23, 2025 13:05

aitap requested changes Feb 23, 2025

View reviewed changes

Mukulyadav2004 added 3 commits February 24, 2025 13:12

changes

0b26fd9

ensure toprint

9c4cffb

ensure col_names

3044358

		abbs = unname(class_abb[classes])
		abbs[classes == "index"] <- "<index>"

		if ( length(idx <- which(is.na(abbs))) ) abbs[idx] = paste0("<", classes[idx], ">")
		stopifnot(length(abbs) == ncol(toprint))

Fix index printing by adding index info to header (#6806) #6816

Are you sure you want to change the base?

Fix index printing by adding index info to header (#6806) #6816

Conversation

Mukulyadav2004 commented Feb 14, 2025

aitap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mukulyadav2004 commented Feb 18, 2025

aitap commented Feb 18, 2025 via email

aitap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aitap Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mukulyadav2004 commented Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

aitap commented Feb 20, 2025

codecov bot commented Feb 20, 2025 • edited Loading

Codecov Report

Mukulyadav2004 commented Feb 21, 2025

MichaelChirico commented Feb 21, 2025 • edited Loading

Mukulyadav2004 commented Feb 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaelChirico commented Feb 21, 2025

Mukulyadav2004 commented Feb 22, 2025

aitap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aitap Feb 19, 2025 •

edited

Loading

Mukulyadav2004 commented Feb 20, 2025 •

edited

Loading

codecov bot commented Feb 20, 2025 •

edited

Loading

MichaelChirico commented Feb 21, 2025 •

edited

Loading