Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nested ':=' reference assignment fails #6768

Open
AntonNM opened this issue Jan 29, 2025 · 12 comments · May be fixed by #6789
Open

nested ':=' reference assignment fails #6768

AntonNM opened this issue Jan 29, 2025 · 12 comments · May be fixed by #6789
Labels
Milestone

Comments

@AntonNM
Copy link

AntonNM commented Jan 29, 2025

Issues arise when a function is called with the reference assignment operator ':=', which in-turn modifies the dt by reference as well. Below is the example running on the latest dev version. The result of the outer function overwrites the inner function in the incorrect column named 'se', the column named new is left empty.

> library(data.table)
data.table 1.16.99 IN DEVELOPMENT built 2025-01-27 04:18:35 UTC; root using 7 threads (see ?getDTthreads).  Latest news: r-datatable.com

library(data.table)
options(datatable.verbose = TRUE)
dt = data.table(a=1:10)
inner <- function(dt){
    dt[,se:=1:10]
}
outer <- function(dt){
    inner(dt);
    return (11:20)
}
dt[,new:=outer(dt)]
print(dt)
Detected that j uses these columns: <none>
Detected that j uses these columns: <none>
Assigning to all 10 rows
RHS_list_of_columns == false
RHS for item 1 has been duplicated because MAYBE_REFERENCED==1 MAYBE_SHARED==1, but then is being plonked. length(values)==10; length(cols)==1)
Assigning to all 10 rows
RHS_list_of_columns == false
RHS for item 1 has been duplicated because MAYBE_REFERENCED==1 MAYBE_SHARED==1, but then is being plonked. length(values)==10; length(cols)==1)
        a    se    new
    <int> <int> <NULL>
 1:     1    11   NULL
 2:     2    12   NULL
 3:     3    13   NULL
 4:     4    14   NULL
 5:     5    15   NULL
 6:     6    16   NULL
 7:     7    17   NULL
 8:     8    18   NULL
 9:     9    19   NULL
10:    10    20   NULL
@AntonNM
Copy link
Author

AntonNM commented Jan 29, 2025

There is a workaround of returning a DT containing the results of the inner and outer function and assigning them all at once, however this structure was more desirable.

@MichaelChirico
Copy link
Member

could you confirm (1) whether the issue is present on CRAN (2) whether updating to the current version of data.table fixes the issue? There are some recent related commits that might fix it

@AntonNM
Copy link
Author

AntonNM commented Jan 29, 2025

Thank you, the latest CRAN does indeed resolve this issue!

@MichaelChirico
Copy link
Member

Great! Please also check on the current devel version, this is probably due to #6551

@AntonNM
Copy link
Author

AntonNM commented Jan 31, 2025

Updating to the current dev version 32dfd8d (1.16.99) re-introduces this issue.

I also pulled, and tested the commits for #6551 & #6542 (the commit directly preceding #6551) and the issue is present in both.

Perhaps it was introduced in an earlier commit?

@MichaelChirico MichaelChirico added this to the 1.17.0 milestone Jan 31, 2025
@MichaelChirico MichaelChirico modified the milestones: 1.17.0, 1.18.0 Jan 31, 2025
@MichaelChirico
Copy link
Member

I see the same behavior on 1.16.0, 1.16.4 and current master:

Detected that j uses these columns: <none>
Detected that j uses these columns: <none>
Assigning to all 10 rows
RHS_list_of_columns == false
RHS for item 1 has been duplicated because MAYBE_REFERENCED==1 MAYBE_SHARED==1, but then is being plonked. length(values)==10; length(cols)==1)
Assigning to all 10 rows
RHS_list_of_columns == false
RHS for item 1 has been duplicated because MAYBE_REFERENCED==1 MAYBE_SHARED==1, but then is being plonked. length(values)==10; length(cols)==1)
        a    se    new
    <int> <int> <NULL>
 1:     1    11   NULL
 2:     2    12   NULL
 3:     3    13   NULL
 4:     4    14   NULL
 5:     5    15   NULL
 6:     6    16   NULL
 7:     7    17   NULL
 8:     8    18   NULL
 9:     9    19   NULL
10:    10    20   NULL

It's definitely odd that you've managed to create a column that's just NULL -- the resulting dt is not a proper data.table.

Could you double check & report which version produces the correct result?

@AntonNM
Copy link
Author

AntonNM commented Feb 1, 2025

I went back as far as 1.13.0 and was unable to find a version that produces the correct result, however I believe I found the root of the confusion:

library(data.table)
options(datatable.verbose = T)

dt = data.table(a=1:10)

inner <- function(dt){
    dt[,se:=1:10]
}
outer <- function(dt){
    inner(dt);
    return (11:20)
}

dt[,new:=outer(dt)]
print(dt)
dt[,new:=outer(dt)]
print(dt)

The second occurrence of the assignment statement produces the correct result. I believe this is what occurred when I first tested the latest CRAN version.

AntonNM added a commit to AntonNM/data.table that referenced this issue Feb 2, 2025
*Followed TODO: by mattdowle from resolution to '2-space indentation Rdatatable#2420'

*Added tests for jsub that modify DT by-reference

*Added test case for interger vector indexing
@tdhock
Copy link
Member

tdhock commented Feb 10, 2025

I'm surprised you would expect this to work at all. Shouldn't it be outer(dt) instead of dt[,new:=outer(dt)] ?

@AntonNM
Copy link
Author

AntonNM commented Feb 10, 2025

I would expect dt[, new:=outer(dt)] to create 2 new columns one called 'new' with the return value of outer, (11:20) and another "side-effect" column, called 'se', with the values of 1:10 as created in the inner function.

Calling only outer(dt) would creates the column 'se' correctly, however the return value of outer() is lost.

library(data.table)
options(datatable.verbose = F)

dt =  data.table(a=1:10)

inner <- function(dt){
    dt[,se:=1:10]
}

outer <- function(dt){

    inner(dt);

    return (11:20)
}

outer(dt)

print(dt)

 [1] 11 12 13 14 15 16 17 18 19 20 #lost return value of outer
        a    se
    <int> <int>
 1:     1     1
 2:     2     2
 3:     3     3
 4:     4     4
 5:     5     5
 6:     6     6
 7:     7     7
 8:     8     8
 9:     9     9
10:    10    10

@AntonNM
Copy link
Author

AntonNM commented Feb 10, 2025

The result I would expect would be the following:

        a    se   new
    <int> <int> <int>
 1:     1     1    11
 2:     2     2    12
 3:     3     3    13
 4:     4     4    14
 5:     5     5    15
 6:     6     6    16
 7:     7     7    17
 8:     8     8    18
 9:     9     9    19
10:    10    10    20

@TysonStanley
Copy link
Member

Thanks for more detail. If just running outer(dt) where would the returned vector get placed into the data.table? The output makes sense to me. It is running inner(dt) and then returning the vector you asked for. I don't see anywhere that it should be added as a column.

@AntonNM
Copy link
Author

AntonNM commented Feb 18, 2025

The original example included an assignment to the column 'se' and 'new'

1.) column se should be created with values 1-10 here

2.) the outer function returns the values 11-20 which should be assigned to new

3.) the return value of outer is assigned to new

The original example:

library(data.table)
options(datatable.verbose = TRUE)
dt = data.table(a=1:10)
inner <- function(dt){
    dt[,se:=1:10] #(1) 
}
outer <- function(dt){
    inner(dt);
    return (11:20) #(2)
}
dt[,new:=outer(dt)] # (3)
print(dt)

The results of the above snippet were:

Detected that j uses these columns: <none>
Detected that j uses these columns: <none>
Assigning to all 10 rows
RHS_list_of_columns == false
RHS for item 1 has been duplicated because MAYBE_REFERENCED==1 MAYBE_SHARED==1, but then is being plonked. length(values)==10; length(cols)==1)
Assigning to all 10 rows
RHS_list_of_columns == false
RHS for item 1 has been duplicated because MAYBE_REFERENCED==1 MAYBE_SHARED==1, but then is being plonked. length(values)==10; length(cols)==1)
        a    se    new
    <int> <int> <NULL>
 1:     1    11   NULL
 2:     2    12   NULL
 3:     3    13   NULL
 4:     4    14   NULL
 5:     5    15   NULL
 6:     6    16   NULL
 7:     7    17   NULL
 8:     8    18   NULL
 9:     9    19   NULL
10:    10    20   NULL

The results I expected were:

        a    se   new
    <int> <int> <int>
 1:     1     1    11
 2:     2     2    12
 3:     3     3    13
 4:     4     4    14
 5:     5     5    15
 6:     6     6    16
 7:     7     7    17
 8:     8     8    18
 9:     9     9    19
10:    10    10    20

I found that this occurs because the evaluation of the left-hand side at points (#1, & #3) both occur at the point when dt only contains the column 'a'. This leads to both new columns being assign the following available integer index, i.e '2'

Hopefully this was a more complete explanation of the bug. Please let me know is there are any other questions, confusion or concerns with the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants