I'm a huge fan of the R data.table package. One of the feature I frequently use is the reference semantics. On short, this means that I can modify a data.table (modify/add/delete columns) without making a copy. Here is an example:

do_something <- function(dt)
{
  dt[, new_column := 42]
  message("dt address inside: ", data.table::address(dt))
}

dt <- data.table::data.table(some_column = letters[1:3])
message("dt address outside: ", data.table::address(dt))
# dt address outside: 000001c05fba22d8

do_something(dt)
# dt address inside: 000001c05fba22d8

print(str(dt))
# Classes ‘data.table’ and 'data.frame':	3 obs. of  2 variables:
#   $ some_column: chr  "a" "b" "c"
#   $ new_column : num  42 42 42
# - attr(*, ".internal.selfref")=

Notice how you don't need to explicitely return the data.table as it was modified in place. This also means that if you want to keep your input data.table unchanged, you have to explicitly pass a copy:

do_something <- function(dt)
{
  dt[, new_column := 42]
  message("dt address inside: ", data.table::address(dt))
}

dt <- data.table::data.table(some_column = letters[1:3])
message("dt address outside: ", data.table::address(dt))
# dt address outside: 000001c062128a08

# ----> make sure you pass a copy, not a reference
do_something(data.table::copy(dt))
# dt address inside: 000001c061967b00

print(str(dt))
# Classes ‘data.table’ and 'data.frame':	3 obs. of  1 variable:
#   $ some_column: chr  "a" "b" "c"
# - attr(*, ".internal.selfref")=

Some time ago a colleague showed me a strange case, in which the passing by reference seemed not to work anymore. As I could not reproduce the problem and the colleague just wanted to continue with his work, he modified the function in question to return the data.table, and this was it.

Then a few days ago another colleague had another strange problem. I had a look at it one evening, did not understand a thing, but it looked again like passing a data.table by reference did not work. I was lost. I remembered having the problem already, I also knew what to do to go on, but still no clue...

Fast forward the next day: in discussion with my colleague, we try to understand what is different now, why something that usually works does not anymore... (Notice the difference between somebody who just wants to go on and somebody who wants/need to understand?). And he goes like: 'well, we have an RData file with a data.table, and try to add a new column to that table, without making a copy...' And then it hit me: 'but this should be easily reproducible...'

Indeed, it is. Just have a look. Let's save our data.table to a file:


dt <- data.table::data.table(some_column = letters[1:3])
save(dt, file = "/tmp/problem.RData")

and I want to create a new column like before:


load("/tmp/problem.RData")
message("dt address outside: ", data.table::address(dt))
# dt address outside: 0x56228bc87e28

do_something <- function(dt)
{
  dt[, new_column := 42]
  message("dt address inside: ", data.table::address(dt))
}

do_something(dt)
# dt address inside: 0x56229052fde0

print(str(dt))
# Classes ‘data.table’ and 'data.frame':	3 obs. of  1 variable:
#   $ some_column: chr  "a" "b" "c"
# - attr(*, ".internal.selfref")=

See the "problem"? My new column is not there as I naively expected.

First I was just confused. Experience taught me that data.table is a package I can rely on and now this... In the German language there is a saying "das Problem sitzt meistens vor dem Rechner" which loosely translated means that most of the time is the user causing the problem. This was true also in this case...

Searching in the data.table FAQs I found this: "*.RDS and *.RData are file types which can store in-memory R objects on disk efficiently. However, storing data.table into the binary file loses its column over-allocation. This isn’t a big deal – your data.table will be copied in memory on the next by reference operation and throw a warning. Therefore it is recommended to call setalloccol() on each data.table loaded with readRDS() or load() calls."

To find more about column over-allocation, see this this.

Now we are getting somewhere...


load("/tmp/problem.RData")

# ---> make sure we reallocate in advance
data.table::setalloccol(dt)

message("dt address outside: ", data.table::address(dt))
# dt address outside: 0x562292694760

do_something <- function(dt)
{
  dt[, new_column := 42]
  message("dt address inside: ", data.table::address(dt))
}

do_something(dt)
# dt address inside: 0x562292694760

print(str(dt))
# Classes ‘data.table’ and 'data.frame':	3 obs. of  2 variables:
#   $ some_column: chr  "a" "b" "c"
#   $ new_column : num  42 42 42
# - attr(*, ".internal.selfref")=

In our case, we had multiple data.tables in the RData file. I wonder if there is a way to call data.table::setalloccol dynamically. I only came up with this:


objects <- ls()
for (i_name in objects) {
  x <- get(i_name)

  if (inherits(x = x, what = "data.table"))
  {
    code <- glue::glue("data.table::setalloccol({i_name})")
    eval(parse(text = code))
  }
}

This seems to work, although I'm not happy with the eval/parse idiom. If you know a nicer way, please let me know.

By the way, if you are a data.table contributor, thanks a lot for the package. (I now know where to look for the source of the problem in the first place :))

xkcd_tech_support

https://xkcd.com/806/

 

Make a promise. Show up. Do the work. Repeat.