forked from Tazinho/Advanced-R-Solutions
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path15-Memory.Rmd
221 lines (165 loc) · 7.8 KB
/
15-Memory.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
# Memory
## Object size
1. __<span style="color:red">Q</span>__: Repeat the analysis above for numeric, logical, and complex vectors.
__<span style="color:green">A</span>__:
**numeric:**
```{r}
library(pryr)
sizes <- sapply(0:50, function(n) object_size(vector("numeric", n)))
plot(0:50, sizes, xlab = "Length", ylab = "Size (bytes)",
type = "s")
plot(0:50, sizes - 40, xlab = "Length",
ylab = "Bytes excluding overhead", type = "n")
abline(h = 0, col = "grey80")
abline(h = c(8, 16, 32, 48, 64, 128), col = "grey80")
abline(a = 0, b = 4, col = "grey90", lwd = 4)
lines(sizes - 40, type = "s")
x <- numeric(1e6)
object_size(x)
y <- list(x, x, x)
object_size(y)
```
**logical:**
```{r}
sizes <- sapply(0:50, function(n) object_size(vector("logical", n)))
plot(0:50, sizes, xlab = "Length", ylab = "Size (bytes)",
type = "s")
plot(0:50, sizes - 40, xlab = "Length",
ylab = "Bytes excluding overhead", type = "n")
abline(h = 0, col = "grey80")
abline(h = c(8, 16, 32, 48, 64, 128), col = "grey80")
abline(a = 0, b = 4, col = "grey90", lwd = 4)
lines(sizes - 40, type = "s")
x <- logical(1e6)
object_size(x)
y <- list(x, x, x)
object_size(y)
```
**complex:**
```{r}
sizes <- sapply(0:50, function(n) object_size(vector("complex", n)))
plot(0:50, sizes, xlab = "Length", ylab = "Size (bytes)",
type = "s")
plot(0:50, sizes - 40, xlab = "Length",
ylab = "Bytes excluding overhead", type = "n")
abline(h = 0, col = "grey80")
abline(h = c(8, 16, 32, 48, 64, 128), col = "grey80")
abline(a = 0, b = 4, col = "grey90", lwd = 4)
lines(sizes - 40, type = "s")
x <- complex(1e6)
object_size(x)
y <- list(x, x, x)
object_size(y)
```
1. __<span style="color:red">Q</span>__: If a data frame has one million rows, and three variables (two numeric, and
one integer), how much space will it take up? Work it out from theory,
then verify your work by creating a data frame and measuring its size.
__<span style="color:green">A</span>__: From the textbook we now that
* an integer`s size is 40 byte plus 4 byte per allocated entry,
* a numerics`s size is 40 byte plus 8 byte per allocated entry.
So we can calculate the size via:
`object_size(df) = 1 * (40 + 4 * 1.000.000) + 2 * (40 + 8 * 1.000.000)
= 20.000.120` bytes.
And test this via:
```{r}
df <- data.frame(int1 = integer(1000000),
num1 = numeric(1000000),
num2 = numeric(1000000))
as.integer(object_size(df))
```
Note that we observe a small difference, because we didn't include the costs for creating the `data.frame()` in our previous calculations.
1. __<span style="color:red">Q</span>__: Compare the sizes of the elements in the following two lists. Each
contains basically the same data, but one contains vectors of small
strings while the other contains a single long string.
```{r}
vec <- lapply(0:50, function(i) c("ba", rep("na", i)))
str <- lapply(vec, paste0, collapse = "")
```
__<span style="color:green">A</span>__:
```{r}
vec <- lapply(0:50, function(i) c("ba", rep("na", i)))
str <- lapply(vec, paste0, collapse = "")
object_size(vec)
object_size(str)
object_size(vec, str)
```
1. __<span style="color:red">Q</span>__: Which takes up more memory: a factor (`x`) or the equivalent character
vector (`as.character(x)`)? Why?
__<span style="color:green">A</span>__: To be exact: it depends on the length of unique elements in relation to the overall length of the vector.
* In case of a long vector with only a few levels, the character takes approximately twice the memory of a factor:
```{r}
object_size(rep(letters[1:20], 1000))
object_size(factor(rep(letters[1:20], 1000)))
```
That is, because a character allocates 8 bytes per entry (if the entry has less than 8 signs, otherwise roughly one byte per sign) and a factor equals an integer (allocates only 4 byte per entry) with a character vector attribute that contains the levels (unique elements) of the vector:
```{r}
object_size(1:20000)
object_size(letters[1:20])
```
* Of course the factor will allocate more memory, if all entries are unique:
```{r}
object_size(letters[1:20])
object_size(factor(letters[1:20]))
```
1. __<span style="color:red">Q</span>__: Explain the difference in size between `1:5` and `list(1:5)`.
__<span style="color:green">A</span>__: An empty list needs 40 bytes. For each entry 8 bytes are added (note that for short lists memory is allocated in chunks, similar as explained for integers in the textbook). We can see this via:
```{r}
object_size(vector("list",0))
object_size(vector("list",1))
```
Since `c(1:5)` needs 48 bytes, `list(1:5)` takes 96. So in general the cost for saving atomics within a list is 40 bytes for the list plus 8 bytes per atomic/listentry.
## Memory profiling with lineprof
1. __<span style="color:red">Q</span>__: When the input is a list, we can make a more efficient `as.data.frame()`
by using special knowledge. A data frame is a list with class `data.frame`
and `row.names` attribute. `row.names` is either a character vector or
vector of sequential integers, stored in a special format created by
`.set_row_names()`. This leads to an alternative `as.data.frame()`:
```{r}
to_df <- function(x) {
class(x) <- "data.frame"
attr(x, "row.names") <- .set_row_names(length(x[[1]]))
x
}
```
What impact does this function have on `read_delim()`? What are the
downsides of this function?
1. __<span style="color:red">Q</span>__: Line profile the following function with `torture = TRUE`. What is
surprising? Read the source code of `rm()` to figure out what's going on.
```{r}
f <- function(n = 1e5) {
x <- rep(1, n)
rm(x)
}
```
## Modification in place
1. __<span style="color:red">Q</span>__: The code below makes one duplication. Where does it occur and why?
(Hint: look at `refs(y)`.)
```{r}
x <- data.frame(matrix(runif(100 * 1e4), ncol = 100))
medians <- vapply(x, median, numeric(1))
y <- as.list(x)
for(i in seq_along(medians)) {
y[[i]] <- y[[i]] - medians[i]
}
```
__<span style="color:green">A</span>__: It occurs in the first iteration of the for loop. `refs(y)` is 2 before the for loop, because `y` is created via `as.list()`, which is not a primitive and so sets the refs counter up to two. Therefore **R** makes a copy, `refs(y)` becomes 1 and the following modifications will occur in place. The following code illustrates this behaviour, when run in the **R**Gui. (Note that `refs()` will always return 2, when run in RStudio, as stated in the textbook. Note also, that you could detect this behaviour with the `tracemem()` function).
```{r, eval = FALSE}
library(pryr)
rm(list = ls(all = TRUE))
x <- data.frame(matrix(runif(100 * 1e4), ncol = 4))
medians <- vapply(x, median, numeric(1))
y <- as.list(x)
is.primitive(as.list)
# [1] FALSE
for(i in seq_along(medians)) {
print(c(address(y), refs(y)))
y[[i]] <- y[[i]] - medians[i]
}
# [1] "0x46c4a98" "2"
# [1] "0x11de30c8" "1"
# [1] "0x11de30c8" "1"
# [1] "0x11de30c8" "1"
```
1. __<span style="color:red">Q</span>__: The implementation of `as.data.frame()` in the previous section has one
big downside. What is it and how could you avoid it?
[long-vectors]: http://cran.r-project.org/doc/manuals/R-ints.html#Long-vectors