-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02-curate_demog_table.Rmd
190 lines (159 loc) · 6.74 KB
/
02-curate_demog_table.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
# Curate demographic table {#curate-demog-table}
In this chapter, we curate a demographic table containing following information:
* the date of birth
* gender
* age at UKB study initial assessment
* age at UKB study repeat assessment
* date of UKB study initial assessment
* date of UKB study repeat assessment
* censored date
The censored date is defined as the earliest date among administrative censoring date, date of last contact and date of death.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval=F)
```
Load packages.
```{r, message = F}
library(tidyverse)
library(lubridate)
```
Load formatted raw demographic data.
```{r}
demog <- readRDS("generated_data/demog_UKB.RDS")
```
Define the date of birth and gender.
```{r}
demog <- demog %>%
rename(YOB = f.34.0.0) %>%
rename(MOB = f.52.0.0) %>%
mutate(DOB = lubridate::make_date(YOB, MOB)) %>%
mutate(SEX = as.character(f.31.0.0))
```
Define UKB study initiation date.
```{r}
demog <- demog %>% rename(date_init = f.53.0.0, date_repeat = f.53.1.0)
```
We will define administrative censoring date (study end date) based on [inpatient record origin](https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=40022). These dates are updated periodically. The most recent censoring dates can be found [here](https://biobank.ndph.ox.ac.uk/ukb/exinfo.cgi?src=Data_providers_and_dates#:~:text=Censoring%20dates&text=The%20censoring%20date%20is%20the,day%20of%20the%20previous%20month.) in the "Showcase censoring date" field of the table under the "Hospital inpatient data" section. The below censoring dates were based on the page accessed on Feb 22 2022:
* Patient Episode Database for Wales (PEDW): Feb 28 2018
* Scottish Morbidity Record (SMR): Jul 31 2021
* Hospital Episode Statistics for England (HES): Sep 30 2021
PEDW, SMR and HES are hospital admission keys (fields `f.40022.0.0`, `f.40022.0.1` and `f.40022.0.2` in `demog`) which map to certain administrative censoring dates.
Define a dictionary which maps hospital admission keys to administrative censoring dates.
```{r}
censor_dates <- c(PEDW = as.Date("2018-02-28"),
SMR = as.Date("2021-07-31"),
HES = as.Date("2021-09-30"))
```
We take the administrative censoring date as the minimum of these three mapped dates for each subject.
```{r}
demog <-
demog %>%
mutate(date_admin_censored = as.Date(pmin(censor_dates[f.40022.0.0], censor_dates[f.40022.0.1],censor_dates[f.40022.0.2],na.rm=T)))
```
Note that there are subjects with unknown administrative censored date (i.e., no inpatient record).
```{r}
demog %>% filter(is.na(date_admin_censored)) %>% select(f.eid) %>% nrow()
```
For these subjects with missing hospital admission keys, we identify subject's administrative censoring date using [data field 54](https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=54) of UKB assessment center data. These fields include which city each participant went in for assessment or imaging, and can indicate where we would expect the origins of their inpatient records.
Define a mapping from country to administrative censoring date.
```{r}
country_to_censor_date_mapping <-
c(wal = as.Date("2018-02-28"),
scot = as.Date("2021-07-31"),
eng = as.Date("2021-09-30"))
```
Load the data containing the field 54. Note, subject ID's are displayed as `Inf` for privacy reasons.
```{r}
bd <- readRDS("generated_data/assessment_center_UKB.RDS")
bd %>% head() %>% mutate(f.eid = Inf)
```
The field `f.54.0.0` contains codes indicating the city where the initial assessment was taken. We see that there is only one subject that is missing this value.
```{r}
rmid <- bd %>% filter(is.na(f.54.0.0))
rmid %>% mutate(f.eid = Inf)
```
This subject does exist in the demographic table, but all of the fields are missing except for the participant's ID.
```{r, message=F}
demog %>% right_join(rmid) %>% as_vector() %>% .[-1] %>% is.na %>% all
```
Remove this subject from the demographic table.
```{r, message=F}
demog <- demog %>% anti_join(rmid)
```
Now, we will use the values in the field `f.54.0.0` to find out which city and in turn which country a participant went in for initial assessment. First, load the [mapping file](https://biobank.ndph.ox.ac.uk/showcase/coding.cgi?id=10) from city code to city name.
```{r, message=F}
code_to_city_mapping <- read_tsv("raw_data/f.54.0.0_coding.tsv")
```
Second, define the mapping from city name to country name.
```{r}
city_to_country_map <-
c(Glasgow = "scot",
Edinburgh = "scot",
Newcastle = "eng",
Middlesborough = "eng",
Leeds = "eng",
Sheffield = "eng",
Bury = "eng",
Liverpool = "eng",
Manchester = "eng",
"Stockport (pilot)" = "eng",
Stoke = "eng",
Nottingham = "eng",
Birmingham = "eng",
Oxford = "eng",
Reading = "eng",
Hounslow = "eng",
"Central London" = "eng",
Croydon = "eng",
Bristol = "eng",
Wrexham = "wal",
Swansea = "wal",
Cardiff = "wal",
Barts = "eng") # hosptial in england
```
```{r, echo=F, results = 'hide'}
#Memory management, not echo
demog <- demog %>% select(f.eid, date_admin_censored, f.191.0.0, date_init, DOB, date_repeat, SEX)
gc()
```
Finally, using defined mappings, we fill in missing administrative censoring dates.
```{r, message=F}
demog <- demog %>%
left_join(bd %>% select(f.eid,`f.54.0.0`)) %>%
rename(coding = `f.54.0.0`) %>%
left_join(code_to_city_mapping, by = "coding") %>%
mutate(country = city_to_country_map[meaning]) %>%
mutate(date_admin_censored = if_else(!is.na(date_admin_censored),date_admin_censored,
country_to_censor_date_mapping[country]))
attr(demog$date_admin_censored,"names") <- NULL
```
Next, we Load reformatted raw ICD table which contains subject's date of death.
```{r}
ICD <- readRDS("generated_data/ICD_UKB.RDS")
```
Define:
- date of death
- date of lost-to-follow-up
- age at study initiation date
- age at second-visit date
- date censored
```{r}
date_death_tab <- ICD %>% select(f.eid, f.40000.0.0) %>% arrange(f.eid) %>% data.frame()
demog <- demog %>% left_join(date_death_tab, by = "f.eid") %>%
rename(date_death = "f.40000.0.0") %>%
rename(date_lost_fu = "f.191.0.0")
demog <- demog %>%
mutate(age_init = decimal_date(date_init) - decimal_date(DOB),
age_repeat = decimal_date(date_repeat) - decimal_date(DOB)) %>%
mutate(date_censored = pmin(date_admin_censored,date_lost_fu,date_death,na.rm = T))
```
Select specific columns from demographic table.
```{r}
demog_sel <- demog %>% select(f.eid,DOB,SEX,
age_init,age_repeat,
date_init,date_repeat,
date_censored)
```
Save demographic table. Note that this table includes subjects whose genetic and reported sex do not match.
```{r}
saveRDS(demog_sel,"generated_data/pre_demog_sel.RDS")
```