forked from PeterClaussenSDSU/JoyOfCooking2018
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathProcessing Recipes.Rmd
224 lines (178 loc) · 7.31 KB
/
Processing Recipes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
title: "Processing Recipes"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
The most recent version of this file and data can be found at
```
git clone https://github.com/PeterClaussenSDSU/JoyOfCooking2018.git
```
# List Recipes
The first step to create a recipe database is to process individual recipe files. The files structure was chosed to allow a group of individuals to contribute, and allow individual recipes to be added to the whole, independent of any coordinating files or tables. These files will be stored in:
```{r}
path.to.recipes <- "~/github/JoyOfCooking2018/RecipeTables"
```
We'll make no assumptions about file types. The expected suffix is `.tab`, but we won't assume that all files conform. If we can guarantee all files conform to a specific format, we might use
```{r,eval=FALSE}
recipe.files <- list.files(path = path.to.recipes, pattern = '*.tab')
```
Instead, we'll attempt to read all files in the table directory.
```{r}
recipe.files <- list.files(path = path.to.recipes)
length(recipe.files)
```
The original specification was for three data columns, and two were to be added by processing file names. A fourth data column was later added, but we won't assume all files conform.
```{r}
expected.columns <- c("Amount","Measure","Ingredient")
added.columns <- c("NDB_No","Year","Recipe")
```
For processing, we'll have have a single organizing structures - a list of tables that have been read from file. We'll use file name for indexes.
```{r}
recipe.list <- vector("list", length(recipe.files))
names(recipe.list) <- recipe.files
```
## Utility functions
### Minimum headers
Does a list of names include the minimum required header names?
```{r}
# returns true if the list of names has the minimum required headers
minimum.headers <- function(names) {
for(name in expected.columns) {
if(!(name %in% names)) {
#found a missing identifier
return(FALSE)
}
}
#all were found
return(TRUE)
}
```
**To do** A complementary function may be implemented to use partial matching and return the list of names, in the order they match the input names. This returned list can be used to map columns in data to the expected column order. This function may return an empty list if matches are not made.
### Parse File Name
Later, we'll need parse the file name into a recipe name and a date. This is a placeholder for that function.
```{r}
#place holder for file name processing function
parse.filename <- function(name) {
return(list(Year=0,Recipe=name))
}
```
# Read Files
We'll try to read each file by first guessing the delimiter. For the time being, I'm assuming either tab-delimited or space-delimited text.
We'll assume the first row is a header row, and that it contains only single words, so we can use `strsplit` to token the row into names. If that doesn't work, we'll try with spaces.
For the files that don't have conforming first lines, we print the file name and header line for further inspection.
```{r}
conforming.files <- 0
nonconforming.files <- c()
for(recipe in recipe.files) {
file.name <- paste(path.to.recipes,'/',recipe,sep='')
current <- NULL
#read the first line
connection <- file(file.name)
first.line <- readLines(connection,n=1)
close(connection) #R won't be happy if we don't close the file
#if the first line can be split by tabs, then use tabs as delimiter
first.tokens <- strsplit(first.line,split='\t')[[1]]
#some files will be written with quoted text.
#The various `read` functions will strip the quotes,
#but here we have to do this manually.
first.tokens <-gsub('\"','', first.tokens)
if(minimum.headers(first.tokens)) {
#read as tab-delimited
recipe.list[[recipe]] <- read.table(file.name,
sep='\t',
stringsAsFactors = FALSE,
header=TRUE)
conforming.files <- conforming.files+1
} else {
#second guess is spaces.
first.tokens <- strsplit(first.line,split=' ')[[1]]
first.tokens <-gsub('\"','', first.tokens)
if(minimum.headers(first.tokens)) {
recipe.list[[recipe]] <- read.table(file.name,
header=TRUE,
stringsAsFactors = FALSE)
conforming.files <- conforming.files+1
} else {
#save a list for later
nonconforming.files <- c(nonconforming.files,recipe)
print(recipe)
print(first.line)
}
}
}
conforming.files
```
**To do** Determine how to partially match data columns to read non-conforming files.
# Combining Recipes
Our final table will have 6 columns. We'll create a enpty data frame to build from, and keep an empty copy to test the join for each individual recipe table.
```{r}
join.names <- c(expected.columns, added.columns)
join.frame <-data.frame(matrix(vector(), nrow=0, ncol=length(join.names),
dimnames=list(c(), join.names)),
stringsAsFactors=F)
test.frame <- join.frame
```
## Fix column names.
There are some simple inconsistencies in the column names that prevent merging. One simple one to correct is NBD vs NDB.
```{r}
for(idx in 1:length(recipe.files)) {
file.name <- recipe.files[idx]
tbl <- recipe.list[[file.name]]
if('NBD_No' %in% names(tbl)) {
nbd.idx <- which(names(tbl)=='NBD_No')
names(tbl)[nbd.idx] <- added.columns[1]
recipe.list[[file.name]] <- tbl
}
}
```
**To do** Partial matches for other mispellings.
Some recipes already have `Year` and `Recipe` columns. If we assume these have correct information, we can join directly. If not, we parse the file name and add `Year` and `Recipe` columns, using the previously defined utility function.
Some may not have `NDB_No`. We can add a dummy variable and attempt to match to Ingredients later.
```{r}
conforming.tables <- 0
nonconforming.tables <- c()
for(idx in 1:length(recipe.files)) {
file.name <- recipe.files[idx]
if(!(file.name %in% nonconforming.files)) {
tbl <- recipe.list[[file.name]]
if(length(names(tbl)) == 3) {
#We tested against Ingredient,Amount and Measure, so append
#an empty NBD_No. We will attempt to look up thos later.
#if we sort the data by ingredient, and multiple recipes have the
#same incredients, we may be able to save some searches.
tbl$NDB_No <- NA
}
if(length(names(tbl)) == 4) {
# we assume, for now, that the table is missing Year and Recipe,
# so parse the file name for this information.
ids <- parse.filename(file.name)
tbl$Year <- ids$Year
tbl$Recipe <- ids$Recipe
}
test.tbl <- NULL
#put the join in a try block. This will allow us to
#process all files without having to stop and fix errors.
tryCatch(test.tbl <- rbind(test.frame, tbl[,join.names]),
error=function(e) {
print(file.name)
print(head(tbl))
})
#if we can bind to the empty table, we can merge this table with
#the rest
if(!is.null(test.tbl)) {
join.frame <- rbind(join.frame,tbl[,join.names])
conforming.tables <- conforming.tables + 1
} else {
nonconforming.tables <- c(nonconforming.tables,file.name)
}
}
}
conforming.tables
```
```{r}
summary(join.frame)
```
**To do**
Iterate over non-conforming tables, determine which can be fixed programmatically.