Skip to content

Commit 6ed3d21

Browse files
author
Raphael Gottardo
committed
Updated and new lecture notes
1 parent 41da99f commit 6ed3d21

14 files changed

+1300
-374
lines changed

Diff for: Advanced_data_manipulation.Rmd

+8
Original file line numberDiff line numberDiff line change
@@ -669,6 +669,14 @@ DT
669669
DT[,Name:=NULL]
670670
```
671671

672+
## Listing all tables
673+
674+
With data.table you can always list the tables that you've created, which will also return basic information on this tables including size, keys, nrows, etc.
675+
676+
```{r}
677+
tables()
678+
```
679+
672680

673681
## Bonuses: fread
674682

Diff for: Advanced_data_manipulation.html

+38-58
Original file line numberDiff line numberDiff line change
@@ -950,6 +950,22 @@ <h3><code>.GRP</code></h3>
950950
## 2: 3
951951
## 3: 4</pre>
952952

953+
</article></slide><slide class=''><hgroup><h2>Listing all tables</h2></hgroup><article id="listing-all-tables" class="smaller ">
954+
955+
<p>With data.table you can always list the tables that you&#39;ve created, which will also return basic information on this tables including size, keys, nrows, etc.</p>
956+
957+
<pre class = 'prettyprint lang-r'>tables()</pre>
958+
959+
<pre >## NAME NROW NCOL MB COLS KEY
960+
## [1,] big_dt 1,000,000 3 20 x,y,z
961+
## [2,] dt 3 3 1 x,y,z
962+
## [3,] DT 3 2 1 Name,Salary
963+
## [4,] DT1 5 4 1 x,y,z,newcol x
964+
## [5,] DT2 1 3 1 x,y,w x
965+
## [6,] tmp1 456,976 3 32 x,y,z x
966+
## [7,] tmp2 456,976 3 32 x,y,z x
967+
## Total: 88MB</pre>
968+
953969
</article></slide><slide class=''><hgroup><h2>Bonuses: fread</h2></hgroup><article id="bonuses-fread" class="smaller ">
954970

955971
<p><code>data.table</code> also comes with <code>fread</code>, a file reader much, much better than <code>read.table</code> or <code>read.csv</code>:</p>
@@ -965,8 +981,8 @@ <h3><code>.GRP</code></h3>
965981

966982
<pre >## Unit: milliseconds
967983
## expr min lq mean median uq max neval
968-
## fread 310.5437 310.5437 310.5437 310.5437 310.5437 310.5437 1
969-
## r.t 7050.3093 7050.3093 7050.3093 7050.3093 7050.3093 7050.3093 1</pre>
984+
## fread 331.6005 331.6005 331.6005 331.6005 331.6005 331.6005 1
985+
## r.t 7447.6770 7447.6770 7447.6770 7447.6770 7447.6770 7447.6770 1</pre>
970986

971987
<pre class = 'prettyprint lang-r'>unlink(file)</pre>
972988

@@ -984,12 +1000,12 @@ <h3><code>.GRP</code></h3>
9841000
<pre class = 'prettyprint lang-r'>microbenchmark(DT = rbindlist(dfs), DF = do.call(rbind, dfs), times = 5)</pre>
9851001

9861002
<pre >## Unit: milliseconds
987-
## expr min lq mean median uq max
988-
## DT 5.981805 8.551772 8.997102 9.64301 9.767579 11.04134
989-
## DF 709.929230 843.444154 869.686238 884.76882 954.812659 955.47633
990-
## neval cld
991-
## 5 a
992-
## 5 b</pre>
1003+
## expr min lq mean median uq max neval
1004+
## DT 10.02123 10.46328 10.7023 10.62325 10.65706 11.7467 5
1005+
## DF 787.99593 984.21012 1088.7448 1060.43489 1142.37993 1468.7029 5
1006+
## cld
1007+
## a
1008+
## b</pre>
9931009

9941010
</article></slide><slide class=''><hgroup><h2>Summary</h2></hgroup><article id="summary" class="smaller ">
9951011

@@ -1044,44 +1060,8 @@ <h3><code>.GRP</code></h3>
10441060
require(org.Hs.eg.db) || biocLite(&quot;org.Hs.eg.db&quot;)</pre>
10451061

10461062
<pre class = 'prettyprint lang-r'># Now we can use the org.Hs.eg.db to load a database
1047-
library(org.Hs.eg.db)</pre>
1048-
1049-
<pre >## Loading required package: AnnotationDbi
1050-
## Loading required package: BiocGenerics
1051-
## Loading required package: parallel
1052-
##
1053-
## Attaching package: &#39;BiocGenerics&#39;
1054-
##
1055-
## The following objects are masked from &#39;package:parallel&#39;:
1056-
##
1057-
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
1058-
## clusterExport, clusterMap, parApply, parCapply, parLapply,
1059-
## parLapplyLB, parRapply, parSapply, parSapplyLB
1060-
##
1061-
## The following object is masked from &#39;package:stats&#39;:
1062-
##
1063-
## xtabs
1064-
##
1065-
## The following objects are masked from &#39;package:base&#39;:
1066-
##
1067-
## anyDuplicated, append, as.data.frame, as.vector, cbind,
1068-
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
1069-
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
1070-
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
1071-
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
1072-
## table, tapply, union, unique, unlist
1073-
##
1074-
## Loading required package: Biobase
1075-
## Welcome to Bioconductor
1076-
##
1077-
## Vignettes contain introductory material; view with
1078-
## &#39;browseVignettes()&#39;. To cite Bioconductor, see
1079-
## &#39;citation(&quot;Biobase&quot;)&#39;, and for packages &#39;citation(&quot;pkgname&quot;)&#39;.
1080-
##
1081-
## Loading required package: GenomeInfoDb
1082-
## Loading required package: DBI</pre>
1083-
1084-
<pre class = 'prettyprint lang-r'># Create a connection
1063+
library(org.Hs.eg.db)
1064+
# Create a connection
10851065
Hs_con &lt;- org.Hs.eg_dbconn()</pre>
10861066

10871067
</article></slide><slide class=''><hgroup><h2>Using RSQLite</h2></hgroup><article id="using-rsqlite-1" class="smaller ">
@@ -1120,16 +1100,16 @@ <h3><code>.GRP</code></h3>
11201100

11211101
<pre class = 'prettyprint lang-r'>gc()</pre>
11221102

1123-
<pre >## used (Mb) gc trigger (Mb) max used (Mb)
1124-
## Ncells 1141630 61.0 2320208 124.0 1628720 87.0
1125-
## Vcells 1261314 9.7 3648212 27.9 1986142 15.2</pre>
1103+
<pre >## used (Mb) gc trigger (Mb) max used (Mb)
1104+
## Ncells 1840419 98.3 4643607 248.0 4276538 228.4
1105+
## Vcells 15889273 121.3 47075374 359.2 47071241 359.2</pre>
11261106

11271107
<pre class = 'prettyprint lang-r'>alias &lt;- dbGetQuery(Hs_con, &quot;SELECT * FROM alias;&quot;)
11281108
gc()</pre>
11291109

1130-
<pre >## used (Mb) gc trigger (Mb) max used (Mb)
1131-
## Ncells 1245119 66.5 3288291 175.7 1628720 87.0
1132-
## Vcells 1625283 12.4 3648212 27.9 2370850 18.1</pre>
1110+
<pre >## used (Mb) gc trigger (Mb) max used (Mb)
1111+
## Ncells 1943826 103.9 4643607 248.0 4276538 228.4
1112+
## Vcells 16187585 123.6 47075374 359.2 47071241 359.2</pre>
11331113

11341114
<pre class = 'prettyprint lang-r'>gene_info &lt;- dbGetQuery(Hs_con, &quot;SELECT * FROM gene_info;&quot;)
11351115
chromosomes &lt;- dbGetQuery(Hs_con, &quot;SELECT * FROM chromosomes;&quot;)</pre>
@@ -1139,16 +1119,16 @@ <h3><code>.GRP</code></h3>
11391119
<pre class = 'prettyprint lang-r'>CD154_df &lt;- dbGetQuery(Hs_con, &quot;SELECT * FROM alias a JOIN gene_info g ON g._id = a._id WHERE a.alias_symbol LIKE &#39;CD154&#39;;&quot;)
11401120
gc()</pre>
11411121

1142-
<pre >## used (Mb) gc trigger (Mb) max used (Mb)
1143-
## Ncells 1290972 69.0 3288291 175.7 1628720 87
1144-
## Vcells 2112441 16.2 5187496 39.6 3004340 23</pre>
1122+
<pre >## used (Mb) gc trigger (Mb) max used (Mb)
1123+
## Ncells 1989679 106.3 4643607 248.0 4276538 228.4
1124+
## Vcells 16674744 127.3 47075374 359.2 47071241 359.2</pre>
11451125

11461126
<pre class = 'prettyprint lang-r'>CD40LG_alias_df &lt;- dbGetQuery(Hs_con, &quot;SELECT * FROM alias a JOIN gene_info g ON g._id = a._id WHERE g.symbol LIKE &#39;CD40LG&#39;;&quot;)
11471127
gc()</pre>
11481128

1149-
<pre >## used (Mb) gc trigger (Mb) max used (Mb)
1150-
## Ncells 1290989 69.0 3288291 175.7 1628720 87
1151-
## Vcells 2112537 16.2 5187496 39.6 3004340 23</pre>
1129+
<pre >## used (Mb) gc trigger (Mb) max used (Mb)
1130+
## Ncells 1989696 106.3 4643607 248.0 4276538 228.4
1131+
## Vcells 16674841 127.3 47075374 359.2 47071241 359.2</pre>
11521132

11531133
</article></slide><slide class=''><hgroup><h2>Some SQL Commands</h2></hgroup><article id="some-sql-commands" class="smaller ">
11541134

Diff for: Advanced_data_manipulation.md

+41-60
Original file line numberDiff line numberDiff line change
@@ -1102,6 +1102,27 @@ DT[, `:=`(Name, NULL)]
11021102
## 3: 4
11031103
```
11041104

1105+
## Listing all tables
1106+
1107+
With data.table you can always list the tables that you've created, which will also return basic information on this tables including size, keys, nrows, etc.
1108+
1109+
1110+
```r
1111+
tables()
1112+
```
1113+
1114+
```
1115+
## NAME NROW NCOL MB COLS KEY
1116+
## [1,] big_dt 1,000,000 3 20 x,y,z
1117+
## [2,] dt 3 3 1 x,y,z
1118+
## [3,] DT 3 2 1 Name,Salary
1119+
## [4,] DT1 5 4 1 x,y,z,newcol x
1120+
## [5,] DT2 1 3 1 x,y,w x
1121+
## [6,] tmp1 456,976 3 32 x,y,z x
1122+
## [7,] tmp2 456,976 3 32 x,y,z x
1123+
## Total: 88MB
1124+
```
1125+
11051126

11061127
## Bonuses: fread
11071128

@@ -1123,8 +1144,8 @@ microbenchmark(fread = fread(file), r.t = read.table(file, header = TRUE, sep =
11231144
```
11241145
## Unit: milliseconds
11251146
## expr min lq mean median uq max neval
1126-
## fread 310.5437 310.5437 310.5437 310.5437 310.5437 310.5437 1
1127-
## r.t 7050.3093 7050.3093 7050.3093 7050.3093 7050.3093 7050.3093 1
1147+
## fread 331.6005 331.6005 331.6005 331.6005 331.6005 331.6005 1
1148+
## r.t 7447.6770 7447.6770 7447.6770 7447.6770 7447.6770 7447.6770 1
11281149
```
11291150

11301151
```r
@@ -1153,12 +1174,12 @@ microbenchmark(DT = rbindlist(dfs), DF = do.call(rbind, dfs), times = 5)
11531174

11541175
```
11551176
## Unit: milliseconds
1156-
## expr min lq mean median uq max
1157-
## DT 5.981805 8.551772 8.997102 9.64301 9.767579 11.04134
1158-
## DF 709.929230 843.444154 869.686238 884.76882 954.812659 955.47633
1159-
## neval cld
1160-
## 5 a
1161-
## 5 b
1177+
## expr min lq mean median uq max neval
1178+
## DT 10.02123 10.46328 10.7023 10.62325 10.65706 11.7467 5
1179+
## DF 787.99593 984.21012 1088.7448 1060.43489 1142.37993 1468.7029 5
1180+
## cld
1181+
## a
1182+
## b
11621183
```
11631184

11641185
## Summary
@@ -1222,46 +1243,6 @@ require(org.Hs.eg.db) || biocLite("org.Hs.eg.db")
12221243
```r
12231244
# Now we can use the org.Hs.eg.db to load a database
12241245
library(org.Hs.eg.db)
1225-
```
1226-
1227-
```
1228-
## Loading required package: AnnotationDbi
1229-
## Loading required package: BiocGenerics
1230-
## Loading required package: parallel
1231-
##
1232-
## Attaching package: 'BiocGenerics'
1233-
##
1234-
## The following objects are masked from 'package:parallel':
1235-
##
1236-
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
1237-
## clusterExport, clusterMap, parApply, parCapply, parLapply,
1238-
## parLapplyLB, parRapply, parSapply, parSapplyLB
1239-
##
1240-
## The following object is masked from 'package:stats':
1241-
##
1242-
## xtabs
1243-
##
1244-
## The following objects are masked from 'package:base':
1245-
##
1246-
## anyDuplicated, append, as.data.frame, as.vector, cbind,
1247-
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
1248-
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
1249-
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
1250-
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
1251-
## table, tapply, union, unique, unlist
1252-
##
1253-
## Loading required package: Biobase
1254-
## Welcome to Bioconductor
1255-
##
1256-
## Vignettes contain introductory material; view with
1257-
## 'browseVignettes()'. To cite Bioconductor, see
1258-
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
1259-
##
1260-
## Loading required package: GenomeInfoDb
1261-
## Loading required package: DBI
1262-
```
1263-
1264-
```r
12651246
# Create a connection
12661247
Hs_con <- org.Hs.eg_dbconn()
12671248
```
@@ -1327,9 +1308,9 @@ gc()
13271308
```
13281309

13291310
```
1330-
## used (Mb) gc trigger (Mb) max used (Mb)
1331-
## Ncells 1141630 61.0 2320208 124.0 1628720 87.0
1332-
## Vcells 1261314 9.7 3648212 27.9 1986142 15.2
1311+
## used (Mb) gc trigger (Mb) max used (Mb)
1312+
## Ncells 1840419 98.3 4643607 248.0 4276538 228.4
1313+
## Vcells 15889273 121.3 47075374 359.2 47071241 359.2
13331314
```
13341315

13351316
```r
@@ -1338,9 +1319,9 @@ gc()
13381319
```
13391320

13401321
```
1341-
## used (Mb) gc trigger (Mb) max used (Mb)
1342-
## Ncells 1245119 66.5 3288291 175.7 1628720 87.0
1343-
## Vcells 1625283 12.4 3648212 27.9 2370850 18.1
1322+
## used (Mb) gc trigger (Mb) max used (Mb)
1323+
## Ncells 1943826 103.9 4643607 248.0 4276538 228.4
1324+
## Vcells 16187585 123.6 47075374 359.2 47071241 359.2
13441325
```
13451326

13461327
```r
@@ -1357,9 +1338,9 @@ gc()
13571338
```
13581339

13591340
```
1360-
## used (Mb) gc trigger (Mb) max used (Mb)
1361-
## Ncells 1290972 69.0 3288291 175.7 1628720 87
1362-
## Vcells 2112441 16.2 5187496 39.6 3004340 23
1341+
## used (Mb) gc trigger (Mb) max used (Mb)
1342+
## Ncells 1989679 106.3 4643607 248.0 4276538 228.4
1343+
## Vcells 16674744 127.3 47075374 359.2 47071241 359.2
13631344
```
13641345

13651346
```r
@@ -1368,9 +1349,9 @@ gc()
13681349
```
13691350

13701351
```
1371-
## used (Mb) gc trigger (Mb) max used (Mb)
1372-
## Ncells 1290989 69.0 3288291 175.7 1628720 87
1373-
## Vcells 2112537 16.2 5187496 39.6 3004340 23
1352+
## used (Mb) gc trigger (Mb) max used (Mb)
1353+
## Ncells 1989696 106.3 4643607 248.0 4276538 228.4
1354+
## Vcells 16674841 127.3 47075374 359.2 47071241 359.2
13741355
```
13751356

13761357
## Some SQL Commands

Diff for: Bioconductor_intro.Rmd

+7-5
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: 'Bioinformatics for Big Omics Data: Introduction to Bioconductor
33
'
44
author: "Raphael Gottardo"
5-
date: "January 15, 2014"
5+
date: "January 21, 2014"
66
output:
77
ioslides_presentation:
88
fig_caption: yes
@@ -169,11 +169,12 @@ head(res)
169169
## Finding specific data
170170

171171
From previous table:
172-
bioc_package = bioconductor package
173-
hu6800 = Affymetrix HuGeneFL Genome Array annotation data (chip hu6800)
174-
rgu34a = Affymetrix Rat Genome U34 Set annotation data (chip rgu34a)
175172

176-
title = data set title or study title
173+
- bioc_package = bioconductor package
174+
- hu6800 = Affymetrix HuGeneFL Genome Array annotation data (chip hu6800)
175+
- rgu34a = Affymetrix Rat Genome U34 Set annotation data (chip rgu34a)
176+
- title = data set title or study title
177+
177178
For example BM_CD34-1a = bone marrow flow-sorted CD34+ cells (>95% purity) and has GSM sample number GSM575.
178179

179180
## Getting the data we want
@@ -239,6 +240,7 @@ library(Biobase)
239240
showMethods(class="eSet")
240241
```
241242
in particular, the following methods are rather convenient:
243+
242244
- assayData(obj); assayData(obj) `<-` value: access or assign assayData
243245
- phenoData(obj); phenoData(obj) `<-` value: access or assign phenoData
244246
- experimentData(obj); experimentData(obj) `<-` value: access or assign experimentData

0 commit comments

Comments
 (0)