Skip to content

Commit eeb8298

Browse files
Merge pull request #70 from JuliaML/cl/cora2
improve Cora + add PubMed and CiteSeer
2 parents aff7d26 + d40696a commit eeb8298

File tree

19 files changed

+424
-46
lines changed

19 files changed

+424
-46
lines changed

.github/workflows/Documenter.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ on:
1010
jobs:
1111
build:
1212
runs-on: ubuntu-latest
13+
env:
14+
PYTHON: ""
1315
steps:
1416
- uses: actions/checkout@v2
1517
- uses: julia-actions/setup-julia@latest

.github/workflows/UnitTest.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@ jobs:
1818
matrix:
1919
julia-version: ['1.0', '1', 'nightly']
2020
os: [ubuntu-latest, windows-latest, macOS-latest]
21-
21+
env:
22+
PYTHON: ""
2223
steps:
2324
- uses: actions/[email protected]
2425
- name: "Set up Julia"

Project.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
name = "MLDatasets"
22
uuid = "eb30cadb-4394-5ae3-aed4-317e484a6458"
3-
version = "0.5.8"
3+
version = "0.5.9"
44

55
[deps]
66
BinDeps = "9e28174c-4ba2-5203-b857-d8d62c4213ee"
@@ -10,6 +10,7 @@ DelimitedFiles = "8bb1440f-4735-579b-a4ab-409b98df4dab"
1010
FixedPointNumbers = "53c48c17-4a7d-5ca2-90c5-79b7896eea93"
1111
GZip = "92fee26a-97fe-5a0c-ad85-20a5f3185b63"
1212
MAT = "23992714-dd62-5051-b70f-ba57cb901cac"
13+
PyCall = "438e738f-606a-5dbb-bf0a-cddfbfd45ab0"
1314
Requires = "ae029012-a4dd-5104-9daa-d747884805df"
1415

1516
[compat]
@@ -20,6 +21,7 @@ FixedPointNumbers = "0.3, 0.4, 0.5, 0.6, 0.7, 0.8"
2021
GZip = "0.5"
2122
ImageCore = "0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8"
2223
MAT = "0.7, 0.8, 0.9, 0.10"
24+
PyCall = "1"
2325
Requires = "1"
2426
julia = "1"
2527

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Each dataset has its own dedicated sub-module.
2020
Find below a list of available datasets and links to their documentation.
2121

2222
#### Vision
23-
- [CIFAR10](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/)
23+
- [CIFAR10](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR10/)
2424
- [CIFAR100](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CIFAR100/)
2525
- [EMNIST](https://juliaml.github.io/MLDatasets.jl/latest/datasets/EMNIST/)
2626
- [FashionMNIST](https://juliaml.github.io/MLDatasets.jl/latest/datasets/FashionMNIST/)
@@ -38,7 +38,9 @@ Find below a list of available datasets and links to their documentation.
3838
- [UD_English](https://juliaml.github.io/MLDatasets.jl/latest/datasets/UD_English/)
3939

4040
#### Graphs
41+
- [CiteSeer](https://juliaml.github.io/MLDatasets.jl/latest/datasets/CiteSeer/)
4142
- [Cora](https://juliaml.github.io/MLDatasets.jl/latest/datasets/Cora/)
43+
- [PubMed](https://juliaml.github.io/MLDatasets.jl/latest/datasets/PubMed/)
4244

4345

4446

docs/make.jl

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ makedocs(
1919
),
2020

2121
authors = "Hiroyuki Shindo, Christof Stocker",
22+
# TODO: automatize `pages` creation
2223
pages = Any[
2324
"Home" => "index.md",
2425
"Available Datasets" => Any[
@@ -40,10 +41,13 @@ makedocs(
4041
],
4142

4243
"Graphs" => Any[
44+
"CiteSeer" => "datasets/CiteSeer.md",
4345
"Cora" => "datasets/Cora.md",
46+
"PubMed" => "datasets/PubMed.md",
4447
],
4548

4649
],
50+
"Utils" => "utils.md",
4751
"LICENSE.md",
4852
],
4953
strict = true

docs/src/datasets/CiteSeer.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# CiteSeer
2+
3+
```@docs
4+
CiteSeer
5+
```
6+
7+
## API reference
8+
9+
```@docs
10+
CiteSeer.dataset
11+
```

docs/src/datasets/Cora.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,5 +7,5 @@ Cora
77
## API reference
88

99
```@docs
10-
Cora.alldata
10+
Cora.dataset
1111
```

docs/src/datasets/PubMed.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# PubMed
2+
3+
```@docs
4+
PubMed
5+
```
6+
7+
## API reference
8+
9+
```@docs
10+
PubMed.dataset
11+
```

docs/src/utils.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Utils
2+
3+
```@docs
4+
MLDatasets.read_planetoid_data
5+
```

src/CiteSeer/CiteSeer.jl

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
export CiteSeer
2+
3+
4+
"""
5+
CiteSeer
6+
7+
The CiteSeer citation network dataset from Ref. [1].
8+
Nodes represent documents and edges represent citation links.
9+
The dataset is designed for the node classification task.
10+
The task is to predict the category of certain paper.
11+
The dataset is retrieved from Ref. [2].
12+
13+
## Interface
14+
15+
- [`CiteSeer.dataset`](@ref)
16+
17+
## References
18+
19+
[1]: [Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking](https://arxiv.org/abs/1707.03815)
20+
[2]: [Planetoid](https://github.com/kimiyoung/planetoid)
21+
"""
22+
module CiteSeer
23+
24+
using DataDeps
25+
using ..MLDatasets: datafile, read_planetoid_data
26+
using DelimitedFiles: readdlm
27+
28+
using PyCall
29+
30+
const DEPNAME = "CiteSeer"
31+
const LINK = "https://github.com/kimiyoung/planetoid/raw/master/data"
32+
const DOCS = "https://github.com/kimiyoung/planetoid"
33+
const DATA = "ind.citeseer." .* ["x", "y", "tx", "allx", "ty", "ally", "graph", "test.index"]
34+
35+
function __init__()
36+
register(DataDep(
37+
DEPNAME,
38+
"""
39+
Dataset: The $DEPNAME dataset.
40+
Website: $DOCS
41+
""",
42+
map(x -> "$LINK/$x", DATA),
43+
"7f7ec4df97215c573eee316de35754d89382011dfd9fb2b954a4a491057e3eb3", # if checksum omitted, will be generated by DataDeps
44+
# post_fetch_method = unpack
45+
))
46+
end
47+
48+
"""
49+
dataset(; dir=nothing, reverse_edges=true)
50+
51+
Retrieve the CiteSeer dataset. The output is a named tuple with fields
52+
```juliarepl
53+
julia> keys(CiteSeer.dataset())
54+
(:node_features, :node_labels, :adjacency_list, :train_indices, :val_indices, :test_indices, :num_classes, :num_nodes, :num_edges, :directed)
55+
```
56+
57+
In particular, `adjacency_list` is a vector of vector,
58+
where `adjacency_list[i]` will contain the neighbors of node `i`
59+
through outgoing edges.
60+
61+
If `reverse_edges=true`, the graph will contain
62+
the reverse of each edge and the graph will be undirected.
63+
64+
See also [`CiteSeer`](@ref).
65+
66+
## Usage Examples
67+
68+
```julia
69+
using MLDatasets: CiteSeer
70+
data = CiteSeer.dataset()
71+
train_labels = data.node_labels[data.train_indices]
72+
```
73+
"""
74+
dataset(; dir=nothing, reverse_edges=true) =
75+
read_planetoid_data(DEPNAME, dir=dir, reverse_edges=reverse_edges)
76+
77+
78+
end #module
79+

0 commit comments

Comments
 (0)