Skip to content

Commit

Permalink
Persian News Corpus (solve mistake!!) (#42)
Browse files Browse the repository at this point in the history
* add README.md

* restore the main readme

* update data/corpus readme
  • Loading branch information
maryambiabani authored and sehsanm committed Dec 31, 2018
1 parent 6e68fb2 commit d6ca522
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 2 deletions.
15 changes: 13 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,13 @@
Persian news corpus contains more than 120 million sentences from tnews.
you can download corpus from [here](https://sbuacir-my.sharepoint.com/personal/se_mahmoudi_sbu_ac_ir/Documents/Forms/All.aspx?slrid=5cbcb09e%2D9091%2D7000%2Db143%2D92a4031b9417&RootFolder=%2Fpersonal%2Fse%5Fmahmoudi%5Fsbu%5Fac%5Fir%2FDocuments%2Fsbunlp&FolderCTID=0x01200065B78F960C7F3B4E9E0BBD567D049028)
# embedding-benchmark
Word Embedding benchmark project By Shahid Beheshti University NLP Lab

Please read [Our Wiki Page](https://github.com/sehsanm/embedding-benchmark/wiki) for more information

Folder structure :
* data/corpus This must be empty as the codes will downlaod the corpus from some external repository to here.
* data/analogy Contains the analogy dataset(s)
* data/wordsim Contains the word similarity dataset(s)
* data/categories Contains the catgories dataset(s)
* code This folder contains codes that will be used to run all evaluation related tasks and utulities to downlaod the corpus files
* scripts This folder contains cleansing/crawling and any other once off activity that needs to be done.

5 changes: 5 additions & 0 deletions data/corpus/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,8 @@ You can download the corpus using this [LINK](https://sbuacir-my.sharepoint.com/
irBlogs is a standard Persian weblogs collection that is suitable for studying Persian social networks and evaluation of graph mining and blog retrieval algorithms.

You can find the collection [here](http://dbrg.ut.ac.ir/irblogs/)

## Persian News Corpus
Persian News Corpus contains more than 120 million sentences from tnews.

You can download corpus from [here](https://sbuacir-my.sharepoint.com/personal/se_mahmoudi_sbu_ac_ir/Documents/Forms/All.aspx?slrid=5cbcb09e%2D9091%2D7000%2Db143%2D92a4031b9417&RootFolder=%2Fpersonal%2Fse%5Fmahmoudi%5Fsbu%5Fac%5Fir%2FDocuments%2Fsbunlp&FolderCTID=0x01200065B78F960C7F3B4E9E0BBD567D049028)

0 comments on commit d6ca522

Please sign in to comment.