pre-process ArXivQA using ar5iv and pandoc
Current workflow is : ar5iv --> HTML --> gfm
You can download resulting tarball here (PW : BLMPqh6mfLkekAQ3ufVDGj7M
)
md5 hash values
ar5iv_v4_textonly.tar.gz
:5f01788d7e9e4d33b29279fa00574dac
arxivQA_v4.json
:e267227c3891eddea4c670f4fbf2102d
arxivQA_v4_tex.json
:1d5ff485b9603bc32d5b5f36a0a0bb3d
merged article + Q&A to json
(was only Q&A)
removed link, image, table, reference, appendix and redundant tail
- 4,602 papers
- 44,545,951 tokens for clean papers
- 11,425,548 tokens for Q&A
- 55,971,499 tokens total (clean papers + Q&A)
requires...
- docker for
pandoc
* ArXivQA
* ar5iv
* arxivQA_script
* convert.py
* data_clean.py
* ...
# execute in order
python3 arxivqa_get_ids.py # update `paper_ids.json`
python3 convert.py --start_index 0 --end_index 3 --url_to_html
python3 convert.py --start_index 0 --end_index 3 --html_to_md
python3 data_clean.py # e.g., deduplication
python3 merge_qa.py # add Q&A from ArXivQA, only for clean dataset
python3 aggregate.py # aggregate to one large `json` file
* ar5iv
* (yymm)
* (id)
* assets
* (yymm.id).html # original
* (yymm.id).md # converted markdown
* (yymm.id).json # Q&A converted from ArXivQA (only for clean dataset)
* ...
* ...