Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

same file name for all parts of one post JSON in the tar file #7

Open
sfelixwu opened this issue Feb 8, 2016 · 4 comments
Open

same file name for all parts of one post JSON in the tar file #7

sfelixwu opened this issue Feb 8, 2016 · 4 comments
Assignees

Comments

@sfelixwu
Copy link

sfelixwu commented Feb 8, 2016

This question is mainly for Fredrik -- I found a potential issue regarding the tar files generated under /data2/sincere/crawled/phar. The issue is that many files under the same tar file are sharing exactly the same file name, which mean that if I use "tar xvf" to extract it, I will only get the last part of a post. The information is there but I am unsure (1) how to extract correctly and (2) why do we have the same name for multiple files (intentional or not)? Here is an example that I put in dropbox -- there are only two posts but each post has multiple parts (and with the same name) --

https://www.dropbox.com/sh/m2vp2v6vdhcyab5/AAB7dVRztjkxK5GSwEmzKg4ca?dl=0

I figured out a way to extract the original JSON -- I did "tar xOvf ../Samsung_Mobile-114219621960016.tar 114219621960016_2012-12-14T13_114219621960016_375711795852975.json > a1.json"

I moved both a1.json and a2.json to the same dropbox folder as well. So, the tar file contains about 20-21 files belonging to TWO posts only (i.e., each post got broken into multiple files). So, a1.json (and a2.json) is the concatenated file (a single one) for ONE single post. I had to use the O option in tar to re-direct the output (otherwise, the content will be overwritten during the untar process due to the same file name).

Update -- @fredrike, now I think we got a bigger problem -- after the tar file grew bigger (~ 128MB), it will be gzipped, but now in the gzip file, only *** ONE *** of those files sharing the same names was kept!!! This means that potentially most of the JSONs under the gz directory really only have a small fraction of a complete post. I will post the compressed file to the folder as well so you can see. Very bad news for our JSONs today, especially for those bigger ones. I believe that we still need to develop a program to re-generate JSONs for all the posts from SINCERE.dB, after all.

@fredrik
Copy link

fredrik commented Feb 9, 2016

Hello @sfelixwu 👋

fredrike added a commit that referenced this issue Feb 9, 2016
@fredrike
Copy link
Member

fredrike commented Feb 9, 2016

The last part is mainly related to this bug in PHP: https://bugs.php.net/bug.php?id=71555

@fredrike fredrike self-assigned this Feb 9, 2016
@fredrike
Copy link
Member

fredrike commented Feb 9, 2016

@fredrik, actually @sfelixwu were referring to me 😃

@fredrik
Copy link

fredrik commented Feb 10, 2016

@fredrike: Yeah, I figured as much 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants