-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
same file name for all parts of one post JSON in the tar file #7
Comments
Hello @sfelixwu 👋 |
fredrike
added a commit
that referenced
this issue
Feb 9, 2016
…ables compression due to phar issue: https://bugs.php.net/bug.php?id=71555
fredrike
added a commit
that referenced
this issue
Feb 9, 2016
…json files with the same name, related to #7
The last part is mainly related to this bug in PHP: https://bugs.php.net/bug.php?id=71555 |
@fredrike: Yeah, I figured as much 😄 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This question is mainly for Fredrik -- I found a potential issue regarding the tar files generated under /data2/sincere/crawled/phar. The issue is that many files under the same tar file are sharing exactly the same file name, which mean that if I use "tar xvf" to extract it, I will only get the last part of a post. The information is there but I am unsure (1) how to extract correctly and (2) why do we have the same name for multiple files (intentional or not)? Here is an example that I put in dropbox -- there are only two posts but each post has multiple parts (and with the same name) --
https://www.dropbox.com/sh/m2vp2v6vdhcyab5/AAB7dVRztjkxK5GSwEmzKg4ca?dl=0
I figured out a way to extract the original JSON -- I did "tar xOvf ../Samsung_Mobile-114219621960016.tar 114219621960016_2012-12-14T13_114219621960016_375711795852975.json > a1.json"
I moved both a1.json and a2.json to the same dropbox folder as well. So, the tar file contains about 20-21 files belonging to TWO posts only (i.e., each post got broken into multiple files). So, a1.json (and a2.json) is the concatenated file (a single one) for ONE single post. I had to use the O option in tar to re-direct the output (otherwise, the content will be overwritten during the untar process due to the same file name).
Update -- @fredrike, now I think we got a bigger problem -- after the tar file grew bigger (~ 128MB), it will be gzipped, but now in the gzip file, only *** ONE *** of those files sharing the same names was kept!!! This means that potentially most of the JSONs under the gz directory really only have a small fraction of a complete post. I will post the compressed file to the folder as well so you can see. Very bad news for our JSONs today, especially for those bigger ones. I believe that we still need to develop a program to re-generate JSONs for all the posts from SINCERE.dB, after all.
The text was updated successfully, but these errors were encountered: