-
Notifications
You must be signed in to change notification settings - Fork 7.1k
LSUN download Data #2748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSUN download Data #2748
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR @AniTho! I have a few comments how to make it more robust. Furthermore, the linter is failing with
./torchvision/datasets/lsun.py:97:1: W293 blank line contains whitespace
Could you fix that?
@@ -82,6 +82,7 @@ def __init__( | |||
# for each class, create an LSUNClassDataset | |||
self.dbs = [] | |||
for c in self.classes: | |||
self._download(c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The download should be optional. Other datasets have a download=True
flag in the constructor that indicates whether the download should happen or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I will make the changes.
Thanks for your support and guidance.
if not(os.path.isfile(os.path.join(self.root, file_name))): | ||
download_url(url, self.root, file_name) | ||
print("Extracting File") | ||
extract_archive(file_path, self.root, True) | ||
print("Done!!!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make this simpler by using
vision/torchvision/datasets/utils.py
Lines 242 to 249 in 588f7ae
def download_and_extract_archive( | |
url: str, | |
download_root: str, | |
extract_root: Optional[str] = None, | |
filename: Optional[str] = None, | |
md5: Optional[str] = None, | |
remove_finished: bool = False, | |
) -> None: |
@@ -93,6 +94,16 @@ def __init__( | |||
self.indices.append(count) | |||
|
|||
self.length = count | |||
|
|||
def _download(self, c): | |||
url = 'http://dl.yf.io/lsun/scenes/{}'.format(c + '_lmdb.zip') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the download you could check if the lmdb
file is present and if its MD5 checks out with
vision/torchvision/datasets/utils.py
Lines 38 to 43 in 588f7ae
def check_integrity(fpath: str, md5: Optional[str] = None) -> bool: | |
if not os.path.isfile(fpath): | |
return False | |
if md5 is None: | |
return True | |
return check_md5(fpath, md5) |
If that is the case you don't need to download anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmeier Do I have to find md5 hash, if yes how? otherwise I can just directly check if the file exists or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I have to find md5 hash [...] otherwise I can just directly check if the file exists or not
Although not terribly common, download errors happen. Without a checksum, you have no idea if the file maybe is corrupt. Thus it we prefer to have a MD5 checksum for most files. In your case that would mean for the .zip
and .lmdb
files of every class.
if yes how?
You can download and extract all archives with the download logic you already have. Afterwards you can use a variety of tools. Depending on your setup you might have md5sum
already installed. If you don't want additional software, you can use torchvision.datasets.utils.calculate_md5
vision/torchvision/datasets/utils.py
Line 26 in de90862
def calculate_md5(fpath: str, chunk_size: int = 1024 * 1024) -> str: |
Codecov Report
@@ Coverage Diff @@
## master #2748 +/- ##
==========================================
+ Coverage 73.05% 73.12% +0.07%
==========================================
Files 96 96
Lines 8298 8317 +19
Branches 1291 1293 +2
==========================================
+ Hits 6062 6082 +20
Misses 1838 1838
+ Partials 398 397 -1
Continue to review full report at Codecov.
|
Closing this since it is stale and the prototype version of LSUN in #5390 includes downloads. |
Added the function to download LSUN data from the site, extract it and then delete zip file