-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Short description
The multi_news dataset fails to download because the source Google Drive URL it relies on appears to be broken or inaccessible. The download process errors out when trying to get a GDrive confirmation link.
Environment information
Operating System: Linux (via Google Colab/Jupyter notebook)
Python version: 3.11.13
tensorflow-dataset version: 4.9.9
tensorflow version: 2.18.0
Does the issue still exist with the last tfds-nightly package (pip install --upgrade tfds-nightly)?
Yes, the issue persists as it is related to a broken source URL, not a library code issue.
Reproduction instructions
The bug can be reproduced with the following minimal code snippet.
import tensorflow_datasets as tfds
# This command will fail during the download and preparation phase.
tf_dataset = tfds.load('multi_news', split='train')
Error Log
Here is the full stack trace produced by the code above:
WARNING:absl:Variant folder /root/TensorFlow_datasets/multi_news/1.0.0 has no dataset_info.json
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/TensorFlow_datasets/multi_news/1.0.0...
Dl Completed... 0% 0/1 [00:02<?, ? url/s]
Dl Size... 0/0 [00:02<?, ? MB/s]
Extraction completed... 0/0 [00:02<?, ? file/s]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
/usr/local/lib/python3.11/dist-packages/tensorflow_datasets/core/download/downloader.py in _process_gdrive_confirmation(original_url, contents)
149 if not form:
--> 150 raise ValueError(
151 f'Failed to obtain confirmation link for Gdrive URL {original_url!r}.'
152 )
ValueError: Failed to obtain confirmation link for Gdrive URL 'https://drive.google.com/uc? export=download&id=1vMYY2WMrp1OZf+9exGtn5ptJ5exlvwJ0c'.
The tfds.load()
command should successfully download all the necessary source files, prepare the dataset, and return a tf.data.Dataset
object without raising an exception.
Additional context
The root cause appears to be the invalid Google Drive URL hardcoded in the dataset's configuration file. The URL that fails is: https://drive.google.com/uc?export=download&id=1vMYY2WMrp1OZf+9exGtn5ptJ5exlvwJ0c.
This link needs to be updated to a valid source for the dataset to be usable again.