Skip to content

[multi_news] Dataset download fails due to broken Google Drive link #11100

@Awesome075

Description

@Awesome075

Short description
The multi_news dataset fails to download because the source Google Drive URL it relies on appears to be broken or inaccessible. The download process errors out when trying to get a GDrive confirmation link.

Environment information

Operating System: Linux (via Google Colab/Jupyter notebook)

Python version: 3.11.13

tensorflow-dataset version: 4.9.9 
tensorflow version: 2.18.0 

Does the issue still exist with the last tfds-nightly package (pip install --upgrade tfds-nightly)?
Yes, the issue persists as it is related to a broken source URL, not a library code issue.

Reproduction instructions
The bug can be reproduced with the following minimal code snippet.

import tensorflow_datasets as tfds

# This command will fail during the download and preparation phase.
tf_dataset = tfds.load('multi_news', split='train')

Error Log
Here is the full stack trace produced by the code above:

WARNING:absl:Variant folder /root/TensorFlow_datasets/multi_news/1.0.0 has no dataset_info.json
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total:    Unknown size) to /root/TensorFlow_datasets/multi_news/1.0.0...
Dl Completed... 0% 0/1 [00:02<?, ? url/s]
Dl Size... 0/0 [00:02<?, ? MB/s]
Extraction completed... 0/0 [00:02<?, ? file/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
/usr/local/lib/python3.11/dist-packages/tensorflow_datasets/core/download/downloader.py in    _process_gdrive_confirmation(original_url, contents)
149         if not form:
--> 150             raise ValueError(
151                 f'Failed to obtain confirmation link for Gdrive URL {original_url!r}.'
152             )

ValueError: Failed to obtain confirmation link for Gdrive URL 'https://drive.google.com/uc? export=download&id=1vMYY2WMrp1OZf+9exGtn5ptJ5exlvwJ0c'.

The tfds.load() command should successfully download all the necessary source files, prepare the dataset, and return a tf.data.Dataset object without raising an exception.

Additional context

The root cause appears to be the invalid Google Drive URL hardcoded in the dataset's configuration file. The URL that fails is: https://drive.google.com/uc?export=download&id=1vMYY2WMrp1OZf+9exGtn5ptJ5exlvwJ0c. This link needs to be updated to a valid source for the dataset to be usable again.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions