Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huggingface下载的时候报错 #1

Open
conkeur opened this issue Jan 18, 2024 · 2 comments
Open

huggingface下载的时候报错 #1

conkeur opened this issue Jan 18, 2024 · 2 comments

Comments

@conkeur
Copy link

conkeur commented Jan 18, 2024

下载用的代码:

from datasets import load_dataset
dataset_name = "Papersnake/people_daily_news"
dataset = load_dataset(dataset_name,cache_dir=r'xxx/')

错误信息:

An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 2 missing columns ({'author', 'page'})

This happened while the json dataset builder was generating data using

..\downloads\d434406d0e80132d996bc6796817699b81390d86744e10acda0ec2ea71fead71

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
Traceback (most recent call last):
  File "_pydevd_bundle/pydevd_cython.pyx", line 546, in _pydevd_bundle.pydevd_cython.PyDBFrame._handle_exception
  File "C:\Program Files\Python39\lib\linecache.py", line 26, in getline
    def getline(filename, lineno, module_globals=None):
  File "C:\Program Files\Python39\lib\linecache.py", line 36, in getlines
    def getlines(filename, module_globals=None):
  File "C:\Program Files\Python39\lib\linecache.py", line 80, in updatecache
    def updatecache(filename, module_globals=None):
  File "C:\Program Files\Python39\lib\codecs.py", line 319, in decode
    def decode(self, input, final=False):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 41: invalid start byte
0.03s - Error on build_exception_info_response.
Traceback (most recent call last):
  File "c:\program files\microsoft visual studio\2022\community\common7\ide\extensions\microsoft\python\core\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_comm.py", line 1404, in build_exception_info_response
    def build_exception_info_response(dbg, thread_id, request_seq, set_additional_thread_info, iter_visible_frames_info, max_frames):
  File "C:\Program Files\Python39\lib\linecache.py", line 26, in getline
    def getline(filename, lineno, module_globals=None):
  File "C:\Program Files\Python39\lib\linecache.py", line 36, in getlines
    def getlines(filename, module_globals=None):
  File "C:\Program Files\Python39\lib\linecache.py", line 80, in updatecache
    def updatecache(filename, module_globals=None):
  File "C:\Program Files\Python39\lib\codecs.py", line 319, in decode
    def decode(self, input, final=False):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 41: invalid start byte
0.03s - Error on build_exception_info_response.
Traceback (most recent call last):
  File "c:\program files\microsoft visual studio\2022\community\common7\ide\extensions\microsoft\python\core\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_comm.py", line 1404, in build_exception_info_response
    def build_exception_info_response(dbg, thread_id, request_seq, set_additional_thread_info, iter_visible_frames_info, max_frames):
  File "C:\Program Files\Python39\lib\linecache.py", line 26, in getline
    def getline(filename, lineno, module_globals=None):
  File "C:\Program Files\Python39\lib\linecache.py", line 36, in getlines
    def getlines(filename, module_globals=None):
  File "C:\Program Files\Python39\lib\linecache.py", line 80, in updatecache
    def updatecache(filename, module_globals=None):
  File "C:\Program Files\Python39\lib\codecs.py", line 319, in decode
    def decode(self, input, final=False):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 41: invalid start byte

打开看了对应的文件,内容是这个:
{"url": "hf://datasets/Papersnake/people_daily_news@e61323bc7692312d907fc2d154b4ffc4290ce496/2004.jsonl.gz", "etag": null}

@prnake
Copy link
Owner

prnake commented Jan 19, 2024

不同年份的 jsonl 并不能保证以相同格式提供,建议下载后手动处理,例如在 git lfs 安装后,使用 git clone https://huggingface.co/datasets/Papersnake/people_daily_news 来下载数据。

@conkeur
Copy link
Author

conkeur commented Jan 19, 2024

不同年份的 jsonl 并不能保证以相同格式提供,建议下载后手动处理,例如在 git lfs 安装后,使用 git clone https://huggingface.co/datasets/Papersnake/people_daily_news 来下载数据。
好的,我试试

Repository owner deleted a comment from m4s1t4 Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants