Replies: 1 comment 1 reply
-
![]() ![]() 按图设置一下这两个地方,它这个默认是用两张卡来训练的。 |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
是跟着README.md文件一步一步做的,可以保证数据集的位置是没有错的,然后出错的地方应该是在Get Started下的第二步Model Traning,我的paddle版本是paddlepaddle-gpu==2.5.0 cudatoolkit=11.6,以下是报错信息,我不太懂是哪个地方出错了
E0719 10:46:00.151664 387175 place.cc:347] Invalid CUDAPlace(1), must inside [0, 1), because GPU number on your machine is 1
E0719 10:46:00.158169 387177 place.cc:347] Invalid CUDAPlace(3), must inside [0, 1), because GPU number on your machine is 1
I0719 10:46:00.170012 387174 tcp_utils.cc:181] The server starts to listen on IP_ANY:38913
I0719 10:46:00.170167 387174 tcp_utils.cc:130] Successfully connected to 127.0.0.1:38913
E0719 10:46:00.173277 387180 place.cc:347] Invalid CUDAPlace(6), must inside [0, 1), because GPU number on your machine is 1
E0719 10:46:00.177418 387176 place.cc:347] Invalid CUDAPlace(2), must inside [0, 1), because GPU number on your machine is 1
E0719 10:46:00.179487 387179 place.cc:347] Invalid CUDAPlace(5), must inside [0, 1), because GPU number on your machine is 1
C++ Traceback (most recent call last):
0 paddle::distributed::TCPStore::TCPStore(std::string, unsigned short, bool, unsigned long, int)
1 paddle::distributed::TCPStore::waitWorkers()
2 paddle::distributed::TCPStore::get(std::string const&)
3 paddle::distributed::TCPStore::wait(std::string const&)
4 void paddle::distributed::tcputils::receive_bytespaddle::distributed::ReplyType(int, paddle::distributed::ReplyType*, unsigned long)
Error Message Summary:
FatalError:
Termination signal
is detected by the operating system.[TimeInfo: *** Aborted at 1689734760 (unix time) try "date -d @1689734760" if you are using GNU date ***]
[SignalInfo: *** SIGTERM (@0x3e80005e7f8) received by PID 387174 (TID 0x7f1a1be7d180) from PID 387064 ***]
E0719 10:46:00.199708 387181 place.cc:347] Invalid CUDAPlace(7), must inside [0, 1), because GPU number on your machine is 1
C++ Traceback (most recent call last):
No stack trace in paddle, may be caused by external reasons.
Error Message Summary:
FatalError:
Termination signal
is detected by the operating system.[TimeInfo: *** Aborted at 1689734760 (unix time) try "date -d @1689734760" if you are using GNU date ***]
[SignalInfo: *** SIGTERM (@0x3e80005e7f8) received by PID 387178 (TID 0x7fd5c10ac180) from PID 387064 ***]
Traceback (most recent call last):
File "/home/iie/PaddleSpeech/paddlespeech/t2s/exps/ernie_sat/train.py", line 203, in
main()
File "/home/iie/PaddleSpeech/paddlespeech/t2s/exps/ernie_sat/train.py", line 197, in main
dist.spawn(train_sp, (args, config), nprocs=args.ngpu)
File "/home/iie/anaconda3/envs/paddlespeech_env/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 606, in spawn
while not context.join():
File "/home/iie/anaconda3/envs/paddlespeech_env/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 413, in join
self._throw_exception(error_index)
File "/home/iie/anaconda3/envs/paddlespeech_env/lib/python3.10/site-packages/paddle/distributed/spawn.py", line 423, in _throw_exception
raise Exception("Process %d terminated with exit code %d." %
Exception: Process 1 terminated with exit code 255.
Beta Was this translation helpful? Give feedback.
All reactions