Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LGSVL crash after ~2hrs of training #1705

Open
ido1Shapira opened this issue Aug 22, 2021 · 15 comments
Open

LGSVL crash after ~2hrs of training #1705

ido1Shapira opened this issue Aug 22, 2021 · 15 comments
Labels
help wanted Extra attention is needed

Comments

@ido1Shapira
Copy link

ido1Shapira commented Aug 22, 2021

hey,
i am using SVL SIM for training RL agent using LGSVL GYM library .

and after about 200 episodes, the simulator crushes. it happens several times, and i cant train the agent properly due to does limitations

My environment:
NVIDIA Corporation GP102 [TITAN X]
31.3 GiB RAM
Intel® Core™ i9-10900K CPU @ 3.70GHz × 20
Ubuntu 20.04

let me know if there is some log that i can provide

@hadiTab
Copy link
Contributor

hadiTab commented Aug 23, 2021

Which version of the simulator are you using?
The log is located at ~/.config/unity3d/LGElectronics/SVLSimulator/Player.log

@ido1Shapira
Copy link
Author

ido1Shapira commented Aug 23, 2021

the version we are using is:
svlsimulator-linux64-2021.2

the log is below:
Player.log

@EricBoiseLGSVL
Copy link
Contributor

This may be a memory leak. Do you have any more info on the tests you are running?
How many egos, how many npcs, and or peds?
What api commands are you using?

@EricBoiseLGSVL EricBoiseLGSVL added the question Further information is requested label Aug 24, 2021
@EricBoiseLGSVL
Copy link
Contributor

EricBoiseLGSVL commented Aug 24, 2021

@revati-naik this might relate to the api reset issue

@ido1Shapira
Copy link
Author

@EricBoiseLGSVL - i am using only 1 agent without any NPCs or peds. regarding to the api commends, i am simply trying to train an agent to steer/throttle using RL algorithm. and i mainly use the on_collusion call to reset the environment.

another thing is that i upgraded to svlsimulator-linux64-2021.2.2 and still has the same issue
here is the log file for the new version crush:
Player.log

can you elaborate on the memory leak issue?

@EricBoiseLGSVL
Copy link
Contributor

Ok, I see the error in this log. Looks like the connection is lost and throws a socket exception. @heeen Would this issue be resolved with the new web socket changes we have for 2021.3?

[CONN] GET api/v1/vehicles/30387e1f-4d7f-4b6e-98aa-d721a16c4e75
[CONN] HTTP 200 OK
[ANMGR] Initializing with TestReportId:<null>
Fallback handler could not load library /home/ido/Desktop/Reducing-Stress-reactions-to-Autonomous-Vehicles-Passengers/svlsimulator-linux64-2021.2.2/simulator_Data/Mono/libKernel32
Fallback handler could not load library /home/ido/Desktop/Reducing-Stress-reactions-to-Autonomous-Vehicles-Passengers/svlsimulator-linux64-2021.2.2/simulator_Data/Mono/libKernel32.so
Fallback handler could not load library /home/ido/Desktop/Reducing-Stress-reactions-to-Autonomous-Vehicles-Passengers/svlsimulator-linux64-2021.2.2/simulator_Data/Mono/Kernel32
Fallback handler could not load library /home/ido/Desktop/Reducing-Stress-reactions-to-Autonomous-Vehicles-Passengers/svlsimulator-linux64-2021.2.2/simulator_Data/Mono/libKernel32
Fallback handler could not load library /home/ido/Desktop/Reducing-Stress-reactions-to-Autonomous-Vehicles-Passengers/svlsimulator-linux64-2021.2.2/simulator_Data/Mono/libKernel32.so
Fallback handler could not load library /home/ido/Desktop/Reducing-Stress-reactions-to-Autonomous-Vehicles-Passengers/svlsimulator-linux64-2021.2.2/simulator_Data/Mono/libKernel32
8/24/2021 9:13:40 AM|Fatal|<>c__DisplayClass174_0.<startReceiving>b__2:0|System.IO.IOException: Unable to read data from the transport connection: Connection reset by peer. ---> System.Net.Sockets.SocketException: Connection reset by peer
                             at System.Net.Sockets.Socket.EndReceive (System.IAsyncResult asyncResult) [0x00012] in <aa976c2104104b7ca9e1785715722c9d>:0 
                             at System.Net.Sockets.NetworkStream.EndRead (System.IAsyncResult asyncResult) [0x00057] in <aa976c2104104b7ca9e1785715722c9d>:0 
                              --- End of inner exception stack trace ---
                             at System.Net.Sockets.NetworkStream.EndRead (System.IAsyncResult asyncResult) [0x0009b] in <aa976c2104104b7ca9e1785715722c9d>:0 
                             at WebSocketSharp.Ext+<>c__DisplayClass59_0.<ReadBytesAsync>b__0 (System.IAsyncResult ar) [0x00002] in <84d0020558774b6c88b3203b756a09ec>:0 

@EricBoiseLGSVL EricBoiseLGSVL added help wanted Extra attention is needed and removed question Further information is requested labels Aug 25, 2021
@ido1Shapira
Copy link
Author

@EricBoiseLGSVL any news on this subject?
and to you have estimate time for 2021.3 release date?

@EricBoiseLGSVL
Copy link
Contributor

We are on code freeze this week. :) We need rounds of testing and then docs. Should be in a few weeks

@yeehsienq
Copy link

yeehsienq commented Nov 16, 2021

Hi, I'm training an RL agent in the Highway101GLE environment, and there is a Segmentation fault after a few thousand episodes of training. I also suspect that the reason for this is a memory leak. I'm currently using 2021.3, and I also get this SocketException. I've attached my log file as well.
Player.log

My environment:
Ubuntu 18.04.5 LTS
Processor: Intel® Core™ i7-3770 CPU @ 3.40GHz × 8
Graphics: NVIDIA GeForce RTX 3060/PCIe/SSE2

Any help is appreciated and let me know if you need more information!

@EricBoiseLGSVL
Copy link
Contributor

it looks like the socket is closed when trying to send the test report.

[ANMGR] Initializing with TestReportId:d09f1269-58b3-48f3-a4a2-bbde6ab05990
16/11/2021 4:12:08 PM|Fatal|<>c__DisplayClass174_0.<startReceiving>b__2|WebSocketSharp.WebSocketException: The header part of a frame could not be read.
                              at WebSocketSharp.WebSocketFrame.processHeader (System.Byte[] header) [0x00010] in <fc1f0434ffca46a29e99654e48881787>:0 
                              at WebSocketSharp.WebSocketFrame+<>c__DisplayClass73_0.<readHeaderAsync>b__0 (System.Byte[] bytes) [0x00000] in <fc1f0434ffca46a29e99654e48881787>:0 
                              at WebSocketSharp.Ext+<>c__DisplayClass57_0.<ReadBytesAsync>b__0 (System.IAsyncResult ar) [0x00078] in <fc1f0434ffca46a29e99654e48881787>:0 

Can you disable the test report feature and see if you still get the crash?

@EricBoiseLGSVL
Copy link
Contributor

It also seems you are loading the cyber bridge multiple times. Can you post your sensors used and any changes you have made to source? How are you resetting the simulation. Please step through how you are running simulations because memory leaks are the hardest to trouble shoot. It may even be an issue with the Unity engine.

@yeehsienq
Copy link

yeehsienq commented Nov 17, 2021

I still get the segmentation error when the test report feature is disabled.
Player.log

I'm actually not using the cyber bridge, but just using the Python API. But if loading the cyber bridge is causing the problem, how can we disable that? For sensors, I currently use one main RGB camera and 3 depth sensor cameras.

In my training script, I'll reset the simulation by calling env.reset(), which sets up the ego and npc vehicles, and resets collision and done flags, either after a fixed number of steps have passed, or when the vehicle experiences a collision.

@EricBoiseLGSVL
Copy link
Contributor

if you are not using the bridge then yes, use a vehicle that has no bridge. This might be the issue.

@yeehsienq
Copy link

May I ask, how do we remove the bridge from the vehicle?

@EricBoiseLGSVL
Copy link
Contributor

in the vehicle config in the webui.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants