-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encountering problems when running darknet on RPI3B+ #106
Comments
This looks like a VC4CC compiler bug/missing support. Also, since you are using a source build of VC4CC, what is your LLVM/clang version you built VC4CC against? This can be queried e.g. by |
Thanks for your quick reply! I will post the two files as follows. I cat the /tmp/vc4cl-source-xyz.cl:
As you see, the dumped code lies in one line and is difficult to read. Is it right? Or how can I make the file more readable? Besides, I tried to cat
How should I transfer the code? As for the LLVM/Clang version, what do you mean by
|
The I'm not sure whether I have time today to debug this issue though. |
The LLVM/Clang version after typing
I have packed the It's okay you may have no time today. You can help me with this problem when you are available. Thanks! |
- Detect more stack-allocated memory addresses, see doe300/VC4CL#106 - Reduce usage of unaligned VPM access
hi @doe300 , I saw you made some commits. Is that mean the problem is solved? |
With the changes (not yet on |
I have tested the 8 lines. The memory error disappeared, but I got another problem: the program stucks at that point. I'm not sure if it is an infinite loop or other things. I have attached the .cl and .bc file into this comment |
Can you check what threads are running (and how "active" are they) when the program is "stuck"? This can be best checked e.g. with the |
I think the thread name is I also observed the change of htop and I guess the procedure is like this. First, there are four concurrent BTW I can also see other threads, such as |
Which command are you using? I am currently trying the example from https://pjreddie.com/darknet/yolo/ and it seems to compile just fine, also I do run in a SEGFAULT afterwards... |
The darknet is this one https://github.com/sowson/darknet. The command to compile this darknet follows https://iblog.isowa.io/2020/04/29/darknet-in-opencl-on-beagleboard-ai/ :
Then you'd better edit
to
Then in the build directory, the command to run is |
Seems I can't reproduce your issue on my setup. On my Raspberry Pi it compiles for several minutes, then prints some layer information and fails with being unable to open a file. I am currently running a large set of regression tests, if none of these fail, I will merge the |
That's weird. Do you use RPI 3B+ too? I think printing some layer information is good and it means you didn't encounter my problem. Also, did you made the changes on darknet? eg |
I tested on a new system and it worked! But I feel that the setup phase is pretty slow, I think this phase is what you said ''compiles for several minutes''? I was wondering why it needs to compile every time I start the program? |
Yes, the compilation takes some time, which is a combination of probably running in Debug build, the compiler itself not being too fast and the weak hardware.
Since neither the application (darknet) nor the OpenCL implementation (VC4CL) caches the compiled binaries. I thought about that, but since VC4CL still in active development, it didn't make too much sense for me at least and would just introduces consistency issues. |
I do not quite understand what you mean by Debug build. Do you mean VC4CL or darknet are compiled in debug mode? I checked the configuration and found that neither of them is built in debug mode. BTW I was wondering have you tested the computation speed between VC4CL and CPU. For example, for matrix multiplication, how much is VC4CL faster than pure CPU? |
VC4C(L), if not set otherwise explicitly (e.g. by specifying the
No, not extensively, I do have very few comparisons where the results differ greatly (sometimes CPU is faster, sometimes VC4CL). I expect that it will highly depend on the code being executed (e.g. how easily it can be parallelized, how much memory it accesses, etc.). |
Today I went deeper into the project but I got another problem. I prepared a code base at https://github.com/ziqi-zhang/darknet-vc4cl so that you will not encounter the missing file problem. To build this repo, you can directly run One problem is that I got I checked the passed arguments of Another problem I occasionally encountered is
Do you have any idea how to avoid such error? Besides, for the CL compilation time, I was wondering do you have any suggestion to walk around it? Compile the CL code is pretty time-consuming and I have to wait for minutes even if I made a small modification to the code... |
There is your problem. VC4CL only supports a local size of up to 12, since the VC4 GPU has only 12 cores. Although OpenCL clients should check what the implementation supports, a lot of applications don't do that and assume some minimum value and thus don't work on VC4CL. Depending on how the kernel is written, it might work without specifying a fixed local size, so that VC4CL can decide on one...
Sadly no, not at the moment at least. The register allocation of VC4C is non-deterministic, so if a kernel has a certain complexity, then register allocation will succeed sometimes and fail other times...
If darknet compiles all kernels from the same place, then maybe adding caching there would be the easiest solution. E.g. create a hash of kernel source code + options, check in some folder if a file with that hash exists. If not, compile and on success store with that hash in the folder. |
Thanks! I reduced the local size and it worked. But I have another problem that the program gets stuck at some CL interface functions. The first place is pull_network_output, which pulls array from opencl. And the program stuck at this clEnqueueReadBuffer. After I commented out the Do you have any idea how to solve these problems? |
Looks like some event gets stuck/takes long and the successive events have to wait. Depending on how the kernel is executed, this could mean that the kernel execution hangs. If you run the program with the |
Thanks to your suggestion, I changed the code so that the kernel code is cached. Loading and building the cached program is much faster than compiling the source. Now I don't need to wait for a long time to debug the program. I ran the program with
Then I wait for several minutes and it outputs another line:
Then I wait for 20+ minutes, it does not output any other things. I also tried other codes. It seems that whenever I call
|
So the first kernel seems to take very long for its 3072/12 = 256 work-groups. Unless a single work-item does a huge amount of calculations, they should probably finish in less than a minute. The second kernel runs 131072 work-groups, which definitively takes some time. I don't know whether 20+ minutes is okay or not...
Without having looked at the darknet code in detail, I would assume it schedules the kernels for execution and then sets their events as dependencies for the To check whether the kernel executions finish successfully or time out (in which case the GPU gets into some weird state), you could also enable the I suspect that the code generated for these kernels is somehow either wrong and hangs the QPUs or is far too slow and thus the kernel executions time out. If the QPUs are hung, then everything after that will also just hang until they are reset... |
I run the program with However, when I run the code without I ran the code again with a simple network with only one layer and record the log. As I expected the code also got stuck. Here is the log.log. Still I didn't see any |
So I had another quick look at the last log you provided (similar values can also be seen in the first log). Given that the duration in line 694 looks suspiciously close (~99.99%) to the timeout in line 692, it might be that the execution times out internally, but it is somehow still reported as success...
If that is the case that most likely means that the VC4C compiler generates some wrong code which causes the QPUs to run inifinte/hang. |
@doe300 hi, I also always encounter
It seems that sometimes I can not fetch the dependency libraries. However, I can get the libraries by So I'd like to manually clone the libraries. What changes should I make so that I can disable the fetch function in |
The only thing that I can think of right now is to download the git project as an archive (e.g. via GitHub website) and then replace the line https://github.com/doe300/VC4C/blob/master/cmake/spirv-headers.cmake#L8 with something like
|
hi @doe300 , I have solved the cache problem. As for the hanging problem, how can I find the wrong code? Can I upload generated code here and can you help me to look at it? |
That is very tricky, especially for rather complex code as I would assume darknet produces. If you give me the generated binary code and the VC4CL debug logs, I can try to run it in the software emulator with some default parameter values and hope that the reason for the hang/long execution time becomes obvious. If the behaviour depends on the actual input buffers etc., then it becomes a lot more complicated... |
Hi,
I tried to run darknet on RPI 3B+ with VC4CL. I can compile the code but encountered
Normalizer: Invalid local type for memory area: (g) f32* %arrayidx29.sink75
when running the program. The darknet version is https://github.com/sowson/darknet. The console output is as follows:I found that the code to cause this problem is the function called
opencl_load_buffer
, which I list below:Do you know how can I fix this problem? Is it because the function invokes some unimplemented library functions?
Thanks!
The text was updated successfully, but these errors were encountered: