Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XDMA data corruption issue (0xFFFFFFFF every other read) was fixed yet was not mentioned anywhere #317

Open
dmitrym1 opened this issue Jan 17, 2025 · 5 comments

Comments

@dmitrym1
Copy link

dmitrym1 commented Jan 17, 2025

TL;DR: if you have the same issue, you should upgrade to Vivado 2019.1 or newer. Or if you are a Xilinx/AMD employee, then you should write your documentation better
I was using Vivado 2018.2 and corresponding XDMA IP connected to iMX8. I had to use @alonbl's patch set yet I still faced issues #311 and #314. But there was one more issue that I couldn't explain and couldn't find anything related to it. I have AXI peripherals connected to AXI DMA port and also some peripherals connected to AXI Lite port for register access. After some time of my app working just fine, I start getting 0xFFFFFFFF instead of data every other read. The kernel module gets registers data corrupted the same way, which leads to #314 and some other issues, slowing everything down and eventually crashing the kernel. Unloading kernel module does not help. The problem persists until system restart. Debugging kernel module lead me to ioread32 function that already gets corrupted data, so the problem goes further into XDMA IP itself. Looking on Xilinx/AMD web site revealed that there is no support tickets, no design advisories, and my IP core version (4.1) is the latest one. IDK how much time I'd spent on this bug if I accidentally did not look into XDMA changelog from Vivado 2022.1. And there it is:

2019.1:
 * Version 4.1 (Rev. 3)
 ...
 * Bug Fix: Fixed back to back reads failure for 7series Gen2 DMA.
  ...

So first of all, the version is not 4.1, it is 4.1.X, which is not what official publicly available documentation says.
Second, I don't know if this bug fix is for the issue that I described above. Because Xilinx did not share anything about this issue. How a developer supposed to know that there was an issue and it was fixed? So I'm doing work instead of Xilinx, sharing as much info as I can for those who face the same issue and try to google for a solution.

Example log, take a look at fields that have 0xffffffff in them.

[  139.737742] xdma:engine_service: Engine was not running!!! Clearing status
[  150.049815] xdma:xdma_xfer_submit: xfer 0x00000000ed2983a6,576, s 0x1 timed out, ep 0xe240.
[  150.049829] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000ae7e6564) = 0x1fc10006 (id).
[  150.093194] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x000000003af4d3f7) = 0xffffffff (status).                       <= BUG
[  150.093204] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000069a83cc0) = 0x00000000 (control)
[  150.136564] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000ef7bce09) = 0x00f83e1f (first_desc_lo)
[  150.179928] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000f04f09bc) = 0xffffffff (first_desc_hi).                       <= BUG
[  150.179939] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x000000003a241afc) = 0x00000000 (first_desc_adjacent).
[  150.223303] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000068406d0f) = 0xffffffff (completed_desc_count)..                       <= BUG
[  150.223315] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x000000002c7973e8) = 0x00f83e1e (interrupt_enable_mask)
[  150.266686] xdma:engine_status_dump: SG engine 0-C2H0-MM status: 0xffffffff: BUSY,DESC_STOPPED,DESC_COMPL,ALIGN_MISMATCH MAGIC_STOPPED INVALID_LEN IDLE_STOPPED,R:DECODE_ERR SLAVE_ERR,DESC_ERR:UNSUPP_REQ COMPL_ABORT PARITY HEADER_EP UNEXP_COMPL
[  150.266698] xdma:transfer_abort: abort transfer 0x00000000ed2983a6, desc 1, engine desc queued 0.
@jason77-wang
Copy link

@dmitrym1 Where can I find @alonbl's patch set? Thanks.

And in my case the driver print this error logs periodically, do you know that does it mean?

Nov 30 03:00:26 ubuntu kernel: xdma:xdma_xfer_submit: xfer 0x0000000090ad39ee,4, s 0x1 timed out, ep 0xa8008080.
Nov 30 03:00:26 ubuntu kernel: xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000009f88c41) = 0xffffffff (id).
Nov 30 03:00:26 ubuntu kernel: xdma:engine_reg_dump: 0-C2H0-MM: engine id missing, 0xfff00000 exp. & 0xfff00000 = 0x1fc00000
Nov 30 03:00:26 ubuntu kernel: xdma:engine_status_read: Failed to dump register
Nov 30 03:00:26 ubuntu kernel: xdma:xdma_xfer_submit: Failed to read engine status

@alonbl
Copy link

alonbl commented Jan 19, 2025

@dmitrym1 Where can I find @alonbl's patch set? Thanks.

#240

@dmitrym1
Copy link
Author

Hi @jason77-wang. In your log there is a failed transaction. The driver says it's because of a timeout but as you've got 0xffffffff from ioread32 I can say it's a communication problem. There could be various reasons why you can get this result. I've seen the same log and had the same issue in my application, and it turned out to be an XDMA IP bug. In my case I had to do a few dozens restarts of my software and this quickly and reliably triggered the issue. Otherwise it could reproduce by itself after a few days of continuous operation. Once it goes to that state, it stays there until I restart the whole system. I've updated Vivado to 2020.1 and upgraded IP cores, and this fixed the problem. The changelog for XDMA says the problem should be fixed since 2019.1. So you could try to update too and see if this fixes the issue. If it does not, then unfortunately I won't be able to help you any further.

@jason77-wang
Copy link

@dmitrym1 Where can I find @alonbl's patch set? Thanks.

#240

Thanks.

@jason77-wang
Copy link

Hi @jason77-wang. In your log there is a failed transaction. The driver says it's because of a timeout but as you've got 0xffffffff from ioread32 I can say it's a communication problem. There could be various reasons why you can get this result. I've seen the same log and had the same issue in my application, and it turned out to be an XDMA IP bug. In my case I had to do a few dozens restarts of my software and this quickly and reliably triggered the issue. Otherwise it could reproduce by itself after a few days of continuous operation. Once it goes to that state, it stays there until I restart the whole system. I've updated Vivado to 2020.1 and upgraded IP cores, and this fixed the problem. The changelog for XDMA says the problem should be fixed since 2019.1. So you could try to update too and see if this fixes the issue. If it does not, then unfortunately I won't be able to help you any further.

Okay, got it. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants