XDMA data corruption issue (0xFFFFFFFF every other read) was fixed yet was not mentioned anywhere #317

dmitrym1 · 2025-01-17T14:26:02Z

TL;DR: if you have the same issue, you should upgrade to Vivado 2019.1 or newer. Or if you are a Xilinx/AMD employee, then you should write your documentation better
I was using Vivado 2018.2 and corresponding XDMA IP connected to iMX8. I had to use @alonbl's patch set yet I still faced issues #311 and #314. But there was one more issue that I couldn't explain and couldn't find anything related to it. I have AXI peripherals connected to AXI DMA port and also some peripherals connected to AXI Lite port for register access. After some time of my app working just fine, I start getting 0xFFFFFFFF instead of data every other read. The kernel module gets registers data corrupted the same way, which leads to #314 and some other issues, slowing everything down and eventually crashing the kernel. Unloading kernel module does not help. The problem persists until system restart. Debugging kernel module lead me to ioread32 function that already gets corrupted data, so the problem goes further into XDMA IP itself. Looking on Xilinx/AMD web site revealed that there is no support tickets, no design advisories, and my IP core version (4.1) is the latest one. IDK how much time I'd spent on this bug if I accidentally did not look into XDMA changelog from Vivado 2022.1. And there it is:

2019.1:
 * Version 4.1 (Rev. 3)
 ...
 * Bug Fix: Fixed back to back reads failure for 7series Gen2 DMA.
  ...

So first of all, the version is not 4.1, it is 4.1.X, which is not what official publicly available documentation says.
Second, I don't know if this bug fix is for the issue that I described above. Because Xilinx did not share anything about this issue. How a developer supposed to know that there was an issue and it was fixed? So I'm doing work instead of Xilinx, sharing as much info as I can for those who face the same issue and try to google for a solution.

Example log, take a look at fields that have 0xffffffff in them.

[  139.737742] xdma:engine_service: Engine was not running!!! Clearing status
[  150.049815] xdma:xdma_xfer_submit: xfer 0x00000000ed2983a6,576, s 0x1 timed out, ep 0xe240.
[  150.049829] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000ae7e6564) = 0x1fc10006 (id).
[  150.093194] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x000000003af4d3f7) = 0xffffffff (status).                       <= BUG
[  150.093204] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000069a83cc0) = 0x00000000 (control)
[  150.136564] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000ef7bce09) = 0x00f83e1f (first_desc_lo)
[  150.179928] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x00000000f04f09bc) = 0xffffffff (first_desc_hi).                       <= BUG
[  150.179939] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x000000003a241afc) = 0x00000000 (first_desc_adjacent).
[  150.223303] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000068406d0f) = 0xffffffff (completed_desc_count)..                       <= BUG
[  150.223315] xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x000000002c7973e8) = 0x00f83e1e (interrupt_enable_mask)
[  150.266686] xdma:engine_status_dump: SG engine 0-C2H0-MM status: 0xffffffff: BUSY,DESC_STOPPED,DESC_COMPL,ALIGN_MISMATCH MAGIC_STOPPED INVALID_LEN IDLE_STOPPED,R:DECODE_ERR SLAVE_ERR,DESC_ERR:UNSUPP_REQ COMPL_ABORT PARITY HEADER_EP UNEXP_COMPL
[  150.266698] xdma:transfer_abort: abort transfer 0x00000000ed2983a6, desc 1, engine desc queued 0.

The text was updated successfully, but these errors were encountered:

jason77-wang · 2025-01-19T07:36:34Z

@dmitrym1 Where can I find @alonbl's patch set? Thanks.

And in my case the driver print this error logs periodically, do you know that does it mean?

Nov 30 03:00:26 ubuntu kernel: xdma:xdma_xfer_submit: xfer 0x0000000090ad39ee,4, s 0x1 timed out, ep 0xa8008080.
Nov 30 03:00:26 ubuntu kernel: xdma:engine_reg_dump: 0-C2H0-MM: ioread32(0x0000000009f88c41) = 0xffffffff (id).
Nov 30 03:00:26 ubuntu kernel: xdma:engine_reg_dump: 0-C2H0-MM: engine id missing, 0xfff00000 exp. & 0xfff00000 = 0x1fc00000
Nov 30 03:00:26 ubuntu kernel: xdma:engine_status_read: Failed to dump register
Nov 30 03:00:26 ubuntu kernel: xdma:xdma_xfer_submit: Failed to read engine status

alonbl · 2025-01-19T10:26:31Z

@dmitrym1 Where can I find @alonbl's patch set? Thanks.

#240

dmitrym1 · 2025-01-20T13:59:51Z

Hi @jason77-wang. In your log there is a failed transaction. The driver says it's because of a timeout but as you've got 0xffffffff from ioread32 I can say it's a communication problem. There could be various reasons why you can get this result. I've seen the same log and had the same issue in my application, and it turned out to be an XDMA IP bug. In my case I had to do a few dozens restarts of my software and this quickly and reliably triggered the issue. Otherwise it could reproduce by itself after a few days of continuous operation. Once it goes to that state, it stays there until I restart the whole system. I've updated Vivado to 2020.1 and upgraded IP cores, and this fixed the problem. The changelog for XDMA says the problem should be fixed since 2019.1. So you could try to update too and see if this fixes the issue. If it does not, then unfortunately I won't be able to help you any further.

jason77-wang · 2025-01-23T01:19:33Z

@dmitrym1 Where can I find @alonbl's patch set? Thanks.

#240

Thanks.

jason77-wang · 2025-01-23T01:19:50Z

Hi @jason77-wang. In your log there is a failed transaction. The driver says it's because of a timeout but as you've got 0xffffffff from ioread32 I can say it's a communication problem. There could be various reasons why you can get this result. I've seen the same log and had the same issue in my application, and it turned out to be an XDMA IP bug. In my case I had to do a few dozens restarts of my software and this quickly and reliably triggered the issue. Otherwise it could reproduce by itself after a few days of continuous operation. Once it goes to that state, it stays there until I restart the whole system. I've updated Vivado to 2020.1 and upgraded IP cores, and this fixed the problem. The changelog for XDMA says the problem should be fixed since 2019.1. So you could try to update too and see if this fixes the issue. If it does not, then unfortunately I won't be able to help you any further.

Okay, got it. thanks.

This was referenced Jan 17, 2025

XDMA IP Core Version field is useless and misleading #318

Open

The XDMA did not work when I send big buffer #298

Open

7022 (rev ff) #291

Open

xdma error : Failed to dump register #133

Open

xdma driver 512 error #64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XDMA data corruption issue (0xFFFFFFFF every other read) was fixed yet was not mentioned anywhere #317

XDMA data corruption issue (0xFFFFFFFF every other read) was fixed yet was not mentioned anywhere #317

dmitrym1 commented Jan 17, 2025 •

edited

Loading

jason77-wang commented Jan 19, 2025

alonbl commented Jan 19, 2025

dmitrym1 commented Jan 20, 2025

jason77-wang commented Jan 23, 2025

jason77-wang commented Jan 23, 2025

XDMA data corruption issue (0xFFFFFFFF every other read) was fixed yet was not mentioned anywhere #317

XDMA data corruption issue (0xFFFFFFFF every other read) was fixed yet was not mentioned anywhere #317

Comments

dmitrym1 commented Jan 17, 2025 • edited Loading

jason77-wang commented Jan 19, 2025

alonbl commented Jan 19, 2025

dmitrym1 commented Jan 20, 2025

jason77-wang commented Jan 23, 2025

jason77-wang commented Jan 23, 2025

dmitrym1 commented Jan 17, 2025 •

edited

Loading