Skip to content

[Arch] Added Artix-7-like Devices on the Xilinx 7 Series Capture #3139

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

AlexandreSinger
Copy link
Contributor

For testing Analytical Placement, we currently need fixed-sized devices. To get the 7-Series architecture capture through the AP flow, I selected fixed device sizes that matched the Artix-7 device family in resource utilization.

The fixed layouts were obtained by performing the following process:

  1. The device was auto-sized to the FPGA with the closest number of CLB
    slices to the target device.

  2. Extra RAMs, DSPs, and IOs were dropped out evenly across the device.

The RAMs and DSPs usually needed to be dropped out, and their numbers were often cut in half relative to their auto-sized resource counts.

The IOs majorly needed to be dropped out. The number of IOs in the auto-sized devices were around 10x their Artix-7 equivalent (with matching CLB resources).

These devices likely do not model the layout of the Artix-7 devices faithfully; however, my goal was to simulate the resource utilization of the Artix-7 device family on our Xilinx 7-Series capture.

For those interested, here are the VTR benchmarks mapped to the smallest Artix-7 device that they can fit on and their resource utilization:

Circuit Minimum Device Device Utilization Limiting Resource Utilization Limiting Resource
arm_core.v XC7A200T 0.06 0.62 io
bgm.v XC7A75T 0.45 0.95 io
blob_merge.v XC7A12T 0.71 0.94 CLB
boundtop.v XC7A200T 0.01 0.77 io
ch_intrinsics.v XC7A15T 0.06 0.89 io
diffeq1.v XC7A75T 0.02 0.85 io
diffeq2.v Yosys crash - - -
LU32PEEng.v XC7A200T 0.52 0.61 CLB
LU8PEEng.v XC7A50T 0.61 0.87 io
mcml.v XC7A200T 0.68 0.79 CLB
mkDelayWorker32B.v Did not fit - - io (Needs 1059)
mkPktMerge.v XC7A200T 0.01 0.93 io
mkSMAdapter4B.v XC7A200T 0.01 0.79 io
or1200.v Did not fit - - io (Needs 747)
raygentop.v Did not fit - - io (Needs 541)
sha.v XC7A12T 0.15 0.49 io
spree.v XC7A12T 0.08 0.51 io
stereovision0.v XC7A200T 0.06 0.73 io
stereovision1.v XC7A75T 0.2 0.89 DSP
stereovision2.v XC7A200T 0.21 0.66 io
stereovision3.v XC7A12T 0.02 0.09 io

Of specific interest to me are the 8 largest VTR benchmark circuits, which I have pulled out below:

Circuit Minimum Device Device Utilization Limiting Resource Utilization Limiting Resource
arm_core.v XC7A200T 0.06 0.62 io
stereovision0.v XC7A200T 0.06 0.73 io
LU8PEEng.v XC7A50T 0.61 0.87 io
bgm.v XC7A75T 0.45 0.95 io
stereovision1.v XC7A75T 0.2 0.89 DSP
stereovision2.v XC7A200T 0.21 0.66 io
LU32PEEng.v XC7A200T 0.52 0.61 CLB
mcml.v XC7A200T 0.68 0.79 CLB

@vaughnbetz
Copy link
Contributor

Does the architecture capture have 2 DSP slices per DSP tile? I thought 7-series DSPs were pretty much one multiplier with some extra circuitry.

@AlexandreSinger
Copy link
Contributor Author

AlexandreSinger commented Jun 15, 2025

Does the architecture capture have 2 DSP slices per DSP tile? I thought 7-series DSPs were pretty much one multiplier with some extra circuitry.

Based on what I am seeing each DSP tile implements 2 DSP slices.

According to my AP mass report it says that each logical DSP block contains two 25x18 multiplier pbs:

image

I also verified my mass report by looking at the architecture file:
image

The data sheet just uses the term "DSP48E1 Slices" and a footnote that says "Each DSP slice contains a pre-adder, a 25 x 18 multiplier, an adder, and an accumulator".

Edit: For the devices that I created, I was careful to ensure that the number of slices in the architecture matched the data sheet (not the number of blocks)

@vaughnbetz
Copy link
Contributor

You're right, that does look like 2 multipliers per block. @WhiteNinjaZ : we can leave this, but what's the motivation for putting 2 multipliers in a DSP block instead of 1 for this architecture? I would have thought one would be a closer fit to the 7-series.

@WhiteNinjaZ
Copy link
Contributor

WhiteNinjaZ commented Jun 16, 2025

@vaughnbetz and @AlexandreSinger your right the Xilinx documentation only talks about individual DSP slices, however, looking at the actual hardware a single 7-series DSP tile is made up of two DSP48E1 slices with some specialized interconnect going between the two of them as well as interfacing with the global routing. The DSP tile spans between 4 and 5 CLBs. Here is an image from the device manager in Vivado showing this:
image
Note as discussed in the VTR_9 paper the specialized interconnect for the DSPs has not been implemented yet.

@WhiteNinjaZ
Copy link
Contributor

WhiteNinjaZ commented Jun 16, 2025

@AlexandreSinger just so I can do some quick checks and calculations, where did you pull your target resource counts from for the actual 7-series? I assume from each resources documentation page (i.e. table 1-1 for CLB)? Specifically I am interested in where you pulled the IO counts from.

Each device has several sub-packages and the number of IOs in each package can vary widely, although the number of LUTs and FFs will stay the same. For instance, the XC7A200T device can have anywhere from 484 to 1156 IOs depending on the package.

@AlexandreSinger
Copy link
Contributor Author

Hi @WhiteNinjaZ I used the following document: https://docs.amd.com/v/u/en-US/ds180_7Series_Overview

Specifically I used Table 4:

image

If I am misinterpreting the IO column please let me know! I was very worried about the IO count, but I assumed the table was very clear that the number of IOs available is the "Max User IO". I recognize that this does not include some of the IOs (such as GTP transceivers), but for general circuits that we are testing on the device, I thought that number lined up the most. For the VTR circuits for example, I thought their inputs would line up the best with the User IO.

@vaughnbetz
Copy link
Contributor

Thanks for the DSP clarification @WhiteNinjaZ

@WhiteNinjaZ
Copy link
Contributor

WhiteNinjaZ commented Jun 17, 2025

@AlexandreSinger Sounds good, that table is a good spot to pull the counts from. When I said 1156 IO for the XC7200T I was including all IOs including GTPs etc. Those numbers in that table are technically the max number of IOBs in the device which is correct. And most of your resource counts do look good. The IO on the 7-series is pretty limited unfortunately. mkDelayWorker32B, or1200, and raygentop are all too big to fit on any 7-series artix device when it comes to IO. you would have to go up to the vertix devices to get enough IO (XC7V2000T is the only device where everything will fit).

I should have time on Thursday to take a detailed look at the changes to the architecture. But in the mean time, as a sanity check here are my resource usages from Vivado on several of the above designs:

Design Device and Package IO LUT FF DSP BRAM
arm_core.v xc7a200tffv1156-1 62 4 1 - 5
bgm.v xc7a75tfgg676-1 96 26 5 12 -
boundtop.v xc7a200tffv1156-1 66 0.17 0.08 - -
diffeq1.v xc7a75tfgg676-1 86 1 1 - -
LU32PEEng.v xc7a200tffv1156-1 43.2 36.84 5.84 8.65 41.1
LU8PEEng.v xc7a50tfgg484-1 86 48 8 13 56
stereovision0.v xc7a200tffv1156-1 73 2 4 - -
stereovision1.v xc7a75tfgg676-1 93 31 12 - -
stereovision2.v xc7a200tffv1156-1 66 6 5 33 -
stereovision3.v xc7a12tcsg325-1 27 0.68 0.62 - -

All utilizations are in percentage. Results are post synthesis only not post implementation. From what I have seen on the VTR benchmarks post synth/vs post implementation is almost always identical in vivado. Each result is run on the device you specified and the package that provided the max number of IO. It would be good to get the resource usage for all of the resources on your end for comparison so we aren't only looking at the limiting resource. For now, though all your results look good with the exception of LU32PEEng, boundtop, and stereovision3. In LU32PEEng and boundtop the resource usage in vivado IO is about 20% less. This could be that vivado is ripping out some logic as its synthesizer is quite a bit more aggressive at removing logic than Yosys. What really confuses me is stereovision3 which uses about 20% less IO in VTR than it does in Xilinx.

I should also note that Vivado does not use DSP resources in stereo1 while VTR does, but this is consistent with what we have seen in the past and seems to be due to differences in synthesizers.

@vaughnbetz
Copy link
Contributor

Maybe VTR removes unused IOs while Vivado doesn't? (Quartus is conservative about removing IOs in case they are placeholders for a planned board design; Vivado is likely similar).

@AlexandreSinger
Copy link
Contributor Author

AlexandreSinger commented Jun 18, 2025

Hi @WhiteNinjaZ sorry for the delay! Thank you so much for your response. I appreciate the numbers from Vivado! I agree that the Artix-7 device family has somewhat limited IO resources. I chose to go with this device since its other resource counts provided a good spread along the 8 largest VTR benchmarks.

Here are the device utilizations I am seeing on VTR master on the devices I posted above. These are the device utilizations reported by VPR, where I believe the utilization is computed as a ratio of the required physical tile area over the available physical tile area on the device for each resource tile type:

Design Device io CLB memory DSP
arm_core XC7A200T 0.62 0.07 0.07 0.00
bgm XC7A75T 0.95 0.55 0.00 0.12
LU32PEEng XC7A200T 0.43 0.61 0.42 0.09
LU8PEEng XC7A50T 0.87 0.72 0.60 0.13
stereovision0 XC7A200T 0.73 0.08 0.00 0.00
stereovision1 XC7A75T 0.86 0.20 0.00 0.89
stereovision2 XC7A200T 0.66 0.21 0.00 0.63
mcml XC7A200T 0.78 0.79 0.44 0.22

I also took the liberty of computing the relative error between the Vivado results you posted above and the VTR results here (relative error = abs(Vivado - VTR) / Vivado ):

Design io relative error DSP relative error BRAM relative error
arm_core 0.00 0.00 0.40
bgm 0.01 0.00 0.00
LU32PEEng 0.00 0.04 0.02
LU8PEEng 0.01 0.00 0.07
stereovision0 0.00 0.00 0.00
stereovision1 0.08 0.00 0.00
stereovision2 0.00 0.91 0.00
mcml - - -

I left out the CLB relative error since its not easy for me to get the device utilization in terms of LUTs and FFs. Its possible, but just not as convenient as the device utilization of the tiles.

Overall, I am quite happy with the relative error. The IOs especially are extremely close to the Vivado results (up to 8% error). Stereovision2 on the VTR side uses way more DSPs than Vivado, but I am not sure if that is a result of the device resources or the packer / synthesis. BRAMs also look quite good. arm_core has 40% error; however Vivado used 5% and VTR used 7%, I do not see that as being too outlandish.

NOTE: I only put the top 8 circuits above since I had that data on hand. If you would like me to collect the others do let me know!

@AlexandreSinger
Copy link
Contributor Author

@WhiteNinjaZ @vaughnbetz I started looking into changing the number of io subtiles per IO tile; however, I am finding that the ratios are being a bit strange. It would take a good amount of work to change the devices I have now to account for this. This would also require us to regenerate and re-collect the data for the tests that did not use fixed-devices. @WhiteNinjaZ , I think you said you were looking to explore the device layout further in the future; so, this would end up being redone in the future anyways.

Are people ok with leaving the devices and architecture capture as it is now (with 8 io sub-tiles per IO tile)?

@vaughnbetz
Copy link
Contributor

We can defer it but I think it would be good if you made this change eventually @WhiteNinjaZ . It's a bit odd to have clumps of 8 IOs in some places and no IOs in other places.

Copy link
Contributor

@WhiteNinjaZ WhiteNinjaZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexandreSinger I've gone through the changes and all your numbers look good. I did leave a few optional comments, but you can decide if you want to do those or not or if you want to leave them as an action item for me in the future. I will definitely cut the IO down in the future.

<col type="io" startx="0" starty="1" incry="8" priority="100" />
<col type="io" startx="36" starty="1" incry="8" priority="100" />
<row type="io" startx="1" starty="0" incrx="8" priority="100" />
<row type="io" startx="1" starty="36" incrx="9" priority="100" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using so few IO it might be good to try just constraining the IO to one side of the device. This is technically more accurate to the seven series (IO constrained only to the left or right of the chip or both).

- 50 BRAM blocks | 50 BRAM tiles
- 250 IOs | 31 IO tiles
-->
<fixed_layout name="XC7A35T-like" width="58" height="58">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another thing that is optional and I can make this change in the future but just to be aware, the 7-series devices are not square but rather rectangles. The ratio is on average about 0.25 (W/H) across the different chip sets. If you are only going for accurate counts of resources on the device then it doesn't really matter but worth mentioning.

For testing Analytical Placement, we currently need fixed-sized devices.
To get the 7-Series architecture capture through the AP flow, I selected
fixed device sizes that matched the Artix-7 device family in resource
utilization.

The fixed layouts were obtained by performing the following process:

1) The device was auto-sized to the FPGA with the closest number of CLB
   slices to the target device.

2) Extra RAMs, DSPs, and IOs were dropped out evenly across the device.

The RAMs and DSPs usually needed to be dropped out, and their numbers
were often cut in half relative to their auto-sized resource counts.

The IOs majorly needed to be dropped out. The number of IOs in the
auto-sized devices were around 10x their Artix-7 equivalent (with
matching CLB resources).

These devices likely do not model the layout of the Artix-7 devices
faithfully; however, my goal was to simulate the resource utilization of
the Artix-7 device family on our Xilinx 7-Series capture.
Based on feedback for the Artix-7-like devices, I put the IOs on the
sides of the devices instead of all around. This helps reduce the
"clumps" of IOs some and is slightly more accurate.
@AlexandreSinger
Copy link
Contributor Author

@vaughnbetz @WhiteNinjaZ Thanks for the feedback! I have updated the device to put the IOs on the left and right side of the device. This should help with the density of the IOs and does not require me to rewrite too much. Ideally we should use the accurate positions of the IO; but for now this should be ok at least for my needs. I look forward to seeing the final devices layouts for Artix-7!

@AlexandreSinger AlexandreSinger merged commit 31cbe66 into verilog-to-routing:master Jun 22, 2025
33 checks passed
@AlexandreSinger AlexandreSinger deleted the feature-arch-artix-7 branch June 22, 2025 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants