[Arch] Added Artix-7-like Devices on the Xilinx 7 Series Capture #3139

AlexandreSinger · 2025-06-15T12:29:43Z

For testing Analytical Placement, we currently need fixed-sized devices. To get the 7-Series architecture capture through the AP flow, I selected fixed device sizes that matched the Artix-7 device family in resource utilization.

The fixed layouts were obtained by performing the following process:

The device was auto-sized to the FPGA with the closest number of CLB
slices to the target device.
Extra RAMs, DSPs, and IOs were dropped out evenly across the device.

The RAMs and DSPs usually needed to be dropped out, and their numbers were often cut in half relative to their auto-sized resource counts.

The IOs majorly needed to be dropped out. The number of IOs in the auto-sized devices were around 10x their Artix-7 equivalent (with matching CLB resources).

These devices likely do not model the layout of the Artix-7 devices faithfully; however, my goal was to simulate the resource utilization of the Artix-7 device family on our Xilinx 7-Series capture.

For those interested, here are the VTR benchmarks mapped to the smallest Artix-7 device that they can fit on and their resource utilization:

Circuit	Minimum Device	Device Utilization	Limiting Resource Utilization	Limiting Resource
arm_core.v	XC7A200T	0.06	0.62	io
bgm.v	XC7A75T	0.45	0.95	io
blob_merge.v	XC7A12T	0.71	0.94	CLB
boundtop.v	XC7A200T	0.01	0.77	io
ch_intrinsics.v	XC7A15T	0.06	0.89	io
diffeq1.v	XC7A75T	0.02	0.85	io
diffeq2.v	Yosys crash	-	-	-
LU32PEEng.v	XC7A200T	0.52	0.61	CLB
LU8PEEng.v	XC7A50T	0.61	0.87	io
mcml.v	XC7A200T	0.68	0.79	CLB
mkDelayWorker32B.v	Did not fit	-	-	io (Needs 1059)
mkPktMerge.v	XC7A200T	0.01	0.93	io
mkSMAdapter4B.v	XC7A200T	0.01	0.79	io
or1200.v	Did not fit	-	-	io (Needs 747)
raygentop.v	Did not fit	-	-	io (Needs 541)
sha.v	XC7A12T	0.15	0.49	io
spree.v	XC7A12T	0.08	0.51	io
stereovision0.v	XC7A200T	0.06	0.73	io
stereovision1.v	XC7A75T	0.2	0.89	DSP
stereovision2.v	XC7A200T	0.21	0.66	io
stereovision3.v	XC7A12T	0.02	0.09	io

Of specific interest to me are the 8 largest VTR benchmark circuits, which I have pulled out below:

Circuit	Minimum Device	Device Utilization	Limiting Resource Utilization	Limiting Resource
arm_core.v	XC7A200T	0.06	0.62	io
stereovision0.v	XC7A200T	0.06	0.73	io
LU8PEEng.v	XC7A50T	0.61	0.87	io
bgm.v	XC7A75T	0.45	0.95	io
stereovision1.v	XC7A75T	0.2	0.89	DSP
stereovision2.v	XC7A200T	0.21	0.66	io
LU32PEEng.v	XC7A200T	0.52	0.61	CLB
mcml.v	XC7A200T	0.68	0.79	CLB

vaughnbetz · 2025-06-15T20:48:38Z

Does the architecture capture have 2 DSP slices per DSP tile? I thought 7-series DSPs were pretty much one multiplier with some extra circuitry.

AlexandreSinger · 2025-06-15T21:03:21Z

Does the architecture capture have 2 DSP slices per DSP tile? I thought 7-series DSPs were pretty much one multiplier with some extra circuitry.

Based on what I am seeing each DSP tile implements 2 DSP slices.

According to my AP mass report it says that each logical DSP block contains two 25x18 multiplier pbs:

I also verified my mass report by looking at the architecture file:

The data sheet just uses the term "DSP48E1 Slices" and a footnote that says "Each DSP slice contains a pre-adder, a 25 x 18 multiplier, an adder, and an accumulator".

Edit: For the devices that I created, I was careful to ensure that the number of slices in the architecture matched the data sheet (not the number of blocks)

vaughnbetz · 2025-06-15T21:35:49Z

You're right, that does look like 2 multipliers per block. @WhiteNinjaZ : we can leave this, but what's the motivation for putting 2 multipliers in a DSP block instead of 1 for this architecture? I would have thought one would be a closer fit to the 7-series.

WhiteNinjaZ · 2025-06-16T16:58:30Z

@vaughnbetz and @AlexandreSinger your right the Xilinx documentation only talks about individual DSP slices, however, looking at the actual hardware a single 7-series DSP tile is made up of two DSP48E1 slices with some specialized interconnect going between the two of them as well as interfacing with the global routing. The DSP tile spans between 4 and 5 CLBs. Here is an image from the device manager in Vivado showing this:

Note as discussed in the VTR_9 paper the specialized interconnect for the DSPs has not been implemented yet.

WhiteNinjaZ · 2025-06-16T17:48:20Z

@AlexandreSinger just so I can do some quick checks and calculations, where did you pull your target resource counts from for the actual 7-series? I assume from each resources documentation page (i.e. table 1-1 for CLB)? Specifically I am interested in where you pulled the IO counts from.

Each device has several sub-packages and the number of IOs in each package can vary widely, although the number of LUTs and FFs will stay the same. For instance, the XC7A200T device can have anywhere from 484 to 1156 IOs depending on the package.

AlexandreSinger · 2025-06-16T18:50:03Z

Hi @WhiteNinjaZ I used the following document: https://docs.amd.com/v/u/en-US/ds180_7Series_Overview

Specifically I used Table 4:

If I am misinterpreting the IO column please let me know! I was very worried about the IO count, but I assumed the table was very clear that the number of IOs available is the "Max User IO". I recognize that this does not include some of the IOs (such as GTP transceivers), but for general circuits that we are testing on the device, I thought that number lined up the most. For the VTR circuits for example, I thought their inputs would line up the best with the User IO.

vaughnbetz · 2025-06-16T19:46:21Z

Thanks for the DSP clarification @WhiteNinjaZ

WhiteNinjaZ · 2025-06-17T16:16:59Z

@AlexandreSinger Sounds good, that table is a good spot to pull the counts from. When I said 1156 IO for the XC7200T I was including all IOs including GTPs etc. Those numbers in that table are technically the max number of IOBs in the device which is correct. And most of your resource counts do look good. The IO on the 7-series is pretty limited unfortunately. mkDelayWorker32B, or1200, and raygentop are all too big to fit on any 7-series artix device when it comes to IO. you would have to go up to the vertix devices to get enough IO (XC7V2000T is the only device where everything will fit).

I should have time on Thursday to take a detailed look at the changes to the architecture. But in the mean time, as a sanity check here are my resource usages from Vivado on several of the above designs:

Design	Device and Package	IO	LUT	FF	DSP	BRAM
arm_core.v	xc7a200tffv1156-1	62	4	1	-	5
bgm.v	xc7a75tfgg676-1	96	26	5	12	-
boundtop.v	xc7a200tffv1156-1	66	0.17	0.08	-	-
diffeq1.v	xc7a75tfgg676-1	86	1	1	-	-
LU32PEEng.v	xc7a200tffv1156-1	43.2	36.84	5.84	8.65	41.1
LU8PEEng.v	xc7a50tfgg484-1	86	48	8	13	56
stereovision0.v	xc7a200tffv1156-1	73	2	4	-	-
stereovision1.v	xc7a75tfgg676-1	93	31	12	-	-
stereovision2.v	xc7a200tffv1156-1	66	6	5	33	-
stereovision3.v	xc7a12tcsg325-1	27	0.68	0.62	-	-

All utilizations are in percentage. Results are post synthesis only not post implementation. From what I have seen on the VTR benchmarks post synth/vs post implementation is almost always identical in vivado. Each result is run on the device you specified and the package that provided the max number of IO. It would be good to get the resource usage for all of the resources on your end for comparison so we aren't only looking at the limiting resource. For now, though all your results look good with the exception of LU32PEEng, boundtop, and stereovision3. In LU32PEEng and boundtop the resource usage in vivado IO is about 20% less. This could be that vivado is ripping out some logic as its synthesizer is quite a bit more aggressive at removing logic than Yosys. What really confuses me is stereovision3 which uses about 20% less IO in VTR than it does in Xilinx.

I should also note that Vivado does not use DSP resources in stereo1 while VTR does, but this is consistent with what we have seen in the past and seems to be due to differences in synthesizers.

vaughnbetz · 2025-06-17T16:47:43Z

Maybe VTR removes unused IOs while Vivado doesn't? (Quartus is conservative about removing IOs in case they are placeholders for a planned board design; Vivado is likely similar).

AlexandreSinger · 2025-06-18T01:09:34Z

Hi @WhiteNinjaZ sorry for the delay! Thank you so much for your response. I appreciate the numbers from Vivado! I agree that the Artix-7 device family has somewhat limited IO resources. I chose to go with this device since its other resource counts provided a good spread along the 8 largest VTR benchmarks.

Here are the device utilizations I am seeing on VTR master on the devices I posted above. These are the device utilizations reported by VPR, where I believe the utilization is computed as a ratio of the required physical tile area over the available physical tile area on the device for each resource tile type:

Design	Device	io	CLB	memory	DSP
arm_core	XC7A200T	0.62	0.07	0.07	0.00
bgm	XC7A75T	0.95	0.55	0.00	0.12
LU32PEEng	XC7A200T	0.43	0.61	0.42	0.09
LU8PEEng	XC7A50T	0.87	0.72	0.60	0.13
stereovision0	XC7A200T	0.73	0.08	0.00	0.00
stereovision1	XC7A75T	0.86	0.20	0.00	0.89
stereovision2	XC7A200T	0.66	0.21	0.00	0.63
mcml	XC7A200T	0.78	0.79	0.44	0.22

I also took the liberty of computing the relative error between the Vivado results you posted above and the VTR results here (relative error = abs(Vivado - VTR) / Vivado ):

Design	io relative error	DSP relative error	BRAM relative error
arm_core	0.00	0.00	0.40
bgm	0.01	0.00	0.00
LU32PEEng	0.00	0.04	0.02
LU8PEEng	0.01	0.00	0.07
stereovision0	0.00	0.00	0.00
stereovision1	0.08	0.00	0.00
stereovision2	0.00	0.91	0.00
mcml	-	-	-

I left out the CLB relative error since its not easy for me to get the device utilization in terms of LUTs and FFs. Its possible, but just not as convenient as the device utilization of the tiles.

Overall, I am quite happy with the relative error. The IOs especially are extremely close to the Vivado results (up to 8% error). Stereovision2 on the VTR side uses way more DSPs than Vivado, but I am not sure if that is a result of the device resources or the packer / synthesis. BRAMs also look quite good. arm_core has 40% error; however Vivado used 5% and VTR used 7%, I do not see that as being too outlandish.

NOTE: I only put the top 8 circuits above since I had that data on hand. If you would like me to collect the others do let me know!

AlexandreSinger · 2025-06-19T19:11:12Z

@WhiteNinjaZ @vaughnbetz I started looking into changing the number of io subtiles per IO tile; however, I am finding that the ratios are being a bit strange. It would take a good amount of work to change the devices I have now to account for this. This would also require us to regenerate and re-collect the data for the tests that did not use fixed-devices. @WhiteNinjaZ , I think you said you were looking to explore the device layout further in the future; so, this would end up being redone in the future anyways.

Are people ok with leaving the devices and architecture capture as it is now (with 8 io sub-tiles per IO tile)?

vaughnbetz · 2025-06-19T23:15:48Z

We can defer it but I think it would be good if you made this change eventually @WhiteNinjaZ . It's a bit odd to have clumps of 8 IOs in some places and no IOs in other places.

WhiteNinjaZ

@AlexandreSinger I've gone through the changes and all your numbers look good. I did leave a few optional comments, but you can decide if you want to do those or not or if you want to leave them as an action item for me in the future. I will definitely cut the IO down in the future.

WhiteNinjaZ · 2025-06-20T04:35:38Z

vtr_flow/arch/xilinx/7series_BRAM_DSP_carry.xml

+      <col type="io" startx="0" starty="1" incry="8" priority="100" />
+      <col type="io" startx="36" starty="1" incry="8" priority="100" />
+      <row type="io" startx="1" starty="0" incrx="8" priority="100" />
+      <row type="io" startx="1" starty="36" incrx="9" priority="100" />


Since we are using so few IO it might be good to try just constraining the IO to one side of the device. This is technically more accurate to the seven series (IO constrained only to the left or right of the chip or both).

vtr_flow/arch/xilinx/7series_BRAM_DSP_carry.xml

WhiteNinjaZ · 2025-06-20T04:41:29Z

vtr_flow/arch/xilinx/7series_BRAM_DSP_carry.xml

+                - 50 BRAM blocks   | 50 BRAM tiles
+                - 250 IOs          | 31 IO tiles
+    -->
+    <fixed_layout name="XC7A35T-like" width="58" height="58">


This is another thing that is optional and I can make this change in the future but just to be aware, the 7-series devices are not square but rather rectangles. The ratio is on average about 0.25 (W/H) across the different chip sets. If you are only going for accurate counts of resources on the device then it doesn't really matter but worth mentioning.

For testing Analytical Placement, we currently need fixed-sized devices. To get the 7-Series architecture capture through the AP flow, I selected fixed device sizes that matched the Artix-7 device family in resource utilization. The fixed layouts were obtained by performing the following process: 1) The device was auto-sized to the FPGA with the closest number of CLB slices to the target device. 2) Extra RAMs, DSPs, and IOs were dropped out evenly across the device. The RAMs and DSPs usually needed to be dropped out, and their numbers were often cut in half relative to their auto-sized resource counts. The IOs majorly needed to be dropped out. The number of IOs in the auto-sized devices were around 10x their Artix-7 equivalent (with matching CLB resources). These devices likely do not model the layout of the Artix-7 devices faithfully; however, my goal was to simulate the resource utilization of the Artix-7 device family on our Xilinx 7-Series capture.

Based on feedback for the Artix-7-like devices, I put the IOs on the sides of the devices instead of all around. This helps reduce the "clumps" of IOs some and is slightly more accurate.

AlexandreSinger · 2025-06-22T00:34:24Z

@vaughnbetz @WhiteNinjaZ Thanks for the feedback! I have updated the device to put the IOs on the left and right side of the device. This should help with the density of the IOs and does not require me to rewrite too much. Ideally we should use the accurate positions of the IO; but for now this should be ok at least for my needs. I look forward to seeing the final devices layouts for Artix-7!

AlexandreSinger requested review from vaughnbetz, amin1377 and WhiteNinjaZ June 15, 2025 12:29

AlexandreSinger force-pushed the feature-arch-artix-7 branch from 8f48fb9 to 61d06ff Compare June 15, 2025 12:30

WhiteNinjaZ reviewed Jun 20, 2025

View reviewed changes

AlexandreSinger added 2 commits June 21, 2025 19:05

[Arch] Updated Artix-7-like Devices IO Placement

43edba0

Based on feedback for the Artix-7-like devices, I put the IOs on the sides of the devices instead of all around. This helps reduce the "clumps" of IOs some and is slightly more accurate.

AlexandreSinger force-pushed the feature-arch-artix-7 branch from 61d06ff to 43edba0 Compare June 22, 2025 00:31

AlexandreSinger merged commit 31cbe66 into verilog-to-routing:master Jun 22, 2025
33 checks passed

AlexandreSinger deleted the feature-arch-artix-7 branch June 22, 2025 13:09

[Arch] Added Artix-7-like Devices on the Xilinx 7 Series Capture #3139

[Arch] Added Artix-7-like Devices on the Xilinx 7 Series Capture #3139

Uh oh!

Conversation

AlexandreSinger commented Jun 15, 2025

Uh oh!

vaughnbetz commented Jun 15, 2025

Uh oh!

AlexandreSinger commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vaughnbetz commented Jun 15, 2025

Uh oh!

WhiteNinjaZ commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WhiteNinjaZ commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexandreSinger commented Jun 16, 2025

Uh oh!

vaughnbetz commented Jun 16, 2025

Uh oh!

WhiteNinjaZ commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vaughnbetz commented Jun 17, 2025

Uh oh!

AlexandreSinger commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexandreSinger commented Jun 19, 2025

Uh oh!

vaughnbetz commented Jun 19, 2025

Uh oh!

WhiteNinjaZ left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WhiteNinjaZ Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WhiteNinjaZ Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

AlexandreSinger commented Jun 22, 2025

Uh oh!

Uh oh!

Uh oh!

AlexandreSinger commented Jun 15, 2025 •

edited

Loading

WhiteNinjaZ commented Jun 16, 2025 •

edited

Loading

WhiteNinjaZ commented Jun 16, 2025 •

edited

Loading

WhiteNinjaZ commented Jun 17, 2025 •

edited

Loading

AlexandreSinger commented Jun 18, 2025 •

edited

Loading

WhiteNinjaZ left a comment •

edited

Loading