-
Notifications
You must be signed in to change notification settings - Fork 418
[Arch] Added Artix-7-like Devices on the Xilinx 7 Series Capture #3139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arch] Added Artix-7-like Devices on the Xilinx 7 Series Capture #3139
Conversation
8f48fb9
to
61d06ff
Compare
Does the architecture capture have 2 DSP slices per DSP tile? I thought 7-series DSPs were pretty much one multiplier with some extra circuitry. |
You're right, that does look like 2 multipliers per block. @WhiteNinjaZ : we can leave this, but what's the motivation for putting 2 multipliers in a DSP block instead of 1 for this architecture? I would have thought one would be a closer fit to the 7-series. |
@vaughnbetz and @AlexandreSinger your right the Xilinx documentation only talks about individual DSP slices, however, looking at the actual hardware a single 7-series DSP tile is made up of two DSP48E1 slices with some specialized interconnect going between the two of them as well as interfacing with the global routing. The DSP tile spans between 4 and 5 CLBs. Here is an image from the device manager in Vivado showing this: |
@AlexandreSinger just so I can do some quick checks and calculations, where did you pull your target resource counts from for the actual 7-series? I assume from each resources documentation page (i.e. table 1-1 for CLB)? Specifically I am interested in where you pulled the IO counts from. Each device has several sub-packages and the number of IOs in each package can vary widely, although the number of LUTs and FFs will stay the same. For instance, the XC7A200T device can have anywhere from 484 to 1156 IOs depending on the package. |
Hi @WhiteNinjaZ I used the following document: https://docs.amd.com/v/u/en-US/ds180_7Series_Overview Specifically I used Table 4: ![]() If I am misinterpreting the IO column please let me know! I was very worried about the IO count, but I assumed the table was very clear that the number of IOs available is the "Max User IO". I recognize that this does not include some of the IOs (such as GTP transceivers), but for general circuits that we are testing on the device, I thought that number lined up the most. For the VTR circuits for example, I thought their inputs would line up the best with the User IO. |
Thanks for the DSP clarification @WhiteNinjaZ |
@AlexandreSinger Sounds good, that table is a good spot to pull the counts from. When I said 1156 IO for the XC7200T I was including all IOs including GTPs etc. Those numbers in that table are technically the max number of IOBs in the device which is correct. And most of your resource counts do look good. The IO on the 7-series is pretty limited unfortunately. mkDelayWorker32B, or1200, and raygentop are all too big to fit on any 7-series artix device when it comes to IO. you would have to go up to the vertix devices to get enough IO (XC7V2000T is the only device where everything will fit). I should have time on Thursday to take a detailed look at the changes to the architecture. But in the mean time, as a sanity check here are my resource usages from Vivado on several of the above designs:
All utilizations are in percentage. Results are post synthesis only not post implementation. From what I have seen on the VTR benchmarks post synth/vs post implementation is almost always identical in vivado. Each result is run on the device you specified and the package that provided the max number of IO. It would be good to get the resource usage for all of the resources on your end for comparison so we aren't only looking at the limiting resource. For now, though all your results look good with the exception of LU32PEEng, boundtop, and stereovision3. In LU32PEEng and boundtop the resource usage in vivado IO is about 20% less. This could be that vivado is ripping out some logic as its synthesizer is quite a bit more aggressive at removing logic than Yosys. What really confuses me is stereovision3 which uses about 20% less IO in VTR than it does in Xilinx. I should also note that Vivado does not use DSP resources in stereo1 while VTR does, but this is consistent with what we have seen in the past and seems to be due to differences in synthesizers. |
Maybe VTR removes unused IOs while Vivado doesn't? (Quartus is conservative about removing IOs in case they are placeholders for a planned board design; Vivado is likely similar). |
Hi @WhiteNinjaZ sorry for the delay! Thank you so much for your response. I appreciate the numbers from Vivado! I agree that the Artix-7 device family has somewhat limited IO resources. I chose to go with this device since its other resource counts provided a good spread along the 8 largest VTR benchmarks. Here are the device utilizations I am seeing on VTR master on the devices I posted above. These are the device utilizations reported by VPR, where I believe the utilization is computed as a ratio of the required physical tile area over the available physical tile area on the device for each resource tile type:
I also took the liberty of computing the relative error between the Vivado results you posted above and the VTR results here (relative error = abs(Vivado - VTR) / Vivado ):
I left out the CLB relative error since its not easy for me to get the device utilization in terms of LUTs and FFs. Its possible, but just not as convenient as the device utilization of the tiles. Overall, I am quite happy with the relative error. The IOs especially are extremely close to the Vivado results (up to 8% error). Stereovision2 on the VTR side uses way more DSPs than Vivado, but I am not sure if that is a result of the device resources or the packer / synthesis. BRAMs also look quite good. arm_core has 40% error; however Vivado used 5% and VTR used 7%, I do not see that as being too outlandish. NOTE: I only put the top 8 circuits above since I had that data on hand. If you would like me to collect the others do let me know! |
@WhiteNinjaZ @vaughnbetz I started looking into changing the number of io subtiles per IO tile; however, I am finding that the ratios are being a bit strange. It would take a good amount of work to change the devices I have now to account for this. This would also require us to regenerate and re-collect the data for the tests that did not use fixed-devices. @WhiteNinjaZ , I think you said you were looking to explore the device layout further in the future; so, this would end up being redone in the future anyways. Are people ok with leaving the devices and architecture capture as it is now (with 8 io sub-tiles per IO tile)? |
We can defer it but I think it would be good if you made this change eventually @WhiteNinjaZ . It's a bit odd to have clumps of 8 IOs in some places and no IOs in other places. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlexandreSinger I've gone through the changes and all your numbers look good. I did leave a few optional comments, but you can decide if you want to do those or not or if you want to leave them as an action item for me in the future. I will definitely cut the IO down in the future.
<col type="io" startx="0" starty="1" incry="8" priority="100" /> | ||
<col type="io" startx="36" starty="1" incry="8" priority="100" /> | ||
<row type="io" startx="1" starty="0" incrx="8" priority="100" /> | ||
<row type="io" startx="1" starty="36" incrx="9" priority="100" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are using so few IO it might be good to try just constraining the IO to one side of the device. This is technically more accurate to the seven series (IO constrained only to the left or right of the chip or both).
- 50 BRAM blocks | 50 BRAM tiles | ||
- 250 IOs | 31 IO tiles | ||
--> | ||
<fixed_layout name="XC7A35T-like" width="58" height="58"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another thing that is optional and I can make this change in the future but just to be aware, the 7-series devices are not square but rather rectangles. The ratio is on average about 0.25 (W/H) across the different chip sets. If you are only going for accurate counts of resources on the device then it doesn't really matter but worth mentioning.
For testing Analytical Placement, we currently need fixed-sized devices. To get the 7-Series architecture capture through the AP flow, I selected fixed device sizes that matched the Artix-7 device family in resource utilization. The fixed layouts were obtained by performing the following process: 1) The device was auto-sized to the FPGA with the closest number of CLB slices to the target device. 2) Extra RAMs, DSPs, and IOs were dropped out evenly across the device. The RAMs and DSPs usually needed to be dropped out, and their numbers were often cut in half relative to their auto-sized resource counts. The IOs majorly needed to be dropped out. The number of IOs in the auto-sized devices were around 10x their Artix-7 equivalent (with matching CLB resources). These devices likely do not model the layout of the Artix-7 devices faithfully; however, my goal was to simulate the resource utilization of the Artix-7 device family on our Xilinx 7-Series capture.
Based on feedback for the Artix-7-like devices, I put the IOs on the sides of the devices instead of all around. This helps reduce the "clumps" of IOs some and is slightly more accurate.
61d06ff
to
43edba0
Compare
@vaughnbetz @WhiteNinjaZ Thanks for the feedback! I have updated the device to put the IOs on the left and right side of the device. This should help with the density of the IOs and does not require me to rewrite too much. Ideally we should use the accurate positions of the IO; but for now this should be ok at least for my needs. I look forward to seeing the final devices layouts for Artix-7! |
For testing Analytical Placement, we currently need fixed-sized devices. To get the 7-Series architecture capture through the AP flow, I selected fixed device sizes that matched the Artix-7 device family in resource utilization.
The fixed layouts were obtained by performing the following process:
The device was auto-sized to the FPGA with the closest number of CLB
slices to the target device.
Extra RAMs, DSPs, and IOs were dropped out evenly across the device.
The RAMs and DSPs usually needed to be dropped out, and their numbers were often cut in half relative to their auto-sized resource counts.
The IOs majorly needed to be dropped out. The number of IOs in the auto-sized devices were around 10x their Artix-7 equivalent (with matching CLB resources).
These devices likely do not model the layout of the Artix-7 devices faithfully; however, my goal was to simulate the resource utilization of the Artix-7 device family on our Xilinx 7-Series capture.
For those interested, here are the VTR benchmarks mapped to the smallest Artix-7 device that they can fit on and their resource utilization:
Of specific interest to me are the 8 largest VTR benchmark circuits, which I have pulled out below: