Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project 1: Nischal K N #13

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 44 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,47 @@
**University of Pennsylvania, CIS 565: GPU Programming and Architecture,
Project 1 - Flocking**
**University of Pennsylvania, CIS 565: GPU Programming and Architecture**
Project 1 - Flocking
====================

* (TODO) YOUR NAME HERE
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Nischal K N
* Tested on: Windows 10, i7-2670QM @ 2.20GHz 8GB, GTX 540 2GB (Personal)

### (TODO: Your README)
### SUMMARY

Include screenshots, analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
This projects implements a flocking simulation based on the [Reynolds Boids algorithm](http://www.red3d.com/cwr/boids/) on a GPU using CUDA and OpenGL. It includes two levels of optimization using a uniform grid and a uniform grid with semi-coherent memory access. Profiling results for various levels of implementation are included in this README.

### OUTPUT

Sample output of uniform grid with semi-coherent memory access with 5000 particles.

![](images/output.gif)

### SETUP and COMPILING

* To setup cuda development enivrnoment use [this guide](https://github.com/nischalkn/Project0-CUDA-Getting-Started/blob/master/INSTRUCTION.md#part-1-setting-up-your-development-environment)
* Once the envirnoment is set up, to Build and run the project follow [this](https://github.com/nischalkn/Project0-CUDA-Getting-Started/blob/master/INSTRUCTION.md#part-3-build--run)

### PARAMETERS
* `VISUALIZE` - `1` or `0`, Turn graphical output on/off
* `UNIFORM_GRID` - `1` or `0`, Turn on/off uniform grid with scattered search optimization
* `COHERENT_GRID` - `1` or `0`, Turn on/off uniform grid with coherent search optimization
* `N_FOR_VIS` - No of boids/particles
* `maxSpeed` - Max speed of particles
* `blockSize` - Number of threds per block
* boid motion parameters

| Rule | Distance | Scale |
|:-------------:|:---------:|:---------:|
| Cohesion | 5.0f | 0.01f |
| Separation | 3.0f | 0.1f |
| Alignment | 5.0f | 0.1f |

### PERFORMANCE ANALYSIS
Performance of the implementation was measured in terms of time taken for computation using cuda event timers under different particle count and block size.

The graph shows the average time to compute for different number of particles in different versions of optimization. The time taken for 5,000 steps were recorded in each case and the average was calculated. As the number of particles increase the time taken to compute the positions and velocities increase exponentially as the number of particles affecting each particle also increases. This exponential increase is very prominent in case of Naive implementation and less in case of optimized search. From the graph it can also be seen that the scattered search performs faster than coherent search for particles less than 2000. But as the number of particles increase, the coherent search has a better performance than scattered search. It is only at high particle count, the sequential memory access overcomes the additional work of rearranging the velocity and position vectors.

![](images/particles.png)

The second analysis was done on the performace of the different optimizations on different block sizes. A constant particle count of 10000 was used in all cases. The average time taken to execute 5000 steps were recorded in each case. It is seen that large block sizes have faster computation time. Bad performance at small block sizes can be attributed to the lack of caching ability of GPU. The smaller block size causes constant switching of memory context resulting in a lot of memory transfer, blocking threads and increasing the execution time. Also increasing the block size further would result in a slight increase in execution time as some clock cycles are wasted to retire dead blocks and free sufficient blocks needed.

![](images/blocks.png)
Binary file added images/blocks.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/output.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/particles.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading