Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ As the Transformers architecture scaled well in Natural Language Processing, the
To summarize, in Vision transformer, images are reorganized as 2D grids of patches. The models are trained on those patches.

The main idea can be found at the picture below:
![Vision Transformer](https://huggingface.co/datasets/hf-vision/course-assets/blob/main/Screenshot%20from%202024-12-27%2014-25-49.png)
![Vision Transformer](https://cdn-lfs.hf.co/repos/bf/2a/bf2a50c0acddc20a4963bc4c9bfd57f4a0a887faf272014a120b3c17331f0aa6/81cd51c440e2a1813e583f63fedb503557c20b2045b44e03080cce119b8c8bdc?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27Screenshot%252520from%2525202024-12-27%25252014-25-49.png%3B+filename%3D%22Screenshot%2520from%25202024-12-27%252014-25-49.png%22%3B&response-content-type=image%2Fpng&Expires=1746645354&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0NjY0NTM1NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9iZi8yYS9iZjJhNTBjMGFjZGRjMjBhNDk2M2JjNGM5YmZkNTdmNGEwYTg4N2ZhZjI3MjAxNGExMjBiM2MxNzMzMWYwYWE2LzgxY2Q1MWM0NDBlMmExODEzZTU4M2Y2M2ZlZGI1MDM1NTdjMjBiMjA0NWI0NGUwMzA4MGNjZTExOWI4YzhiZGM%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qJnJlc3BvbnNlLWNvbnRlbnQtdHlwZT0qIn1dfQ__&Signature=uGeFIZ1%7E1PG-wskGgLC%7Ev5HUADQBT4%7EesKTlSBIuk2VrKZAl1linwb5hfWtW2trWRoKjM3%7Ek05QessmOzMcBkfqCzj7IdCWsNW0b1OUE7lD5HVxIsgkaQMhPKimIpMf5whyCs1xNTsOXIQq9m4PJIbwtysby35a3K61n8e2po11Mm89MEJRKe-5ZzfXxavCatiysaLR2PGQfunhD43eiKh2N%7E41O80konEcjIg0d8iyGEDHuHjJF7ZFSX8SnARcxBaeJ93e07RSqMIiCzccnn98s0qeSJzAOjN5LDD6r1OZ933xpZHiyGCXf9dYnXQlFM%7EGgzvXjCKR2hqnocqVPHQ__&Key-Pair-Id=K3RPWS32NSSJCE)

But there is a catch! The Convolutional Neural Networks (CNN) are designed with an assumption missing in the VT. This assumption is based on how we perceive the objects in the images as humans. It is described in the following section.

Expand Down