We introduce a large image dataset HaGRID (HAnd Gesture Recognition Image Dataset) for hand gesture recognition (HGR) systems. You can use it for image classification or image detection tasks. Proposed dataset allows to build HGR systems, which can be used in video conferencing services (Zoom, Skype, Discord, Jazz etc.), home automation systems, the automotive sector, etc.
HaGRID size is 716GB and dataset contains 552,992 FullHD (1920 Ă— 1080) RGB images divided into 18 classes of gestures. Also, some images have no_gesture class if there is a second free hand in the frame. This extra class contains 123,589 samples. The data were split into training 92%, and testing 8% sets by subject user_id, with 509,323 images for train and 43,669 images for test.
The dataset contains 34,730 unique persons and at least this number of unique scenes. The subjects are people from 18 to 65 years old. The dataset was collected mainly indoors with considerable variation in lighting, including artificial and natural light. Besides, the dataset includes images taken in extreme conditions such as facing and backing to a window. Also, the subjects had to show gestures at a distance of 0.5 to 4 meters from the camera.
Example of sample and its annotation:
For more information see our arxiv paper HaGRID - HAnd Gesture Recognition Image Dataset.
Clone and install required python packages:
git clone https://github.com/hukenovs/hagrid.git
# or mirror link:
cd hagrid
# Create virtual env by conda or venv
conda create -n gestures python=3.9 -y
conda activate gestures
# Install requirements
pip install -r requirements.txtdocker build -t gestures .
docker run -it -d -v $PWD:/gesture-classifier gesturesWe split the train dataset into 18 archives by gestures because of the large size of data. Download and unzip them from the following links:
| Gesture | Size | Gesture | Size |
|---|---|---|---|
call |
39.1 GB | peace |
38.6 GB |
dislike |
38.7 GB | peace_inverted |
38.6 GB |
fist |
38.0 GB | rock |
38.9 GB |
four |
40.5 GB | stop |
38.3 GB |
like |
38.3 GB | stop_inverted |
40.2 GB |
mute |
39.5 GB | three |
39.4 GB |
ok |
39.0 GB | three2 |
38.5 GB |
one |
39.9 GB | two_up |
41.2 GB |
palm |
39.3 GB | two_up_inverted |
39.2 GB |
train_val annotations: ann_train_val
| Test | Archives | Size |
|---|---|---|
| images | test |
60.4 GB |
| annotations | ann_test |
27.3 MB |
Subsample has 100 items per gesture.
| Subsample | Archives | Size |
|---|---|---|
| images | subsample |
2.5 GB |
| annotations | ann_subsample |
1.2 MB |
or by using python script
python download.py --save_path <PATH_TO_SAVE> \
--train \
--test \
--subset \
--annotations \
--datasetRun the following command with key --subset to download the small subset (100 images per class). You can download the
train subset with --trainval or test subset with --test. Download annotations for selected stage by --annotations key. Download dataset with images by --dataset.
usage: download.py [-h] [--train] [--test] [--subset] [-a] [-d] [-t TARGETS [TARGETS ...]] [-p SAVE_PATH]
Download dataset...
optional arguments:
-h, --help show this help message and exit
--train Download trainval set
--test Download test set
--subset Download subset with 100 items of each gesture
-a, --annotations Download annotations
-d, --dataset Download dataset
-t TARGETS [TARGETS ...], --targets TARGETS [TARGETS ...]
Target(s) for downloading train set
-p SAVE_PATH, --save_path SAVE_PATH
Save pathWe provide some pre-trained models as the baseline with the classic backbone architectures and two output heads - for gesture classification and leading hand classification.
| Classifiers | F1 Gestures | F1 Leading hand |
|---|---|---|
| ResNet18 | 98.80 | 98.80 |
| ResNet152 | 99.04 | 98.92 |
| ResNeXt50 | 98.95 | 98.87 |
| ResNeXt101 | 99.16 | 98.71 |
| MobileNetV3_small | 96.50 | 97.31 |
| MobileNetV3_large | 98.03 | 97.99 |
| Vitb32 | 98.35 | 98.63 |
| Lenet | 84.58 | 91.16 |
Also we provide some models to solve hand detection problem.
| Detector | mAP |
|---|---|
| SSDLiteMobileNetV3Large | 71.49 |
| SSDLiteMobileNetV3Small | 53.38 |
| FRCNNMobilenetV3LargeFPN | 78.05 |
| YoloV7Tiny | 81.1 |
However, if you need a single gesture, you can use pre-trained full frame classifiers instead of detectors.
To use full frame models, set the configuration parameter full_frame: True and remove the no_gesture class
| Full Frame Classifiers | F1 Gestures |
|---|---|
| ResNet18 | 93.51 |
| ResNet152 | 94.49 |
| ResNeXt50 | 95.20 |
| ResNeXt101 | 95.67 |
| MobileNetV3_small | 87.09 |
| MobileNetV3_large | 90.96 |
You can use downloaded trained models, otherwise select a classifier and parameters for training in default.yaml.
To train the model, execute the following command:
python -m classifier.run --command 'train' --path_to_config <PATH>python -m detector.run --command 'train' --path_to_config <PATH>Every step, the current loss, learning rate and others values get logged to Tensorboard.
See all saved metrics and parameters by opening a command line (this will open a webpage at localhost:6006):
tensorboard --logdir=experimentsTest your model by running the following command:
python -m classifier.run --command 'test' --path_to_config <PATH>python -m detecotr.run --command 'test' --path_to_config <PATH>python demo.py -p <PATH_TO_CONFIG> --landmarkspython demo_ff.py -p <PATH_TO_CONFIG> --landmarksThe annotations consist of bounding boxes of hands in COCO format [top left X position, top left Y position, width, height] with gesture labels. Also, annotations have 21 landmarks in format [x,y] relative image coordinates, markups of leading hands (left or right for gesture hand) and leading_conf as confidence for leading_hand annotation. We provide user_id field that will allow you to split the train / val dataset yourself.
"0534147c-4548-4ab4-9a8c-f297b43e8ffb": {
"bboxes": [
[0.38038597, 0.74085361, 0.08349486, 0.09142549],
[0.67322755, 0.37933984, 0.06350809, 0.09187757]
],
"landmarks"[
[
[
[0.39917091, 0.74502739],
[0.42500172, 0.74984396],
...
],
[0.70590734, 0.46012364],
[0.69208878, 0.45407018],
...
],
],
"labels": [
"no_gesture",
"one"
],
"leading_hand": "left",
"leading_conf": 1.0,
"user_id": "bb138d5db200f29385f..."
}- Key - image name without extension
- Bboxes - list of normalized bboxes
[top left X pos, top left Y pos, width, height] - Labels - list of class labels e.g.
like,stop,no_gesture - Landmarks - list of normalized hand landmarks
[x, y] - Leading hand -
rightorleftfor hand which showing gesture - Leading conf - leading confidence for
leading_hand - User ID - subject id (useful for split data to train / val subsets).
| Object | Train + Val | Test | Total |
|---|---|---|---|
| gesture | ~ 28 300 | ~ 2 400 | 30 629 |
| no gesture | 112 740 | 10 849 | 123 589 |
| total boxes | 622 063 | 54 518 | 676 581 |
We annotate 21 hand keypoints by using MediaPipe open source framework. Due to auto markup empty lists may be present in landmarks.
| Object | Train + Val | Test | Total |
|---|---|---|---|
| leading hand | 503 872 | 43 167 | 547 039 |
| not leading hand | 98 766 | 9 243 | 108 009 |
| total landmarks | 602 638 | 52 410 | 655 048 |
Yolo
We provide a script to convert annotations to YOLO format. To convert annotations, run the following command:
python -m converters.hagrid_to_yolo --path_to_config <PATH>after conversion, you need change original definition img2labels to:
def img2label_paths(img_paths):
img_paths = list(img_paths)
# Define label paths as a function of image paths
if "subsample" in img_paths[0]:
return [x.replace("subsample", "subsample_labels").replace(".jpg", ".txt") for x in img_paths]
elif "train_val" in img_paths[0]:
return [x.replace("train_val", "train_val_labels").replace(".jpg", ".txt") for x in img_paths]
elif "test" in img_paths[0]:
return [x.replace("test", "test_labels").replace(".jpg", ".txt") for x in img_paths]Coco
Also, we provide a script to convert annotations to Coco format. To convert annotations, run the following command:
python -m converters.hagrid_to_coco --path_to_config <PATH>
This work is licensed under a variant of Creative Commons Attribution-ShareAlike 4.0 International License.
Please see the specific license.
You can cite the paper using the following BibTeX entry:
@article{hagrid,
title={HaGRID - HAnd Gesture Recognition Image Dataset},
author={Kapitanov, Alexander and Makhlyarchuk, Andrey and Kvanchiani, Karina},
journal={arXiv preprint arXiv:2206.08219},
year={2022}
}




