Hand RPS Recognition – AAI / Computer Vision Project
Deep Learning for Detecting Hand Gestures in Rock, Paper, Scissors

Introduction
Hand gesture recognition is a rapidly evolving field in computer vision and human–computer interaction. One of the classic examples is the “Rock–Paper–Scissors” (RPS) game: simple gestures, clear classes, but still rich enough for exploring data collection, augmentation, model training, and real-time inference.
In this project, I present a practical implementation of hand RPS recognition, available on GitHub at https://github.com/Riel0303ru/hand-rps-recognition. In this article you will find the motivation, dataset preparation, model architecture, training process, and deployment considerations.
Motivation
Why build a hand RPS recognition system?
It’s a manageable multi-class classification problem (rock vs paper vs scissors) that still involves real-world challenges: varying lighting, hand shapes, camera angle, background clutter.
It makes a good playground for exploring convolutional neural networks (CNNs), transfer learning, and data augmentation.
It has real practical use cases: gesture-based controls, human–robot interaction, game interfaces, accessibility tools for users with motor impairments.
Project Overview
Dataset & Preprocessing
The repository contains a dataset of hand gestures for rock, paper, and scissors categories (see link above).
Pre-processing steps typically include:
Resizing images to a fixed size (e.g., 128×128 or 224×224).
Normalizing pixel values (scale to [0, 1] or mean-subtracted).
Data augmentation: flips, rotations, brightness/contrast changes, cropping, random background noise to improve generalization.
Splitting into training, validation (and optionally test) sets.
Model Architecture
A typical architecture for this kind of task might be:
Input: 224×224×3 image (RGB)
Convolution Block 1: conv(32 filters, 3×3) → ReLU → BatchNorm → MaxPool
Convolution Block 2: conv(64 filters, 3×3) → ReLU → BatchNorm → MaxPool
Convolution Block 3: conv(128 filters, 3×3) → ReLU → BatchNorm → MaxPool
Flatten → Dense(256) → ReLU → Dropout(0.5)
Output Layer: Dense(3) → Softmax (classes: rock, paper, scissors)
Or: use transfer learning with a pretrained backbone (e.g., MobileNetV2, EfficientNetB0) and fine-tune on your hand gesture dataset to achieve higher accuracy with less data.
Learning Algorithm & Training
The core learning algorithm is supervised classification with cross-entropy loss. Here’s the general training loop:
Forward pass: input image → model → class probabilities.
Compute loss:
$$\text{Loss} = -\sum_{c=1}^{3} y_c \log(p_c)$$
(y_c) is ground-truth one-hot label, (p_c) is predicted probability for class (c).
Back-propagation (Adam optimizer or SGD with momentum).
Weight update.
Monitor validation accuracy and loss; apply early stopping or learning-rate scheduling.
Key hyper-parameters to tune: initial learning rate (e.g., 1e-3), batch size (e.g., 32), number of epochs (e.g., 50–100), image size, augmentation mix.
Deployment & Real-Time Inference
Once the model is trained and validated, deploy it for real-time use:
Capture camera frames via OpenCV or MediaPipe Hand tracker.
Pre-process each frame: crop hand region, resize to model input, normalize.
Pass through model → get predicted class.
Overlay result on video feed (e.g., display “Rock”, “Paper”, or “Scissors”).
Optionally integrate into UI/UX, game interface, or robotics control.
Why the Algorithms Matter
Data augmentation helps avoid over-fitting by simulating variations in hand appearance and environment.
Transfer learning leverages large pretrained models to extract robust features (edges, textures, shapes) and fine-tunes them for your specific gesture classes.
Softmax classification inherently gives you class probabilities, which allows thresholding or ensemble techniques for higher reliability.
Dropout and BatchNorm improve generalization and stability of training.
Results & Metrics


You’ll want to report metrics like:
Final validation accuracy (%).
Precision, recall, and F1-score for each class (rock, paper, scissors).
Real-time inference latency (ms per frame).
Failure cases: e.g., ambiguous hand shape, blur, occlusion.
Lessons Learned & Challenges
Hand shape variation: different users, different skin tones, accessories (rings, watches) change appearance.
Lighting & background: strong shadows or clutter make segmentation harder.
Real-time constraints: achieving < 30ms latency per frame for smooth UX.
Dataset bias: often many “paper” samples but fewer “scissors”; balancing classes matters.
Future Work
Expand classes: e.g., “Lizard”, “Spock” (extended RPS game) or dynamic gesture sequences.
Use MediaPipe/Hand-Landmarks to localize hand and feed only hand region to model → reduce noise.
Deploy to mobile/web using TensorFlow.js or ONNX with WebAssembly for browser-based recognition.
Real-world application: accessibility interface, virtual-reality gesture control, game-integration.
How to Use / Get Started
Clone the project:
git clone https://github.com/Riel0303ru/hand-rps-recognitionInstall dependencies (see README).
Capture dataset or use provided set.
Train model or use pretrained weights.
Run inference script or GUI to test live recognition.
Modify model/parameters as needed for custom data.
Conclusion
Hand RPS recognition is a fun but meaningful project bridging computer vision, deep learning, and real-time inference UX. With the code provided and the concepts outlined above, you’re well equipped to build, experiment, and extend this system. Feel free to reuse, modify, and share — that’s the power of open-source.
References
Goodfellow, I., Bengio, Y., & Courville, A. Deep Learning. MIT Press.





