top of page

A Passion Avenue For Science

Introduction

Computer vision is a field of artificial intelligence that enables the derivation of information from visual inputs such as images and videos. It is a field that enables computers to see, observe and understand. In this work, two similar projects were developed side-by-side. A pose detection recognition system that allows playing tetris with one’s body, and a remote control system through computer vision for a smart couch system.


Software used

Python was employed as the primary programming language. OpenCV and Tensorflow APIs were used for real-time image analysis. MediaPipe’s convolutional neural network on hand and body gesture landmark detection was used, from which the Keras feed-forward neural network was employed for classification. The MQTT protocol was used for transfer of data between the ESP32 and the neural network.


Hardware used

Python was employed as the primary programming language. OpenCV and Tensorflow APIs were used for real-time image analysis. MediaPipe’s convolutional neural network on hand and body gesture landmark detection was used, from which the Keras feed-forward neural network was employed for classification. The MQTT protocol was used for transfer of data between the ESP32 and the neural network.


Artifical Neural Network (ANNs)

An artificial neural network is a computing system comprised of collections of units called nodes/neurons, designed to work like the human brain. Nodes are organized into layers, and these nodes work together. There are three main types of layers in every ANN: Input layer, hidden layer, and output layer. Each connection between nodes have associated weights, representing the strength of the connection between the nodes. The input of each node is multiplied by the weight of the connection. The  weighted sum of incoming inputs for a given node is then computed through an activation function that transforms the weighted sum. The transformation depends on the type of activation function used. This process is repeated until the output layer is reached, from which a classification is made. Training a neural network is about finding the optimal weighs between the connection between these nodes.


Mediapipe's Convolutional Neural Network (CNN) API

A CNN is a type of ANN that has a specialization for picking out patterns, and is especially used for image analysis. What makes a ANN and CNN is that they use a type of hidden layers called convolutional layers. Convolutional layers transform the received input and outputs it to the next layer. In each convolutional layer, specific filters are specified to detect patterns within input values - edges, shapes, color, texture. This is done through  There are for example, layers which can detect and filter out the edges in an image. The deeper the network, the more sophisticated this detection goes, until it can detect more complex features such as  landmarks of hands and poses. CNN’s are very complex and difficult to train however. Therefore, in this work, Mediapipe’s pre-built detection APIs were used.


Processing + Feed-Forward Neural Network

The landmark datasets got from Mediapipe’s CNN is fed into a much simpler feed-forward neural network for classification. However, this data is not fed in as raw pixel coordinate values, since if the model is trained in that way, a hand placed at different relative positions to the camera will be recognized as completely separate hands. Therefore, the coordinate values are normalized to a constant point on the pose/hand (for the hand gesture classification, this point is the wrist (point 0) and for the pose classification, this point is the center point in between the two hip landmarks. These values were then normalized to a floating value between 1 and 0 ) so that the distance from the camera does not matter either. From there, the data was fed into a much simpler feed-forward neural network, with a corresponding number of inputs (42 for hand, 99 for pose). Different hidden layers (dense, dropout) and relu activation functions were used.


Training Process

Training a model involves optimizing the the weights of connections between nodes within the model, to find weights that most accurately map the input to the correct output.

In this work, the feed-forward neural network for classification from processed landmark data was trained with thousands of data inputs. The pose detection model was trained with ~4900 data points with 99 normalized landmark coordinates. The hand classification model was trained on ~8800 data points with 42 landmark coordinates.


To train the model, data was first collected on to a csv file. This dataset is divided into a train-test split, where 75% of the data is used for training while the other 25% is used for testing. Dividing the data set as such ensures that the model is not overfitting to only the data in the training set and able to generalize and make predictions on data it was not trained on. The training set is then fed forwards through the model one by one for the model to predict.


Application + MQTT

For the application of the remote control system, the neural network connected to a camera runs on a laptop, and is connected wirelessly through the internet through the MQTT (Message Queuing Telemetry Transport) messaging protocol. Depending on the hand gesture detected, the code will publish different payloads to pass to the ESP32, which translates those payloads into hardware outputs in the form of digital writing pins.


MQTT is a publish/subscribe messaging architecture for devices with limited bandwidth, like the ESP32. It is a light-weight protocol that runs over TCP/IP. In MQTT, clients can publish or subscribe to the MQTT broker topic. In a publish/subscribe architecture, multiple clients can subscribe and publish to the same broker topic. In this work, the computer with the neural network program running on its OS is publishes to the MQTT topic, from which the ESP32, which is subscribed to that topic, is able to detect the various payloads (content of the publish), and perform corresponding actions.


Conclusion, Application and Future Outlook

Applications of computer vision and machine learning is one that is a rapidly growing field of study in recent years. With the as-of-yet unquantifiable potential limit of machine learning especially being one that could completely revolutionize modern-day society over the next few decades, or even years, With the rise of more advanced machine learning technology, such as deep fakes and chatGPT, the future is very uncertain. This project was a step forward, a massive learning experience, as well as a wonderful opportunity that has enabled the exploration of machine learning technology, alongside computer vision. Applications of the computer vision technology demonstrated in this project, too has endless possibilities for applications in fields such as augmented reality, virtual reality, or more specifically perhaps in other interesting projects like sports coaching/posture fixing apps. Sign language translation is also a one that is definitely to come in the future. Application of the smart couch system especially could be in hospitals, where patients are unable to move from their bed, or in remote human-machine interface. The current system as developed in this work could be improved upon by taking into account motion history to create dynamic gesture models, rather than the static one as deployed in this work.

In this work, Jin Wan determined to implement artifical intelligence.

Hand Gesture Remote Control Development

2022

bottom of page