Capstone Project

Back to listing
Group 2022-12 Status completed
Title An American Sign Language/Speech to Text Conversation Mediator
Supervisor H. Rivaz, T. Fevens (CSSE)
Description American Sign Language (ASL) enables deaf and mute people to communicate using hand gestures and actions that represent words and letters in English. As of today, only a fraction of people suffering from deafness are able to communicate through sign language. This is due to the low frequency of interaction with people suffering from deafness and a lack of teaching resources for the language. The project aims to promote the awareness of the difficulties mute and/or deaf people face not having a reliable or widespread communication system. To achieve this, an application will be developed that facilitates the learning of ASL, for those that need it and for those that do not. It will also provide the ability to translate ASL to text and speech to text, through a video call, allowing for a seamless conversation between a deaf person and an able-bodied person. The project will translate ASL to text using an action recognition neural network developed using Tensorflow to determine which gesture is performed. The neural network will be trained and tested using Sklearn by providing a train-test split of multiple videos for each sign - representing one word each - to categorize and recognize the signs. A holistic model is created using Mediapipe to detect the placement of the hands and extract the points of interest. Along with Mediapipe, the landmarks on the hands and face of a person performing the action for both training videos and live feed will be captured using OpenCV. These data points will be extracted from the pre-recorded sign language video dataset which in turn will be used to build and test the neural network architecture. The web application will serve as a text conversation service in which the signer actions and speaker speech are translated into text, in the form of a text conversation. This allows both parties to understand what is being communicated by each party in a hands-off maner (no texting required). It will also enable a learning tool feature, which will prompt a user to sign a word and score the user based on the action. The neural network will determine the word associated with the action and display it on screen. Sentences can be formed from multiple continuous actions. The model will be hosted and deployed on a Google Cloud Platform (GCP) Virtual Machine instance, as well as the web application. Speech-To-Text external service offered by Google Cloud will be used for converting the speech to text. The web application will run on a web browser from a laptop, each party requiring their own. The deliverables consist of: 1) Data collection and extraction: a) Acquire a dataset of varied words signed in ASL, with multiple variations of each to be processed and used for training. b) Development of software to process the training videos into useful data for a neural network to build on. Use of OpenCV and Mediapipe to extract the ASL gesture landmark points on the hands and face 2) A neural network using PyTorch capable of quickly and accurately determining the words being signed by a user in real-time from a video feed 3) A real-time software web application capable of: a) Facilitating a one-on-one conversation through a video and audio feed with socket programming. b) Processing the video feed on a cloud instance running the neural network model capable of determining the signed words and responding with the textual equivalent. The response being fed back to the feed as text. c) A learning tool for ASL, prompting a user to sign a specific word/phrase. A video feed will capture the gesture and the application will then score the correctness of the gesture compared to the existing model. 4) Hosting of the software application and neural network model on Google Cloud Platform instances, enabling the use of several AI and Machine Learning APIs.
Student Requirement ● Knowledge in full-stack development for the web application ● Relevant coursework (completed or currently enrolled in): COEN366, COEN 424, COMP 472 ● Experience with neural networks (machine learning) using Python for gesture recognition
Tools ● Technologies for AI: OpenCV, Mediapipe, PyTorch, Sklearn (Python) ● Technologies for Web Application: MongoDB, Express, React, Node.js, GCP ● Dataset: 2000 words, comprising of approximately 19000 videos of signed words
Number of Students 6
Students N. Harris, N. Kawwas, M. Sklivas, A. Turkman, A. Mirza, T. Elango
Comments:
Links: