Date Approved

8-2-2024

Graduate Degree Type

Thesis

Degree Name

Engineering (M.S.E.)

Degree Program

School of Engineering

First Advisor

Dr. Chirag Parikh

Second Advisor

Dr. Denton Bobeldyk

Third Advisor

Dr. Samhita Rhodes

Academic Year

2023/2024

Abstract

Spoken Keyword Spotting (KWS) has steadily remained one of the most studied and implemented technologies in human-facing artificially intelligent systems and has enabled them to detect specific keywords in utterances. Modern machine learning models, such as the variants of deep neural networks, have significantly improved the performance and accuracy of these systems over other rudimentary techniques. However, they often demand substantial computational resources, use large parameter spaces, and introduce latencies that limit their real-time applicability and offline use. These speed and memory requirements have become a tremendous problem where faster and more efficient KWS methods dominate and better meet industry demands.

To address these challenges, this thesis presents an improved method of accomplishing the KWS task using a lightweight and efficient 1-D Convolutional Neural Network (CNN) operating on 2-D feature maps of Mel-Frequency Cepstral Coefficients (MFCCs). The model was trained using the Google Speech Commands V2 dataset, and model compression techniques such as quantization and pruning were applied to facilitate deployment onto hardware. Further minimization of inference latency was accomplished with hardware acceleration by deploying the KWS model onto a Field Programmable Gate Array (FPGA) with an open-source toolset called hls4ml. The resulting model was evaluated and compared to state-of-the-art models in literature, along with comparisons of its inference latency on different computing platforms. Finally, an application was developed to demonstrate the model running entirely on the FPGA for classifying live speech in real-time.

The developed KWS model achieved near state-of-the-art performance with far fewer parameters and a simpler architecture than comparable models in the reviewed literature. A top-one classification accuracy of 91.48% was achieved with a 30.36KB baseline model using 32-bit parameters. The baseline model was optimized and compressed to almost 50% sparsity and used 12-bit weights and activations. This compressed configuration exhibited negligible performance degradation by maintaining a top-one accuracy of 90.16% and occupying just 11.38KB of memory. These results demonstrate that 1-D CNNs are effective in accurately performing the KWS task with small parameter spaces and simple architectures. By deploying the optimized model onto FPGA hardware and running batches of samples through it, inference latencies of less than 373µs per inference, on average, were achieved indicating their usefulness in accelerating KWS models.

Share

COinS