Speech Emotion Recognition, live in your browser

Click Start microphone, speak for a few seconds, and watch the emotion classifier update in real time. The model runs entirely on your device using ONNX Runtime Web and Pyodide.

Read this first

This live demo uses a different, lighter model than the research system in the case study. Specifically, it swaps Inception-Resnet-V2 for RepViT-M1.0, a much smaller vision transformer that fits in a browser, and the head was trained on a smaller subset of data so the entire pipeline can be downloaded and run on a phone.

As a result, predictions here are noticeably less accurate than the production model. Treat this page as a demonstration of the approach (mic → spectrogram → vision model → fused features → classifier), not as a measure of the maximum achievable accuracy.

First load is large (~30 MB of model weights and Python runtime); subsequent loads are cached.

How the demo differs

Engineered for browsers, not for accuracy.

Research model

Inception-Resnet-V2 backbone

98,304-dimensional deep features from a large pretrained vision network. Trained on the full RAVDESS, CREMA-D, and SAVEE datasets server-side.

Demo model

RepViT-M1.0 backbone

448-dimensional features from a much smaller vision transformer designed for mobile. Smaller MLP head trained on a subset of the data so everything fits in a browser download.

Research runtime

Server-side, GPU

Python pipeline running on a server with NumPy / SciPy / TensorFlow. Audio is uploaded, processed, and classified remotely.

Demo runtime

Client-side, browser

Pyodide runs the librosa-compatible NumPy preprocessing in WebAssembly; ONNX Runtime Web executes the RepViT and MLP head on the CPU. Mic audio never leaves your device.

What you should see

Live spectrogram, top emotion, confidence bars.

The 3D scrolling spectrogram at the top visualizes the live frequency content of your microphone. The big label below it is the model's current top prediction; the percentage is the softmax confidence on that class. The bars at the bottom show probabilities for all eight classes (Neutral, Calm, Happy, Sad, Angry, Fearful, Disgust, Surprised), updated every 1.5 seconds on a rolling 3-second window.

Below a small RMS energy threshold the demo treats the input as silence and shows Neutral. Below a 30% confidence threshold it shows “Listening…” rather than guessing. These are pragmatic choices for a noisy real-world environment, not part of the underlying model.

Back to the case study