Building an AI Voice Assistant on NVIDIA Jetson – Voice Activation & Speech-to-Text

As voice interfaces become more prevalent in smart homes and robotics, developers are increasingly seeking low-latency, private, and offline AI voice assistants. In this tutorial, we’ll show you how to build a fully local voice assistant on the NVIDIA Jetson platform. Unlike cloud-based services, this assistant runs entirely on edge hardware, offering real-time performance, enhanced data privacy, and no internet dependency. Here we’ll focus on building how to:

Capture microphone input
Activate with a hotword (wake word)
Convert speech to text using Whisper, an open-source speech recognition model

Whether you’re building smart home devices, service robots, or edge AI prototypes, this guide will help you deploy a powerful voice pipeline using Jetson Orin and open-source tools.

🛠️ Hardware You’ll Need

To follow along, you’ll need:

NVIDIA Jetson edge AI computer (e.g., reComputer Super with Orin NX 16GB or similar)
Microphone array (like ReSpeaker 4-Mic USB Array)
Speakers (USB or HDMI output)

All the processing will be done entirely offline, directly on the Jetson device.

🎙️ Step 1: Capture Voice Input & Detect Wake Word

We’ll use a lightweight C++ implementation to handle microphone input and hotword detection.

🔧 Install Dependencies

Run the following commands on your Jetson:

sudo apt install nlohmann-json3-dev libcurl4-openssl-dev mpg123
git clone https://github.com/jjjadand/record-activate.git

🔧 Configure Microphone Settings

Open respeaker.cpp and adjust these parameters based on your mic and use case:

#define SAMPLE_RATE 44100
#define CHANNELS 2
#define RECORD_MS 20000
#define SILENCE_MS 4000
#define ENERGY_VOICE 2000
#define DEVICE_NAME "plughw:2,0" // Use 'arecord -l' to get this

ENERGY_VOICE defines how sensitive the detection is
Change DEVICE_NAME based on your mic’s hardware ID

🏗️ Build the Recorder

cd record-activate/build
cmake .. && make

This will generate two binaries: one to detect voice (record_lite) and another to send audio to Whisper (wav2text).

🧠 Step 2: Run Whisper Locally for Speech-to-Text

We’ll use whisper.cpp, a lightweight and fast C++ implementation of OpenAI’s Whisper model.

🔽 Download & Compile

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
sh ./models/download-ggml-model.sh base.en
cmake -B build && cmake --build build -j

🎯 Optional: Quantize the Model

To reduce memory usage:

./build/bin/quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0

This creates a smaller, faster version of the English Whisper model.

🔁 Step 3: Connect Voice Input with Whisper Transcription

🔌 Start the Whisper Server

./build/bin/whisper-server -m models/ggml-base.en-q5_0.bin -t 8

This runs a local HTTP server that accepts .wav files and returns transcribed text.

🗣️ Start the Voice Assistant Pipeline

Back in the record-activate folder:

pasuspender -- sudo ./wav2text
pasuspender -- sudo ./record_lite

When the hotword is detected, the system captures voice input and sends it to Whisper
Once transcribed, the assistant plays an activation sound (activate.mp3)
You can then use the transcribed command to trigger AI actions or send it to a local language model

✅ What’s Working Now

At this point, you’ve built a local voice assistant that:

Listens for a hotword
Captures your voice
Converts it to text in real-time
Works completely offline, with no cloud dependencies

🧭 Next Steps

Running a quantized LLM to understand and respond to commands
Adding text-to-speech (TTS) output
Controlling smart home devices or APIs using voice

About Author

Jennie Wang

Seeed Studio AIoT Marketing and Partnership
Always coffee always alive ☕️

See author's posts

Tags: AI Agent, Nvidia Jetson, open hardware

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31