Speech-to-Text

This example demonstrates a Speech-to-Text (STT) setup using the Whisper extension's STT node. The graph includes a Voice Activity Detector (VAD) from the SileroVAD extension, which monitors audio input and triggers the STT node to transcribe when voice activity ends. The entire graph is executed by an audio engine that accesses the microphone for real-time processing.

JSON

{
    "type": "RealTimeGraphRenderer",
    "config": {
        "microphoneEnabled": true,
        "graph": {
            "config": {
                "sampleRate": 16000,
                "bufferSize": 512
            },
            "nodes": [
                {
                    "id": "multiChannelToMonoNode",
                    "type": "MultiChannelToMono"
                },
                {
                    "id": "busSplitterNode",
                    "type": "BusSplitter"
                },
                {
                    "id": "vadNode",
                    "type": "SileroVAD.SileroVAD"
                },
                {
                    "id": "sttNode",
                    "type": "Whisper.WhisperSTT",
                    "config": {
                        "initializeModel": true,
                        "useGPU": true
                    }
                }
            ],
            "connections": [
                {
                    "sourceNode": "inputNode",
                    "destinationNode": "multiChannelToMonoNode"
                },
                {
                    "sourceNode": "multiChannelToMonoNode",
                    "destinationNode": "busSplitterNode"
                },
                {
                    "sourceNode": "busSplitterNode",
                    "destinationNode": "vadNode"
                },
                {
                    "sourceNode": "busSplitterNode",
                    "destinationNode": "sttNode"
                },
                {
                    "sourceNode": "vadNode.end",
                    "destinationNode": "sttNode.transcribe"
                }
            ]
        }
    }
}

To get the transcribed text, you subscribe to the transcription event of the sttNode. The event provides the transcribed text in a callback function.

SwitchboardV3::addEventListener("sttNode", "transcription", [](const std::any& data) {
    const auto text = std::any_cast<std::string>(data);
    std::cout << "STT node transcribed: " << text << std::endl;
});