Part 9: Voice Transcription with Whisper and Foundry Local

Goal: Use the OpenAI Whisper model running locally through Foundry Local to transcribe audio files - completely on-device, no cloud required.

Overview

Foundry Local is not just for text generation; it also supports speech-to-text models. In this lab you will use the OpenAI Whisper Medium model to transcribe audio files entirely on your machine. This is ideal for scenarios like transcribing Zava customer service calls, product review recordings, or workshop planning sessions where audio data must never leave your device.

Learning Objectives

By the end of this lab you will be able to:

Understand the Whisper speech-to-text model and its capabilities
Download and run the Whisper model using Foundry Local
Transcribe audio files using the Foundry Local SDK in Python, JavaScript, and C#
Build a simple transcription service that runs entirely on-device
Understand the differences between chat/text models and audio models in Foundry Local

Prerequisites

Requirement	Details
Foundry Local CLI	Version 0.8.101 or above (Whisper models are available from v0.8.101 onwards)
OS	Windows 10/11 (x64 or ARM64)
Language runtime	Python 3.9+ and/or Node.js 18+ and/or .NET 9 SDK (Download .NET)
Completed	Part 1: Getting Started, Part 2: Foundry Local SDK Deep Dive, and Part 3: SDKs and APIs

Note: Whisper models must be downloaded via the SDK (not the CLI). The CLI does not support the audio transcription endpoint. Check your version with:
foundry --version

Concept: How Whisper Works with Foundry Local

The OpenAI Whisper model is a general-purpose speech recognition model trained on a large dataset of diverse audio. When running through Foundry Local:

The model runs entirely on your CPU - no GPU required
Audio never leaves your device - complete privacy
The Foundry Local SDK handles model download and cache management
JavaScript and C# provide a built-in AudioClient in the SDK that handles the entire transcription pipeline — no manual ONNX setup required
Python uses the SDK for model management and ONNX Runtime for direct inference against the encoder/decoder ONNX models

How the Pipeline Works (JavaScript and C#) — SDK AudioClient

Foundry Local SDK downloads and caches the Whisper model
model.createAudioClient() (JS) or model.GetAudioClientAsync() (C#) creates an AudioClient
audioClient.transcribe(path) (JS) or audioClient.TranscribeAudioAsync(path) (C#) handles the full pipeline internally — audio preprocessing, encoder, decoder, and token decoding
The AudioClient exposes a settings.language property (set to "en" for English) to guide accurate transcription

How the Pipeline Works (Python) — ONNX Runtime

Foundry Local SDK downloads and caches the Whisper ONNX model files
Audio preprocessing converts WAV audio into a mel spectrogram (80 mel bins x 3000 frames)
Encoder processes the mel spectrogram and produces hidden states plus cross-attention key/value tensors
Decoder runs autoregressively, generating one token at a time until it produces an end-of-text token
Tokeniser decodes the output token IDs back into readable text

Whisper Model Variants

Alias	Model ID	Device	Size	Description
`whisper-medium`	`openai-whisper-medium-cuda-gpu:1`	GPU	1.53 GB	GPU-accelerated (CUDA)
`whisper-medium`	`openai-whisper-medium-generic-cpu:1`	CPU	3.05 GB	CPU-optimised (recommended for most devices)

Note: Unlike chat models that list by default, Whisper models are categorised under the automatic-speech-recognition task. Use foundry model info whisper-medium to see details.

Lab Exercises

Exercise 0 - Get Sample Audio Files

This lab includes pre-built WAV files based on Zava DIY product scenarios. Generate them with the included script:

# From the repo root - create and activate a .venv first
python -m venv .venv

# Windows (PowerShell):
.venv\Scripts\Activate.ps1
# macOS:
source .venv/bin/activate

pip install openai
python samples/audio/generate_samples.py

This creates six WAV files in samples/audio/:

File	Scenario
`zava-customer-inquiry.wav`	Customer asking about the Zava ProGrip Cordless Drill
`zava-product-review.wav`	Customer reviewing the Zava UltraSmooth Interior Paint
`zava-support-call.wav`	Support call about the Zava TitanLock Tool Chest
`zava-project-planning.wav`	DIYer planning a deck with Zava EcoBoard Composite Decking
`zava-workshop-setup.wav`	Walkthrough of a workshop using all five Zava products
`zava-full-project-walkthrough.wav`	Extended garage renovation walkthrough using all Zava products (~4 min, for long-audio testing)

Tip: You can also use your own WAV/MP3/M4A files, or record yourself with Windows Voice Recorder.

Exercise 1 - Download the Whisper Model Using the SDK

Due to CLI incompatibilities with Whisper models in newer Foundry Local versions, use the SDK to download and load the model. Choose your language:

🐍 Python

Install the SDK:

pip install foundry-local-sdk

from foundry_local import FoundryLocalManager

alias = "whisper-medium"

# Start the service
manager = FoundryLocalManager()
manager.start_service()

# Check catalog info
info = manager.get_model_info(alias)
print(f"Model: {info.id}")
print(f"Task:  {info.task}")

# Check if already cached
cached = manager.list_cached_models()
is_cached = any(m.id == info.id for m in cached) if info else False

if is_cached:
    print("Whisper model already downloaded.")
else:
    print("Downloading Whisper model (this may take several minutes)...")
    manager.download_model(alias)
    print("Download complete.")

# Load the model into memory
manager.load_model(alias)
print(f"Whisper model loaded. Endpoint: {manager.endpoint}")

Save as download_whisper.py and run:

python download_whisper.py

📘 JavaScript

Install the SDK:

npm install foundry-local-sdk

import { FoundryLocalManager } from "foundry-local-sdk";

const alias = "whisper-medium";

// Create manager and start the service
FoundryLocalManager.create({ appName: "WhisperDemo" });
const manager = FoundryLocalManager.instance;
await manager.startWebService();

// Get model from catalogue
const catalog = manager.catalog;
const model = await catalog.getModel(alias);
console.log(`Model: ${model.id}`);

if (model.isCached) {
  console.log("Whisper model already downloaded.");
} else {
  console.log("Downloading Whisper model (this may take several minutes)...");
  await model.download();
  console.log("Download complete.");
}

// Load the model into memory
await model.load();
console.log(`Whisper model loaded. Service URL: ${manager.urls[0]}`);

Save as download-whisper.mjs and run:

node download-whisper.mjs

💜 C#

Install the SDK:

dotnet add package Microsoft.AI.Foundry.Local

using Microsoft.AI.Foundry.Local;
using Microsoft.Extensions.Logging.Abstractions;

var alias = "whisper-medium";

// Start the service
Console.WriteLine("Starting Foundry Local service...");
await FoundryLocalManager.CreateAsync(
    new Configuration
    {
        AppName = "FoundryLocalSamples",
        Web = new Configuration.WebService { Urls = "http://127.0.0.1:0" }
    }, NullLogger.Instance, default);
var manager = FoundryLocalManager.Instance;
await manager.StartWebServiceAsync(default);

// Get model from catalog
var catalog = await manager.GetCatalogAsync(default);
var model = await catalog.GetModelAsync(alias, default);
Console.WriteLine($"Model: {model.Id}");

// Check if already cached
var isCached = await model.IsCachedAsync(default);

if (isCached)
{
    Console.WriteLine("Whisper model already downloaded.");
}
else
{
    Console.WriteLine("Downloading Whisper model (this may take several minutes)...");
    await model.DownloadAsync(null, default);
    Console.WriteLine("Download complete.");
}

// Load the model into memory
await model.LoadAsync(default);
Console.WriteLine($"Whisper model loaded: {model.Id}");

Why SDK instead of CLI? The Foundry Local CLI does not support downloading or serving Whisper models directly. The SDK provides a reliable way to download and manage audio models programmatically. The JavaScript and C# SDKs include a built-in AudioClient that handles the entire transcription pipeline. Python uses ONNX Runtime for direct inference against the cached model files.

Exercise 2 - Understand the Whisper SDK

Whisper transcription uses different approaches depending on the language. JavaScript and C# provide a built-in AudioClient in the Foundry Local SDK that handles the full pipeline (audio preprocessing, encoder, decoder, token decoding) in a single method call. Python uses the Foundry Local SDK for model management and ONNX Runtime for direct inference against the encoder/decoder ONNX models.

Component	Python	JavaScript	C#
SDK packages	`foundry-local-sdk`, `onnxruntime`, `transformers`, `librosa`	`foundry-local-sdk`	`Microsoft.AI.Foundry.Local`
Model management	`FoundryLocalManager(alias)`	`FoundryLocalManager.create()` + `catalog.getModel()`	`FoundryLocalManager.CreateAsync()` + catalog
Feature extraction	`WhisperFeatureExtractor` + `librosa`	Handled by SDK `AudioClient`	Handled by SDK `AudioClient`
Inference	`ort.InferenceSession` (encoder + decoder)	`audioClient.transcribe()`	`audioClient.TranscribeAudioAsync()`
Token decoding	`WhisperTokenizer`	Handled by SDK `AudioClient`	Handled by SDK `AudioClient`
Language setting	Set via `forced_ids` in decoder tokens	`audioClient.settings.language = "en"`	`audioClient.Settings.Language = "en"`
Input	WAV file path	WAV file path	WAV file path
Output	Decoded text string	`result.text`	`result.Text`

Important: Always set the language on the AudioClient (e.g. "en" for English). Without an explicit language setting, the model may produce garbled output as it attempts to auto-detect the language.

SDK Patterns: Python uses FoundryLocalManager(alias) to bootstrap, then get_cache_location() to find the ONNX model files. JavaScript and C# use the SDK’s built-in AudioClient — obtained via model.createAudioClient() (JS) or model.GetAudioClientAsync() (C#) — which handles the entire transcription pipeline. See Part 2: Foundry Local SDK Deep Dive for full details.

Exercise 3 - Build a Simple Transcription App

Choose your language track and build a minimal application that transcribes an audio file.

Supported audio formats: WAV, MP3, M4A. For best results, use WAV files with 16kHz sample rate.

Python Track

Setup

cd python
python -m venv venv

# Activate the virtual environment:
# Windows (PowerShell):
venv\Scripts\Activate.ps1
# macOS:
source venv/bin/activate

pip install foundry-local-sdk onnxruntime transformers librosa

Transcription Code

Create a file foundry-local-whisper.py:

import sys
import os
import numpy as np
import onnxruntime as ort
import librosa
from transformers import WhisperFeatureExtractor, WhisperTokenizer
from foundry_local import FoundryLocalManager

model_alias = "whisper-medium"
audio_file = sys.argv[1] if len(sys.argv) > 1 else "sample.wav"

if not os.path.exists(audio_file):
    print(f"Audio file not found: {audio_file}")
    sys.exit(1)

# Step 1: Bootstrap - starts service, downloads, and loads the model
print(f"Initialising Foundry Local with model: {model_alias}...")
manager = FoundryLocalManager(model_alias)
model_info = manager.get_model_info(model_alias)
cache_location = manager.get_cache_location()

# Build path to the cached ONNX model files
model_dir = os.path.join(
    cache_location, "Microsoft",
    model_info.id.replace(":", "-"), "cpu-fp32"
)

# Step 2: Load ONNX sessions and feature extractor
encoder = ort.InferenceSession(
    os.path.join(model_dir, "whisper-medium_encoder_fp32.onnx"),
    providers=["CPUExecutionProvider"]
)
decoder = ort.InferenceSession(
    os.path.join(model_dir, "whisper-medium_decoder_fp32.onnx"),
    providers=["CPUExecutionProvider"]
)
fe = WhisperFeatureExtractor.from_pretrained(model_dir)
tokenizer = WhisperTokenizer.from_pretrained(model_dir)

# Step 3: Extract mel spectrogram features
audio, _ = librosa.load(audio_file, sr=16000)
features = fe(audio, sampling_rate=16000, return_tensors="np")
input_features = features.input_features.astype(np.float32)

# Step 4: Run encoder
enc_out = encoder.run(None, {"audio_features": input_features})
# First output is hidden states; remaining are cross-attention KV pairs
cross_kv = {
    f"past_key_cross_{i}": enc_out[1 + 2 * i]
    for i in range(24)
}
cross_kv.update({
    f"past_value_cross_{i}": enc_out[2 + 2 * i]
    for i in range(24)
})

# Step 5: Autoregressive decoding
initial_tokens = [50258, 50259, 50359, 50363]  # sot, en, transcribe, notimestamps
input_ids = np.array([initial_tokens], dtype=np.int32)

# Empty self-attention KV cache
self_kv = {}
for i in range(24):
    self_kv[f"past_key_self_{i}"] = np.zeros((1, 16, 0, 64), dtype=np.float32)
    self_kv[f"past_value_self_{i}"] = np.zeros((1, 16, 0, 64), dtype=np.float32)

generated = []
for _ in range(448):
    feeds = {"input_ids": input_ids, **cross_kv, **self_kv}
    outputs = decoder.run(None, feeds)
    logits = outputs[0]
    next_token = int(np.argmax(logits[0, -1, :]))

    if next_token == 50257:  # end of text
        break
    generated.append(next_token)

    # Update self-attention KV cache
    for i in range(24):
        self_kv[f"past_key_self_{i}"] = outputs[1 + 2 * i]
        self_kv[f"past_value_self_{i}"] = outputs[2 + 2 * i]
    input_ids = np.array([[next_token]], dtype=np.int32)

print(tokenizer.decode(generated, skip_special_tokens=True))

Run it

# Transcribe a Zava product scenario
python foundry-local-whisper.py ../samples/audio/zava-customer-inquiry.wav

# Or try others:
python foundry-local-whisper.py ../samples/audio/zava-product-review.wav
python foundry-local-whisper.py ../samples/audio/zava-workshop-setup.wav

Key Python Points

Method	Purpose
`FoundryLocalManager(alias)`	Bootstrap: start service, download, and load the model
`manager.get_cache_location()`	Get the path to cached ONNX model files
`WhisperFeatureExtractor.from_pretrained()`	Load the mel spectrogram feature extractor
`ort.InferenceSession()`	Create ONNX Runtime sessions for encoder and decoder
`tokenizer.decode()`	Convert output token IDs back to text

JavaScript Track

Setup

cd javascript
npm install foundry-local-sdk onnxruntime-node

Transcription Code

Create a file foundry-local-whisper.mjs:

import { FoundryLocalManager } from "foundry-local-sdk";
import fs from "node:fs";

const modelAlias = "whisper-medium";
const audioFile = process.argv[2] || "sample.wav";

if (!fs.existsSync(audioFile)) {
  console.error(`Audio file not found: ${audioFile}`);
  process.exit(1);
}

// Step 1: Bootstrap - create manager, start service, and load the model
console.log(`Initialising Foundry Local with model: ${modelAlias}...`);
FoundryLocalManager.create({ appName: "WhisperDemo" });
const manager = FoundryLocalManager.instance;
await manager.startWebService();

const catalog = manager.catalog;
const model = await catalog.getModel(modelAlias);

if (!model.isCached) {
  console.log("Downloading Whisper model...");
  await model.download();
}
await model.load();

// Step 2: Create an audio client and transcribe
const audioClient = model.createAudioClient();
audioClient.settings.language = "en";

console.log(`Transcribing: ${audioFile}`);
const result = await audioClient.transcribe(audioFile);

console.log("\n--- Transcription ---");
console.log(result.text);
console.log("---------------------");

// Cleanup
await model.unload();

Note: The Foundry Local SDK provides a built-in AudioClient via model.createAudioClient() that handles the entire ONNX inference pipeline internally — no onnxruntime-node import needed. Always set audioClient.settings.language = "en" to ensure accurate English transcription.

Run it

# Transcribe a Zava product scenario
node foundry-local-whisper.mjs ../samples/audio/zava-customer-inquiry.wav

# Or try others:
node foundry-local-whisper.mjs ../samples/audio/zava-support-call.wav
node foundry-local-whisper.mjs ../samples/audio/zava-project-planning.wav

Key JavaScript Points

Method	Purpose
`FoundryLocalManager.create({ appName })`	Create the manager singleton
`await catalog.getModel(alias)`	Get a model from the catalogue
`model.download()` / `model.load()`	Download and load the Whisper model
`model.createAudioClient()`	Create an audio client for transcription
`audioClient.settings.language = "en"`	Set the transcription language (required for accurate output)
`audioClient.transcribe(path)`	Transcribe an audio file, returns `{ text, duration }`

C# Track

Setup

mkdir whisper-demo
cd whisper-demo
dotnet new console --framework net9.0
dotnet add package Microsoft.AI.Foundry.Local

Note: The C# track uses the Microsoft.AI.Foundry.Local package which provides a built-in AudioClient via model.GetAudioClientAsync(). This handles the full transcription pipeline in-process — no separate ONNX Runtime setup needed.

Transcription Code

Replace the contents of Program.cs:

using Microsoft.AI.Foundry.Local;
using Microsoft.Extensions.Logging.Abstractions;

// --- Configuration ---
var modelAlias = "whisper-medium";
var audioFile = args.Length > 0 ? args[0] : "sample.wav";

if (!File.Exists(audioFile))
{
    Console.WriteLine($"Audio file not found: {audioFile}");
    Console.WriteLine("Usage: dotnet run <path-to-audio-file>");
    return;
}

// --- Step 1: Initialize Foundry Local ---
Console.WriteLine("Initializing Foundry Local...");
await FoundryLocalManager.CreateAsync(
    new Configuration
    {
        AppName = "WhisperDemo",
        Web = new Configuration.WebService { Urls = "http://127.0.0.1:0" }
    }, NullLogger.Instance, default);
var manager = FoundryLocalManager.Instance;
await manager.StartWebServiceAsync(default);

// --- Step 2: Load the Whisper model ---
Console.WriteLine($"Loading model: {modelAlias}...");
var catalog = await manager.GetCatalogAsync(default);
var model = await catalog.GetModelAsync(modelAlias, default);

// Download if needed
var isCached = await model.IsCachedAsync(default);
if (!isCached)
{
    Console.WriteLine("Downloading model...");
    await model.DownloadAsync(null, default);
}

// Load model into memory
Console.WriteLine("Loading model into memory...");
await model.LoadAsync(default);

// --- Step 3: Transcribe audio ---
Console.WriteLine($"Transcribing: {audioFile}");

var audioClient = await model.GetAudioClientAsync();
audioClient.Settings.Language = "en";

var result = await audioClient.TranscribeAudioAsync(audioFile);

Console.WriteLine("\n--- Transcription ---");
Console.WriteLine(result.Text);
Console.WriteLine("---------------------");

Run it

# Transcribe a Zava product scenario
dotnet run -- ..\samples\audio\zava-customer-inquiry.wav

# Or try others:
dotnet run -- ..\samples\audio\zava-product-review.wav
dotnet run -- ..\samples\audio\zava-workshop-setup.wav

Key C# Points

Method	Purpose
`FoundryLocalManager.CreateAsync(config)`	Initialise Foundry Local with configuration
`catalog.GetModelAsync(alias)`	Get model from catalog
`model.DownloadAsync()`	Download the Whisper model
`model.GetAudioClientAsync()`	Get the AudioClient (not ChatClient!)
`audioClient.Settings.Language = "en"`	Set the transcription language (required for accurate output)
`audioClient.TranscribeAudioAsync(path)`	Transcribe an audio file
`result.Text`	The transcribed text

C# vs Python/JS: The C# SDK provides a built-in AudioClient for in-process transcription via model.GetAudioClientAsync(), similar to the JavaScript SDK. Python uses ONNX Runtime directly for inference against the cached encoder/decoder models.

Exercise 4 - Batch Transcribe All Zava Samples

Now that you have a working transcription app, transcribe all five Zava sample files and compare the results.

Python Track

The full sample python/foundry-local-whisper.py already supports batch transcription. When run without arguments, it transcribes all zava-*.wav files in samples/audio/:

cd python
python foundry-local-whisper.py

The sample uses FoundryLocalManager(alias) to bootstrap, then runs the encoder and decoder ONNX sessions for each file.

JavaScript Track

The full sample javascript/foundry-local-whisper.mjs already supports batch transcription. When run without arguments, it transcribes all zava-*.wav files in samples/audio/:

cd javascript
node foundry-local-whisper.mjs

The sample uses FoundryLocalManager.create() and catalog.getModel(alias) to initialise the SDK, then uses the AudioClient (with settings.language = "en") to transcribe each file.

C# Track

The full sample csharp/WhisperTranscription.cs already supports batch transcription. When run without a specific file argument, it transcribes all zava-*.wav files in samples/audio/:

cd csharp
dotnet run whisper

The sample uses FoundryLocalManager.CreateAsync() and the SDK’s AudioClient (with Settings.Language = "en") for in-process transcription.

What to look for: Compare the transcription output against the original text in samples/audio/generate_samples.py. How accurately does Whisper capture product names like "Zava ProGrip" and technical terms like "brushless motor" or "composite decking"?

Exercise 5 - Understand the Key Code Patterns

Study how Whisper transcription differs from chat completions across all three languages:

Python - Key Differences from Chat

# Chat completion (Parts 2-6):
client = openai.OpenAI(base_url=manager.endpoint, api_key=manager.api_key)
stream = client.chat.completions.create(
    model=model_id,
    messages=[{"role": "user", "content": "Hello"}],
    stream=True,
)

# Audio transcription (This Part):
# Uses ONNX Runtime directly instead of the OpenAI client
encoder = ort.InferenceSession(encoder_path, providers=["CPUExecutionProvider"])
decoder = ort.InferenceSession(decoder_path, providers=["CPUExecutionProvider"])

audio, _ = librosa.load("audio.wav", sr=16000)
features = feature_extractor(audio, sampling_rate=16000, return_tensors="np")
enc_out = encoder.run(None, {"audio_features": features.input_features})
# ... autoregressive decoder loop ...
print(tokenizer.decode(generated_tokens))

Key insight: Chat models use the OpenAI-compatible API via manager.endpoint. Whisper uses the SDK to locate the cached ONNX model files, then runs inference directly with ONNX Runtime.

JavaScript - Key Differences from Chat

// Chat completion (Parts 2-6):
const client = new OpenAI({ baseURL: manager.urls[0] + "/v1", apiKey: "foundry-local" });
const stream = await client.chat.completions.create({
  model: model.id,
  messages: [{ role: "user", content: "Hello" }],
  stream: true,
});

// Audio transcription (This Part):
// Uses the SDK's built-in AudioClient
const audioClient = model.createAudioClient();
audioClient.settings.language = "en"; // Always set language for best results
const result = await audioClient.transcribe("audio.wav");
console.log(result.text);

Key insight: Chat models use the OpenAI-compatible API via manager.urls[0] + "/v1". Whisper transcription uses the SDK’s AudioClient, obtained from model.createAudioClient(). Set settings.language to avoid garbled output from auto-detection.

C# - Key Differences from Chat

The C# approach uses the SDK’s built-in AudioClient for in-process transcription:

Model initialisation:

// 1. Create the manager with configuration
await FoundryLocalManager.CreateAsync(
    new Configuration
    {
        AppName = "WhisperDemo",
        Web = new Configuration.WebService { Urls = "http://127.0.0.1:0" }
    }, NullLogger.Instance, default);
var manager = FoundryLocalManager.Instance;
await manager.StartWebServiceAsync(default);

// 2. Get model from catalog, download, and load
var catalog = await manager.GetCatalogAsync(default);
var model = await catalog.GetModelAsync("whisper-medium", default);
await model.DownloadAsync(null, default);
await model.LoadAsync(default);

Transcription:

// Get the audio client (not a chat client!)
var audioClient = await model.GetAudioClientAsync();
audioClient.Settings.Language = "en"; // Always set language for best results

// Transcribe - returns an object with a .Text property
var response = await audioClient.TranscribeAudioAsync(filePath);
Console.WriteLine(response.Text);

Key insight: C# uses FoundryLocalManager.CreateAsync() and gets an AudioClient directly — no ONNX Runtime setup needed. Set Settings.Language to avoid garbled output from auto-detection.

Summary: Python uses the Foundry Local SDK for model management and ONNX Runtime for direct inference against the encoder/decoder models. JavaScript and C# both use the SDK’s built-in AudioClient for streamlined transcription — create the client, set the language, and call transcribe() / TranscribeAudioAsync(). Always set the language property on the AudioClient for accurate results.

Exercise 6 - Experiment

Try these modifications to deepen your understanding:

Try different audio files - record yourself speaking using Windows Voice Recorder, save as WAV, and transcribe it
Compare model variants - if you have an NVIDIA GPU, try the CUDA variant:
```
foundry model download whisper-medium --device GPU
```
Compare the transcription speed against the CPU variant.

Add output formatting - the JSON response can include:

{
  "text": "Welcome to Zava Home Improvement. I'd like to learn more about the ProGrip Cordless Drill.",
  "language": "en",
  "duration": 10.5
}

Build a REST API - wrap your transcription code in a web server:

Language	Framework	Example
Python	FastAPI	`@app.post("/v1/audio/transcriptions")` with `UploadFile`
JavaScript	Express.js	`app.post("/v1/audio/transcriptions")` with `multer`
C#	ASP.NET Minimal API	`app.MapPost("/v1/audio/transcriptions")` with `IFormFile`

Multi-turn with transcription - combine Whisper with a chat agent from Part 4: transcribe audio first, then pass the text to an agent for analysis or summarisation.

SDK Audio API Reference

JavaScript AudioClient:

model.createAudioClient() — creates an AudioClient instance

audioClient.settings.language — set the transcription language (e.g. "en")

audioClient.settings.temperature — control randomness (optional)

audioClient.transcribe(filePath) — transcribe a file, returns { text, duration }

audioClient.transcribeStreaming(filePath, callback) — stream transcription chunks via callback

C# AudioClient:

await model.GetAudioClientAsync() — creates an OpenAIAudioClient instance

audioClient.Settings.Language — set the transcription language (e.g. "en")

audioClient.Settings.Temperature — control randomness (optional)

await audioClient.TranscribeAudioAsync(filePath) — transcribe a file, returns object with .Text

audioClient.TranscribeAudioStreamingAsync(filePath) — returns IAsyncEnumerable of transcription chunks

Tip: Always set the language property before transcribing. Without it, the Whisper model attempts auto-detection, which can produce garbled output (a single replacement character instead of text).

Comparison: Chat Models vs. Whisper

Aspect	Chat Models (Parts 3-7)	Whisper - Python	Whisper - JS / C#
Task type	`chat`	`automatic-speech-recognition`	`automatic-speech-recognition`
Input	Text messages (JSON)	Audio files (WAV/MP3/M4A)	Audio files (WAV/MP3/M4A)
Output	Generated text (streamed)	Transcribed text (complete)	Transcribed text (complete)
SDK package	`openai` + `foundry-local-sdk`	`foundry-local-sdk` + `onnxruntime`	`foundry-local-sdk` (JS) / `Microsoft.AI.Foundry.Local` (C#)
API method	`client.chat.completions.create()`	ONNX Runtime direct	`audioClient.transcribe()` (JS) / `audioClient.TranscribeAudioAsync()` (C#)
Language setting	N/A	Decoder prompt tokens	`audioClient.settings.language` (JS) / `audioClient.Settings.Language` (C#)
Streaming	Yes	No	`transcribeStreaming()` (JS) / `TranscribeAudioStreamingAsync()` (C#)
Privacy benefit	Code/data stays local	Audio data stays local	Audio data stays local

Key Takeaways

Concept	What You Learned
Whisper on-device	Speech-to-text runs entirely locally, ideal for transcribing Zava customer calls and product reviews on-device
SDK AudioClient	JavaScript and C# SDKs provide a built-in `AudioClient` that handles the full transcription pipeline in a single call
Language setting	Always set the AudioClient language (e.g. `"en"`) — without it, auto-detection may produce garbled output
Python	Uses `foundry-local-sdk` for model management + `onnxruntime` + `transformers` + `librosa` for direct ONNX inference
JavaScript	Uses `foundry-local-sdk` with `model.createAudioClient()` — set `settings.language`, then call `transcribe()`
C#	Uses `Microsoft.AI.Foundry.Local` with `model.GetAudioClientAsync()` — set `Settings.Language`, then call `TranscribeAudioAsync()`
Streaming support	JS and C# SDKs also offer `transcribeStreaming()` / `TranscribeAudioStreamingAsync()` for chunk-by-chunk output
CPU-optimised	The CPU variant (3.05 GB) works on any Windows device without a GPU
Privacy-first	Perfect for keeping Zava customer interactions and proprietary product data on-device

Resources

Resource	Link
Foundry Local docs	Microsoft Learn - Foundry Local
Foundry Local SDK Reference	Microsoft Learn - SDK Reference
OpenAI Whisper model	github.com/openai/whisper
Foundry Local website	foundrylocal.ai

Next Step

Continue to Part 10: Using Custom or Hugging Face Models to compile your own models from Hugging Face and run them through Foundry Local.

FilesExpand file tree

part9-whisper-voice-transcription.md

Latest commit

History

part9-whisper-voice-transcription.md

File metadata and controls

Part 9: Voice Transcription with Whisper and Foundry Local

Overview

Learning Objectives

Prerequisites

Concept: How Whisper Works with Foundry Local

How the Pipeline Works (JavaScript and C#) — SDK AudioClient

How the Pipeline Works (Python) — ONNX Runtime

Whisper Model Variants

Lab Exercises

Exercise 0 - Get Sample Audio Files

Exercise 1 - Download the Whisper Model Using the SDK

Exercise 2 - Understand the Whisper SDK

Exercise 3 - Build a Simple Transcription App

Python Track

Setup

Transcription Code

Run it

Key Python Points

JavaScript Track

Setup

Transcription Code

Run it

Key JavaScript Points

C# Track

Setup

Transcription Code

Run it

Key C# Points

Exercise 4 - Batch Transcribe All Zava Samples

Python Track

JavaScript Track

C# Track

Exercise 5 - Understand the Key Code Patterns

Exercise 6 - Experiment

SDK Audio API Reference

Comparison: Chat Models vs. Whisper

Key Takeaways

Resources

Next Step