Audio & Voice (STT/TTS)

Transcribe audio into text for LLM requests (policy-driven), and optionally synthesize audio output via capability plugins.

Overview

AbstractCore treats audio as a first-class media type, but it is intentionally policy-driven: by default, attaching audio to an LLM request will fail loudly unless you opt into an explicit fallback.

Audio input (speech-to-text): set audio_policy="speech_to_text" to inject a transcript into the prompt
Auto policy: audio_policy="auto" uses native audio when supported, otherwise STT when configured
Audio output (text-to-speech): use llm.voice.tts(...) or generate_with_outputs(...) (optional)

Capability backends are provided by optional plugins (for example abstractvoice) and are discovered lazily at runtime. This keeps pip install abstractcore lightweight.

Install

# Core + capability plugin (recommended)
pip install abstractcore abstractvoice

# Plus any provider you use
pip install "abstractcore[openai]"       # or "abstractcore[anthropic]" / "abstractcore[huggingface]" / "abstractcore[mlx]" / "abstractcore[vllm]"

# If you also want images/PDF/Office docs
pip install "abstractcore[media]"

Speech-to-text in `generate()`

Attach an audio file via media=[...] and opt into the STT fallback policy. AbstractCore transcribes the audio (via the audio capability plugin), removes the audio from the provider-native media path, and injects a short transcript block into your prompt. This works with any provider/model because the transcript becomes plain text.

from abstractcore import create_llm

llm = create_llm("openai", model="gpt-4o-mini")

resp = llm.generate(
    "Summarize the key decisions from this call.",
    media=["./call.wav"],
    audio_policy="speech_to_text",
    audio_language="en",  # optional (also supports stt_language=...)
)

print(resp.content)
print(resp.metadata.get("media_enrichment"))

Transparency metadata (`media_enrichment`)

When AbstractCore converts a non-text input (image/audio/video) into injected text context, it records what happened in response.metadata.media_enrichment (for example: policy used, modality, and injected transcript size).

Default policy via centralized config

If you attach audio frequently, you can set a default policy in ~/.abstractcore/config/abstractcore.json (see Centralized Configuration).

{
  "audio": {
    "strategy": "speech_to_text",
    "stt_language": "en"
  }
}

Direct transcription (no LLM call)

You can also use the capability surface directly:

from abstractcore import create_llm

core = create_llm("openai", model="gpt-4o-mini")
text = core.audio.transcribe("./call.wav", language="en")
print(text)

Text-to-speech output (optional)

If a voice backend is available (for example via abstractvoice), you can synthesize audio from text:

from abstractcore import create_llm

core = create_llm("openai", model="gpt-4o-mini")
wav_bytes = core.voice.tts("Hello from AbstractCore.", format="wav")
print(len(wav_bytes))

Or, as a convenience wrapper, generate text then run TTS as an explicit second step:

from abstractcore import create_llm

core = create_llm("openai", model="gpt-4o-mini")
result = core.generate_with_outputs(
    "Write a 1-sentence welcome message.",
    outputs={"tts": {"format": "wav"}},
)

print(result.response.content)
wav_bytes = result.outputs["tts"]

HTTP server: OpenAI-compatible audio endpoints

The AbstractCore server exposes OpenAI-compatible audio endpoints backed by capability plugins:

POST /v1/audio/transcriptions (multipart form; speech-to-text)
POST /v1/audio/speech (JSON; text-to-speech)

pip install "abstractcore[server]" abstractvoice
python -m abstractcore.server.app

# STT (requires: pip install abstractvoice; python-multipart is included in abstractcore[server])
curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@./call.wav \
  -F language=en

# TTS (requires: pip install abstractvoice on the server)
curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input":"Hello from AbstractCore","format":"wav"}' \
  --output out.wav

See HTTP Server Guide for full setup and deployment notes.

Audio & Voice (STT/TTS)

Overview

Install

Speech-to-text in `generate()`

Transparency metadata (`media_enrichment`)

Default policy via centralized config

Direct transcription (no LLM call)

Text-to-speech output (optional)

HTTP server: OpenAI-compatible audio endpoints

Related Documentation

Media Handling

Centralized Configuration

HTTP Server Guide

Audio & Voice (STT/TTS)

Overview

Install

Speech-to-text in generate()

Transparency metadata (media_enrichment)

Default policy via centralized config

Direct transcription (no LLM call)

Text-to-speech output (optional)

HTTP server: OpenAI-compatible audio endpoints

Related Documentation

Media Handling

Centralized Configuration

HTTP Server Guide

Speech-to-text in `generate()`

Transparency metadata (`media_enrichment`)