Audio & Voice (STT/TTS)

Transcribe audio into text for LLM requests (policy-driven), and optionally synthesize audio output via capability plugins.

Overview

AbstractCore treats audio as a first-class media type, but it is intentionally policy-driven: by default, attaching audio to an LLM request will fail loudly unless you opt into an explicit fallback.

  • Audio input (speech-to-text): set audio_policy="speech_to_text" to inject a transcript into the prompt
  • Auto policy: audio_policy="auto" uses native audio when supported, otherwise STT when configured
  • Audio output (text-to-speech): use llm.voice.tts(...) or generate_with_outputs(...) (optional)

Capability backends are provided by optional plugins (for example abstractvoice) and are discovered lazily at runtime. This keeps pip install abstractcore lightweight.

Install

# Core + capability plugin (recommended)
pip install abstractcore abstractvoice

# Plus any provider you use
pip install "abstractcore[openai]"       # or "abstractcore[anthropic]" / "abstractcore[huggingface]" / "abstractcore[mlx]" / "abstractcore[vllm]"

# If you also want images/PDF/Office docs
pip install "abstractcore[media]"

Speech-to-text in generate()

Attach an audio file via media=[...] and opt into the STT fallback policy. AbstractCore transcribes the audio (via the audio capability plugin), removes the audio from the provider-native media path, and injects a short transcript block into your prompt. This works with any provider/model because the transcript becomes plain text.

from abstractcore import create_llm

llm = create_llm("openai", model="gpt-4o-mini")

resp = llm.generate(
    "Summarize the key decisions from this call.",
    media=["./call.wav"],
    audio_policy="speech_to_text",
    audio_language="en",  # optional (also supports stt_language=...)
)

print(resp.content)
print(resp.metadata.get("media_enrichment"))

Transparency metadata (media_enrichment)

When AbstractCore converts a non-text input (image/audio/video) into injected text context, it records what happened in response.metadata.media_enrichment (for example: policy used, modality, and injected transcript size).

Default policy via centralized config

If you attach audio frequently, you can set a default policy in ~/.abstractcore/config/abstractcore.json (see Centralized Configuration).

{
  "audio": {
    "strategy": "speech_to_text",
    "stt_language": "en"
  }
}

Direct transcription (no LLM call)

You can also use the capability surface directly:

from abstractcore import create_llm

core = create_llm("openai", model="gpt-4o-mini")
text = core.audio.transcribe("./call.wav", language="en")
print(text)

Text-to-speech output (optional)

If a voice backend is available (for example via abstractvoice), you can synthesize audio from text:

from abstractcore import create_llm

core = create_llm("openai", model="gpt-4o-mini")
wav_bytes = core.voice.tts("Hello from AbstractCore.", format="wav")
print(len(wav_bytes))

Or, as a convenience wrapper, generate text then run TTS as an explicit second step:

from abstractcore import create_llm

core = create_llm("openai", model="gpt-4o-mini")
result = core.generate_with_outputs(
    "Write a 1-sentence welcome message.",
    outputs={"tts": {"format": "wav"}},
)

print(result.response.content)
wav_bytes = result.outputs["tts"]

HTTP server: OpenAI-compatible audio endpoints

The AbstractCore server exposes OpenAI-compatible audio endpoints backed by capability plugins:

  • POST /v1/audio/transcriptions (multipart form; speech-to-text)
  • POST /v1/audio/speech (JSON; text-to-speech)
pip install "abstractcore[server]" abstractvoice
python -m abstractcore.server.app
# STT (requires: pip install abstractvoice; python-multipart is included in abstractcore[server])
curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F file=@./call.wav \
  -F language=en

# TTS (requires: pip install abstractvoice on the server)
curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input":"Hello from AbstractCore","format":"wav"}' \
  --output out.wav

See HTTP Server Guide for full setup and deployment notes.

Related Documentation

Media Handling

Files, media types, and fallbacks

Centralized Configuration

Set default audio policy and language

HTTP Server Guide

OpenAI-compatible endpoints