Audio & Voice (STT/TTS)
Transcribe audio into text for LLM requests (policy-driven), and optionally synthesize audio output via capability plugins.
Overview
AbstractCore treats audio as a first-class media type, but it is intentionally policy-driven: by default, attaching audio to an LLM request will fail loudly unless you opt into an explicit fallback.
- Audio input (speech-to-text): set
audio_policy="speech_to_text"to inject a transcript into the prompt - Auto policy:
audio_policy="auto"uses native audio when supported, otherwise STT when configured - Audio output (text-to-speech): use
llm.voice.tts(...)orgenerate_with_outputs(...)(optional)
Capability backends are provided by optional plugins (for example abstractvoice) and are discovered lazily at runtime.
This keeps pip install abstractcore lightweight.
Install
# Core + capability plugin (recommended)
pip install abstractcore abstractvoice
# Plus any provider you use
pip install "abstractcore[openai]" # or "abstractcore[anthropic]" / "abstractcore[huggingface]" / "abstractcore[mlx]" / "abstractcore[vllm]"
# If you also want images/PDF/Office docs
pip install "abstractcore[media]"
Speech-to-text in generate()
Attach an audio file via media=[...] and opt into the STT fallback policy.
AbstractCore transcribes the audio (via the audio capability plugin), removes the audio from the provider-native media path,
and injects a short transcript block into your prompt. This works with any provider/model because the transcript becomes plain text.
from abstractcore import create_llm
llm = create_llm("openai", model="gpt-4o-mini")
resp = llm.generate(
"Summarize the key decisions from this call.",
media=["./call.wav"],
audio_policy="speech_to_text",
audio_language="en", # optional (also supports stt_language=...)
)
print(resp.content)
print(resp.metadata.get("media_enrichment"))
Transparency metadata (media_enrichment)
When AbstractCore converts a non-text input (image/audio/video) into injected text context, it records what happened in
response.metadata.media_enrichment (for example: policy used, modality, and injected transcript size).
Default policy via centralized config
If you attach audio frequently, you can set a default policy in ~/.abstractcore/config/abstractcore.json
(see Centralized Configuration).
{
"audio": {
"strategy": "speech_to_text",
"stt_language": "en"
}
}
Direct transcription (no LLM call)
You can also use the capability surface directly:
from abstractcore import create_llm
core = create_llm("openai", model="gpt-4o-mini")
text = core.audio.transcribe("./call.wav", language="en")
print(text)
Text-to-speech output (optional)
If a voice backend is available (for example via abstractvoice), you can synthesize audio from text:
from abstractcore import create_llm
core = create_llm("openai", model="gpt-4o-mini")
wav_bytes = core.voice.tts("Hello from AbstractCore.", format="wav")
print(len(wav_bytes))
Or, as a convenience wrapper, generate text then run TTS as an explicit second step:
from abstractcore import create_llm
core = create_llm("openai", model="gpt-4o-mini")
result = core.generate_with_outputs(
"Write a 1-sentence welcome message.",
outputs={"tts": {"format": "wav"}},
)
print(result.response.content)
wav_bytes = result.outputs["tts"]
HTTP server: OpenAI-compatible audio endpoints
The AbstractCore server exposes OpenAI-compatible audio endpoints backed by capability plugins:
POST /v1/audio/transcriptions(multipart form; speech-to-text)POST /v1/audio/speech(JSON; text-to-speech)
pip install "abstractcore[server]" abstractvoice
python -m abstractcore.server.app
# STT (requires: pip install abstractvoice; python-multipart is included in abstractcore[server])
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F file=@./call.wav \
-F language=en
# TTS (requires: pip install abstractvoice on the server)
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input":"Hello from AbstractCore","format":"wav"}' \
--output out.wav
See HTTP Server Guide for full setup and deployment notes.