Media Handling System
AbstractCore provides a production-ready unified media handling system that enables seamless file attachment and processing across all LLM providers and models. The system automatically processes images, documents, and other media files using the same simple API, with intelligent provider-specific formatting and graceful fallback handling.
Table of Contents
Key Benefits Quick Start How It Works Behind the Scenes Supported File Types Provider Compatibility Usage Examples Advanced Features Recommended Practices Model-Specific Examples Installation Troubleshooting API Reference Next StepsKey Benefits
- Universal API: Same
media=[]parameter works across all providers (OpenAI, Anthropic, Ollama, LMStudio, etc.) - Intelligent Processing: Automatic file type detection with specialized processors for each format
- Provider Adaptation: Automatic formatting for each provider's API requirements (JSON for OpenAI, XML for Anthropic, etc.)
- Robust Fallback: Graceful degradation when advanced processing fails, always provides meaningful results
- CLI Integration: Simple
@filenamesyntax in CLI for instant file attachment - Production Quality: Comprehensive error handling, logging, and performance optimization
- Cross-Format Support: Images, PDFs, Office documents, CSV/TSV, text files all work seamlessly
Quick Start
from abstractcore import create_llm
# Works with any provider - just change the provider name
llm = create_llm("openai", model="gpt-4o", api_key="your-key")
response = llm.generate(
"What's in this image and document?",
media=["photo.jpg", "report.pdf"]
)
print(response.content)
# Same code works with Anthropic
llm = create_llm("anthropic", model="claude-3.5-sonnet", api_key="your-key")
response = llm.generate(
"Analyze these materials",
media=["chart.png", "data.csv", "presentation.ppt"]
)
# Or with local models
llm = create_llm("ollama", model="qwen2.5vl:7b")
response = llm.generate(
"Describe this image",
media=["screenshot.png"]
)
How It Works Behind the Scenes
AbstractCore's media system uses a sophisticated multi-layer architecture that seamlessly processes any file type and formats it correctly for each LLM provider:
1. File Attachment Processing
CLI Integration (@filename syntax):
# User types: "Analyze this @report.pdf and @chart.png"
# MessagePreprocessor extracts files and cleans text:
clean_text = "Analyze this and" # File references removed
media_files = ["report.pdf", "chart.png"] # Extracted file paths
Python API:
# Direct media parameter usage
llm.generate("Analyze these files", media=["report.pdf", "chart.png"])
2. Intelligent File Processing Pipeline
AutoMediaHandler Coordination:
# 1. Detect file types automatically
MediaType.IMAGE -> ImageProcessor (PIL-based)
MediaType.DOCUMENT -> PDFProcessor (PyMuPDF4LLM) or OfficeProcessor (Unstructured)
MediaType.TEXT -> TextProcessor (pandas for CSV/TSV)
# 2. Process each file with specialized processor
pdf_content = PDFProcessor.process("report.pdf") # → Markdown text
image_content = ImageProcessor.process("chart.png") # → Base64 + metadata
Graceful Fallback System:
try:
# Advanced processing (PyMuPDF4LLM, Unstructured)
content = advanced_processor.process(file)
except Exception:
# Always falls back to basic processing
content = basic_text_extraction(file) # Never fails
3. Provider-Specific Formatting
The same processed content gets formatted differently for each provider:
OpenAI Format (JSON):
{
"role": "user",
"content": [
{"type": "text", "text": "Analyze these files"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,iVBORw0..."}},
{"type": "text", "text": "PDF Content: # Report Title\n\nExecutive Summary..."}
]
}
Anthropic Format (Messages API):
{
"role": "user",
"content": [
{"type": "text", "text": "Analyze these files"},
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "iVBORw0..."}},
{"type": "text", "text": "PDF Content: # Report Title\n\nExecutive Summary..."}
]
}
Local Models (Text Embedding):
# For local models without native multimodal support
combined_prompt = """
Analyze these files:
Image Analysis: [A business chart showing quarterly revenue trends...]
PDF Content: # Report Title
Executive Summary...
"""
4. Cross-Provider Workflow
5. Error Handling & Resilience
Multi-Level Fallback Strategy: 1. Advanced Processing: Try specialized libraries (PyMuPDF4LLM, Unstructured) 2. Basic Processing: Fall back to simple text extraction 3. Metadata Only: If all else fails, provide file metadata 4. Graceful Degradation: Best-effort results with clear errors (no silent semantic changes)
Example of Robust Error Handling:
try:
# Try advanced PDF processing with PyMuPDF4LLM
content = pdf_processor.extract_with_formatting(file)
except PDFProcessingError:
try:
# Fall back to basic text extraction
content = pdf_processor.extract_basic_text(file)
except Exception:
# Ultimate fallback - provide metadata
content = f"PDF file: {file.name} ({file.size} bytes)"
# Result: Callers get a best-effort output or a clear error message (no silent truncation).
Supported File Types
Images (Vision Models)
- Formats: PNG, JPEG, GIF, WEBP, BMP, TIFF
- Automatic: Optimization, resizing, format conversion
- Features: EXIF handling, quality optimization for vision models
Documents
- Text Files: TXT, MD, CSV, TSV, JSON with intelligent parsing and data analysis
- PDF: Text extraction with PyMuPDF4LLM (when installed), with best-effort structure preservation
- Office: DOCX, XLSX, PPTX via Unstructured (when installed), with best-effort extraction
- Word: section/paragraph extraction
- Excel: sheet-by-sheet extraction
- PowerPoint: slide-by-slide extraction
Audio (policy-driven; optional STT fallback)
- Formats: common
audio/*types (WAV, MP3, M4A, …) as attachments viamedia=[...] - Default behavior:
audio_policy="native_only"(fails loudly unless the model supports native audio input) - Speech-to-text:
audio_policy="speech_to_text"runs STT via the capability plugin layer (llm.audio.transcribe(...); typically installabstractvoice) and injects a transcript into the main request - Auto:
audio_policy="auto"uses native audio when supported, otherwise STT when configured, otherwise errors - Reserved:
audio_policy="caption"is not configured in v0 (must error; non-speech audio analysis needs an explicit capability)
Transparency:
- When STT fallback is used, GenerateResponse.metadata.media_enrichment[] records what was injected and which backend was used.
Requirements:
- Native audio requires an audio-capable model.
- STT fallback requires installing an STT capability plugin (typically pip install abstractvoice) and using audio_policy="auto"/"speech_to_text" (or setting a default via abstractcore --set-audio-strategy ...).
Video (policy-driven; native or frames fallback)
- Formats: common
video/*types as attachments viamedia=[...] - Default behavior:
video_policy="auto"(native video when supported; otherwise sample frames and route through the vision pipeline) - Budgets: frame count and downscale are explicit and logged (see
abstractcore/providers/base.py)
Requirements:
- Frame sampling fallback requires ffmpeg/ffprobe available on PATH.
- For the sampled-frame path, you also need image/vision handling: either a vision-capable main model or configured vision fallback, and (for local frame attachments) pip install "abstractcore[media]" so Pillow-based image processing is available.
Processing Features
- Intelligent Detection: Automatic file type recognition and processor selection
- Content Optimization: Format-specific processing optimized for LLM consumption
- Robust Fallback: Graceful degradation ensures users always get meaningful results
- Performance Optimized: Lazy loading and efficient memory usage
- Testing status: Coverage varies by provider and modality; see the test suite under
tests/media_handling/
Token Estimation & No Truncation Policy
AbstractCore processors do not silently truncate content. This design decision ensures:
- No data loss: Full file content is always preserved
- User control: Callers decide how to handle large files (summarize, chunk, error)
- Model flexibility: Works correctly across models with different context limits (8K to 200K+)
Token estimation is automatically added to MediaContent.metadata:
result = processor.process_file("data.csv")
print(result.media_content.metadata['estimated_tokens']) # e.g., 1500
print(result.media_content.metadata['content_length']) # e.g., 6000 chars
Handlers use this for validation:
handler = OpenAIMediaHandler()
tokens = handler.estimate_tokens_for_media(media_content)
# Uses metadata['estimated_tokens'] if available, falls back to heuristic
For large files that exceed model context limits, use BasicSummarizer or implement custom chunking at the application layer.
Provider Compatibility
Vision-Enabled Providers
| Provider | Vision Models | Image Support | Document Support |
|---|---|---|---|
| OpenAI | GPT-4o, GPT-4 Turbo with Vision | Supported: Multi-image | Supported: All formats |
| Anthropic | Claude 3.5 Sonnet, Claude 4 series | Supported: Up to 20 images | Supported: All formats |
| Ollama | qwen2.5vl:7b, gemma3:4b, llama3.2-vision:11b | Supported: Single image | Supported: All formats |
| LMStudio | qwen2.5-vl-7b, gemma-3n-e4b, magistral-small-2509 | Supported: Multiple images | Supported: All formats |
Text-Only Providers
All providers support document processing even without vision capabilities:
| Provider | Document Processing | Text Extraction |
|---|---|---|
| HuggingFace | Supported: All formats | Supported: Embedded in prompt |
| MLX | Supported: All formats | Supported: Embedded in prompt |
| Any Provider | Supported: Automatic fallback | Supported: Text extraction |
⚠️ Model Compatibility Notes (Updated: 2025-10-17)
Some newer vision models may not be immediately available due to rapid development:
LMStudio Limitations:
- qwen3-vl models (8B, 30B) - Not yet supported in LMStudio
- Use qwen2.5-vl-7b as a proven alternative
HuggingFace Limitations:
- Qwen3-VL models - Require newer transformers architecture
- Install latest transformers: pip install --upgrade transformers
- Or use bleeding edge: pip install git+https://github.com/huggingface/transformers.git
Recommended Stable Models (2025-10-17):
- LMStudio: qwen/qwen2.5-vl-7b, google/gemma-3n-e4b, mistralai/magistral-small-2509
- Ollama: qwen2.5vl:7b, gemma3:4b, llama3.2-vision:11b
- OpenAI: gpt-4o, gpt-4-turbo-with-vision
- Anthropic: claude-3.5-sonnet, claude-4-series
Usage Examples
Vision Analysis
from abstractcore import create_llm
# Analyze images with any vision model
llm = create_llm("openai", model="gpt-4o")
# Single image analysis
response = llm.generate(
"What's happening in this image?",
media=["photo.jpg"]
)
# Multiple images comparison
response = llm.generate(
"Compare these two charts and explain the trends",
media=["chart1.png", "chart2.png"]
)
# Mixed media analysis
response = llm.generate(
"Summarize the report and relate it to what you see in the image",
media=["financial_report.pdf", "stock_chart.png"]
)
Document Processing
# PDF analysis
response = llm.generate(
"Summarize the key findings from this research paper",
media=["research_paper.pdf"]
)
# Office document processing
response = llm.generate(
"Create a summary of this presentation and spreadsheet",
media=["quarterly_results.ppt", "financial_data.xlsx"]
)
# CSV data analysis
response = llm.generate(
"What patterns do you see in this sales data?",
media=["sales_data.csv"]
)
CLI Usage
These examples work in AbstractCore CLI when abstractcore[media] is installed and your selected provider/model supports the requested media (or you configured fallbacks):
# PDF Analysis - Working
python -m abstractcore.utils.cli --prompt "What is this document about? @report.pdf"
# Office Documents - Working
python -m abstractcore.utils.cli --prompt "Summarize this presentation @slides.pptx"
python -m abstractcore.utils.cli --prompt "What data is in @spreadsheet.xlsx"
python -m abstractcore.utils.cli --prompt "Analyze this document @contract.docx"
# Data Files - Working
python -m abstractcore.utils.cli --prompt "What patterns are in @sales_data.csv"
python -m abstractcore.utils.cli --prompt "Analyze this data @metrics.tsv"
# Images - Working
python -m abstractcore.utils.cli --prompt "What's in this image? @screenshot.png"
# Mixed Media - Working
python -m abstractcore.utils.cli --prompt "Compare @chart.png and @data.csv and explain trends"
Cross-provider semantics (what’s consistent)
# AbstractCore exposes a single `media=[...]` parameter across providers, but behavior
# depends on provider/model capabilities and your media policies.
# Documents (PDF/Office/text/CSV/TSV/...) are extracted to text/metadata and injected into the request.
# This generally works across providers because the final payload is text.
media_files = ["report.pdf", "data.xlsx"]
prompt = "Analyze these documents and provide insights"
# OpenAI
openai_llm = create_llm("openai", model="gpt-4o")
openai_response = openai_llm.generate(prompt, media=media_files)
# Anthropic
anthropic_llm = create_llm("anthropic", model="claude-haiku-4-5")
anthropic_response = anthropic_llm.generate(prompt, media=media_files)
# Image/audio/video inputs are policy-driven and require native support or explicit fallbacks.
# See: docs/vision-capabilities.md and docs/media-handling-system.md (policies + fallbacks).
Streaming with Media
# Real-time streaming responses with media
llm = create_llm("openai", model="gpt-4o") # requires: pip install "abstractcore[openai]"
for chunk in llm.generate(
"Describe this image in detail",
media=["complex_diagram.png"],
stream=True
):
print(chunk.content or "", end="", flush=True)
Advanced Features
Maximum Resolution Optimization (NEW)
AbstractCore automatically optimizes image resolution for each model's maximum capability, ensuring optimal vision results:
from abstractcore import create_llm
# Images are automatically optimized for each model's maximum resolution
llm = create_llm("openai", model="gpt-4o")
response = llm.generate(
"Analyze this image in detail",
media=["photo.jpg"] # Auto-resized to 4096x4096 for GPT-4o
)
# Different model, different optimization
llm = create_llm("ollama", model="qwen2.5vl:7b")
response = llm.generate(
"What's in this image?",
media=["photo.jpg"] # Auto-resized to 3584x3584 for qwen2.5vl
)
Model-Specific Resolution Limits: - GPT-4o: Up to 4096x4096 pixels - Claude 3.5 Sonnet: Up to 1568x1568 pixels - qwen2.5vl:7b: Up to 3584x3584 pixels - gemma3:4b: Up to 896x896 pixels - llama3.2-vision:11b: Up to 560x560 pixels
Benefits: - Better Accuracy: Higher resolution means more detail for the model to analyze - Automatic: No manual configuration required - Provider-Aware: Adapts to each provider's optimal settings - Quality Optimization: Increased JPEG quality (90%) for better compression
Capability Detection
The system automatically detects model capabilities and adapts accordingly:
from abstractcore.media.capabilities import is_vision_model, supports_images
# Check if a model supports vision
if is_vision_model("gpt-4o"):
print("This model can process images")
if supports_images("claude-3.5-sonnet"):
print("This model supports image analysis")
# Text-only model + image input is policy-driven
llm = create_llm("openai", model="gpt-4") # text-only example
response = llm.generate(
"Analyze this image",
media=["photo.jpg"], # Errors unless vision fallback is configured; see below.
)
Vision fallback (optional; config-driven)
AbstractCore includes an optional vision fallback that enables text-only models to process images using a transparent two-stage pipeline (caption → inject short observations).
How Vision Fallback Works
When vision fallback is configured and you use a text-only model with images, AbstractCore:
- Detects Model Limitations: Identifies when a text-only model receives an image
- Uses Vision Fallback: Employs a configured vision model to analyze the image
- Provides Description: Passes the image description to the text-only model
- Returns Results: Your text model answers using the injected observations (recorded in
metadata.media_enrichment[])
Example
Configure a vision captioner once:
abstractcore --set-vision-provider lmstudio qwen/qwen3-vl-4b
Then use any text model with images:
from abstractcore import create_llm
llm = create_llm("lmstudio", model="qwen/qwen3-next-80b") # text-only
resp = llm.generate("What's in this image?", media=["whale_photo.jpg"])
print(resp.content)
Behind the Scenes
What actually happens (transparent to user):
1. Stage 1: qwen2.5vl:7b (vision model) analyzes whale_photo.jpg → detailed description
2. Stage 2: qwen/qwen3-next-80b (text-only) processes description + user question → final analysis
Configuration Commands
# Check current status
abstractcore --status
# Download local caption models (optional)
abstractcore --download-vision-model # BLIP base (990MB)
abstractcore --download-vision-model vit-gpt2 # ViT-GPT2 (500MB, CPU-friendly)
abstractcore --download-vision-model git-base # GIT base (400MB, smallest)
# Use an existing vision-capable model as the fallback captioner
abstractcore --set-vision-provider ollama qwen2.5vl:7b
abstractcore --set-vision-provider lmstudio qwen/qwen3-vl-4b
abstractcore --set-vision-provider openai gpt-4o
abstractcore --set-vision-provider anthropic claude-sonnet-4-5
# Interactive setup
abstractcore --config
# Advanced: Fallback chains
abstractcore --add-vision-fallback ollama qwen2.5vl:7b
abstractcore --add-vision-fallback openai gpt-4o
Benefits of Vision Fallback
- Universal Compatibility: Any text-only model can now process images
- Cost Optimization: Use cheaper text models for reasoning, vision models only for description
- Transparent Operation: Users don't need to change their code
- Flexible Configuration: Local models, cloud APIs, or hybrid setups
- Offline-First: Works without internet after downloading local models
- Automatic Fallback: Graceful degradation when vision not configured
Supported Vision Models
Local Models (Downloaded): - BLIP Base: 990MB, high quality, CPU/GPU compatible - ViT-GPT2: 500MB, CPU-friendly, good performance - GIT Base: 400MB, smallest size, basic quality
Provider Models:
- Ollama: qwen2.5vl:7b, llama3.2-vision:11b, gemma3:4b
- LMStudio: qwen/qwen2.5-vl-7b, google/gemma-3n-e4b
- OpenAI: gpt-4o, gpt-4-turbo-with-vision
- Anthropic: claude-3.5-sonnet, claude-4-series
Custom Processing Options
# Advanced image processing
from abstractcore.media.processors import ImageProcessor
processor = ImageProcessor(
optimize_for_vision=True,
max_dimension=1024,
quality=85
)
# Advanced PDF processing
from abstractcore.media.processors import PDFProcessor
pdf_processor = PDFProcessor(
extract_images=True,
markdown_output=True,
preserve_tables=True
)
Direct Media Processing
# Process files directly (without LLM)
from abstractcore.media import process_file
# Process any supported file
result = process_file("document.pdf")
if result.success:
print(f"Content: {result.media_content.content}")
print(f"Type: {result.media_content.media_type}")
print(f"Metadata: {result.media_content.metadata}")
Recommended Practices
File Size and Limits
# Check model-specific limits
from abstractcore.media.capabilities import get_media_capabilities
caps = get_media_capabilities("gpt-4o")
print(f"Max images per message: {caps.max_images}")
print(f"Supported formats: {caps.supported_formats}")
Error Handling
try:
response = llm.generate(
"Analyze this file",
media=["large_document.pdf"]
)
except Exception as e:
print(f"Media processing error: {e}")
# Fallback to text-only processing
response = llm.generate("Analyze the uploaded document content")
Performance Tips
# For large documents, consider chunking
from abstractcore.media.processors import PDFProcessor
processor = PDFProcessor(chunk_size=8000) # Process in chunks
# For multiple images, process in batches
image_files = ["img1.jpg", "img2.jpg", "img3.jpg"]
for batch in [image_files[i:i+3] for i in range(0, len(image_files), 3)]:
response = llm.generate("Analyze these images", media=batch)
Model-Specific Examples
OpenAI GPT-4o
# Multi-image analysis with high detail
llm = create_llm("openai", model="gpt-4o")
response = llm.generate(
"Compare these architectural photos and identify the styles",
media=["building1.jpg", "building2.jpg", "building3.jpg"]
)
Anthropic Claude 3.5 Sonnet
# Document analysis with specialized prompts
llm = create_llm("anthropic", model="claude-3.5-sonnet")
response = llm.generate(
"Provide a comprehensive analysis of this research paper",
media=["academic_paper.pdf"]
)
Local Vision Models
# Ollama with qwen2.5-vl
ollama_llm = create_llm("ollama", model="qwen2.5vl:7b")
response = ollama_llm.generate(
"What objects do you see in this image?",
media=["scene.jpg"]
)
# LMStudio with qwen2.5-vl
lmstudio_llm = create_llm("lmstudio", model="qwen/qwen2.5-vl-7b")
response = lmstudio_llm.generate(
"Describe this chart and its trends",
media=["business_chart.png"]
)
# Ollama with Llama 3.2 Vision
llama_llm = create_llm("ollama", model="llama3.2-vision:11b")
response = llama_llm.generate(
"Analyze this document layout",
media=["document.jpg"]
)
Installation
Basic Installation
# Core media handling (images, text, basic documents)
pip install "abstractcore[media]"
Full Installation
# Media features (PDF + Office docs) are covered by `abstractcore[media]`.
# If you want the full framework install (providers + tools + server + docs), pick one:
pip install "abstractcore[all-apple]" # macOS/Apple Silicon (includes MLX, excludes vLLM)
pip install "abstractcore[all-non-mlx]" # Linux/Windows/Intel Mac (excludes MLX and vLLM)
pip install "abstractcore[all-gpu]" # Linux NVIDIA GPU (includes vLLM, excludes MLX)
Advanced: If you prefer to install only the pieces you need (instead of abstractcore[media]),
these are the main libraries AbstractCore uses:
Pillow(images)pymupdf4llm+pymupdf-layout(PDF extraction)unstructured[docx,pptx,xlsx,odt,rtf](Office docs)pandas(tabular helpers)
Troubleshooting
Common Issues
Media not processed:
# Check if media dependencies are installed
try:
response = llm.generate("Test", media=["test.jpg"])
except ImportError as e:
print(f"Missing dependency: {e}")
print('Install with: pip install "abstractcore[media]"')
Vision model not detecting images:
# Verify model capabilities
from abstractcore.media.capabilities import is_vision_model
if not is_vision_model("your-model"):
print("This model doesn't support vision")
print("Try: gpt-4o, claude-3.5-sonnet, qwen2.5vl:7b, or llama3.2-vision:11b")
Large file processing:
# For large files, check size limits
import os
file_size = os.path.getsize("large_file.pdf")
if file_size > 10 * 1024 * 1024: # 10MB
print("File may be too large for some providers")
Validation
# Test your installation
python validate_media_system.py
# Run comprehensive tests
python -m pytest tests/media_handling/ -v
API Reference
Core Functions
# Main generation with media
llm.generate(prompt, media=files, **kwargs)
# Direct file processing
from abstractcore.media import process_file
result = process_file(file_path)
# Capability detection
from abstractcore.media.capabilities import (
is_vision_model,
supports_images,
get_media_capabilities
)
Media Types
from abstractcore.media.types import MediaType, ContentFormat
# MediaType.IMAGE, MediaType.DOCUMENT, MediaType.TEXT
# ContentFormat.BASE64, ContentFormat.TEXT, ContentFormat.BINARY
Processors
from abstractcore.media.processors import (
ImageProcessor, # Images with PIL
TextProcessor, # Text, CSV, JSON with pandas
PDFProcessor, # PDFs with PyMuPDF4LLM
OfficeProcessor # DOCX, XLSX, PPT with unstructured
)
Next Steps
- Getting Started Guide - Complete AbstractCore tutorial
- API Reference - Full Python API documentation
- Glyph + Vision Example - End-to-end document analysis with a vision model
- Supported Formats Utility - Inspect available processors and supported formats
The media handling system makes AbstractCore multimodal while maintaining the same "write once, run everywhere" philosophy. Focus on your application logic while AbstractCore handles the complexity of different provider APIs and media formats.