Vision Capabilities
Seamless image analysis across all AI providers with automatic optimization and intelligent fallback for text-only models.
Overview
This page describes vision as an input modality in AbstractCore (images + video understanding), and clarifies how it relates to
vision fallback (caption → inject short observations) and generative vision (image/video creation), which lives in
the optional abstractvision capability plugin.
Key Features
- Cross-Provider Consistency - Same code works identically across cloud and local providers
- Automatic Optimization - Images automatically resized for each model's maximum capability
- Vision Fallback - Text-only models can process images through transparent two-stage pipeline
- Multi-Image Support - Analyze and compare multiple images simultaneously
- Format Flexibility - PNG, JPEG, GIF, WEBP, BMP, TIFF all supported
This page focuses on vision input. For generative vision output, see Capabilities and Server.
1) Image/video input modalities (owned by AbstractCore)
Attach images (and videos) to an LLM call using media=[...]:
from abstractcore import create_llm
llm = create_llm("openai", model="gpt-4o-mini") # example; pick a vision-capable model you have access to
# Image input
resp = llm.generate("What is in this image?", media=["photo.jpg"])
print(resp.content)
# Video input (policy-driven)
resp = llm.generate(
"Summarize what happens in this clip.",
media=["clip.mp4"],
video_policy="auto", # native when supported; otherwise sample frames
)
print(resp.content)
2) Vision fallback for text-only models (optional; config-driven)
When a user attaches an image to a text-only model, AbstractCore can optionally run a configured vision backend to produce short grounded observations
and inject them into the main request. This is explicit (config-driven) and transparent via response.metadata.media_enrichment[].
See Media Handling for policy details and enrichment metadata.
3) Generative vision output is not part of AbstractCore’s default install
Creating/editing images and videos is a deterministic capability provided by abstractvision. You can integrate it in two ways:
- Capability plugin (library mode): install
abstractvisionand usellm.vision.* - HTTP interop: run the server and enable
/v1/images/*endpoints delegated toabstractvision
See Server for the optional /v1/images/generations and /v1/images/edits endpoints.
Supported Providers and Models
Cloud Providers
- OpenAI: GPT-4o, GPT-4o mini (multiple images)
- Anthropic: Claude Haiku 4.5, Claude Sonnet 4.5 (vision-capable models)
Local Providers
- Ollama: qwen2.5vl:7b, llama3.2-vision:11b, gemma3:4b
- LMStudio: qwen/qwen2.5-vl-7b, google/gemma-3n-e4b
- HuggingFace: Qwen2.5-VL variants, LLaVA models
- MLX: Vision models via MLX framework
Image Formats
PNG, JPEG, GIF, WEBP, BMP, TIFF with automatic optimization
Basic Vision Analysis
from abstractcore import create_llm
# Works with any vision-capable provider
llm = create_llm("openai", model="gpt-4o")
# Single image analysis
response = llm.generate(
"What objects do you see in this image?",
media=["photo.jpg"]
)
# Multiple images comparison
response = llm.generate(
"Compare these architectural styles and identify differences",
media=["building1.jpg", "building2.jpg", "building3.jpg"]
)
Cross-Provider Consistency
The same code works across all providers:
# Same code works across all providers
image_files = ["chart.png", "document.pdf"]
prompt = "Analyze the data in these files"
# All work identically
openai_response = create_llm("openai", model="gpt-4o").generate(prompt, media=image_files)
anthropic_response = create_llm("anthropic", model="claude-haiku-4-5").generate(prompt, media=image_files)
ollama_response = create_llm("ollama", model="qwen2.5vl:7b").generate(prompt, media=image_files)
Vision Fallback System
The Vision Fallback System enables text-only models to process images through a transparent two-stage pipeline. This is particularly useful when you want to use a powerful text model that doesn't have native vision capabilities.
Configuration (One-Time Setup)
# Option 1: Download local vision model (recommended)
abstractcore --download-vision-model
# Option 2: Use existing Ollama model
abstractcore --set-vision-provider ollama qwen2.5vl:7b
# Option 3: Use cloud API
abstractcore --set-vision-provider openai gpt-4o
# Disable vision fallback
abstractcore --disable-vision
Using Vision Fallback
# After configuration, text-only models can process images seamlessly
text_llm = create_llm("lmstudio", model="qwen/qwen3-next-80b") # No native vision
response = text_llm.generate(
"What's happening in this image?",
media=["complex_scene.jpg"]
)
# Works transparently: vision model analyzes image → text model processes description
How Vision Fallback Works
- You send a request to a text-only model with an image
- AbstractCore detects the model lacks native vision capabilities
- The image is automatically sent to your configured vision model
- The vision model generates a detailed description
- The description is passed to your text model along with your original prompt
- You receive a response combining vision analysis with text processing
No silent drops
If you attach images to a text-only model and vision fallback isn't configured, AbstractCore won't silently ignore your files.
It injects a small placeholder (for example [Image 1: ...]) and records what happened in response.metadata.media_enrichment.
Automatic Resolution Optimization
AbstractCore automatically optimizes images for each model's maximum capability:
# Images automatically optimized per model
llm = create_llm("openai", model="gpt-4o")
response = llm.generate("Analyze this", media=["photo.jpg"]) # Auto-resized to 4096×4096
llm = create_llm("ollama", model="qwen2.5vl:7b")
response = llm.generate("Analyze this", media=["photo.jpg"]) # Auto-resized to 3584×3584
Structured Vision Analysis
# Get structured responses with specific requirements
llm = create_llm("openai", model="gpt-4o")
response = llm.generate("""
Analyze this image and provide:
- objects: list of objects detected
- colors: dominant colors
- setting: location/environment
- activities: what's happening
Format as JSON.
""", media=["scene.jpg"])
import json
analysis = json.loads(response.content)
Multi-Image Analysis
Comparison Tasks
# Compare multiple images
llm = create_llm("anthropic", model="claude-haiku-4-5")
response = llm.generate(
"Compare these three architectural designs and identify common elements",
media=["design_a.jpg", "design_b.jpg", "design_c.jpg"]
)
Sequential Analysis
# Analyze sequence of images
response = llm.generate(
"Describe the progression shown in these time-lapse images",
media=["hour1.jpg", "hour2.jpg", "hour3.jpg", "hour4.jpg"]
)
Common Use Cases
Document OCR and Analysis
# Extract text from images
response = llm.generate(
"Extract all text from this image and organize it",
media=["receipt.jpg"]
)
# Handwriting recognition
response = llm.generate(
"Transcribe the handwritten notes in this image",
media=["notes.jpg"]
)
Chart and Graph Analysis
# Analyze visualizations
response = llm.generate(
"What trends do you see in this chart? Provide the data points.",
media=["sales_chart.png"]
)
# Compare visualizations
response = llm.generate(
"Compare the trends shown in these two charts",
media=["chart1.png", "chart2.png"]
)
Quality Control and Inspection
# Defect detection
response = llm.generate(
"Identify any defects or anomalies in this product image",
media=["product_photo.jpg"]
)
# Comparison with standard
response = llm.generate(
"Compare this sample with the reference and note any differences",
media=["sample.jpg", "reference.jpg"]
)
Scene Understanding
# Detailed scene analysis
response = llm.generate(
"""Analyze this scene and provide:
- Objects present
- Activities occurring
- Environmental conditions
- Any safety concerns""",
media=["scene.jpg"]
)
CLI Integration
Vision capabilities work seamlessly with the CLI using @filename syntax:
# Basic image analysis
python -m abstractcore.utils.cli --prompt "Describe this image @photo.jpg"
# Multiple images
python -m abstractcore.utils.cli --prompt "Compare @before.jpg and @after.jpg"
# Mixed media
python -m abstractcore.utils.cli --prompt "Verify the chart @chart.png matches @data.csv"
Best Practices
- Image Quality - Use high-quality images for better analysis accuracy
- Clear Prompts - Be specific about what you want to extract from images
- Resolution - Let AbstractCore handle optimization; don't pre-resize
- Multiple Images - Limit to 10-15 images per request for best performance
- Vision Fallback - Use local vision models for privacy-sensitive images
- Format - JPEG for photos, PNG for screenshots/diagrams
Error Handling
AbstractCore provides robust error handling for vision tasks:
- Unsupported Format - Clear error with supported format list
- Image Too Large - Automatic resizing with warning
- Corrupted Image - Graceful error with fallback attempt
- Vision Model Unavailable - Falls back to text-only with clear message