Vision Capabilities
Seamless image analysis across all AI providers with automatic optimization and intelligent fallback for text-only models.
Overview
AbstractCore provides comprehensive vision capabilities that enable seamless image analysis across multiple AI providers and models. The system automatically handles image optimization, provider-specific formatting, and intelligent fallback mechanisms.
Key Features
- Cross-Provider Consistency - Same code works identically across cloud and local providers
- Automatic Optimization - Images automatically resized for each model's maximum capability
- Vision Fallback - Text-only models can process images through transparent two-stage pipeline
- Multi-Image Support - Analyze and compare multiple images simultaneously
- Format Flexibility - PNG, JPEG, GIF, WEBP, BMP, TIFF all supported
Supported Providers and Models
Cloud Providers
- OpenAI: GPT-4o, GPT-4 Turbo Vision (multiple images, up to 4096×4096)
- Anthropic: Claude 3.5 Sonnet, Claude 3 Haiku (up to 20 images, 1568×1568)
Local Providers
- Ollama: qwen2.5vl:7b, llama3.2-vision:11b, gemma3:4b
- LMStudio: qwen/qwen2.5-vl-7b, google/gemma-3n-e4b
- HuggingFace: Qwen2.5-VL variants, LLaVA models
- MLX: Vision models via MLX framework
Image Formats
PNG, JPEG, GIF, WEBP, BMP, TIFF with automatic optimization
Basic Vision Analysis
from abstractcore import create_llm
# Works with any vision-capable provider
llm = create_llm("openai", model="gpt-4o")
# Single image analysis
response = llm.generate(
"What objects do you see in this image?",
media=["photo.jpg"]
)
# Multiple images comparison
response = llm.generate(
"Compare these architectural styles and identify differences",
media=["building1.jpg", "building2.jpg", "building3.jpg"]
)
Cross-Provider Consistency
The same code works across all providers:
# Same code works across all providers
image_files = ["chart.png", "document.pdf"]
prompt = "Analyze the data in these files"
# All work identically
openai_response = create_llm("openai", model="gpt-4o").generate(prompt, media=image_files)
anthropic_response = create_llm("anthropic", model="claude-3.5-sonnet").generate(prompt, media=image_files)
ollama_response = create_llm("ollama", model="qwen2.5vl:7b").generate(prompt, media=image_files)
Vision Fallback System
The Vision Fallback System enables text-only models to process images through a transparent two-stage pipeline. This is particularly useful when you want to use a powerful text model that doesn't have native vision capabilities.
Configuration (One-Time Setup)
# Option 1: Download local vision model (recommended)
abstractcore --download-vision-model
# Option 2: Use existing Ollama model
abstractcore --set-vision-caption qwen2.5vl:7b
# Option 3: Use cloud API
abstractcore --set-vision-provider openai --model gpt-4o
# Disable vision fallback
abstractcore --disable-vision
Using Vision Fallback
# After configuration, text-only models can process images seamlessly
text_llm = create_llm("lmstudio", model="qwen/qwen3-next-80b") # No native vision
response = text_llm.generate(
"What's happening in this image?",
media=["complex_scene.jpg"]
)
# Works transparently: vision model analyzes image → text model processes description
How Vision Fallback Works
- You send a request to a text-only model with an image
- AbstractCore detects the model lacks native vision capabilities
- The image is automatically sent to your configured vision model
- The vision model generates a detailed description
- The description is passed to your text model along with your original prompt
- You receive a response combining vision analysis with text processing
Automatic Resolution Optimization
AbstractCore automatically optimizes images for each model's maximum capability:
# Images automatically optimized per model
llm = create_llm("openai", model="gpt-4o")
response = llm.generate("Analyze this", media=["photo.jpg"]) # Auto-resized to 4096×4096
llm = create_llm("ollama", model="qwen2.5vl:7b")
response = llm.generate("Analyze this", media=["photo.jpg"]) # Auto-resized to 3584×3584
Structured Vision Analysis
# Get structured responses with specific requirements
llm = create_llm("openai", model="gpt-4o")
response = llm.generate("""
Analyze this image and provide:
- objects: list of objects detected
- colors: dominant colors
- setting: location/environment
- activities: what's happening
Format as JSON.
""", media=["scene.jpg"])
import json
analysis = json.loads(response.content)
Multi-Image Analysis
Comparison Tasks
# Compare multiple images
llm = create_llm("anthropic", model="claude-3.5-sonnet")
response = llm.generate(
"Compare these three architectural designs and identify common elements",
media=["design_a.jpg", "design_b.jpg", "design_c.jpg"]
)
Sequential Analysis
# Analyze sequence of images
response = llm.generate(
"Describe the progression shown in these time-lapse images",
media=["hour1.jpg", "hour2.jpg", "hour3.jpg", "hour4.jpg"]
)
Common Use Cases
Document OCR and Analysis
# Extract text from images
response = llm.generate(
"Extract all text from this image and organize it",
media=["receipt.jpg"]
)
# Handwriting recognition
response = llm.generate(
"Transcribe the handwritten notes in this image",
media=["notes.jpg"]
)
Chart and Graph Analysis
# Analyze visualizations
response = llm.generate(
"What trends do you see in this chart? Provide the data points.",
media=["sales_chart.png"]
)
# Compare visualizations
response = llm.generate(
"Compare the trends shown in these two charts",
media=["chart1.png", "chart2.png"]
)
Quality Control and Inspection
# Defect detection
response = llm.generate(
"Identify any defects or anomalies in this product image",
media=["product_photo.jpg"]
)
# Comparison with standard
response = llm.generate(
"Compare this sample with the reference and note any differences",
media=["sample.jpg", "reference.jpg"]
)
Scene Understanding
# Detailed scene analysis
response = llm.generate(
"""Analyze this scene and provide:
- Objects present
- Activities occurring
- Environmental conditions
- Any safety concerns""",
media=["scene.jpg"]
)
CLI Integration
Vision capabilities work seamlessly with the CLI using @filename syntax:
# Basic image analysis
python -m abstractcore.utils.cli --prompt "Describe this image @photo.jpg"
# Multiple images
python -m abstractcore.utils.cli --prompt "Compare @before.jpg and @after.jpg"
# Mixed media
python -m abstractcore.utils.cli --prompt "Verify the chart @chart.png matches @data.csv"
Best Practices
- Image Quality - Use high-quality images for better analysis accuracy
- Clear Prompts - Be specific about what you want to extract from images
- Resolution - Let AbstractCore handle optimization; don't pre-resize
- Multiple Images - Limit to 10-15 images per request for best performance
- Vision Fallback - Use local vision models for privacy-sensitive images
- Format - JPEG for photos, PNG for screenshots/diagrams
Error Handling
AbstractCore provides robust error handling for vision tasks:
- Unsupported Format - Clear error with supported format list
- Image Too Large - Automatic resizing with warning
- Corrupted Image - Graceful error with fallback attempt
- Vision Model Unavailable - Falls back to text-only with clear message