Vision Capabilities

Seamless image analysis across all AI providers with automatic optimization and intelligent fallback for text-only models.

Overview

AbstractCore provides comprehensive vision capabilities that enable seamless image analysis across multiple AI providers and models. The system automatically handles image optimization, provider-specific formatting, and intelligent fallback mechanisms.

Key Features

Cross-Provider Consistency - Same code works identically across cloud and local providers
Automatic Optimization - Images automatically resized for each model's maximum capability
Vision Fallback - Text-only models can process images through transparent two-stage pipeline
Multi-Image Support - Analyze and compare multiple images simultaneously
Format Flexibility - PNG, JPEG, GIF, WEBP, BMP, TIFF all supported

Supported Providers and Models

Cloud Providers

OpenAI: GPT-4o, GPT-4 Turbo Vision (multiple images, up to 4096×4096)
Anthropic: Claude 3.5 Sonnet, Claude 3 Haiku (up to 20 images, 1568×1568)

Local Providers

Ollama: qwen2.5vl:7b, llama3.2-vision:11b, gemma3:4b
LMStudio: qwen/qwen2.5-vl-7b, google/gemma-3n-e4b
HuggingFace: Qwen2.5-VL variants, LLaVA models
MLX: Vision models via MLX framework

Image Formats

PNG, JPEG, GIF, WEBP, BMP, TIFF with automatic optimization

Basic Vision Analysis

from abstractcore import create_llm

# Works with any vision-capable provider
llm = create_llm("openai", model="gpt-4o")

# Single image analysis
response = llm.generate(
    "What objects do you see in this image?",
    media=["photo.jpg"]
)

# Multiple images comparison
response = llm.generate(
    "Compare these architectural styles and identify differences",
    media=["building1.jpg", "building2.jpg", "building3.jpg"]
)

Cross-Provider Consistency

The same code works across all providers:

# Same code works across all providers
image_files = ["chart.png", "document.pdf"]
prompt = "Analyze the data in these files"

# All work identically
openai_response = create_llm("openai", model="gpt-4o").generate(prompt, media=image_files)
anthropic_response = create_llm("anthropic", model="claude-3.5-sonnet").generate(prompt, media=image_files)
ollama_response = create_llm("ollama", model="qwen2.5vl:7b").generate(prompt, media=image_files)

Vision Fallback System

The Vision Fallback System enables text-only models to process images through a transparent two-stage pipeline. This is particularly useful when you want to use a powerful text model that doesn't have native vision capabilities.

Configuration (One-Time Setup)

# Option 1: Download local vision model (recommended)
abstractcore --download-vision-model

# Option 2: Use existing Ollama model
abstractcore --set-vision-caption qwen2.5vl:7b

# Option 3: Use cloud API
abstractcore --set-vision-provider openai --model gpt-4o

# Disable vision fallback
abstractcore --disable-vision

Using Vision Fallback

# After configuration, text-only models can process images seamlessly
text_llm = create_llm("lmstudio", model="qwen/qwen3-next-80b")  # No native vision

response = text_llm.generate(
    "What's happening in this image?",
    media=["complex_scene.jpg"]
)
# Works transparently: vision model analyzes image → text model processes description

How Vision Fallback Works

You send a request to a text-only model with an image
AbstractCore detects the model lacks native vision capabilities
The image is automatically sent to your configured vision model
The vision model generates a detailed description
The description is passed to your text model along with your original prompt
You receive a response combining vision analysis with text processing

Automatic Resolution Optimization

AbstractCore automatically optimizes images for each model's maximum capability:

# Images automatically optimized per model
llm = create_llm("openai", model="gpt-4o")
response = llm.generate("Analyze this", media=["photo.jpg"])  # Auto-resized to 4096×4096

llm = create_llm("ollama", model="qwen2.5vl:7b")
response = llm.generate("Analyze this", media=["photo.jpg"])  # Auto-resized to 3584×3584

Structured Vision Analysis

# Get structured responses with specific requirements
llm = create_llm("openai", model="gpt-4o")

response = llm.generate("""
Analyze this image and provide:
- objects: list of objects detected
- colors: dominant colors
- setting: location/environment
- activities: what's happening

Format as JSON.
""", media=["scene.jpg"])

import json
analysis = json.loads(response.content)

Multi-Image Analysis

Comparison Tasks

# Compare multiple images
llm = create_llm("anthropic", model="claude-3.5-sonnet")

response = llm.generate(
    "Compare these three architectural designs and identify common elements",
    media=["design_a.jpg", "design_b.jpg", "design_c.jpg"]
)

Sequential Analysis

# Analyze sequence of images
response = llm.generate(
    "Describe the progression shown in these time-lapse images",
    media=["hour1.jpg", "hour2.jpg", "hour3.jpg", "hour4.jpg"]
)

Common Use Cases

Document OCR and Analysis

# Extract text from images
response = llm.generate(
    "Extract all text from this image and organize it",
    media=["receipt.jpg"]
)

# Handwriting recognition
response = llm.generate(
    "Transcribe the handwritten notes in this image",
    media=["notes.jpg"]
)

Chart and Graph Analysis

# Analyze visualizations
response = llm.generate(
    "What trends do you see in this chart? Provide the data points.",
    media=["sales_chart.png"]
)

# Compare visualizations
response = llm.generate(
    "Compare the trends shown in these two charts",
    media=["chart1.png", "chart2.png"]
)

Quality Control and Inspection

# Defect detection
response = llm.generate(
    "Identify any defects or anomalies in this product image",
    media=["product_photo.jpg"]
)

# Comparison with standard
response = llm.generate(
    "Compare this sample with the reference and note any differences",
    media=["sample.jpg", "reference.jpg"]
)

Scene Understanding

# Detailed scene analysis
response = llm.generate(
    """Analyze this scene and provide:
    - Objects present
    - Activities occurring
    - Environmental conditions
    - Any safety concerns""",
    media=["scene.jpg"]
)

CLI Integration

Vision capabilities work seamlessly with the CLI using @filename syntax:

# Basic image analysis
python -m abstractcore.utils.cli --prompt "Describe this image @photo.jpg"

# Multiple images
python -m abstractcore.utils.cli --prompt "Compare @before.jpg and @after.jpg"

# Mixed media
python -m abstractcore.utils.cli --prompt "Verify the chart @chart.png matches @data.csv"

Best Practices

Image Quality - Use high-quality images for better analysis accuracy
Clear Prompts - Be specific about what you want to extract from images
Resolution - Let AbstractCore handle optimization; don't pre-resize
Multiple Images - Limit to 10-15 images per request for best performance
Vision Fallback - Use local vision models for privacy-sensitive images
Format - JPEG for photos, PNG for screenshots/diagrams

Error Handling

AbstractCore provides robust error handling for vision tasks:

Unsupported Format - Clear error with supported format list
Image Too Large - Automatic resizing with warning
Corrupted Image - Graceful error with fallback attempt
Vision Model Unavailable - Falls back to text-only with clear message