Vision Capabilities

Seamless image analysis across all AI providers with automatic optimization and intelligent fallback for text-only models.

Overview

AbstractCore provides comprehensive vision capabilities that enable seamless image analysis across multiple AI providers and models. The system automatically handles image optimization, provider-specific formatting, and intelligent fallback mechanisms.

Key Features

  • Cross-Provider Consistency - Same code works identically across cloud and local providers
  • Automatic Optimization - Images automatically resized for each model's maximum capability
  • Vision Fallback - Text-only models can process images through transparent two-stage pipeline
  • Multi-Image Support - Analyze and compare multiple images simultaneously
  • Format Flexibility - PNG, JPEG, GIF, WEBP, BMP, TIFF all supported

Supported Providers and Models

Cloud Providers

  • OpenAI: GPT-4o, GPT-4 Turbo Vision (multiple images, up to 4096×4096)
  • Anthropic: Claude 3.5 Sonnet, Claude 3 Haiku (up to 20 images, 1568×1568)

Local Providers

  • Ollama: qwen2.5vl:7b, llama3.2-vision:11b, gemma3:4b
  • LMStudio: qwen/qwen2.5-vl-7b, google/gemma-3n-e4b
  • HuggingFace: Qwen2.5-VL variants, LLaVA models
  • MLX: Vision models via MLX framework

Image Formats

PNG, JPEG, GIF, WEBP, BMP, TIFF with automatic optimization

Basic Vision Analysis

from abstractcore import create_llm

# Works with any vision-capable provider
llm = create_llm("openai", model="gpt-4o")

# Single image analysis
response = llm.generate(
    "What objects do you see in this image?",
    media=["photo.jpg"]
)

# Multiple images comparison
response = llm.generate(
    "Compare these architectural styles and identify differences",
    media=["building1.jpg", "building2.jpg", "building3.jpg"]
)

Cross-Provider Consistency

The same code works across all providers:

# Same code works across all providers
image_files = ["chart.png", "document.pdf"]
prompt = "Analyze the data in these files"

# All work identically
openai_response = create_llm("openai", model="gpt-4o").generate(prompt, media=image_files)
anthropic_response = create_llm("anthropic", model="claude-3.5-sonnet").generate(prompt, media=image_files)
ollama_response = create_llm("ollama", model="qwen2.5vl:7b").generate(prompt, media=image_files)

Vision Fallback System

The Vision Fallback System enables text-only models to process images through a transparent two-stage pipeline. This is particularly useful when you want to use a powerful text model that doesn't have native vision capabilities.

Configuration (One-Time Setup)

# Option 1: Download local vision model (recommended)
abstractcore --download-vision-model

# Option 2: Use existing Ollama model
abstractcore --set-vision-caption qwen2.5vl:7b

# Option 3: Use cloud API
abstractcore --set-vision-provider openai --model gpt-4o

# Disable vision fallback
abstractcore --disable-vision

Using Vision Fallback

# After configuration, text-only models can process images seamlessly
text_llm = create_llm("lmstudio", model="qwen/qwen3-next-80b")  # No native vision

response = text_llm.generate(
    "What's happening in this image?",
    media=["complex_scene.jpg"]
)
# Works transparently: vision model analyzes image → text model processes description

How Vision Fallback Works

  1. You send a request to a text-only model with an image
  2. AbstractCore detects the model lacks native vision capabilities
  3. The image is automatically sent to your configured vision model
  4. The vision model generates a detailed description
  5. The description is passed to your text model along with your original prompt
  6. You receive a response combining vision analysis with text processing

Automatic Resolution Optimization

AbstractCore automatically optimizes images for each model's maximum capability:

# Images automatically optimized per model
llm = create_llm("openai", model="gpt-4o")
response = llm.generate("Analyze this", media=["photo.jpg"])  # Auto-resized to 4096×4096

llm = create_llm("ollama", model="qwen2.5vl:7b")
response = llm.generate("Analyze this", media=["photo.jpg"])  # Auto-resized to 3584×3584

Structured Vision Analysis

# Get structured responses with specific requirements
llm = create_llm("openai", model="gpt-4o")

response = llm.generate("""
Analyze this image and provide:
- objects: list of objects detected
- colors: dominant colors
- setting: location/environment
- activities: what's happening

Format as JSON.
""", media=["scene.jpg"])

import json
analysis = json.loads(response.content)

Multi-Image Analysis

Comparison Tasks

# Compare multiple images
llm = create_llm("anthropic", model="claude-3.5-sonnet")

response = llm.generate(
    "Compare these three architectural designs and identify common elements",
    media=["design_a.jpg", "design_b.jpg", "design_c.jpg"]
)

Sequential Analysis

# Analyze sequence of images
response = llm.generate(
    "Describe the progression shown in these time-lapse images",
    media=["hour1.jpg", "hour2.jpg", "hour3.jpg", "hour4.jpg"]
)

Common Use Cases

Document OCR and Analysis

# Extract text from images
response = llm.generate(
    "Extract all text from this image and organize it",
    media=["receipt.jpg"]
)

# Handwriting recognition
response = llm.generate(
    "Transcribe the handwritten notes in this image",
    media=["notes.jpg"]
)

Chart and Graph Analysis

# Analyze visualizations
response = llm.generate(
    "What trends do you see in this chart? Provide the data points.",
    media=["sales_chart.png"]
)

# Compare visualizations
response = llm.generate(
    "Compare the trends shown in these two charts",
    media=["chart1.png", "chart2.png"]
)

Quality Control and Inspection

# Defect detection
response = llm.generate(
    "Identify any defects or anomalies in this product image",
    media=["product_photo.jpg"]
)

# Comparison with standard
response = llm.generate(
    "Compare this sample with the reference and note any differences",
    media=["sample.jpg", "reference.jpg"]
)

Scene Understanding

# Detailed scene analysis
response = llm.generate(
    """Analyze this scene and provide:
    - Objects present
    - Activities occurring
    - Environmental conditions
    - Any safety concerns""",
    media=["scene.jpg"]
)

CLI Integration

Vision capabilities work seamlessly with the CLI using @filename syntax:

# Basic image analysis
python -m abstractcore.utils.cli --prompt "Describe this image @photo.jpg"

# Multiple images
python -m abstractcore.utils.cli --prompt "Compare @before.jpg and @after.jpg"

# Mixed media
python -m abstractcore.utils.cli --prompt "Verify the chart @chart.png matches @data.csv"

Best Practices

  • Image Quality - Use high-quality images for better analysis accuracy
  • Clear Prompts - Be specific about what you want to extract from images
  • Resolution - Let AbstractCore handle optimization; don't pre-resize
  • Multiple Images - Limit to 10-15 images per request for best performance
  • Vision Fallback - Use local vision models for privacy-sensitive images
  • Format - JPEG for photos, PNG for screenshots/diagrams

Error Handling

AbstractCore provides robust error handling for vision tasks:

  • Unsupported Format - Clear error with supported format list
  • Image Too Large - Automatic resizing with warning
  • Corrupted Image - Graceful error with fallback attempt
  • Vision Model Unavailable - Falls back to text-only with clear message

Related Documentation

Getting Started

First steps with AbstractCore

Media Handling

Universal file attachment

Centralized Configuration

Configure vision fallback

Internal CLI

CLI with vision support

HTTP Server

REST API with vision support

API Reference

Complete Python API