Vision Capabilities

Seamless image analysis across all AI providers with automatic optimization and intelligent fallback for text-only models.

Overview

This page describes vision as an input modality in AbstractCore (images + video understanding), and clarifies how it relates to vision fallback (caption → inject short observations) and generative vision (image/video creation), which lives in the optional abstractvision capability plugin.

Key Features

  • Cross-Provider Consistency - Same code works identically across cloud and local providers
  • Automatic Optimization - Images automatically resized for each model's maximum capability
  • Vision Fallback - Text-only models can process images through transparent two-stage pipeline
  • Multi-Image Support - Analyze and compare multiple images simultaneously
  • Format Flexibility - PNG, JPEG, GIF, WEBP, BMP, TIFF all supported

This page focuses on vision input. For generative vision output, see Capabilities and Server.

1) Image/video input modalities (owned by AbstractCore)

Attach images (and videos) to an LLM call using media=[...]:

from abstractcore import create_llm

llm = create_llm("openai", model="gpt-4o-mini")  # example; pick a vision-capable model you have access to

# Image input
resp = llm.generate("What is in this image?", media=["photo.jpg"])
print(resp.content)

# Video input (policy-driven)
resp = llm.generate(
    "Summarize what happens in this clip.",
    media=["clip.mp4"],
    video_policy="auto",  # native when supported; otherwise sample frames
)
print(resp.content)

2) Vision fallback for text-only models (optional; config-driven)

When a user attaches an image to a text-only model, AbstractCore can optionally run a configured vision backend to produce short grounded observations and inject them into the main request. This is explicit (config-driven) and transparent via response.metadata.media_enrichment[].

See Media Handling for policy details and enrichment metadata.

3) Generative vision output is not part of AbstractCore’s default install

Creating/editing images and videos is a deterministic capability provided by abstractvision. You can integrate it in two ways:

  • Capability plugin (library mode): install abstractvision and use llm.vision.*
  • HTTP interop: run the server and enable /v1/images/* endpoints delegated to abstractvision

See Server for the optional /v1/images/generations and /v1/images/edits endpoints.

Supported Providers and Models

Cloud Providers

  • OpenAI: GPT-4o, GPT-4o mini (multiple images)
  • Anthropic: Claude Haiku 4.5, Claude Sonnet 4.5 (vision-capable models)

Local Providers

  • Ollama: qwen2.5vl:7b, llama3.2-vision:11b, gemma3:4b
  • LMStudio: qwen/qwen2.5-vl-7b, google/gemma-3n-e4b
  • HuggingFace: Qwen2.5-VL variants, LLaVA models
  • MLX: Vision models via MLX framework

Image Formats

PNG, JPEG, GIF, WEBP, BMP, TIFF with automatic optimization

Basic Vision Analysis

from abstractcore import create_llm

# Works with any vision-capable provider
llm = create_llm("openai", model="gpt-4o")

# Single image analysis
response = llm.generate(
    "What objects do you see in this image?",
    media=["photo.jpg"]
)

# Multiple images comparison
response = llm.generate(
    "Compare these architectural styles and identify differences",
    media=["building1.jpg", "building2.jpg", "building3.jpg"]
)

Cross-Provider Consistency

The same code works across all providers:

# Same code works across all providers
image_files = ["chart.png", "document.pdf"]
prompt = "Analyze the data in these files"

# All work identically
openai_response = create_llm("openai", model="gpt-4o").generate(prompt, media=image_files)
anthropic_response = create_llm("anthropic", model="claude-haiku-4-5").generate(prompt, media=image_files)
ollama_response = create_llm("ollama", model="qwen2.5vl:7b").generate(prompt, media=image_files)

Vision Fallback System

The Vision Fallback System enables text-only models to process images through a transparent two-stage pipeline. This is particularly useful when you want to use a powerful text model that doesn't have native vision capabilities.

Configuration (One-Time Setup)

# Option 1: Download local vision model (recommended)
abstractcore --download-vision-model

# Option 2: Use existing Ollama model
abstractcore --set-vision-provider ollama qwen2.5vl:7b

# Option 3: Use cloud API
abstractcore --set-vision-provider openai gpt-4o

# Disable vision fallback
abstractcore --disable-vision

Using Vision Fallback

# After configuration, text-only models can process images seamlessly
text_llm = create_llm("lmstudio", model="qwen/qwen3-next-80b")  # No native vision

response = text_llm.generate(
    "What's happening in this image?",
    media=["complex_scene.jpg"]
)
# Works transparently: vision model analyzes image → text model processes description

How Vision Fallback Works

  1. You send a request to a text-only model with an image
  2. AbstractCore detects the model lacks native vision capabilities
  3. The image is automatically sent to your configured vision model
  4. The vision model generates a detailed description
  5. The description is passed to your text model along with your original prompt
  6. You receive a response combining vision analysis with text processing

No silent drops

If you attach images to a text-only model and vision fallback isn't configured, AbstractCore won't silently ignore your files. It injects a small placeholder (for example [Image 1: ...]) and records what happened in response.metadata.media_enrichment.

Automatic Resolution Optimization

AbstractCore automatically optimizes images for each model's maximum capability:

# Images automatically optimized per model
llm = create_llm("openai", model="gpt-4o")
response = llm.generate("Analyze this", media=["photo.jpg"])  # Auto-resized to 4096×4096

llm = create_llm("ollama", model="qwen2.5vl:7b")
response = llm.generate("Analyze this", media=["photo.jpg"])  # Auto-resized to 3584×3584

Structured Vision Analysis

# Get structured responses with specific requirements
llm = create_llm("openai", model="gpt-4o")

response = llm.generate("""
Analyze this image and provide:
- objects: list of objects detected
- colors: dominant colors
- setting: location/environment
- activities: what's happening

Format as JSON.
""", media=["scene.jpg"])

import json
analysis = json.loads(response.content)

Multi-Image Analysis

Comparison Tasks

# Compare multiple images
llm = create_llm("anthropic", model="claude-haiku-4-5")

response = llm.generate(
    "Compare these three architectural designs and identify common elements",
    media=["design_a.jpg", "design_b.jpg", "design_c.jpg"]
)

Sequential Analysis

# Analyze sequence of images
response = llm.generate(
    "Describe the progression shown in these time-lapse images",
    media=["hour1.jpg", "hour2.jpg", "hour3.jpg", "hour4.jpg"]
)

Common Use Cases

Document OCR and Analysis

# Extract text from images
response = llm.generate(
    "Extract all text from this image and organize it",
    media=["receipt.jpg"]
)

# Handwriting recognition
response = llm.generate(
    "Transcribe the handwritten notes in this image",
    media=["notes.jpg"]
)

Chart and Graph Analysis

# Analyze visualizations
response = llm.generate(
    "What trends do you see in this chart? Provide the data points.",
    media=["sales_chart.png"]
)

# Compare visualizations
response = llm.generate(
    "Compare the trends shown in these two charts",
    media=["chart1.png", "chart2.png"]
)

Quality Control and Inspection

# Defect detection
response = llm.generate(
    "Identify any defects or anomalies in this product image",
    media=["product_photo.jpg"]
)

# Comparison with standard
response = llm.generate(
    "Compare this sample with the reference and note any differences",
    media=["sample.jpg", "reference.jpg"]
)

Scene Understanding

# Detailed scene analysis
response = llm.generate(
    """Analyze this scene and provide:
    - Objects present
    - Activities occurring
    - Environmental conditions
    - Any safety concerns""",
    media=["scene.jpg"]
)

CLI Integration

Vision capabilities work seamlessly with the CLI using @filename syntax:

# Basic image analysis
python -m abstractcore.utils.cli --prompt "Describe this image @photo.jpg"

# Multiple images
python -m abstractcore.utils.cli --prompt "Compare @before.jpg and @after.jpg"

# Mixed media
python -m abstractcore.utils.cli --prompt "Verify the chart @chart.png matches @data.csv"

Best Practices

  • Image Quality - Use high-quality images for better analysis accuracy
  • Clear Prompts - Be specific about what you want to extract from images
  • Resolution - Let AbstractCore handle optimization; don't pre-resize
  • Multiple Images - Limit to 10-15 images per request for best performance
  • Vision Fallback - Use local vision models for privacy-sensitive images
  • Format - JPEG for photos, PNG for screenshots/diagrams

Error Handling

AbstractCore provides robust error handling for vision tasks:

  • Unsupported Format - Clear error with supported format list
  • Image Too Large - Automatic resizing with warning
  • Corrupted Image - Graceful error with fallback attempt
  • Vision Model Unavailable - Falls back to text-only with clear message

Related Documentation

Getting Started

First steps with AbstractCore

Media Handling

Universal file attachment

Centralized Configuration

Configure vision fallback

Internal CLI

CLI with vision support

HTTP Server

REST API with vision support

API Reference

Complete Python API