Fallbacks
This document describes best-effort fallbacks AbstractCore uses when a provider/runtime does not expose (or does not reliably honor) a model’s native control surface.
Table of Contents
Qwen3 / Qwen3.5: thinking (“reasoning”) toggleThe goal of a fallback is:
- Keep the public API stable (e.g.
thinking="none|low|medium|high") - Prefer backend-native knobs when they exist
- Avoid “system prompt injection” where possible
- Be explicit about trade-offs and when behavior is only best-effort
Qwen3 / Qwen3.5: thinking (“reasoning”) toggle
What upstream Qwen recommends
Qwen3’s official docs describe two ways to switch between thinking and non-thinking modes:
1) Stateless hard switch (recommended for reliability)
Append a final assistant message containing only:
```text
```
This is stateless (applies to a single turn) and “strictly prevents” the model from generating thinking content.
2) Stateful soft switch
Add /no_think or /think to a user (or system) message. The model follows the most recent instruction across turns.
Reference: Qwen docs “Thinking & Non-Thinking Mode”.
https://qwen.readthedocs.io/en/stable/inference/transformers.html#thinking-non-thinking-mode
AbstractCore strategy
AbstractCore implements a layered approach for thinking=... on Qwen3/Qwen3.5:
1) Backend-native knob (preferred)
When the serving stack supports template kwargs, we send:
chat_template_kwargs.enable_thinking = true|false- and a compatibility alias
enableThinking = true|false
This is the “clean” approach because it aligns with Qwen’s chat templates and avoids injecting control tokens into the conversation.
2) Robust fallback for thinking="off"/"none" (LM Studio + llama.cpp/GGUF)
Some local runtimes either:
- do not expose template kwargs via API (e.g. llama-cpp-python today), or
- may ignore chat_template_kwargs for some model formats (observed in some LM Studio builds)
In those cases, AbstractCore uses Qwen’s stateless hard switch by appending a final assistant “prefill” message containing the empty think block:
- Implemented in
abstractcore/providers/base.py(Qwen hard-switch marker injection). - Used for Qwen3/Qwen3.5 on:
LMStudioProviderHuggingFaceProviderwhenmodel_type=="gguf"(llama-cpp-python)
Note: this fallback adds an extra assistant turn in the outbound request only. Callers should not persist that marker message as part of the canonical chat history.
3) Why we do not rely on /no_think as the primary switch
/no_think is a “soft” instruction and can be unreliable when:
- The instruction is not placed in a position the model “sees” as authoritative
- The serving stack rewrites prompts or inserts additional wrapper text
- The runtime ignores or alters the chat template behavior
The assistant-prefill hard switch is stateless and robust, and matches Qwen’s own documented method.