Providers

Swival supports local, hosted, and API-based model providers:

LM Studio — local inference
llama.cpp — local llama.cpp server
HuggingFace Inference API — hosted inference
OpenRouter — multi-provider access through a single API
Generic (OpenAI-compatible) — any OpenAI-compatible server
Apple Foundation Models — experimental, local Apple Foundation Models server
Google Gemini API — Google's models via API key
Gemini Enterprise Agent Platform — Gemini through Google Cloud (formerly Vertex AI)
ChatGPT Plus/Pro — OpenAI models via your existing subscription
AWS Bedrock — models hosted on AWS
Command (External Program) — shell out to an external program

Switching Between Providers

If you use more than one provider regularly, define named profiles in your config instead of retyping flags each time. See Profiles for the full syntax.

[profiles.local]
provider = "lmstudio"
model = "qwen3-coder-next"

[profiles.gpt5]
provider = "chatgpt"
model = "gpt-5.5"
reasoning_effort = "high"

Then switch with --profile at startup, or /profile mid-session:

swival --profile local "quick task"
swival --profile gpt5 "hard task"

swival> /profile gpt5

Profiles switch the whole stack at once. To change only the model within the current provider, use the /model command in the REPL: with no argument it opens an interactive picker over the models the provider can serve right now, /model NAME switches directly, and favorites you star in the picker are remembered across sessions. Bedrock, GEAP, and the command provider have no model listing, but /model ID still works. See Usage for the details.

The sections below document each provider's flags, authentication, and behavior in detail.

LM Studio

LM Studio is the default provider and usually requires no flags when the local server is already running with a loaded model.

At startup, Swival calls http://127.0.0.1:1234/api/v1/models unless you override --base-url. It looks for the first model entry with type: "llm" and a non-empty loaded_instances array, then extracts the model identifier and current context length from that payload. If no loaded model is found, Swival exits and asks you to load a model or pass --model explicitly.

If LM Studio is running on another host or port, set --base-url.

swival --base-url http://192.168.1.100:1234 "task"

If you want to bypass auto-discovery, pass --model.

swival --model "qwen3-coder-next" "task"

If you pass --max-context-tokens, Swival may reload the model through LM Studio's /api/v1/models/load endpoint.

swival --max-context-tokens 131072 "task"

If the requested value already matches the loaded context length, no reload happens.

When a reload is required, it can take noticeable time depending on model size and hardware.

HuggingFace Inference API

For HuggingFace, --model is required and must be in org/model format. Authentication comes from HF_TOKEN by default or --api-key if you pass one explicitly. If you don't know which model to pick, the /model command in the REPL (and the first-run wizard) opens an explorer over the models currently served by inference providers, with per-provider context and pricing details plus each model's aggregate tool-support status; see Usage.

export HF_TOKEN=hf_your_token_here
swival --provider huggingface --model zai-org/GLM-5.2 "task"

Serverless HuggingFace endpoints often expose smaller context windows than local deployments, so long multi-turn coding sessions can hit context pressure sooner. When --max-context-tokens is not set, Swival looks the window up in LiteLLM's model registry; set the flag explicitly if your model is not listed there.

For dedicated endpoints, keep the same model identifier and pass your endpoint URL and key.

swival --provider huggingface \
    --model zai-org/GLM-5.2 \
    --base-url https://xyz.endpoints.huggingface.cloud \
    --api-key hf_your_key \
    "task"

HuggingFace Inference Endpoints

HuggingFace Inference Endpoints let you deploy any supported model on dedicated infrastructure. Create an endpoint from the HuggingFace UI, then point Swival at it with --base-url.

swival --provider huggingface \
    --model Qwen/Qwen3.5-35B-A3B \
    --base-url https://tfg1ghx03o7xuv5p.us-east-1.aws.endpoints.huggingface.cloud

Most inference endpoints use vLLM as the serving backend. For tool calling to work, you must add the following to the Container Arguments field in your endpoint's configuration on HuggingFace:

--enable-auto-tool-choice --tool-call-parser qwen3_xml

The --tool-call-parser value depends on the model you deploy. For Qwen models use qwen3_xml, for other model families check the vLLM tool calling documentation for the correct parser name. Without these arguments, the endpoint will not return structured tool calls and Swival will not be able to use its tools.

For recently released models, the default vLLM version configured in Inference Endpoints may not support them yet. If you hit errors during model loading, set the Engine URI in your endpoint configuration to vllm/vllm-openai:latest to use a more recent build.

Some models served through vLLM leak internal reasoning markers like <think> tags into their responses. If you see these in the output, enable --sanitize-thinking to strip them. See Thinking Tag Sanitization for details.

Dedicated endpoints usually let you use the full deployed model context window rather than tighter serverless limits.

OpenRouter

For OpenRouter, --model is required and authentication comes from OPENROUTER_API_KEY or --api-key.

export OPENROUTER_API_KEY=sk_or_your_token_here
swival --provider openrouter --model z-ai/glm-5.2 "task"

If you use an OpenRouter-compatible custom endpoint, set --base-url.

swival --provider openrouter \
    --model z-ai/glm-5.2 \
    --base-url https://custom.openrouter.endpoint \
    --api-key sk_or_key \
    "task"

OpenRouter models vary widely in context limits. When --max-context-tokens is not set, Swival looks the window up in LiteLLM's model registry. If the model is missing from the registry, or you want a different limit, set the value explicitly:

swival --provider openrouter --model z-ai/glm-5.2 \
    --max-context-tokens 131072 "task"

Pass bare model identifiers like z-ai/glm-5.2. If you accidentally include a provider prefix (e.g. openrouter/z-ai/glm-5.2), Swival detects and corrects the double prefix.

llama.cpp

llama.cpp runs GGUF models locally through llama-server, which exposes an OpenAI-compatible API. Start the server first, then point Swival at it.

Start llama-server with a model from HuggingFace:

llama-server \
    --reasoning auto \
    --fit on \
    -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
    --temp 1.0 --top-p 0.95 --top-k 64

--reasoning auto lets the model use chain-of-thought when it helps. --fit on automatically sizes the context window and batch parameters to fit in available memory. The -hf flag downloads the model directly from HuggingFace on first run and caches it locally.

Once the server is listening (default port 8080), connect Swival:

swival --provider llamacpp "task"

Swival auto-detects the model name from the server, so --model is not required. It also auto-detects the context window: when --max-context-tokens is not set, it reads n_ctx from the server's /props endpoint, falling back to the OpenAI-compatible /v1/models payload. The default base URL is http://127.0.0.1:8080. To override either:

swival --provider llamacpp \
    --base-url http://192.168.1.10:9090 \
    --model my-model \
    "task"

Generic (OpenAI-compatible)

The generic provider works with any server that exposes an OpenAI-compatible chat completions endpoint. This covers mlx_lm.server, ollama, vLLM, LocalAI, text-generation-webui, DeepSeek API, and similar tools.

Both --model and --base-url are required. Pass the server's root URL without /v1 — Swival appends it automatically. If your URL already ends in a version segment, that's fine too: /v1, or a provider that versions differently such as Z.AI's /api/paas/v4, is left untouched.

# mlx_lm.server
swival --provider generic \
    --base-url http://127.0.0.1:8080 \
    --model mlx-community/Qwen3-Coder-480B-A35B-4bit \
    "task"

# ollama
swival --provider generic \
    --base-url http://127.0.0.1:11434 \
    --model qwen3:32b \
    "task"

# DeepSeek API
export DEEPSEEK_API_KEY=sk-...
swival --provider generic \
    --base-url https://api.deepseek.com \
    --model deepseek-chat \
    --api-key "$DEEPSEEK_API_KEY" \
    --max-output-tokens 8192 \
    "task"

# Z.AI — note the /v4 endpoint, which Swival keeps as-is
export ZAI_API_KEY=...
swival --provider generic \
    --base-url https://api.z.ai/api/paas/v4 \
    --model glm-5.2 \
    --api-key "$ZAI_API_KEY" \
    "task"

No API key is required for most local servers. If your server needs one, pass --api-key or set OPENAI_API_KEY.

export OPENAI_API_KEY=sk-...
swival --provider generic \
    --base-url https://my-server.example.com \
    --model my-model \
    "task"

There is no model auto-discovery, so --model is always required. The context window, however, is auto-detected: when --max-context-tokens is not set, Swival probes the server's /v1/models payload for a context-length field (max_model_len, max_context_length, context_length, or context_window). Set --max-context-tokens manually if your server does not advertise the window.

Some providers gate access based on the User-Agent header. Use --user-agent to set it explicitly. For example, Kimi's coding API requires a KimiCLI user agent:

export KIMI_API_KEY=sk-kimi-...
swival --provider generic \
    --base-url https://api.kimi.com/coding/v1 \
    --model kimi-for-coding \
    --api-key "$KIMI_API_KEY" \
    --user-agent "KimiCLI/Swival" \
    "task"

Apple Foundation Models

The applefm provider talks to a local Apple Foundation Models server that exposes an OpenAI-compatible endpoint at http://127.0.0.1:1976/v1. It is experimental, and tool support is limited — Swival sanitizes the tool schemas it sends, dropping any the Foundation Models runtime cannot handle.

swival --provider applefm "task"

Two models are available. Private Cloud Compute (pcc) is the default and the only one with enough room to do real work — roughly 32K tokens. The on-device system model tops out near 4K tokens and is experimental at best; select it with --model system. Swival defaults to pcc when you do not name a model.

The server does not report its context window, so Swival falls back to measured defaults (about 32K for pcc, about 4K for system). Override either with --max-context-tokens, and point at a non-default server with --base-url.

Google Gemini API

The google provider connects to Google's Gemini API through its OpenAI-compatible endpoint (/v1beta/openai).

--model is required. Authentication comes from --api-key, GEMINI_API_KEY, or OPENAI_API_KEY.

export GEMINI_API_KEY=...
swival --provider google \
    --model gemini-3-flash \
    "task"

When --max-context-tokens is not set, Swival auto-detects the context window from the model's known limits. If detection fails, context length is unknown and compaction may not trigger at the right time — set --max-context-tokens explicitly if you hit issues.

--base-url overrides the default endpoint if you need a custom one.

Gemini Enterprise Agent Platform

The geap provider connects to Google's Gemini models through Vertex AI / Gemini Enterprise Agent Platform. Unlike the google provider which uses a public API key, geap uses Google Cloud project credentials and is designed for enterprise setups.

--provider vertexai is accepted as an alias for --provider geap.

--model, --gcp-project, and --location are required. Authentication uses Google Application Default Credentials — no API key.

gcloud auth application-default login
swival --provider geap \
    --gcp-project my-gcp-project \
    --location us-central1 \
    --model gemini-3.1-pro \
    "task"

For service accounts, set GOOGLE_APPLICATION_CREDENTIALS instead of running the gcloud login:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
swival --provider geap \
    --gcp-project my-gcp-project \
    --location us-central1 \
    --model gemini-3.1-pro \
    "task"

--gcp-project can also be set via the GOOGLE_CLOUD_PROJECT environment variable:

export GOOGLE_CLOUD_PROJECT=my-gcp-project
swival --provider geap --location us-central1 --model gemini-3.1-pro "task"

Pass bare model names like gemini-3.1-pro. Do not include a vertex_ai/ prefix; Swival adds it automatically and rejects prefixed names with a clear error.

When --max-context-tokens is not set, Swival looks the window up in LiteLLM's model registry under the vertex_ai/ prefix. Set the flag explicitly if your model is not listed there.

In config:

provider = "geap"
model = "gemini-3.1-pro"
project = "my-gcp-project"
location = "us-central1"

ChatGPT Plus/Pro

The chatgpt provider lets you use OpenAI models through your existing ChatGPT Plus or ChatGPT Pro subscription, without needing a separate API key.

Authentication uses an OAuth device-code flow — on first use, Swival prints a device code and a verification URL to your terminal. Open the URL, enter the code, and authorize with your ChatGPT account. The resulting tokens are cached locally and refreshed automatically on subsequent runs.

If you need to pass an API key explicitly (for example, when using --self-review which passes credentials via environment variables), set CHATGPT_API_KEY or use --api-key.

--model is required. There is no default model.

swival --provider chatgpt --model gpt-5.5 "task"

On the first run, you will see a device-code prompt with a URL and a code to enter in your browser. Once you complete the flow, the OAuth tokens are stored locally and refreshed automatically. To remove the cached tokens and force the device-code flow on the next run, use swival --logout.

Supported model names may change over time. Check OpenAI's documentation for the current model list and naming conventions.

When --max-context-tokens is not set, Swival looks the window up in LiteLLM's model registry. Set the flag explicitly if your model is not listed there.

Two environment variables are available for advanced use. CHATGPT_TOKEN_DIR overrides the default token storage directory, and CHATGPT_AUTH_FILE overrides the token filename or path. Use --base-url to override the API base URL.

export CHATGPT_TOKEN_DIR=/path/to/tokens
swival --provider chatgpt --model gpt-5.5 "task"
swival --logout

The --top-p, --seed, and tool_choice parameters are not supported by the ChatGPT Plus/Pro backend. Swival drops them automatically when using this provider.

Models like gpt-5.5 support tunable reasoning effort. Use --reasoning-effort to control how much the model thinks before responding:

swival --provider chatgpt --model gpt-5.5 --reasoning-effort high "task"

You can also combine reasoning effort with priority processing (service_tier: "priority") for faster responses without sacrificing quality. See the Service Tiers section.

No other configuration is needed.

AWS Bedrock

The bedrock provider connects to AWS Bedrock. It uses your existing AWS credentials — --api-key is not supported.

--model is required. Use the Bedrock model ID.

export AWS_ACCESS_KEY_ID=AKIA...
export AWS_SECRET_ACCESS_KEY=...
export AWS_REGION_NAME=us-east-2
swival --provider bedrock --model global.anthropic.claude-opus-4-6-v1 "task"

Swival picks up credentials from environment variables, ~/.aws/credentials, ~/.aws/config, IAM roles, and SSO. To use a named profile, pass --aws-profile or set AWS_PROFILE.

swival --provider bedrock --model global.anthropic.claude-opus-4-6-v1 \
    --aws-profile bedrock "task"

Use --base-url to set the region, or pass a custom endpoint URL.

swival --provider bedrock --model global.anthropic.claude-opus-4-6-v1 \
    --base-url us-east-2 "task"

Note: the region env var is AWS_REGION_NAME, not AWS_DEFAULT_REGION.

When --max-context-tokens is not set, Swival reads the window from LiteLLM's model registry for the Bedrock model ID. Set the flag explicitly if your model is not listed there.

For cross-region inference, use the regional prefix in the model ID (e.g. us., eu., apac.).

swival --provider bedrock --model us.anthropic.claude-opus-4-6-v1 "task"

Extra Provider Parameters

Some models and servers accept parameters that go beyond the standard OpenAI API. Use --extra-body to pass them through. The value is a JSON object that gets forwarded directly to the API call.

For example, Qwen models served through vLLM can disable internal thinking mode:

swival --provider generic \
    --base-url http://127.0.0.1:8000 \
    --model Qwen/Qwen3.5-35B-A3B \
    --extra-body '{"chat_template_kwargs": {"enable_thinking": false}}' \
    "task"

You can also set this in config so you don't repeat it every time:

provider = "generic"
base_url = "http://127.0.0.1:8000"
model = "Qwen/Qwen3.5-35B-A3B"
extra_body = { chat_template_kwargs = { enable_thinking = false } }

The dictionary is forwarded as extra_body to the provider's API. Refer to your model or server documentation for supported parameters.

When using vLLM as the inference backend, models may leak internal reasoning markers like <think> tags into their output even with thinking disabled. Use --sanitize-thinking to strip them. See Thinking Tag Sanitization for details.

For reasoning effort specifically, Swival provides a dedicated --reasoning-effort flag instead of requiring extra_body. See Customization for details.

Command (External Program)

The command provider shells out to an external program instead of calling an API. This is useful when you want to wrap an existing CLI tool — such as codex exec --skip-git-repo-check --full-auto --ephemeral, ollama run, or a custom script — as Swival's backend.

The conversation transcript is written to the program's stdin, and the program's stdout is read back as the model response. --model holds the command string, which is split with shlex. --base-url and --api-key are ignored.

Tool calling is supported when command execution is enabled (the default, or via --commands). The external program can request tool execution by emitting <swival:call> XML blocks in its output.

Swival parses these, dispatches the tool calls, appends results to the transcript, and re-invokes the command. This loop continues (up to 20 rounds) until the program responds without tool calls. With --commands none, the native run_command and run_shell_command tools are not exposed, but the program can still call MCP, A2A, and skill tools through the same XML block protocol when those are configured.

swival --provider command --model "codex exec --skip-git-repo-check --full-auto --ephemeral" "task"

Or in config:

provider = "command"
model = "codex exec --skip-git-repo-check --full-auto --ephemeral"

Because the external program handles all model routing, there is no auto-discovery. If you set --max-context-tokens, Swival will apply output clamping and graduated compaction as usual; otherwise context management is left to the command itself.

Prompt Caching

For providers that support explicit cache annotations, Swival automatically marks the system message as cacheable each turn. This avoids re-processing the system prompt and tool schemas on every call, which typically saves 30–60% of input token costs in long sessions.

Provider	Caching mechanism	Notes
Anthropic (via OpenRouter)	Explicit `cache_control` injected by Swival	System message cached; tool schemas not cached in Phase 1
Google Gemini	Explicit `cache_control` injected by Swival	Via `openrouter/google/...` or native `google` provider
AWS Bedrock	Explicit `cache_control` injected by Swival	Supported for Anthropic models on Bedrock
OpenAI / Deepseek	Automatic (provider-side)	No annotation needed; prompts >1024 tokens cached automatically
LM Studio	None	Local inference, no server-side cache
Vertex AI (`geap`/`vertexai`)	None	Excluded: Vertex AI rejects cached content when tools or system instructions are in the same request
Generic with custom base_url	Best effort	Annotation injected only when LiteLLM recognizes the model as cache-capable

Cache annotation is applied automatically when the model is known to support it (Swival defers to LiteLLM's supports_prompt_caching check). It is injected for every provider except LM Studio and Vertex AI; if the call succeeds with a provider that ignores the annotation, the extra field is silently dropped.

When diagnostics are enabled (the default unless --quiet is set), Swival prints cache hit and write stats to stderr after each turn:

Prompt cache: 4821 tokens cached
Prompt cache: 6103 tokens written to cache

To opt out of explicit cache annotations, pass --no-prompt-cache. This only suppresses the injection; providers that cache automatically are unaffected.