Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

AI Gateway

Barbacane ships an OpenAI-compatible AI gateway built from one dispatcher and four middlewares. This page is a quickstart — it walks through the minimum viable configuration, the three protocol surfaces, and the layering of policy concerns. For the full reference of each component, follow the cross-links.

What you get

SurfaceEndpointPurpose
Chat CompletionsPOST /v1/chat/completionsOpenAI Chat Completions; Anthropic translated to/from Messages
Responses API (stateless)POST /v1/responsesOpenAI Responses; synthetic resp_<uuid-v7> ids; previous_response_id returns 400
Model catalogGET /v1/modelsAggregated catalog across every unique provider declared in the config

All three are bound to the same ai-proxy dispatcher. The dispatcher routes a request to a provider by glob-matching the client-supplied model field — the gateway never declares its own (ADR-0030 §0).

Quickstart — drop-in spec fragment

The simplest way to bring up the full gateway is to drop the shipped spec fragment into your project’s specs/ folder:

mkdir -p specs/
cp /path/to/barbacane/schemas/ai-gateway.yaml specs/ai-gateway.yaml

Multi-file spec discovery picks it up at compile time alongside your tenant spec. The fragment declares the three operations bound to ai-proxy with a YAML anchor for the dispatcher config and reads provider credentials from environment variables via env:// references:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export OLLAMA_BASE_URL=http://localhost:11434  # optional; default

Default routing in the fragment:

GlobProvider
claude-*Anthropic
gpt-*OpenAI
o[1-4]*OpenAI (reasoning series)
*Ollama (catch-all — see caveat)

Catch-all caveat. The fragment ships with pattern: "*" → Ollama as the last route. This is convenient for local dev but in production it means typos in model (e.g. gtp-4o) silently route to Ollama instead of returning a clean 400 no_route. Drop the catch-all from your copy of the fragment if you want strict validation.

To customise (Azure target, restricted catalog, named tiers), copy the fragment into your specs/ folder and edit your copy — it’s a regular OpenAPI document.

Layering policy on top

The dispatcher owns provider routing and catalog policy. Layer middlewares on the same operation for content and cost policy:

Order of evaluation. Middlewares run before the dispatcher, in declaration order. A cel on_match.deny short-circuits the chain — the dispatcher never sees the request — so cel body-gating fires before routes and per-target allow/deny. Use this when the dispatcher’s static lists can’t express the rule (e.g. it depends on JWT claims); otherwise prefer allow/deny so the policy applies on every resolution path.

# Stack on top of the operations declared in schemas/ai-gateway.yaml
paths:
  /v1/chat/completions:
    post:
      x-barbacane-middlewares:
        - name: jwt-auth
          config:
            issuer: "https://auth.example.com"

        # Per-tier model gating using request body + claims (cel body_json)
        - name: cel
          config:
            expression: >
              request.body_json.model.startsWith('gpt-4')
              && request.claims.tier != 'premium'
            on_match:
              deny:
                status: 403
                code: model_not_permitted_for_tier
                message: "gpt-4* is restricted to the premium tier"

        # Tier-driven profile selection for the AI middlewares
        - name: cel
          config:
            expression: "request.claims.tier == 'premium'"
            on_match:
              set_context:
                ai.policy: premium

        - name: ai-prompt-guard
          config:
            default_profile: standard
            profiles:
              standard: { max_messages: 50, blocked_patterns: ["(?i)ignore previous instructions"] }
              premium:  { max_messages: 200 }

        - name: ai-token-limit
          config:
            default_profile: standard
            partition_key: "header:x-auth-sub"
            profiles:
              standard: { quota: 10000,  window: 60 }
              premium:  { quota: 100000, window: 60 }

        - name: ai-cost-tracker
          config:
            prices:
              openai/gpt-4o:                      { prompt: 0.0025, completion: 0.01 }
              anthropic/claude-sonnet-4-20250514: { prompt: 0.003,  completion: 0.015 }
              ollama/mistral:                     { prompt: 0.0,    completion: 0.0 }

Where each concern lives

DecisionPlaceMechanism
Which provider serves a modeldispatcherroutes glob
Which models a target may servedispatcherper-target allow / deny lists
Which target a request goes to (caller-driven)upstream celset_context: { ai.target: ... }
Per-tier “this caller can’t use that model”upstream celon_match.deny on request.body_json.model
Prompt validation, token budgets, response redactionAI middlewaresai-policy profile selection
Per-call cost in USDai-cost-trackerreads ai.provider / ai.model / ai.prompt_tokens / ai.completion_tokens set by the dispatcher

The dispatcher’s allow/deny is enforced on every resolution path — context-driven dispatch included — so a cel misconfig cannot leak a denied model. Reach for cel + body_json only when the rule depends on caller attributes the dispatcher doesn’t see (claims, headers, time-of-day) or when a custom error code is needed.

Operating notes

  • Stateless Responses API. previous_response_id returns 400 previous_response_id_not_supported. store: true is permissive but emits Warning: 299 and increments barbacane_plugin_ai_proxy_responses_store_downgrades_total. Stateful storage is on the roadmap.
  • /v1/models partial failures. A single flaky upstream returns 200 OK with partial: true + a warnings: [] array rather than a 5xx. Discovery clients should handle the partial case rather than retry on the aggregator. Per-provider timeout is models_timeout_ms (default 5000), distinct from the LLM timeout so one hung provider doesn’t block discovery.
  • Streaming. SSE chat-completion streams pass through unchanged. Streamed Responses on OpenAI passthrough do not rewrite the in-event id — true SSE re-encoding is deferred. For strict synthetic-id enforcement, drop "stream": true.
  • Ollama Responses. Returns 400 responses_not_supported_for_provider — Ollama’s OpenAI-compat surface is Chat Completions only.

Going deeper