GitHub - antoinezambelli/forge: A Python framework for self-hosted LLM tool-calling and multi-step agentic workflows

G github.com ↗

▲ 676 points • 250 comments • by zambelli • 5d ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully AI-generated

93 %

AI likelihood · overall

0% human-written 100% AI-generated

SEGMENTS · HUMAN 0 of 5

SEGMENTS · AI 5 of 5

WORD COUNT 1,073

PEAK AI % 100% · §1

Analyzed

May 19

backend: pangram/v3.3

Segments scanned

5 windows

avg 215 words each

Distribution

0 / 100%

human / AI fraction

Verdict

Pangram v3.3

Article text · 1,073 words · 5 segments analyzed

Human AI-generated

§1 AI · 100%

A reliability layer for self-hosted LLM tool-calling. Forge lifts an 8B local model to the top of its class on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction). The current top self-hosted config (Ministral-3 8B Instruct Q8 on llama-server) scores 86.5% across forge's 26-scenario eval suite — and 76% on the hardest tier. Three ways to use it:

WorkflowRunner — Define tools, pick a backend, run structured agent loops. Forge manages the full lifecycle: system prompts, tool execution, context compaction, and guardrails. SlotWorker adds priority-queued access to a shared inference slot with auto-preemption — for multi-agent architectures where specialist workflows share a GPU slot. Best when you're building on forge directly.

Guardrails middleware — Use forge's reliability stack (composable middleware) inside your own orchestration loop. You control the loop; forge validates responses, rescues malformed tool calls, and enforces required steps.

Proxy server — Drop-in OpenAI-compatible proxy (python -m forge.proxy) that sits between any client (opencode, Continue, aider, etc.) and a local model server. Applies guardrails transparently — the client thinks it's talking to a smarter model.

Supports Ollama, llama-server (llama.cpp), Llamafile, and Anthropic as backends. Requirements

Python 3.12+ A running LLM backend (see below)

Install pip install forge-guardrails # core only pip install "forge-guardrails[anthropic]" # + Anthropic client For development: git clone https://github.com/antoinezambelli/forge.git cd forge pip install -e ".[dev]" Backend setup (pick one) llama-server (recommended — top 10 eval configs all run on llama-server): # Install from https://github.com/ggml-org/llama.cpp/releases

§2 AI · 100%

llama-server -m path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf --jinja -ngl 999 --port 8080 Ollama (alternative — easier setup, slightly weaker on harder workloads): # Install from https://ollama.com/download ollama pull ministral-3:8b-instruct-2512-q4_K_M Anthropic (API, no local GPU needed): pip install -e ".[anthropic]" export ANTHROPIC_API_KEY=sk-... See Backend Setup for full instructions and Model Guide for which model fits your hardware. Quick Start import asyncio from pydantic import BaseModel, Field from forge import ( Workflow, ToolDef, ToolSpec, WorkflowRunner, OllamaClient, ContextManager, TieredCompact, )

def get_weather(city: str) -> str: return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel): city: str = Field(description="City name")

workflow = Workflow( name="weather", description="Look up weather for a city.", tools={ "get_weather": ToolDef( spec=ToolSpec( name="get_weather", description="Get current weather", parameters=GetWeatherParams, ), callable=get_weather, ), }, required_steps=[], terminal_tool="get_weather", system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.", )

async def main(): client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True) ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192) runner = WorkflowRunner(client=client, context_manager=ctx) await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main()) For multi-step workflows, multi-turn conversations, and backend auto-management, see the User Guide. If you're building a long-running session (CLI, chat server, voice assistant), see the long-running session advisory for important guidance on filtering transient messages.

§3 AI · 100%

Proxy Server Drop-in replacement for a local model server. Point any OpenAI-compatible client at the proxy and get forge's guardrails for free. # External mode — you manage llama-server, forge proxies it python -m forge.proxy --backend-url http://localhost:8080 --port 8081

# Managed mode — forge starts llama-server and the proxy together python -m forge.proxy --backend llamaserver --gguf path/to/model.gguf --port 8081 Then configure your client to use http://localhost:8081/v1 as the API base URL. Note: The proxy automatically injects a synthetic respond tool when tools are present in the request. The model calls respond(message="...") instead of producing bare text, keeping it in tool-calling mode where forge's full guardrail stack applies. The respond call is stripped from the outbound response — the client sees a normal text response (finish_reason: "stop") and never knows the tool exists. This is essential for small local models (~8B), which cannot be trusted to choose correctly between text and tool calls — guiding them to a tool is a must. See ADR-013 for the full analysis. Backends

Backend Best for Native FC?

Ollama Easiest setup, model management built-in Yes

llama-server Best performance, full control Yes (with --jinja)

Llamafile Single binary, zero dependencies No (prompt-injected)

Anthropic Frontier baseline, hybrid workflows Yes

See Backend Setup for installation and Model Guide for which model to pick. Running Tests python -m pytest tests/ -v --tb=short python -m pytest tests/ --cov=forge --cov-report=term-missing Eval Harness 26 scenarios measuring how reliably a model + backend combo navigates multi-step tool-calling workflows — split into an OG-18 baseline tier and an 8-scenario advanced_reasoning tier for top-end separation.

§4 AI · 100%

See Eval Guide for full CLI reference. # llama-server (start in another terminal first; see Eval Guide) python -m tests.eval.eval_runner --backend llamafile --llamafile-mode prompt --gguf "path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf" --runs 10 --stream --verbose

# Batch eval (JSONL output, automatic resume) python -m tests.eval.batch_eval --config all --runs 50

# Reports (ASCII table, HTML dashboard, markdown views) python -m tests.eval.report eval_results.jsonl Project Structure src/forge/ __init__.py # Public API exports errors.py # ForgeError hierarchy server.py # setup_backend(), ServerManager, BudgetMode core/ messages.py # Message, MessageRole, MessageType, MessageMeta workflow.py # ToolSpec, ToolDef, ToolCall, TextResponse, Workflow inference.py # run_inference() — shared front half (compact, fold, validate, retry) runner.py # WorkflowRunner — the agentic loop slot_worker.py # SlotWorker — priority-queued slot access steps.py # StepTracker guardrails/ nudge.py # Nudge dataclass response_validator.py # ResponseValidator, ValidationResult step_enforcer.py # StepEnforcer, StepCheck error_tracker.py # ErrorTracker clients/ base.py # ChunkType, StreamChunk, LLMClient protocol ollama.py # OllamaClient (native FC) llamafile.py # LlamafileClient (native FC or prompt-injected) anthropic.py # AnthropicClient (frontier baseline) context/ manager.py # ContextManager, CompactEvent strategies.py # CompactStrategy, NoCompact, TieredCompact, SlidingWindowCompact hardware.py # HardwareProfile, detect_hardware() prompts/ templates.py # Tool prompt builders (prompt-injected path) nudges.py # Retry and step-enforcement nudge templates tools/

§5 AI · 100%

respond.py # Synthetic respond tool (respond_tool(), respond_spec()) proxy/ proxy.py # ProxyServer — programmatic start/stop API server.py # Raw asyncio HTTP server, SSE streaming handler.py # Request handler — bridge between HTTP and run_inference convert.py # OpenAI messages ↔ forge Messages conversion tests/ unit/ # 865 deterministic tests — no LLM backend required eval/ # Eval harness — model qualification against real backends

Documentation

User Guide — Usage patterns, multi-turn, context management, guardrails, slot worker, long-running session advisory Model Guide — Which model and backend for your hardware Backend Setup — Backend installation and server setup Eval Guide — Eval harness CLI reference, batch eval Architecture — Full design document Workflow Internals — Workflow design and runner internals Contributing — How to set up, test, and add new backends or scenarios

Paper The forge guardrail framework and ablation study are published as:

Zambelli, A. Forge: A Reliability Layer for Self-Hosted LLM Tool-Calling. https://doi.org/10.1145/3786335.3813193

A pre-publication preprint is also available at docs/forge_ieee_preprint.pdf — kept as a historical artifact. Cite the published version above; the DOI link may not resolve immediately depending on the publisher's release timing. License MIT — Copyright (c) 2025-2026 Antoine Zambelli