skein ecosystem
The AI stack
that answers
to you.
An inference proxy that hot-swaps models. An orchestrator that connects the fleet. A coding agent that finds your backends on your LAN and shows you exactly what fits in VRAM. All running on your hardware, none of it phoning home.
the ecosystem
- llama-skein
- Inference proxy. Hot-swaps models on demand behind one OpenAI-compatible endpoint. Reports VRAM and KV-cache. Advertises itself over mDNS so nothing needs a hardcoded IP.
- skein
- The orchestrator. Routes requests across the fleet, manages sessions, and gives you a single place to see everything that's running.
- opencode-skein
- AI coding agent, forked from opencode. Finds llama-skein on your LAN automatically, shows live VRAM bars, picks context sizes from real hardware data, and runs long sessions unattended.
skein is the central node. It discovers llama-skein instances on your network, routes requests to the right backend, manages fleet-wide observability, and gives opencode-skein a single consistent place to talk to.
Without skein you have two good tools. With skein you have an ecosystem. The orchestration layer is what turns a collection of running processes into something you can actually operate and reason about.
- Fleet-wide request routing
- mDNS-based backend discovery
- Session and context management
- Real-time observability across nodes
github.com/androidand/skeinAn inference proxy that sits in front of llama.cpp (and any other OpenAI-compatible engine) and exposes a single unified API with on-demand model hot-swapping, hardware and VRAM reporting, and LAN discovery. A single static Go binary with a YAML config.
Define your models in YAML. Point them at llama.cpp, vLLM, tabbyAPI, or stable-diffusion.cpp. llama-skein detects which model a request wants, loads it, serves it, and unloads the previous one — no restart, no manual intervention. It also advertises itself over mDNS so opencode-skein finds it on your LAN without a hardcoded IP.
- Hot-swap on demand, no restart
- OpenAI + Anthropic API compatible
- VRAM & KV-cache reporting via
/api/hardware - mDNS LAN advertising (zero-config discovery)
- CPU/MoE offload recommendations
- Metal · CUDA · ROCm · Vulkan · CPU
- TTL auto-unload, model aliases, swap matrices
- llama.cpp · vLLM · tabbyAPI · Whisper · more
make clean all && llama-skein --config config.yamlA maintained fork of opencode that makes local LLM backends first-class: it finds them on your network, shows you exactly what fits in VRAM, and keeps long agent sessions running unattended.
The live VRAM bar shows model weights vs. KV cache vs. free headroom — so you see when you're about to blow past memory before the model does. The context-size picker recommends a size from real hardware data and flags presets that would exceed your memory. /loop scheduling lets you set an agent session going and come back when it's done.
- Auto-discovers llama-skein via mDNS
- Live VRAM/RAM bars (weights + KV cache)
- Hardware-aware context-size picker
- /loop scheduling — run unattended
- Intelligent auto-reply (AI-to-AI)
- Loop/repetition detection + intervention
- Build agent (full access) + Plan agent (read-only)
- Upstream-safe: fork tooling proves no feature is lost on sync
curl -fsSL https://raw.githubusercontent.com/androidand/opencode/dev/install | bashwhy local-first
Six things that matter when AI infrastructure is yours.
- Zero cloud costs
- llama-skein serves completions from your hardware. Nobody invoices you per token. Run ten thousand completions today.
- Total privacy
- Your code goes to opencode-skein, which routes to llama-skein, which runs on your machine. The chain is entirely yours.
- VRAM transparency
- opencode-skein shows you live model weights vs. KV cache vs. free headroom. No more mysterious OOMs mid-session.
- Zero-config discovery
- llama-skein advertises over mDNS. opencode-skein probes your LAN and offers backends straight in /connect. No IPs, no config files.
- Runs unattended
- opencode-skein's /loop scheduling, intelligent auto-reply, and loop-detection let long agent sessions run without babysitting.
- Any engine
- llama-skein fronts llama.cpp, vLLM, tabbyAPI, stable-diffusion.cpp, Whisper — anything OpenAI-compatible. Swap the backend, keep the clients.
live manifest data
Each repo publishes a skein.json at its root. This section is built from those files — the source of truth lives with the code.
opencode, tuned for people who run their own models.
A maintained fork of opencode that makes local LLM backends first-class: it finds them on your network, shows you exactly what fits in VRAM, and keeps long agent sessions running unattended.
curl -fsSL https://raw.githubusercontent.com/androidand/opencode/dev/install | bash- Local provider auto-discoverystable
Finds Ollama, LM Studio and llama-swap (llama-skein) backends on your LAN via mDNS + port probing and offers them straight in /connect — no manual baseURL juggling.
- Context window usage in the sidebarstable
Live context-window bar: tokens used vs. the model's limit, a percentage, a token breakdown, throughput (tokens/sec), and cost for cloud providers.
- Live VRAM/RAM usage barsstable
For local models, a stacked memory bar showing model weights vs. KV cache against free VRAM/RAM — so you can see headroom before you blow past it.
- Hardware-aware context-size pickerstable
An interactive dialog that recommends a context size from real KV-cache rate + free VRAM, with presets flagged when they'd exceed memory, and applies it to the local model.
- /loop schedulingstable
Run a prompt or command on a schedule or until done — ralph-style iterate-until-complete, or interval/cron based, with list/cancel/pause/resume.
- Intelligent auto-replybeta
Keeps long sessions moving by auto-answering input prompts — static phrases, AI-to-AI continuation, or an external webhook/CLI hook.
- Loop / repetition detectionbeta
Detects when the agent is stuck repeating itself (string-similarity over a time window) and intervenes to break the loop.
- Themed local-backend loading screensstable
Bridges your active TUI theme to the local backend via an X-Loading-Theme header (e.g. pip-boy → vault-boy) for matching loading visuals.
- Self-maintaining fork toolingstable
A manifest-driven workflow (fork:verify, sync:check) that proves no fork feature is lost on upstream syncs, plus a fork-owned updater + release pipeline so the official updater can never overwrite the skein binary.
Hot-swap local models behind one OpenAI-compatible API, with hardware-aware budgeting.
An inference proxy that sits in front of llama.cpp (and other OpenAI-compatible engines) and exposes a single unified API with on-demand model hot-swapping, hardware/VRAM reporting, and LAN discovery. A single static Go binary, built for the skein ecosystem.
- On-demand model hot-swappingstable
Detects the requested model per call and loads/unloads/swaps the running model automatically — no manual restart to change models.
- Unified OpenAI-compatible APIstable
Pass-through for /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models and audio/image endpoints, so existing OpenAI clients work unchanged.
- Hardware & resource reportingstable
Reports GPU/VRAM usage, unified/system memory, loaded-model size and KV-cache estimates via /api/hardware — the data behind opencode-skein's live memory bars.
- CPU/MoE offload recommendationsbeta
Analyzes available VRAM and recommends offloading compute or mixture-of-experts layers to CPU for large MoE models (/api/models/offload).
- Config-driven model lifecyclestable
YAML config with TTL auto-unload, model aliases, per-model env vars, and grouping/swap matrices for concurrent multi-model execution.
- mDNS LAN advertisingstable
Advertises the instance over mDNS/zeroconf so opencode-skein and the skein supervisor discover and connect to it without hardcoded IPs.
- Control-plane model & capability APIstable
Skein-specific endpoints to query/patch model state, set a default model, set per-model context size (triggers a reload), and detect hardware backends (Metal/ROCm/CUDA/Vulkan).
- Any OpenAI-compatible enginestable
Front llama.cpp, vLLM, tabbyAPI, stable-diffusion.cpp or Whisper; supports container backends with graceful SIGTERM shutdown.
public release in progress
Start with
opencode-skein.
opencode-skein is available now. Install it, point it at an Ollama or LM Studio instance, and it works immediately — no skein or llama-skein required yet. Add them when you want the full fleet experience.
curl -fsSL https://raw.githubusercontent.com/androidand/opencode/dev/install | bash