Creating Recipes
Learn how to define, test, and publish custom model deployment blueprints.
A Bloc recipe is a YAML configuration manifest that describes how to run a local AI model using llama.cpp (or other engines) on target hardware. It is not the weights themselves; it is the configuration recipe telling the Bloc CLI exactly how to download, optimize, and serve a model.
The Two-Layer Design
Every Bloc recipe YAML has two distinct layers serving different purposes:
- Registry Metadata (Hub Website): Read by the Bloc Hub website to power search filters (VRAM, platform, model size), registry cards, and recipe pages.
- Engine Configuration (CLI Execution): Passed verbatim to the Bloc CLI. The CLI translates these structured fields into engine arguments (e.g.
llama-serverflags) and environment setups.
Full Schema Reference
schema (required)
Tells the Bloc CLI which parser version to use for compatibility.
schema: "bloc/v1"
extends (optional)
Inherit fields from another recipe and override only what differs. Great for multi-VRAM variants of the same model.
extends: "arnav/qwen3-base"
metadata (required)
metadata:
name: "qwen3-30b-moe-8gb-cpu-offload"
description: "What model, what hardware, what result."
tags: [moe, long-context, 8gb, cuda, reasoning]
author_notes: |
Multi-line developer notes. trade-offs, experiments, and tips.
- name: Unique per-user lowercase identifier (letters, numbers, hyphens). Combined with your username, it forms the unique ID:
{username}/{name}. - description: Brief summary displayed on cards.
- tags: Arrays used for search filters (e.g.,
moe,cuda,metal,reasoning).
model (required)
Describes the GGUF model file or Hugging Face repository download source. You must specify either download_url (for GGUF) or hf_repo (for full repository downloads like vLLM).
model:
source: "huggingface:Qwen/Qwen3-30B-A3B"
# For GGUF/llama.cpp downloads:
gguf_repo: "huggingface:bartowski/Qwen_Qwen3-30B-A3B-GGUF"
file: "Qwen_Qwen3-30B-A3B-Q4_K_M.gguf"
download_url: "https://huggingface.co/bartowski/Qwen_Qwen3-30B-A3B-GGUF/resolve/main/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf"
# For full HF repository downloads (e.g. vLLM):
# hf_repo: "Qwen/Qwen3-30B-A3B"
quantization: "Q4_K_M"
size_gb: 17.2
parameters: "30B"
architecture: "MoE"
- source: The upstream base model on Hugging Face (
huggingface:org/repo). - hf_repo: Hugging Face repo path (
org/repo) for full model repository downloads. Validated against a strict security regex. - file: Precise GGUF filename for deterministic download caching.
- download_url: Direct HTTPS GGUF download link.
engine (required)
engine:
name: "llama.cpp" # options: "llama.cpp", "vllm"
runtime: "native" # options: "native", "docker"
version: "0.9.0" # pinned engine version (semver format)
image: "vllm/vllm-openai:v0.9.0" # Docker image tag (docker runtime only)
tested_commit: "b5350" # (llama.cpp only)
- runtime: How to run the engine. Defaults to
native. Usedockerto launch in containerized mode. - version: A validated semver string (e.g.,
0.9.0) specifying the pinned version. - image: Validated Docker image tag format (e.g.,
vllm/vllm-openai:v0.9.0). - tested_commit: The short git hash of
llama.cppthis was verified against. The CLI checks this against the user's binary to alert them of version shifts.
hardware (required)
Specifies minimum recommendations for running the configuration.
hardware:
min_vram: "8GB" # options: 4GB, 8GB, 12GB, 16GB, 24GB, Unified (Mac)
target_platform: "cuda" # options: cuda, metal, rocm, cpu, vulkan
gpu_count: 1
recommended_ram: "32GB"
Engine Configuration (engine_config)
These fields map to engine-specific CLI flags during deployment execution. Fields set to null are ignored.
[!IMPORTANT] Cross-Engine Validation: To prevent confusing behavior, llama.cpp-only flags (
gpu_layers,n_cpu_moe,batch_size,ubatch_size) cannot be defined in avllmengine recipe, and vLLM-only flags cannot be used in allama.cpprecipe.
[!NOTE] The Explicit Field Rule: Always define options explicitly rather than relying on default values, ensuring your recipes remain compatible if engines update their defaults.
Context & Batching
ctx_size: Size of the KV cache/context window. Memory scales linearly with this.batch_size: Parallelism of prompt processing (-b). High numbers speed up prompt prefill but increase peak memory. (llama.cpp only)ubatch_size: Micro-batch size (-ub). Halving this (e.g., from512to256) can save up to~1.4GBVRAM during prefill on tight setups. (llama.cpp only)
GPU Offloading
gpu_layers: Number of layers loaded into VRAM (-ngl). Use99for full GPU offload. (llama.cpp only)split_mode: Multi-GPU splitting strategy (layer,row, ornone). (llama.cpp only)tensor_split: Ratio/split weight across multiple GPUs (e.g.58,42). (llama.cpp only)flash_attn: Reduces VRAM usage linearly with context depth. Highly recommended on compatible hardware. (llama.cpp only)
Mixture-of-Experts (MoE)
n_cpu_moe: Number of active experts to offload to system RAM. (llama.cpp only)99= All experts on CPU (maximum context, lowest VRAM usage, ~10% slower speed).32= Partial expert offload (faster inference, uses more VRAM).
KV Cache Quantization
cache_type_k/cache_type_v: Quantization format for key/value matrices. (llama.cpp only)f16: Default, full precision, highest memory.q8_0: Near-perfect quality, ~50% VRAM reduction (recommended default).q4_0: Aggressive VRAM saving (great for deep chat context; may cause minor reasoning issues).
Speculative Decoding
spec_type: Speculative decoding type (draftordraft-mtpfor native multi-token prediction). (llama.cpp only)spec_draft_model: Path or HF identifier for the draft model. (llama.cpp only)
vLLM-Specific Parameters
tensor_parallel_size: Number of GPUs to partition the model across using Tensor Parallelism.gpu_memory_utilization: Fraction of GPU memory allocated for model weights and KV cache (0.0 to 1.0).max_model_len: Maximum sequence length capacity.dtype: Model weights precision data type (e.g.,auto,float16,bfloat16).kv_cache_dtype: Key-value cache precision format.quantization: Specify model quantization type (e.g.awq,gptq,squeezellm).enable_expert_parallel: Enable parallel MoE expert processing on multi-GPU setups.tokenizer_mode: Tokenizer behavior override (e.g.auto,slow).tool_call_parser: Parser to interpret model-generated tool calls (e.g.llama3,mistral).reasoning_parser: Parser for model reasoning/thinking tokens (e.g.deepseek_r1).trust_remote_code: Must be explicitly set totrueto execute custom model-specific code. This triggers an interactive user safety confirmation gate on the CLI.speculative_model: Pointer to the draft model for speculative decoding.num_speculative_tokens: Number of speculative tokens to sample.
System Settings
threads: Number of physical CPU cores (do not include logical cores/threads to avoid CPU lockups).mlock: Locks weights in physical RAM to prevent OS swapping/latency spikes.jinja: Enable Jinja2 chat templates (required for modern chat formatting like Qwen/Llama).port: Port to serve on.[!WARNING] Ports under
1024(privileged) and port0(dynamic binding) are blocked for safety. Only ports between1024and65535are allowed.extra_args: Escape hatch array for custom/unsupported flags:extra_args: - "--cuda-graphs"[!WARNING] Flags passed to
extra_argsare validated against a strict security allowlist. Dangerous flags (like--api-keyor overriding host addresses) are rejected.
Pre-Run Hooks (pre_run)
Optional system-level operations executed before the engine launches. The CLI will request confirmation from the user before executing commands.
pre_run:
env:
CUDA_VISIBLE_DEVICES: "0,1"
commands:
- "sudo nvidia-smi -pl 250" # power limits GPU for efficiency
Common Configuration Patterns
Single GPU Budget Setup (8GB VRAM)
hardware:
min_vram: "8GB"
target_platform: "cuda"
engine_config:
gpu_layers: 28
ctx_size: 8192
flash_attn: true
cache_type_k: "q8_0"
cache_type_v: "q8_0"
threads: 8
Apple Silicon (Metal Unified Memory)
hardware:
min_vram: "Unified"
target_platform: "metal"
engine_config:
gpu_layers: 99
ctx_size: 32768
flash_attn: true
cache_type_k: "q8_0"
cache_type_v: "q8_0"
threads: 6 # matched to physical performance cores
mlock: false
jinja: true
MoE Offloading (Large Model, Low VRAM)
hardware:
min_vram: "8GB"
recommended_ram: "32GB"
engine_config:
gpu_layers: 99
n_cpu_moe: 99 # moves experts to system RAM
ctx_size: 262144
flash_attn: true
cache_type_k: "q4_0"
cache_type_v: "q4_0"
jinja: true
vLLM Single-GPU Setup (Containerized)
model:
source: "huggingface:meta-llama/Meta-Llama-3-8B-Instruct"
hf_repo: "meta-llama/Meta-Llama-3-8B-Instruct"
parameters: "8B"
architecture: "Dense"
engine:
name: "vllm"
runtime: "docker"
version: "0.9.0"
image: "vllm/vllm-openai:v0.9.0"
hardware:
min_vram: "16GB"
target_platform: "cuda"
gpu_count: 1
engine_config:
tensor_parallel_size: 1
gpu_memory_utilization: 0.90
max_model_len: 8192
trust_remote_code: false
host: "127.0.0.1"
port: 8080
YAML Template Reference
Below is a complete, fully annotated template blueprint you can copy to create your own recipes. It covers both engine branches and all validation parameters under the bloc/v1 schema.
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# BLOC RECIPE BLUEPRINT · schema bloc/v1
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
schema: "bloc/v1"
extends: null
# ─── LAYER 1: Registry Metadata (Parsed & Indexed by Hub) ──────
metadata:
name: "your-recipe-name" # Lowercase letters, numbers, and hyphens only
description: "A short, one-sentence card summary."
tags:
- "cuda" # cuda | metal | rocm | cpu | vulkan
- "8gb" # VRAM target
- "reasoning"
author_notes: |
Multi-line developer details, benchmarks, and target hardware.
model:
source: "huggingface:org/repo"
# For llama.cpp (GGUF):
gguf_repo: "huggingface:quantizer/repo-gguf"
file: "model-Q4_K_M.gguf"
download_url: "https://huggingface.co/..."
# For vLLM (Alternative):
# hf_repo: "org/repo"
quantization: "Q4_K_M"
size_gb: 4.8
parameters: "7B"
architecture: "Dense"
engine:
name: "llama.cpp" # "llama.cpp" or "vllm"
runtime: "native" # "native" or "docker"
version: "0.9.0" # pinned engine version (semver format)
image: null # Docker image tag (required if runtime is docker)
tested_commit: "b5350" # (llama.cpp only)
hardware:
min_vram: "8GB" # 4GB | 8GB | 12GB | 16GB | 24GB | Unified
target_platform: "cuda" # cuda | metal | rocm | cpu | vulkan
gpu_count: 1
recommended_ram: "16GB"
# ─── LAYER 2: Engine Config (Translated directly to backend CLI flags)
engine_config:
# [ Server (Shared) ]
host: "127.0.0.1"
port: 8080 # Must be between 1024 and 65535
n_parallel: 1 # -np (llama.cpp) or --max-num-seqs (vLLM)
# ─── llama.cpp-specific configurations ───
ctx_size: 8192 # -c
gpu_layers: 99 # -ngl (99 = offload all)
flash_attn: true # -fa
mlock: false # --mlock
mmap: true # Set false for --no-mmap
split_mode: null # --split-mode null | "none" | "layer" | "row"
tensor_split: null # -ts
main_gpu: 0 # -mg
threads: 8 # -t
batch_size: 512 # -b
ubatch_size: 256 # -ub
cache_type_k: "q8_0" # -ctk
cache_type_v: "q8_0" # -ctv
jinja: true # --jinja
# ─── vLLM-specific configurations ───
# tensor_parallel_size: 1
# gpu_memory_utilization: 0.90
# max_model_len: 8192
# trust_remote_code: false
# [ Escape Hatch (Allowlisted flags only) ]
extra_args: []
# ─── LAYER 3: Pre-Run System Hooks ─────────────────────────────
pre_run:
env: {}
commands: []