Creating Recipes

Learn how to define, test, and publish custom model deployment blueprints.

A Bloc recipe is a YAML configuration manifest that describes how to run a local AI model using llama.cpp (or other engines) on target hardware. It is not the weights themselves; it is the configuration recipe telling the Bloc CLI exactly how to download, optimize, and serve a model.


The Two-Layer Design

Every Bloc recipe YAML has two distinct layers serving different purposes:

  1. Registry Metadata (Hub Website): Read by the Bloc Hub website to power search filters (VRAM, platform, model size), registry cards, and recipe pages.
  2. Engine Configuration (CLI Execution): Passed verbatim to the Bloc CLI. The CLI translates these structured fields into engine arguments (e.g. llama-server flags) and environment setups.

Full Schema Reference

schema (required)

Tells the Bloc CLI which parser version to use for compatibility.

schema: "bloc/v1"

extends (optional)

Inherit fields from another recipe and override only what differs. Great for multi-VRAM variants of the same model.

extends: "arnav/qwen3-base"

metadata (required)

metadata:
  name: "qwen3-30b-moe-8gb-cpu-offload"
  description: "What model, what hardware, what result."
  tags: [moe, long-context, 8gb, cuda, reasoning]
  author_notes: |
    Multi-line developer notes. trade-offs, experiments, and tips.
  • name: Unique per-user lowercase identifier (letters, numbers, hyphens). Combined with your username, it forms the unique ID: {username}/{name}.
  • description: Brief summary displayed on cards.
  • tags: Arrays used for search filters (e.g., moe, cuda, metal, reasoning).

model (required)

Describes the GGUF model file or Hugging Face repository download source. You must specify either download_url (for GGUF) or hf_repo (for full repository downloads like vLLM).

model:
  source: "huggingface:Qwen/Qwen3-30B-A3B"
  
  # For GGUF/llama.cpp downloads:
  gguf_repo: "huggingface:bartowski/Qwen_Qwen3-30B-A3B-GGUF"
  file: "Qwen_Qwen3-30B-A3B-Q4_K_M.gguf"
  download_url: "https://huggingface.co/bartowski/Qwen_Qwen3-30B-A3B-GGUF/resolve/main/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf"
  
  # For full HF repository downloads (e.g. vLLM):
  # hf_repo: "Qwen/Qwen3-30B-A3B"
  
  quantization: "Q4_K_M"
  size_gb: 17.2
  parameters: "30B"
  architecture: "MoE"
  • source: The upstream base model on Hugging Face (huggingface:org/repo).
  • hf_repo: Hugging Face repo path (org/repo) for full model repository downloads. Validated against a strict security regex.
  • file: Precise GGUF filename for deterministic download caching.
  • download_url: Direct HTTPS GGUF download link.

engine (required)

engine:
  name: "llama.cpp" # options: "llama.cpp", "vllm"
  runtime: "native" # options: "native", "docker"
  version: "0.9.0"  # pinned engine version (semver format)
  image: "vllm/vllm-openai:v0.9.0" # Docker image tag (docker runtime only)
  tested_commit: "b5350" # (llama.cpp only)
  • runtime: How to run the engine. Defaults to native. Use docker to launch in containerized mode.
  • version: A validated semver string (e.g., 0.9.0) specifying the pinned version.
  • image: Validated Docker image tag format (e.g., vllm/vllm-openai:v0.9.0).
  • tested_commit: The short git hash of llama.cpp this was verified against. The CLI checks this against the user's binary to alert them of version shifts.

hardware (required)

Specifies minimum recommendations for running the configuration.

hardware:
  min_vram: "8GB" # options: 4GB, 8GB, 12GB, 16GB, 24GB, Unified (Mac)
  target_platform: "cuda" # options: cuda, metal, rocm, cpu, vulkan
  gpu_count: 1
  recommended_ram: "32GB"

Engine Configuration (engine_config)

These fields map to engine-specific CLI flags during deployment execution. Fields set to null are ignored.

[!IMPORTANT] Cross-Engine Validation: To prevent confusing behavior, llama.cpp-only flags (gpu_layers, n_cpu_moe, batch_size, ubatch_size) cannot be defined in a vllm engine recipe, and vLLM-only flags cannot be used in a llama.cpp recipe.

[!NOTE] The Explicit Field Rule: Always define options explicitly rather than relying on default values, ensuring your recipes remain compatible if engines update their defaults.

Context & Batching

  • ctx_size: Size of the KV cache/context window. Memory scales linearly with this.
  • batch_size: Parallelism of prompt processing (-b). High numbers speed up prompt prefill but increase peak memory. (llama.cpp only)
  • ubatch_size: Micro-batch size (-ub). Halving this (e.g., from 512 to 256) can save up to ~1.4GB VRAM during prefill on tight setups. (llama.cpp only)

GPU Offloading

  • gpu_layers: Number of layers loaded into VRAM (-ngl). Use 99 for full GPU offload. (llama.cpp only)
  • split_mode: Multi-GPU splitting strategy (layer, row, or none). (llama.cpp only)
  • tensor_split: Ratio/split weight across multiple GPUs (e.g. 58,42). (llama.cpp only)
  • flash_attn: Reduces VRAM usage linearly with context depth. Highly recommended on compatible hardware. (llama.cpp only)

Mixture-of-Experts (MoE)

  • n_cpu_moe: Number of active experts to offload to system RAM. (llama.cpp only)
    • 99 = All experts on CPU (maximum context, lowest VRAM usage, ~10% slower speed).
    • 32 = Partial expert offload (faster inference, uses more VRAM).

KV Cache Quantization

  • cache_type_k / cache_type_v: Quantization format for key/value matrices. (llama.cpp only)
    • f16: Default, full precision, highest memory.
    • q8_0: Near-perfect quality, ~50% VRAM reduction (recommended default).
    • q4_0: Aggressive VRAM saving (great for deep chat context; may cause minor reasoning issues).

Speculative Decoding

  • spec_type: Speculative decoding type (draft or draft-mtp for native multi-token prediction). (llama.cpp only)
  • spec_draft_model: Path or HF identifier for the draft model. (llama.cpp only)

vLLM-Specific Parameters

  • tensor_parallel_size: Number of GPUs to partition the model across using Tensor Parallelism.
  • gpu_memory_utilization: Fraction of GPU memory allocated for model weights and KV cache (0.0 to 1.0).
  • max_model_len: Maximum sequence length capacity.
  • dtype: Model weights precision data type (e.g., auto, float16, bfloat16).
  • kv_cache_dtype: Key-value cache precision format.
  • quantization: Specify model quantization type (e.g. awq, gptq, squeezellm).
  • enable_expert_parallel: Enable parallel MoE expert processing on multi-GPU setups.
  • tokenizer_mode: Tokenizer behavior override (e.g. auto, slow).
  • tool_call_parser: Parser to interpret model-generated tool calls (e.g. llama3, mistral).
  • reasoning_parser: Parser for model reasoning/thinking tokens (e.g. deepseek_r1).
  • trust_remote_code: Must be explicitly set to true to execute custom model-specific code. This triggers an interactive user safety confirmation gate on the CLI.
  • speculative_model: Pointer to the draft model for speculative decoding.
  • num_speculative_tokens: Number of speculative tokens to sample.

System Settings

  • threads: Number of physical CPU cores (do not include logical cores/threads to avoid CPU lockups).
  • mlock: Locks weights in physical RAM to prevent OS swapping/latency spikes.
  • jinja: Enable Jinja2 chat templates (required for modern chat formatting like Qwen/Llama).
  • port: Port to serve on.

    [!WARNING] Ports under 1024 (privileged) and port 0 (dynamic binding) are blocked for safety. Only ports between 1024 and 65535 are allowed.

  • extra_args: Escape hatch array for custom/unsupported flags:
    extra_args:
      - "--cuda-graphs"

    [!WARNING] Flags passed to extra_args are validated against a strict security allowlist. Dangerous flags (like --api-key or overriding host addresses) are rejected.


Pre-Run Hooks (pre_run)

Optional system-level operations executed before the engine launches. The CLI will request confirmation from the user before executing commands.

pre_run:
  env:
    CUDA_VISIBLE_DEVICES: "0,1"
  commands:
    - "sudo nvidia-smi -pl 250" # power limits GPU for efficiency

Common Configuration Patterns

Single GPU Budget Setup (8GB VRAM)

hardware:
  min_vram: "8GB"
  target_platform: "cuda"
engine_config:
  gpu_layers: 28
  ctx_size: 8192
  flash_attn: true
  cache_type_k: "q8_0"
  cache_type_v: "q8_0"
  threads: 8

Apple Silicon (Metal Unified Memory)

hardware:
  min_vram: "Unified"
  target_platform: "metal"
engine_config:
  gpu_layers: 99
  ctx_size: 32768
  flash_attn: true
  cache_type_k: "q8_0"
  cache_type_v: "q8_0"
  threads: 6 # matched to physical performance cores
  mlock: false
  jinja: true

MoE Offloading (Large Model, Low VRAM)

hardware:
  min_vram: "8GB"
  recommended_ram: "32GB"
engine_config:
  gpu_layers: 99
  n_cpu_moe: 99 # moves experts to system RAM
  ctx_size: 262144
  flash_attn: true
  cache_type_k: "q4_0"
  cache_type_v: "q4_0"
  jinja: true

vLLM Single-GPU Setup (Containerized)

model:
  source: "huggingface:meta-llama/Meta-Llama-3-8B-Instruct"
  hf_repo: "meta-llama/Meta-Llama-3-8B-Instruct"
  parameters: "8B"
  architecture: "Dense"
engine:
  name: "vllm"
  runtime: "docker"
  version: "0.9.0"
  image: "vllm/vllm-openai:v0.9.0"
hardware:
  min_vram: "16GB"
  target_platform: "cuda"
  gpu_count: 1
engine_config:
  tensor_parallel_size: 1
  gpu_memory_utilization: 0.90
  max_model_len: 8192
  trust_remote_code: false
  host: "127.0.0.1"
  port: 8080

YAML Template Reference

Below is a complete, fully annotated template blueprint you can copy to create your own recipes. It covers both engine branches and all validation parameters under the bloc/v1 schema.

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
#  BLOC RECIPE BLUEPRINT  ·  schema bloc/v1
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

schema: "bloc/v1"
extends: null

# ─── LAYER 1: Registry Metadata (Parsed & Indexed by Hub) ──────
metadata:
  name: "your-recipe-name"     # Lowercase letters, numbers, and hyphens only
  description: "A short, one-sentence card summary."
  tags:
    - "cuda"                   # cuda | metal | rocm | cpu | vulkan
    - "8gb"                    # VRAM target
    - "reasoning"
  author_notes: |
    Multi-line developer details, benchmarks, and target hardware.

model:
  source: "huggingface:org/repo"
  
  # For llama.cpp (GGUF):
  gguf_repo: "huggingface:quantizer/repo-gguf"
  file: "model-Q4_K_M.gguf"
  download_url: "https://huggingface.co/..."
  
  # For vLLM (Alternative):
  # hf_repo: "org/repo"

  quantization: "Q4_K_M"
  size_gb: 4.8
  parameters: "7B"
  architecture: "Dense"

engine:
  name: "llama.cpp"           # "llama.cpp" or "vllm"
  runtime: "native"           # "native" or "docker"
  version: "0.9.0"            # pinned engine version (semver format)
  image: null                 # Docker image tag (required if runtime is docker)
  tested_commit: "b5350"      # (llama.cpp only)

hardware:
  min_vram: "8GB"             # 4GB | 8GB | 12GB | 16GB | 24GB | Unified
  target_platform: "cuda"     # cuda | metal | rocm | cpu | vulkan
  gpu_count: 1
  recommended_ram: "16GB"

# ─── LAYER 2: Engine Config (Translated directly to backend CLI flags)
engine_config:
  # [ Server (Shared) ]
  host: "127.0.0.1"
  port: 8080                  # Must be between 1024 and 65535
  n_parallel: 1               # -np (llama.cpp) or --max-num-seqs (vLLM)

  # ─── llama.cpp-specific configurations ───
  ctx_size: 8192              # -c
  gpu_layers: 99              # -ngl (99 = offload all)
  flash_attn: true            # -fa
  mlock: false                # --mlock
  mmap: true                  # Set false for --no-mmap
  split_mode: null            # --split-mode  null | "none" | "layer" | "row"
  tensor_split: null          # -ts
  main_gpu: 0                 # -mg
  threads: 8                  # -t
  batch_size: 512             # -b
  ubatch_size: 256            # -ub
  cache_type_k: "q8_0"        # -ctk
  cache_type_v: "q8_0"        # -ctv
  jinja: true                 # --jinja

  # ─── vLLM-specific configurations ───
  # tensor_parallel_size: 1
  # gpu_memory_utilization: 0.90
  # max_model_len: 8192
  # trust_remote_code: false

  # [ Escape Hatch (Allowlisted flags only) ]
  extra_args: []

# ─── LAYER 3: Pre-Run System Hooks ─────────────────────────────
pre_run:
  env: {}
  commands: []