@arnav080 (Arnav Gautam)

METALUnified VRAM

arnav080/gemma-4-12b-q4-k-m-m1-16gb-balanced

Gemma 4 12B optimized for Apple M1 16GB. Reduced KV cache and prompt processing overhead for faster TTFT.

Base Model: huggingface:unsloth/gemma-4-12b-it

0 runs

METAL8GB VRAM

arnav080/gemma-4-12b-Q4_K_XL-M116GB

Gemma 4 12B recipe for local execution using llama.cpp based on Unsloth guide

Base Model: huggingface:unsloth/gemma-4-12b-it

0 runs

CUDA48GB VRAM

arnav080/ornstein3.6-27b-dual-3090-mtp

Dual RTX 3090 setup running Ornstein3.6-27B dense model with Tensor Parallelism, 65k context, FP16 KV cache, and multi-speculative decoding (MTP + ngram-mod + ngram-map).

Base Model: huggingface:saber-labs/Ornstein3.6-27B-MTP-NSC-ACE-SABER

0 runs

CPU0GB VRAM

arnav080/llama-qwen-0.5b

Lightweight Qwen 0.5B model using llama.cpp for fast testing

Base Model: huggingface:Qwen/Qwen2.5-0.5B-Instruct

0 runs

CUDA8GB VRAM

arnav080/ornstein3.6-27b-4060-q3km-bb

>

Base Model: huggingface:saber-labs/Ornstein3.6-27B-MTP-NSC-ACE-SABER

0 runs

CUDA48GB VRAM

arnav080/ornstein3.6-27b-dual-3090-mtp-JDT

>

Base Model: huggingface:saber-labs/Ornstein3.6-27B-MTP-NSC-ACE-SABER

0 runs

CUDA192GB VRAM

arnav080/step-3-7-flash-nvfp4

Step 3.7 Flash MoE (198B) quantized to 4-bit (NVFP4) optimized for dual RTX 6000 GPUs

Base Model: huggingface:stepfun-ai/Step-3.7-Flash-NVFP4

0 runs

CUDA192GB VRAM

arnav080/minimax-m2-7-nvfp4

MiniMax M2.7 MoE (230B) quantized to 4-bit (NVFP4) optimized for dual RTX 6000 GPUs

Base Model: huggingface:demon-zombie/MiniMax-M2.7-NVFP4

0 runs

CUDA192GB VRAM

arnav080/deepseek-v4-flash

DeepSeek V4-Flash MoE (284B) production profile tuned for 2x RTX 6000 GPUs

Base Model: huggingface:deepseek-ai/DeepSeek-V4-Flash

0 runs

CUDA192GB VRAM

arnav080/step-3-7-flash-nvfp4-hikari

Step 3.7 Flash MoE (198B) quantized to 4-bit (NVFP4) optimized for dual RTX 6000 GPUs

Base Model: huggingface:stepfun-ai/Step-3.7-Flash-NVFP4

0 runs

CUDA192GB VRAM

arnav080/minimax-m2-7-nvfp4-hikari

MiniMax M2.7 MoE (230B) quantized to 4-bit (NVFP4) optimized for dual RTX 6000 GPUs

Base Model: huggingface:demon-zombie/MiniMax-M2.7-NVFP4

0 runs

CUDA192GB VRAM

arnav080/deepseek-v4-flash-hikari

DeepSeek V4-Flash MoE (284B) production profile tuned for 2x RTX 6000 GPUs

Base Model: huggingface:deepseek-ai/DeepSeek-V4-Flash

0 runs

CUDA8GB VRAM

arnav080/qwen3.6-35b-moe-tq3-rtx4060ti-no-checkpoint-witch

Highly optimized 8GB GPU recipe for Qwen 3.6 35B MoE, disabling context checkpoints to double deep-context throughput (~30 tok/s at 26k).

Base Model: huggingface:Qwen/Qwen3.6-35B-A3B-Instruct

0 runs

CUDA16GB VRAM

arnav080/qwen3.6-35b-moe-rtx5080-128k-LeftCDev

Fully GPU-accelerated Qwen 3.6 35B MoE recipe optimized for 16GB VRAM GPUs, achieving 150 tok/s and 128k context.

Base Model: huggingface:Qwen/Qwen3.6-35B-A3B-Instruct

0 runs

CUDA320GB VRAM

arnav080/step-3-7-flash-bf16-multi-gpu

Unquantized BF16 Step-3.7-Flash multimodal recipe optimized for multi-GPU server clusters (e.g., 4x A100/H100 80GB).

Base Model: huggingface:stepfun-ai/Step-3.7-Flash

0 runs

CUDA128GB VRAM

arnav080/step-3.7-flash-multimodal-128gb-SudoSU

Multimodal Step-3.7-Flash config for ultra-large hardware (128GB VRAM / GB10) featuring f16 vision support and 65k context.

Base Model: huggingface:stepfun-ai/Step-3.7-Flash

0 runs

CUDA12GB VRAM

arnav080/qwen3.6-35b-moe-turboquant-tq3-AJKV

Optimized Qwen 3.6 35B MoE recipe using the experimental llama.cpp-tq3 (TurboQuant) engine, TQ3_4S quant, and hybrid offload.

Base Model: huggingface:Qwen/Qwen3.6-35B-A3B-Instruct

0 runs

CUDA8GB VRAM

arnav080/qwen3.6-35b-moe-4060ti-above_spec

Optimized Qwen 3.6 35B MoE config for 8GB GPUs, reaching 200k context via CPU expert offload and q8_0 KV cache.

Base Model: huggingface:Qwen/Qwen3.6-35B-A3B-Instruct

0 runs

CUDAUnified Mac VRAM

arnav080/qwen3.6-27b-iq3-k-r4-5060ti-above_spec

Dense Qwen 3.6 27B on a single RTX 5060 Ti 16GB running completely on GPU with R4 quantization and 139k deep context.

Base Model: huggingface:Qwen/Qwen3.6-27B-Instruct

0 runs

CUDA8GB VRAM

arnav080/qwen3.6-35b-moe-4060ti-iq4-above_spec

Qwen 3.6 35B MoE on RTX 4060 Ti 8GB using the ik_llama.cpp fork for high-quality IQ4_K_R4 quantization and 262k context.

Base Model: huggingface:Qwen/Qwen3.6-35B-A3B-Instruct

0 runs

CUDA24GB VRAM

arnav080/qwen3.6-27b-dual-gpu-mtp-above_spec

Optimized Qwen 3.6 27B dual-GPU recipe with native MTP speculative decoding & 100k context.

Base Model: huggingface:Qwen/Qwen3.6-27B-Instruct

0 runs

CUDA16GB VRAM

arnav080/qwen3-30b-dual-gpu-speculative-100k

Qwen3-30B across two GPUs with native MTP speculative decoding. Tensor split 58/42 to account for MTP pinning the draft head + KV growth on GPU1. -ub 256 is the key finding: halves per-prefill compute reserve (~1.8 GB), freeing ~1.4 GB on the choking card. Honest context ceiling: 102400 (Config D ran 0→94k clean).

Base Model: huggingface:Qwen/Qwen3-30B-A3B

0 runs

CUDA8GB VRAM

arnav080/qwen3-30b-moe-8gb-cpu-offload

Qwen3-30B-A3B (MoE, 3B active params) on a single RTX 4060 Ti 8GB. Offloads all 41 MoE expert blocks to CPU RAM. Achieves 49 t/s at full 262k native context with 3.2 GB VRAM to spare. Trades ~10% speed for 4x the context vs partial GPU expert loading.

Base Model: huggingface:Qwen/Qwen3-30B-A3B

0 runs