Back to registry
Arnav Gautam

Arnav Gautam

Handle:@arnav080
ContributorIndia

Building and learning about local ai

CUDA8GB VRAM

arnav080/ornstein3.6-27b-4060-q3km-bb

>

Base Model: huggingface:saber-labs/Ornstein3.6-27B-MTP-NSC-ACE-SABER
0 runs
CUDA384GB VRAM

arnav080/kimi-k2-6-nvfp4-0xS

Kimi-K2.6 519B (NVFP4) served via SGLang at 256k context with reasoning and tool-call parsers

Base Model: huggingface:0xSero/Kimi-K2.6-519B-NVFP4
0 runs
CUDA48GB VRAM

arnav080/ornstein3.6-27b-dual-3090-mtp-JDT

>

Base Model: huggingface:saber-labs/Ornstein3.6-27B-MTP-NSC-ACE-SABER
0 runs
CUDA192GB VRAM

arnav080/step-3-7-flash-nvfp4

Step 3.7 Flash MoE (198B) quantized to 4-bit (NVFP4) optimized for dual RTX 6000 GPUs

Base Model: huggingface:stepfun-ai/Step-3.7-Flash-NVFP4
0 runs
CUDA192GB VRAM

arnav080/minimax-m2-7-nvfp4

MiniMax M2.7 MoE (230B) quantized to 4-bit (NVFP4) optimized for dual RTX 6000 GPUs

Base Model: huggingface:demon-zombie/MiniMax-M2.7-NVFP4
0 runs
CUDA192GB VRAM

arnav080/deepseek-v4-flash

DeepSeek V4-Flash MoE (284B) production profile tuned for 2x RTX 6000 GPUs

Base Model: huggingface:deepseek-ai/DeepSeek-V4-Flash
0 runs
CUDA192GB VRAM

arnav080/step-3-7-flash-nvfp4-hikari

Step 3.7 Flash MoE (198B) quantized to 4-bit (NVFP4) optimized for dual RTX 6000 GPUs

Base Model: huggingface:stepfun-ai/Step-3.7-Flash-NVFP4
0 runs
CUDA192GB VRAM

arnav080/minimax-m2-7-nvfp4-hikari

MiniMax M2.7 MoE (230B) quantized to 4-bit (NVFP4) optimized for dual RTX 6000 GPUs

Base Model: huggingface:demon-zombie/MiniMax-M2.7-NVFP4
0 runs
CUDA192GB VRAM

arnav080/deepseek-v4-flash-hikari

DeepSeek V4-Flash MoE (284B) production profile tuned for 2x RTX 6000 GPUs

Base Model: huggingface:deepseek-ai/DeepSeek-V4-Flash
0 runs
CUDA8GB VRAM

arnav080/qwen3.6-35b-moe-tq3-rtx4060ti-no-checkpoint-witch

Highly optimized 8GB GPU recipe for Qwen 3.6 35B MoE, disabling context checkpoints to double deep-context throughput (~30 tok/s at 26k).

Base Model: huggingface:Qwen/Qwen3.6-35B-A3B-Instruct
0 runs
CUDA16GB VRAM

arnav080/qwen3.6-35b-moe-rtx5080-128k-LeftCDev

Fully GPU-accelerated Qwen 3.6 35B MoE recipe optimized for 16GB VRAM GPUs, achieving 150 tok/s and 128k context.

Base Model: huggingface:Qwen/Qwen3.6-35B-A3B-Instruct
0 runs
CUDA320GB VRAM

arnav080/step-3-7-flash-bf16-multi-gpu

Unquantized BF16 Step-3.7-Flash multimodal recipe optimized for multi-GPU server clusters (e.g., 4x A100/H100 80GB).

Base Model: huggingface:stepfun-ai/Step-3.7-Flash
0 runs
CUDA128GB VRAM

arnav080/step-3.7-flash-multimodal-128gb-SudoSU

Multimodal Step-3.7-Flash config for ultra-large hardware (128GB VRAM / GB10) featuring f16 vision support and 65k context.

Base Model: huggingface:stepfun-ai/Step-3.7-Flash
0 runs
CUDA12GB VRAM

arnav080/qwen3.6-35b-moe-turboquant-tq3-AJKV

Optimized Qwen 3.6 35B MoE recipe using the experimental llama.cpp-tq3 (TurboQuant) engine, TQ3_4S quant, and hybrid offload.

Base Model: huggingface:Qwen/Qwen3.6-35B-A3B-Instruct
0 runs
CUDA8GB VRAM

arnav080/qwen3.6-35b-moe-4060ti-above_spec

Optimized Qwen 3.6 35B MoE config for 8GB GPUs, reaching 200k context via CPU expert offload and q8_0 KV cache.

Base Model: huggingface:Qwen/Qwen3.6-35B-A3B-Instruct
0 runs
CUDAUnified Mac VRAM

arnav080/qwen3.6-27b-iq3-k-r4-5060ti-above_spec

Dense Qwen 3.6 27B on a single RTX 5060 Ti 16GB running completely on GPU with R4 quantization and 139k deep context.

Base Model: huggingface:Qwen/Qwen3.6-27B-Instruct
0 runs
CUDA8GB VRAM

arnav080/qwen3.6-35b-moe-4060ti-iq4-above_spec

Qwen 3.6 35B MoE on RTX 4060 Ti 8GB using the ik_llama.cpp fork for high-quality IQ4_K_R4 quantization and 262k context.

Base Model: huggingface:Qwen/Qwen3.6-35B-A3B-Instruct
0 runs
CUDA24GB VRAM

arnav080/qwen3.6-27b-dual-gpu-mtp-above_spec

Optimized Qwen 3.6 27B dual-GPU recipe with native MTP speculative decoding & 100k context.

Base Model: huggingface:Qwen/Qwen3.6-27B-Instruct
0 runs
CUDA16GB VRAM

arnav080/qwen3-30b-dual-gpu-speculative-100k

Qwen3-30B across two GPUs with native MTP speculative decoding. Tensor split 58/42 to account for MTP pinning the draft head + KV growth on GPU1. -ub 256 is the key finding: halves per-prefill compute reserve (~1.8 GB), freeing ~1.4 GB on the choking card. Honest context ceiling: 102400 (Config D ran 0→94k clean).

Base Model: huggingface:Qwen/Qwen3-30B-A3B
0 runs
CUDA8GB VRAM

arnav080/qwen3-30b-moe-8gb-cpu-offload

Qwen3-30B-A3B (MoE, 3B active params) on a single RTX 4060 Ti 8GB. Offloads all 41 MoE expert blocks to CPU RAM. Achieves 49 t/s at full 262k native context with 3.2 GB VRAM to spare. Trades ~10% speed for 4x the context vs partial GPU expert loading.

Base Model: huggingface:Qwen/Qwen3-30B-A3B
0 runs