Community Configurations
Real-world GPU inference setups shared by the community. Find inspiration, compare approaches, and share your own.
Showing 22 configurations
Llama 3.1 8B INT4 on L40S — Ultra Cheap
by cost_optimizer · 2/28/2025
Minimal cost deployment for high-volume simple queries. INT4 quantization keeps it on a single affordable GPU.
Enterprise DeepSeek V3 on 8x H200
by enterprise_ml_lead · 3/19/2025
Maximum throughput enterprise deployment of DeepSeek V3 MoE. H200 HBM3e memory eliminates bottlenecks for the 671B model.
DeepSeek V3 MoE on 4x H100 — Efficiency
by moe_fan · 3/14/2025
Mixture-of-experts architecture activates only 37B of 671B params. Surprisingly affordable at scale.
Embedding Service with BGE-Large
by search_engineer · 3/13/2025
High-throughput embedding generation for semantic search and vector databases. Tiny model, massive batch sizes.
DeepSeek R1 on 8x H100 — Reasoning King
by ml_deployer · 3/10/2025
Full-precision reasoning model for complex multi-step tasks. High throughput with tensor parallelism.
Budget Qwen 7B Chat on RTX 4090
by home_lab_hero · 3/20/2025
Consumer-grade deployment for personal or small team chat applications. RTX 4090 delivers surprising inference performance.
High-Throughput Mistral 7B on H100
by scale_engineer · 3/6/2025
Maximize requests per second for Mistral 7B. H100 with continuous batching handles massive concurrent load.
RAG Pipeline with Llama 70B FP8
by rag_architect · 3/16/2025
Retrieval-augmented generation pipeline optimized for long-context document Q&A. FP8 keeps 70B on 2 GPUs with room for KV cache.
Llama 70B on H200 — Next Gen Perf
by hw_reviewer · 3/18/2025
H200's extra HBM3e memory and bandwidth deliver 40% more throughput than H100 for large models.
Vision Pipeline with Llama 3.2 Vision 90B
by vision_ml_eng · 3/7/2025
Multimodal vision-language model for image understanding, OCR, and visual QA. Requires multi-GPU for the 90B variant.
Qwen 7B FP8 — Fastest Budget Option
by startup_cto · 3/12/2025
Lightning-fast small model for simple tasks. FP8 on H100 maximizes throughput per dollar.
Code Assistant with StarCoder2 15B
by devtools_startup · 3/17/2025
Optimized for IDE-integrated code completion and generation. Low latency on A100 with high batch throughput.
Llama 405B on 8x H100 — Maximum Quality
by quality_first · 3/2/2025
The largest open model for tasks where quality is paramount. Full BF16 for zero quality loss.
Cost-Optimized Mixtral 8x7B MoE
by cost_hawk · 3/4/2025
Mixtral MoE on A100 with AWQ quantization. Only 12.9B active params gives dense-like speed at fraction of the cost.
Gemma 2 27B on 1x A100 — Code Assistant
by dev_tools_team · 3/1/2025
Optimized for code generation and review tasks. Great quality-to-cost ratio for developer tools.
Edge Deployment Phi-4 on L4
by edge_ml_team · 3/9/2025
Microsoft Phi-4 on NVIDIA L4 for cost-efficient edge inference. Great quality-to-size ratio for constrained deployments.
Phi-3 Mini on A10G — Edge Deployment
by edge_deployer · 2/20/2025
Tiny but capable model for edge and on-device inference. A10G keeps costs minimal.
Budget Llama 70B on 2x A100
by gpu_enthusiast · 3/15/2025
Cost-effective setup for chat applications with strong quality. BF16 precision balances speed and accuracy.
Multilingual Aya-23 on A100
by global_ops · 3/11/2025
Cohere Aya-23 for 23-language customer support. Excellent multilingual performance with efficient serving.
Mixtral 8x7B on 2x A100 — MoE Value
by mixtral_veteran · 1/15/2025
Original MoE value champion. Still competitive for many general-purpose tasks.
Mistral Large on 4x A100 — Balanced
by enterprise_arch · 3/8/2025
Good balance of cost and capability for enterprise use cases. Handles complex instructions well.
Qwen 72B AWQ on 2x H100 — Multilingual
by global_support · 3/5/2025
AWQ quantization for multilingual customer support. Excellent CJK language performance.