Skip to content

Community Configurations

Real-world GPU inference setups shared by the community. Find inspiration, compare approaches, and share your own.

Showing 22 configurations

Llama 3.1 8B INT4 on L40S — Ultra Cheap

by cost_optimizer · 2/28/2025

Minimal cost deployment for high-volume simple queries. INT4 quantization keeps it on a single affordable GPU.

Llama 3.1 8B on L40S
Cost/M
$0.35
Tokens/s
3,200
Margin
85%
ultra-budgethigh-volumellama

Enterprise DeepSeek V3 on 8x H200

by enterprise_ml_lead · 3/19/2025

Maximum throughput enterprise deployment of DeepSeek V3 MoE. H200 HBM3e memory eliminates bottlenecks for the 671B model.

DeepSeek V3
Cost/M
$1.90
Tokens/s
680
Margin
71%
enterprisemoeh200deepseekpremium

DeepSeek V3 MoE on 4x H100 — Efficiency

by moe_fan · 3/14/2025

Mixture-of-experts architecture activates only 37B of 671B params. Surprisingly affordable at scale.

DeepSeek V3
Cost/M
$2.80
Tokens/s
410
Margin
64%
moeefficientdeepseek

Embedding Service with BGE-Large

by search_engineer · 3/13/2025

High-throughput embedding generation for semantic search and vector databases. Tiny model, massive batch sizes.

BGE Large EN v1.5 on A10G
Cost/M
$0.05
Tokens/s
15,000
Margin
95%
embeddingsearchultra-budgethigh-volume

DeepSeek R1 on 8x H100 — Reasoning King

by ml_deployer · 3/10/2025

Full-precision reasoning model for complex multi-step tasks. High throughput with tensor parallelism.

DeepSeek R1
Cost/M
$12.50
Tokens/s
280
Margin
62%
reasoningpremiumdeepseek

Budget Qwen 7B Chat on RTX 4090

by home_lab_hero · 3/20/2025

Consumer-grade deployment for personal or small team chat applications. RTX 4090 delivers surprising inference performance.

Qwen 2.5 7B on RTX 4090
Cost/M
$0.45
Tokens/s
2,400
Margin
82%
budgetconsumerchatqwen

High-Throughput Mistral 7B on H100

by scale_engineer · 3/6/2025

Maximize requests per second for Mistral 7B. H100 with continuous batching handles massive concurrent load.

Cost/M
$0.65
Tokens/s
4,200
Margin
84%
high-throughputfastmistralscale

RAG Pipeline with Llama 70B FP8

by rag_architect · 3/16/2025

Retrieval-augmented generation pipeline optimized for long-context document Q&A. FP8 keeps 70B on 2 GPUs with room for KV cache.

Llama 3.1 70B
Cost/M
$3.50
Tokens/s
520
Margin
56%
raglong-contextenterprisellama

Llama 70B on H200 — Next Gen Perf

by hw_reviewer · 3/18/2025

H200's extra HBM3e memory and bandwidth deliver 40% more throughput than H100 for large models.

Llama 3.1 70B
Cost/M
$3.10
Tokens/s
620
Margin
58%
performanceh200llama

Vision Pipeline with Llama 3.2 Vision 90B

by vision_ml_eng · 3/7/2025

Multimodal vision-language model for image understanding, OCR, and visual QA. Requires multi-GPU for the 90B variant.

Llama 3.2 90B Vision
Cost/M
$8.50
Tokens/s
210
Margin
44%
visionmultimodalllamapremium

Qwen 7B FP8 — Fastest Budget Option

by startup_cto · 3/12/2025

Lightning-fast small model for simple tasks. FP8 on H100 maximizes throughput per dollar.

Qwen 2.5 7B
Cost/M
$0.85
Tokens/s
2,800
Margin
78%
budgetfastqwen

Code Assistant with StarCoder2 15B

by devtools_startup · 3/17/2025

Optimized for IDE-integrated code completion and generation. Low latency on A100 with high batch throughput.

StarCoder2 15B on A100 80GB SXM
Cost/M
$1.20
Tokens/s
1,800
Margin
72%
codedeveloper-toolsstarcoderlow-latency

Llama 405B on 8x H100 — Maximum Quality

by quality_first · 3/2/2025

The largest open model for tasks where quality is paramount. Full BF16 for zero quality loss.

Llama 3.1 405B
Cost/M
$18.50
Tokens/s
180
Margin
41%
premiummax-qualityllama

Cost-Optimized Mixtral 8x7B MoE

by cost_hawk · 3/4/2025

Mixtral MoE on A100 with AWQ quantization. Only 12.9B active params gives dense-like speed at fraction of the cost.

A100 80GB SXM
Cost/M
$1.10
Tokens/s
980
Margin
73%
moecost-optimizedmixtralquantized

Gemma 2 27B on 1x A100 — Code Assistant

by dev_tools_team · 3/1/2025

Optimized for code generation and review tasks. Great quality-to-cost ratio for developer tools.

Gemma 2 27B on A100 80GB SXM
Cost/M
$2.10
Tokens/s
680
Margin
61%
codedeveloper-toolsgemma

Edge Deployment Phi-4 on L4

by edge_ml_team · 3/9/2025

Microsoft Phi-4 on NVIDIA L4 for cost-efficient edge inference. Great quality-to-size ratio for constrained deployments.

Phi-4 on L4
Cost/M
$0.18
Tokens/s
3,800
Margin
88%
edgesmallphicost-efficient

Phi-3 Mini on A10G — Edge Deployment

by edge_deployer · 2/20/2025

Tiny but capable model for edge and on-device inference. A10G keeps costs minimal.

A10G
Cost/M
$0.22
Tokens/s
4,100
Margin
90%
edgetinyphi

Budget Llama 70B on 2x A100

by gpu_enthusiast · 3/15/2025

Cost-effective setup for chat applications with strong quality. BF16 precision balances speed and accuracy.

Llama 3.1 70B on A100 80GB SXM
Cost/M
$4.20
Tokens/s
450
Margin
47%
budgetchatllama

Multilingual Aya-23 on A100

by global_ops · 3/11/2025

Cohere Aya-23 for 23-language customer support. Excellent multilingual performance with efficient serving.

Aya 23 35B on A100 80GB SXM
Cost/M
$2.80
Tokens/s
620
Margin
58%
multilingualsupportcohereaya

Mixtral 8x7B on 2x A100 — MoE Value

by mixtral_veteran · 1/15/2025

Original MoE value champion. Still competitive for many general-purpose tasks.

A100 80GB SXM
Cost/M
$1.90
Tokens/s
750
Margin
66%
moevaluemistral

Mistral Large on 4x A100 — Balanced

by enterprise_arch · 3/8/2025

Good balance of cost and capability for enterprise use cases. Handles complex instructions well.

A100 80GB SXM
Cost/M
$7.80
Tokens/s
380
Margin
52%
enterprisebalancedmistral

Qwen 72B AWQ on 2x H100 — Multilingual

by global_support · 3/5/2025

AWQ quantization for multilingual customer support. Excellent CJK language performance.

Qwen 2.5 72B
Cost/M
$3.90
Tokens/s
520
Margin
55%
multilingualsupportqwen