Community Configurations

Real-world GPU inference setups shared by the community. Find inspiration, compare approaches, and share your own.

Showing 22 configurations

Llama 3.1 8B INT4 on L40S — Ultra Cheap

by cost_optimizer · 2/28/2025

Minimal cost deployment for high-volume simple queries. INT4 quantization keeps it on a single affordable GPU.

Llama 3.1 8B on L40S

Cost/M

$0.35

Tokens/s

3,200

Margin

85%

ultra-budgethigh-volumellama

Try This Config →

Enterprise DeepSeek V3 on 8x H200

by enterprise_ml_lead · 3/19/2025

Maximum throughput enterprise deployment of DeepSeek V3 MoE. H200 HBM3e memory eliminates bottlenecks for the 671B model.

DeepSeek V3

Cost/M

$1.90

Tokens/s

680

Margin

71%

enterprisemoeh200deepseekpremium

Try This Config →

DeepSeek V3 MoE on 4x H100 — Efficiency

by moe_fan · 3/14/2025

Mixture-of-experts architecture activates only 37B of 671B params. Surprisingly affordable at scale.

DeepSeek V3

Cost/M

$2.80

Tokens/s

410

Margin

64%

moeefficientdeepseek

Try This Config →

Embedding Service with BGE-Large

by search_engineer · 3/13/2025

High-throughput embedding generation for semantic search and vector databases. Tiny model, massive batch sizes.

BGE Large EN v1.5 on A10G

Cost/M

$0.05

Tokens/s

15,000

Margin

95%

embeddingsearchultra-budgethigh-volume

Try This Config →

DeepSeek R1 on 8x H100 — Reasoning King

by ml_deployer · 3/10/2025

Full-precision reasoning model for complex multi-step tasks. High throughput with tensor parallelism.

DeepSeek R1

Cost/M

$12.50

Tokens/s

280

Margin

62%

reasoningpremiumdeepseek

Try This Config →

Budget Qwen 7B Chat on RTX 4090

by home_lab_hero · 3/20/2025

Consumer-grade deployment for personal or small team chat applications. RTX 4090 delivers surprising inference performance.

Qwen 2.5 7B on RTX 4090

Cost/M

$0.45

Tokens/s

2,400

Margin

82%

budgetconsumerchatqwen

Try This Config →

High-Throughput Mistral 7B on H100

by scale_engineer · 3/6/2025

Maximize requests per second for Mistral 7B. H100 with continuous batching handles massive concurrent load.

Cost/M

$0.65

Tokens/s

4,200

Margin

84%

high-throughputfastmistralscale

Try This Config →

RAG Pipeline with Llama 70B FP8

by rag_architect · 3/16/2025

Retrieval-augmented generation pipeline optimized for long-context document Q&A. FP8 keeps 70B on 2 GPUs with room for KV cache.

Llama 3.1 70B

Cost/M

$3.50

Tokens/s

520

Margin

56%

raglong-contextenterprisellama

Try This Config →

Llama 70B on H200 — Next Gen Perf

by hw_reviewer · 3/18/2025

H200's extra HBM3e memory and bandwidth deliver 40% more throughput than H100 for large models.

Llama 3.1 70B

Cost/M

$3.10

Tokens/s

620

Margin

58%

performanceh200llama

Try This Config →

Vision Pipeline with Llama 3.2 Vision 90B

by vision_ml_eng · 3/7/2025

Multimodal vision-language model for image understanding, OCR, and visual QA. Requires multi-GPU for the 90B variant.

Llama 3.2 90B Vision

Cost/M

$8.50

Tokens/s

210

Margin

44%

visionmultimodalllamapremium

Try This Config →

Qwen 7B FP8 — Fastest Budget Option

by startup_cto · 3/12/2025

Lightning-fast small model for simple tasks. FP8 on H100 maximizes throughput per dollar.

Qwen 2.5 7B

Cost/M

$0.85

Tokens/s

2,800

Margin

78%

budgetfastqwen

Try This Config →

Code Assistant with StarCoder2 15B

by devtools_startup · 3/17/2025

Optimized for IDE-integrated code completion and generation. Low latency on A100 with high batch throughput.

StarCoder2 15B on A100 80GB SXM

Cost/M

$1.20

Tokens/s

1,800

Margin

72%

codedeveloper-toolsstarcoderlow-latency

Try This Config →

Llama 405B on 8x H100 — Maximum Quality

by quality_first · 3/2/2025

The largest open model for tasks where quality is paramount. Full BF16 for zero quality loss.

Llama 3.1 405B

Cost/M

$18.50

Tokens/s

180

Margin

41%

premiummax-qualityllama

Try This Config →

Cost-Optimized Mixtral 8x7B MoE

by cost_hawk · 3/4/2025

Mixtral MoE on A100 with AWQ quantization. Only 12.9B active params gives dense-like speed at fraction of the cost.

A100 80GB SXM

Cost/M

$1.10

Tokens/s

980

Margin

73%

moecost-optimizedmixtralquantized

Try This Config →

Gemma 2 27B on 1x A100 — Code Assistant

by dev_tools_team · 3/1/2025

Optimized for code generation and review tasks. Great quality-to-cost ratio for developer tools.

Gemma 2 27B on A100 80GB SXM

Cost/M

$2.10

Tokens/s

680

Margin

61%

codedeveloper-toolsgemma

Try This Config →

Edge Deployment Phi-4 on L4

by edge_ml_team · 3/9/2025

Microsoft Phi-4 on NVIDIA L4 for cost-efficient edge inference. Great quality-to-size ratio for constrained deployments.

Phi-4 on L4

Cost/M

$0.18

Tokens/s

3,800

Margin

88%

edgesmallphicost-efficient

Try This Config →

Phi-3 Mini on A10G — Edge Deployment

by edge_deployer · 2/20/2025

Tiny but capable model for edge and on-device inference. A10G keeps costs minimal.

A10G

Cost/M

$0.22

Tokens/s

4,100

Margin

90%

edgetinyphi

Try This Config →

Budget Llama 70B on 2x A100

by gpu_enthusiast · 3/15/2025

Cost-effective setup for chat applications with strong quality. BF16 precision balances speed and accuracy.

Llama 3.1 70B on A100 80GB SXM

Cost/M

$4.20

Tokens/s

450

Margin

47%

budgetchatllama

Try This Config →

Multilingual Aya-23 on A100

by global_ops · 3/11/2025

Cohere Aya-23 for 23-language customer support. Excellent multilingual performance with efficient serving.

Aya 23 35B on A100 80GB SXM

Cost/M

$2.80

Tokens/s

620

Margin

58%

multilingualsupportcohereaya

Try This Config →

Mixtral 8x7B on 2x A100 — MoE Value

by mixtral_veteran · 1/15/2025

Original MoE value champion. Still competitive for many general-purpose tasks.

A100 80GB SXM

Cost/M

$1.90

Tokens/s

750

Margin

66%

moevaluemistral

Try This Config →

Mistral Large on 4x A100 — Balanced

by enterprise_arch · 3/8/2025

Good balance of cost and capability for enterprise use cases. Handles complex instructions well.

A100 80GB SXM

Cost/M

$7.80

Tokens/s

380

Margin

52%

enterprisebalancedmistral

Try This Config →

Qwen 72B AWQ on 2x H100 — Multilingual

by global_support · 3/5/2025

AWQ quantization for multilingual customer support. Excellent CJK language performance.

Qwen 2.5 72B

Cost/M

$3.90

Tokens/s

520

Margin

55%

multilingualsupportqwen

Try This Config →