25 min read

Best GPU for Local LLMs (2026): VRAM Tier List & Buyer's Guide

Don't overpay for AI hardware. Discover the best GPUs for local LLM inference based on VRAM-per-dollar, from used RTX 3090s to the new RTX 5090 endgame.
A VRAM quantisization chart, showing tiers starting from 8gb through to 24gb and what you can run.
VRAM Quant chart for reference,

The 2026 Reality: VRAM is No Longer a Luxury

Most GPUs can run AI models. Very few can run them well enough to actually use every day.

If your model doesnโ€™t fit in VRAM, performance collapses hard, no matter how powerful the card is.

๐Ÿง  Fast answer:

  • ๐Ÿ’ฐ Best value GPU: RTX 3090 (24GB) - cheapest way into serious usable daily driver LLMs
  • โš–๏ธ Best balanced build: RTX 4060 Ti 16GB - efficient, modern, stable
  • ๐Ÿš€ Best overall performance: RTX 4090 / RTX 5090 - top-tier inference speed but sadly, top tier costs too
  • ๐Ÿงช Best budget entry: Intel Arc B580 12GB - basic 7B models only

๐Ÿ“Š What actually matters (in order):

  1. VRAM capacity (hard limit โ€” if you exceed it, everything breaks)
  2. Model size (7B โ†’ 70B+)
  3. Quantization level (Q4, Q5, Q8)
  4. Memory bandwidth (affects speed, not capability)

โšก Quick decision guide:

  • Want ChatGPT-like experience locally? โ†’ 24GB VRAM minimum
  • Want to experiment cheaply? โ†’ 12โ€“16GB is enough
  • Want 70B models? โ†’ 24GB single GPU OR dual GPU setup
  • Want zero compromises? โ†’ 32GB+ (5090 / workstation / Mac Studio)

๐Ÿงฎ Not sure what fits your setup?

Use the planner below to instantly check:

  • what models will run
  • how fast theyโ€™ll run
  • cheapest GPU that works

๐Ÿ‘‰ Jump to the GPU Planner ๐Ÿ”ฝ


The "Golden Rule" of local AI hasn't changed, but the goalposts have moved. In 2024, 8GB was an entry point (as noted by my usage of an old GTX1650). In 2026, 16GB VRAM is basically the minimum usable. This is due to the expanded capabilities of the LLMs, but at the cost of a higher bar of entry to operate them.

๐Ÿ Core Lab ๐Ÿ† Best GPU by Model Size (2026 Quick Guide)

Target / Use CaseRecommended HardwareTier / Value Prop
7B Models (Entry Level)RTX 3060 12GBBest 12GB VRAM entry point
13B Models (Mid-Tier)RTX 4060 Ti 16GBEfficient 16GB "Sweet Spot"
34B Models (Advanced)RTX 3090 / 7900 XTX24GB VRAM for high-logic inference
70B Models (Frontier)Dual 3090s / RTX 509048GB+ VRAM for massive models
Best Budget StartIntel Arc B580Modern features for under
Best for OllamaRTX 3090 (Used)The "Gold Standard" community workhorse
Best for LinuxAMD RX 7900 XTXNative ROCm support & Open Source
Best Quiet SystemMac StudioSilent performance via Unified Memory

LLM GPU Interactive Tools & Selectorsโšก

๐Ÿ“Š GPU Comparison for Local LLMs

Real-world LLM performance depends more on VRAM than raw compute.

GPU VRAM 7B Performance Best Use Case Value Rating
RTX 509032GB๐Ÿ”ฅ 75-100 tok/s70B / multi-model๐Ÿ’ธ Premium
RTX 409024GBโšก 35-50 tok/sBest all-around๐Ÿ”ฅ Best high-end
RTX 309024GB15-25 tok/sBudget 24GB king๐Ÿ’ฐ BEST VALUE
4060 Ti 16GB16GB12-18 tok/sEntry 13B๐Ÿ‘ Solid
RX 7900 XTX24GB10-15 tok/sVRAM heavy, slower stackโš ๏ธ Mixed
Arc B58012GB8-14 tok/sBudget builds๐Ÿ’ฐ Budget king
Tesla P4024GB5-10 tok/sUltra cheap VRAM๐Ÿงช Niche
๐Ÿ“ก
Detailed breakdown of GPU's & use cases further in the post below!


๐Ÿง  The Golden Rule of 2026: VRAM Over Everything

Why VRAM Still Dictates Your "Intelligence"

If your model doesn't fit in your Video RAM, it spills over to your system RAM (DDR4/5), and your performance drops from 50 tokens/sec to 2 tokens/sec. Itโ€™s the difference between a conversation and reading a telegram through a straw.

The "Quick Math" for 2026: To run a model at Q4_K_M (the standard high-quality compression):8B Model: Needs ~6GB VRAM34B Model: Needs ~20GB VRAM70B Model: Needs ~42GB VRAM
๐Ÿ“ก
Note: If you are choosing the Dual 3090 or Mac Studio path, ensure your case, cooling and power infrastructure is verified. See my Core Lab Hardware Guide for hardware advice.

Quantization Explained: How to Fit 70B Models on Consumer Gear

We no longer use FP16 for local inference. Itโ€™s wasteful. GGUF (IQ4_XS) and EXL2 are the 2026 standards. This is because the models have gotten so much better via much more granular training & optimization, therefore more like a scalpel, less like a sledgehammer!

  • Q4_K_M: The "Gold Standard." Minimal logic loss.
  • Q2_K: "The Lobotomy." Only used to cram a 70B model onto a single 24GB card.

In summary:

  • VRAM (Video RAM): The most critical factor. LLMs are massive, and their parameters need to fit into VRAM. More VRAM = larger models.
  • Precision (FP16, FP32, BF16): Lower precision (FP16, BF16) allows you to fit larger models into a given VRAM amount, but can sometimes impact accuracy. FP32 offers the highest accuracy but requires more VRAM.
  • CUDA Cores/Tensor Cores: Impact performance. Tensor Cores are specifically designed for accelerating matrix operations common in LLMs.
  • Memory Bandwidth: How quickly data can be transferred to and from VRAM.

What are Tokens?

For those new to Large Language Models (LLMs), understanding the term "token" is crucial. Tokens are essentially the building blocks of text that LLMs process. They're not always equivalent to words; a single word can be split into multiple tokens, and a token can sometimes represent punctuation or even parts of a word.

Think of it like this: an LLM doesnโ€™t โ€œreadโ€ words; it processes sequences of tokens. The number of tokens in a piece of text directly impacts the computational resources required to process it. Most LLMs have a token limit โ€“ the maximum number of tokens they can handle at once. Exceeding this limit can lead to errors or truncated responses. Gemma3 can handle 128k tokens however so they have some memory!

Note: Tokens/sec estimates are highly variable and depend on factors like model size, quantization, prompt length, and inference settings. These are rough estimates for a 7B model in FP16.

๐Ÿ“‰ What is Quantization? (Q4, Q5, Q8 Explained)

The Short Version: Quantization is "MP3 compression" for AI models. It shrinks the file size massively with almost zero loss in intelligence, allowing you to run huge models on consumer hardware.

The Deep Dive: Standard LLMs are trained in FP16 (16-bit Floating Point). This means every single number (parameter) in the model's brain takes up 16 bits of space.

  • The Problem: A 70 Billion parameter model at FP16 requires ~140GB of VRAM. You would need an Enterprise server to run it.

Enter Quantization: We "round down" those high-precision numbers. Instead of 16-bit (3.14159265), we chop it down to 4-bit (3.14).

  • The Result (Q4): That same 70B model now only needs ~40GB of VRAM. Suddenly, you can run it on dual 3090s or a Mac Studio!

The "Sweet Spot" Cheat Sheet

When downloading models (GGUF format), you will see these tags. Here is what they mean for your hardware:

QuantizationQualityVRAM UsageVerdictQ8 / FP16LosslessMassiveAvoid. Unless you are doing scientific research, you cannot tell the difference. Wastes VRAM.Q6_KNear PerfectHighGood for smaller models (8B) where you have VRAM to spare.Q4_K_MThe StandardLowThe Gold Standard. This is the "MP3 @ 320kbps" of AI. It is 99% as smart as the original but uses half the VRAM. Start here.Q2 / Q3DegradedLowest"Brain Damage" territory. The model starts to hallucinate and lose coherence. Only use if you are desperate for VRAM.


๐ŸŸข NVIDIA: The CUDA Standard (Tiered Recommendations)

NVIDIA remains the easiest path due to the sheer dominance of the CUDA ecosystem. If you want "plug and play," this is it. Painfully, due to costs in 2026.

GPUVRAM2026 Verdict
RTX 509032GBThe new "Endgame" card. 32GB allows for 70B models at high-q.
RTX 409024GBThe high-end value king. Still beats almost everything in speed.
RTX 508016GBFast, but the 16GB limit is frustrating for 2026's mid-tier models.
RTX 3090 (Used)24GBThe Core Lab Recommendation. Best bang-for-buck for 24GB VRAM.

The 24GB+ Club: RTX 5090, 4090, and the 3090 Value King (The Best for 70B+ Models)

NVIDIA GeForce RTX 5090 32GB

ASUS ROG Astral GeForce RTXโ„ข 5090 OC Edition Graphics Card, NVIDIA (PCIeยฎ 5.0, 32GB GDDR7
ASUS ROG Astral GeForce RTXโ„ข 5090 OC Edition Graphics Card, NVIDIA (PCIeยฎ 5.0, 32GB GDDR7
  • VRAM: 32GB
  • Precision: FP16, FP32, FP8, BF16
  • Tensor Cores: Yes (5th Gen)
  • Typical LLM Compatibility: The new undisputed champion for consumers. 32GB of VRAM and blazing-fast GDDR7 memory can handle massive 70B+ models at high quantization, complex fine-tuning, and multi-modal models with ease.
  • Performance: The fastest consumer card on the market, period. Expect 60-90+ tokens/sec on 7B models.
  • Price Range: $1,999+ USD / $2,700+ CAD

NVIDIA GeForce RTX 4090 24GB

  • VRAM: 24GB
  • Precision: FP16, FP32, BF16
  • Tensor Cores: Yes (4th Gen)
  • Typical LLM Compatibility: The previous king and still an absolute beast for 70B models. Its 24GB of fast VRAM remains in high demand for its capability and speed.
  • Performance: Top-tier performance, second only to the 5090. Expect 35-50+ tokens/sec.
  • Price Range: $1,600 - $2,000 USD / $2,150 - $2,700 CAD (Prices remain very high)

NVIDIA GeForce RTX 3090 24GB (Used)

  • VRAM: 24GB
  • Precision: FP16, FP32
  • Tensor Cores: Yes (3rd Gen)
  • Typical LLM Compatibility: The best price-to-VRAM workhorse. Offers the same 24GB as a 4090, making it capable of running 70B models. The best value for getting into the 24GB club.
  • Performance: Excellent, though a clear step behind the 4090. Expect 15-25 tokens/sec.
  • Price Range: $700 - $900 USD / $940 - $1,210 CAD (Used market only)

๐Ÿš€ Multi-GPU Secrets: How to Build a 48GB VRAM Monster with Dual 3090s

What's better than 24GB of VRAM? 48GB. If you are serious about running the massive "Frontier Models" (like Llama-3-70B at full precision, or Command-R+), a single card just won't cut it. As a "Blue Team" guy, I love efficiency. For the price of one new 5090, you can snag two used 3090s. With 48GB of VRAM, you can run a 70B model at near-lossless precision.

The secret weapon of the community is buying:

2X (Yes TWO) used RTX 3090s.

  • Cost: ~$1,600 USD (Total)
  • VRAM: 48GB
  • Performance: Slower than a 5090, but infinite capability.
  • The Magic: You don't even need NVLink (though it helps). Software like llama.cpp and Ollama automatically split the model across both cards over your PCIe slots. This setup rivals enterprise workstations costing $10,000+.

If you'd like to explore this option, checkout the use case of building an "AIO Server/NAS" to support it.

NAS vs Server: Whatโ€™s the Difference? (2026 Guide for Homelabs)
NAS vs server isnโ€™t just semantics - itโ€™s the difference between smooth performance and constant frustration. Learn when to use each and how to build the perfect homelab in 2026.

The 16GB Sweet Spot: RTX 5080 vs. 4060 Ti (The Sweet Spot for 34B Models)

NVIDIA GeForce RTX 5080 16GB

  • VRAM: 16GB
  • Precision: FP16, FP32, FP8, BF16
  • Tensor Cores: Yes (5th Gen)
  • Typical LLM Compatibility: A performance monster for 16GB. Ideal for 34B models at high quantization or 70B models at low-q. Its speed also makes it a potent fine-tuning card for smaller models.
  • Performance: Exceptionally fast, massively outperforming the 4080. Expect 40-55 tokens/sec.
  • Price Range: $1,199+ USD / $1,650+ CAD

NVIDIA GeForce RTX 4080 16GB

  • VRAM: 16GB
  • Precision: FP16, FP32, BF16
  • Tensor Cores: Yes (4th Gen)
  • Typical LLM Compatibility: A high-performance 16GB card. Easily handles 34B models and can run 70B models at low quantization.
  • Performance: High-end, a significant step up from the 4060 Ti. Expect 25-35 tokens/sec.
  • Price Range: $800 - $1,000 USD / $1,075 - $1,350 CAD

NVIDIA GeForce RTX 4060 Ti 16GB

  • VRAM: 16GB
  • Precision: FP16, FP32, BF16
  • Tensor Cores: Yes (4th Gen)
  • Typical LLM Compatibility: The best new budget VRAM card. Its 16GB VRAM is its key feature, making 34B models comfortable and 70B models (low-q) possible.
  • Performance: Good, but limited by its 128-bit bus. A great "VRAM-first, speed-second" choice. Expect 12-18 tokens/sec.
  • Price Range: $400 - $500 USD / $540 - $670 CAD

The 12GB VRAM Club (Great for 13B Models)

NVIDIA GeForce RTX 4070 Ti 12GB

  • VRAM: 12GB
  • Precision: FP16, FP32, BF16
  • Tensor Cores: Yes (4th Gen)
  • Typical LLM Compatibility: A very fast 12GB card. Excellent for 13B models and can run 34B models at very low quantization (q3/q4).
  • Performance: Faster than a 3060, but VRAM is the limit. Expect 18-28 tokens/sec.
  • Price Range: $550 - $700 USD / $740 - $940 CAD

NVIDIA GeForce RTX 3060 12GB

A picture of Core Lab's server with side panel off, showing internals with a Asus RTX3060 12GB GPU showing & additional components.
Core Lab's own server w/RTX3060 12GB!
๐Ÿ˜
TESTED: This is the GPU I use for my LLMs! I've run all kinds of models on this little beast and generated pics, text, video etc... Solid choice for price/performance.
  • VRAM: 12GB
  • Precision: FP16, FP32
  • Tensor Cores: Yes (3rd Gen)
  • Typical LLM Compatibility: The long-time budget VRAM king for 12GB. Perfect for running 13B models and experimenting with 34B models (low-q). Works quickest with 7-8B models of course.
  • Performance: A solid entry-level workhorse. Expect 7-12 tokens/sec.
  • Price Range: $250 - $350 USD / $330 - $465 CAD

The 8GB VRAM Club (Entry-Level 7B Models)

NVIDIA GeForce RTX 3050 8GB

  • VRAM: 8GB
  • Precision: FP16, FP32
  • Tensor Cores: Yes (3rd Gen)
  • Typical LLM Compatibility: The bare minimum for entry. Can comfortably run 7B models.
  • Performance: Entry-level. Expect slower generation speeds. 5-10 tokens/sec.
  • Price Range: $200 - $250 USD / $270 - $330 CAD

๐Ÿ”ด AMD & ROCm: The High-VRAM Value Disruptors

In 2026, AMD is no longer the "broken driver" underdog. ROCm 7.x is stable and supported natively by Ollama and LM Studio. For years, the answer was simple: if you weren't using NVIDIA's CUDA, you were wasting your time. Not anymore!

AMD (ROCm): The Power User's Choice

AMD's compute platform, ROCm, has made massive progress. Thanks to community efforts (like llama.cpp, text-generation-webui, and Oobabooga) and AMD's own driver improvements, running LLMs on modern AMD cards is now completely viable, especially on Linux. AMD basically doubled-down on Linux support!

  • The Hardware: The Radeon RX 7900 XTX (24GB) and RX 7900 XT (20GB) are now direct competitors to the 3090 and 4090. They offer huge VRAM pools for a competitive price.
  • The 2026 Verdict: This is still the undisputed champion for AMD-based AI. Itโ€™s the "budget 3090" alternative. With ROCm 7.x stability, this card is a monster for running 70B models on Linux.
  • Why it's a "Blue Team" pick: It offers 24GB of VRAM for hundreds less than a 4090/5090, making it the most pragmatic way to get high-parameter intelligence without the "NVIDIA Tax."

AMD Radeon RX 7900 XTX: 24GB of VRAM for under $800

XFX Speedster MERC310 AMD Radeon RX 7900XTX Black Gaming Graphics Card with 24GB GDDR6, AMD RDNA 3
XFX Speedster MERC310 AMD Radeon RX 7900XTX Black Gaming Graphics Card with 24GB GDDR6, AMD RDNA 3
  • VRAM: 24GB
  • Precision: FP16, FP32
  • Typical LLM Compatibility: AMD's 24GB VRAM king. A direct competitor to the RTX 3090 for VRAM capacity, allowing it to run 70B models.
  • Estimated Tokens/sec: 7-12
  • Price Range: $750 - $900 USD / $1,000 - $1,210 CAD

AMD Radeon RX 9070 XT 16GB

  • VRAM: 16GB
  • Precision: FP16, FP32
  • Typical LLM Compatibility: The new RDNA 4 performance card. Its 16GB VRAM and 2nd Gen AI accelerators make it an excellent choice for 34B models on a Linux-based system.
  • Estimated Tokens/sec: 8-14
  • Price Range: $599+ USD / $800+ CAD

AMD Radeon RX 9060 XT 16GB

  • VRAM: 16GB
  • Precision: FP16, FP32
  • Typical LLM Compatibility: A direct competitor to the RTX 4060 Ti 16GB. This is AMD's "budget VRAM" option, perfect for 34B models.
  • Estimated Tokens/sec: 6-10
  • Price Range: $349+ USD / $470+ CAD

AMD Radeon RX 6900 XT 16GB (Used)

  • VRAM: 16GB
  • Precision: FP16, FP32
  • Typical LLM Compatibility: A great used 16GB option. Can handle 34B models (low-q). Lacks the new AI accelerators, so performance relies on raw compute.
  • Estimated Tokens/sec: 5-9
  • Price Range: $350 - $500 USD / $470 - $670 CAD (Used market)

AMD Radeon RX 6800 XT 16GB (Used)

  • VRAM: 16GB
  • Precision: FP16, FP32
  • Typical LLM Compatibility: Can handle 13B models well and 34B models at low quantization. A solid value on the used market.
  • Estimated Tokens/sec: 4-8
  • Price Range: $300 - $450 USD / $400 - $600 CAD (Used market)

As you can see, there's a lot to consider and you can run an LLM on almost any recent GPU. Much of this depends on your budget but if you've got an "old" GPU laying around, you can make use of it probably. I started playing with 4b LLM's via ollama on a GTX1660 6gb!


๐Ÿ”ต Intel Battlemage: The New Budget Entry Point

Intel is the new player on the block, and for a long time, they were a non-starter for AI. However, thanks to massive community and Intel-led software efforts (like OpenVINO, SYCL support in llama.cpp, and specific extensions for text-generation-webui), the flagship Arc card has become a fascinating budget option for tinkerers who are willing to get their hands dirty.

Intel Arc Comparison: Battlemage (Xe2) vs. Alchemist (Xe)


FeatureArc A380 (Legacy/Entry)Arc B580 (Battlemage Mid)Arc B770 (Battlemage High)
VRAM6GB GDDR612GB GDDR616GB GDDR6
ArchitectureXe (Alchemist)Xe2 (Battlemage)Xe2 (Battlemage)
PrecisionFP16, FP32FP8, BF16 (Native)FP8, BF16 (Native)
70B SupportNo (System RAM only)Poor (Requires 3-bit/GGUF)Possible (4-bit/GGUF)
Optimal ModelTinyLlama (1.1B) / Phi-37B - 8B Models (Llama 3)12B - 14B Models (Mistral)
Price (Est.)~$100 USD~$250 USD~$399 - $449 USD

Intel Arc A770 16GB

  • VRAM: 16GB
  • Precision: FP32, FP16, INT8
  • Typical LLM Compatibility: This is the ultimate budget VRAM card for hobbyists. Its 16GB of VRAM is its entire selling point, allowing it to comfortably run quantized 34B models or even 70B models at very low quantization (q2/q3).
  • The Catch: Performance is not its strong suit, and software support is experimental. This is NOT plug-and-play like NVIDIA. You must be comfortable using Linux, updating to the latest drivers, and using specific software like llama.cpp (with SYCL) or text-generation-webui (with the XPU extension). It is the definition of a "tinkerer's" card, but it's the cheapest 16GB of VRAM you can get, period.
  • Estimated Tokens/sec: 3-7 (Highly variable and software-dependent)
  • Price Range: $200 - $300 (Used market)
  • Note: This card is the direct gaming/consumer version of the Arc Pro B70, which features a massive 32GB VRAM buffer for workstation AI tasks. If you can find the "Pro" variant, it is the best value for 70B models outside of an RTX 3090/9900 XTX

Intel Arc B580 12GB (Battlemage)

  • Typical LLM Compatibility: The 12GB buffer is a bit of a "middle child." It is fantastic for 7B-8B parameter models with room for a large 8k-16k context window. It struggles with 14B models unless they are quantized down to 4-bit.
  • Estimated Tokens/sec: 10 - 15 t/s (for 8B models).
  • VRAM Limitation: Users have reported "Out of Memory" (OOM) errors on 30B+ models during the "prefill" stage if the prompt exceeds 2,000 tokens, so keep your context lengths modest on this card.

Intel Arc A380 6GB (Alchemist)

  • Typical LLM Compatibility: This is strictly an "entry-level" or "headless" AI card. At 6GB, it cannot run the most popular 7B/8B models at 4-bit precision comfortably (which usually require ~5.5GB plus context). It is best used for summarization tasks with tiny models like Phi-3 (3.8B) or as a dedicated encoder for AV1 video.
  • Estimated Tokens/sec: 3 - 5 t/s (for 3.8B models).
  • Fact: While it's a "budget beast" for video, it lacks the Xe2 architectural jumps, making it significantly slower for LLM inference than its Battlemage successors.

๐Ÿš๏ธ The "Scrap Lab" Special: Enterprise E-Waste for AI

For the technical masochists (my people!), retired data center cards are 2026's best-kept secret. If you aren't afraid of 3D printing fan shrouds and hacking driver configs, you can get 24GB of VRAM for the price of a budget gaming card.

๐Ÿ’ต
NVIDIA RTX A6000 (48GB): Coming down in price as companies upgrade to Blackwell. Relatively, speaking... It still hurts.

These are retired data-center cards. They are Headless (no HDMI/DisplayPort) and Passively Cooled (they rely on screaming loud server fans).

Tesla P40 & P100: 24GB VRAM for the Price of a Dinner

GPU ModelVRAMArchitectureThe "Catch"Est. Price (Used)Best For...
NVIDIA Tesla P4024GB (GDDR5)PascalNo FP16 Speed: It is slow at "training" precision. However, for running GGUF/Quantized models in llama.cpp, it uses FP32 math which works fine. Cooling: You MUST strap a blower fan to it.$150 - $200 USDThe absolute cheapest way to run 70B Models.
NVIDIA Tesla P10016GB (HBM2)PascalCapacity: 16GB is awkward. It has incredibly fast memory (HBM2), but holds less data than the P40.$130 - $180 USDFast inference on medium (34B) models.

โš ๏ธ Cooling and Powering Legacy Data Center GPUs at Home:

  1. Cooling: These cards will overheat and die in a normal PC case. You must buy or 3D print a fan shroud that forces air directly through the card.
  2. Video Output: They have no ports. You need a CPU with integrated graphics (iGPU) or a second cheap GPU just to see your monitor.
  3. Power: They often use EPS (CPU) power connectors, not standard PCIe plugs. You likely need a custom adapter cable.
  4. The "Driver Dance": You often need to enable "Above 4G Decoding" in BIOS and potentially hack Windows drivers to get them working alongside a GeForce card (WDDM vs TCC mode). Linux is highly recommended here.

๐Ÿ“Š Expanded GPU Comparison Table (April 2026 Edition)

GPU ModelVRAMPrecisionTensor/AI Cores?Power (W)Tokens/sec (7B)Price (USD)Price (CAD)
NVIDIA (The Kings)
RTX 509032 GBFP8, FP16, FP32โœ… (5th Gen)575W75 - 100+$1,999+$2,800+
RTX 508016 GBFP8, FP16, FP32โœ… (5th Gen)360W45 - 60$1,199+$1,650+
RTX 409024 GBFP16, FP32, BF16โœ… (4th Gen)450W35 - 50$1,500 - $1,800$2,100 - $2,500
RTX 3090 (Used)24 GBFP16, FP32โœ… (3rd Gen)350W15 - 25$650 - $800$900 - $1,100
RTX 4060 Ti 16GB16 GBFP16, FP32โœ… (4th Gen)160W12 - 18$380 - $450$520 - $620
AMD (The Value)
RX 9070 XT16 GBFP16, FP32โœ… (2nd Gen)300W15 - 22$599+$820+
RX 7900 XTX24 GBFP16, FP32โœ… (1st Gen)355W10 - 15$700 - $850$950 - $1,150
Intel (The Budget)
Arc B770 (Battlemage)16 GBFP16, XMXโœ… (Xe2 XMX)240W10 - 18$349$480
Arc B58012 GBFP16, XMXโœ… (Xe2 XMX)190W8 - 14$249$340
Specialty (Used)
Tesla P4024 GBFP32 OnlyโŒ (Pascal)250W5 - 10$150$210
  • Intel Battlemage (B770/B580): The B580 (12GB) is now the absolute budget king for entry-level LLMs, beating the RTX 3060 in price/performance. The B770 (16GB) replaces the A770 as the mid-range "tinkerer" choice.
  • AMD RDNA 4 Boost: The RX 9070 XT performs better in AI than previous AMD cards because RDNA 4 doubled the AI throughput per compute unit. It's token estimate is very close or a match for the RTX 4070 class.
  • Tesla P40 Added: Massive price gap ($150 vs $650 for 24GB VRAM).

๐ŸŽ How to Run LLM's on Apple Mac Studio: The "Unified Memory" Cheat Code - Apple Silicon

If you are allergic to fan noise and Linux terminal troubleshooting, the Mac Studio (M4 Ultra) is the answer.

  • 192GB Unified Memory: You can run Llama-4-405B (quantized) on a desktop.
  • Efficiency: It draws less power than a single RTX 5090 while providing 6x the VRAM.

Apple's M-Series chips use Unified Memory Architecture (UMA). This means the system RAM is the VRAM.

  • Mac Studio (M1/M2 Ultra) with 128GB RAM: This allows you to load massive 120B+ parameter models that simply crash on an RTX 4090.
  • The Trade-off: Speed. An M2 Ultra generates text at ~15-20 tokens/sec, while an RTX 4090 might hit 50+. But if you need to run the biggest models possible, Mac Silicon is currently the only consumer way to get 100GB+ of VRAM.

๐Ÿ’ก Core Lab Buying Tip: You don't need to buy these new. The M1 Ultra is still a beast for LLMs. Look for "Amazon Renewed" units to save $500+.

Apple Mac Studio M1 Ultra

128GB Unified RAM, Renewed,

Check Price

Best For: Running 120B+ Models (Command-R+, Llama-3-405B-Q3)

VRAM: 128GB Unified

Status: The "Quiet Giant" of Local AI.


๐Ÿ Summary: Which GPU Should You Buy?

  • The "Money is No Object" Build: NVIDIA RTX 5090 (32GB).
  • The "Serious Researcher" Build: Dual Used RTX 3090s (48GB Total).
  • The "Best for Most People" Build: NVIDIA RTX 4060 Ti (16GB) or Intel B880.
  • The "Local GPT-4 Rival" Build: Mac Studio with 128GB+ RAM.

๐Ÿ› ๏ธ Build Checklist: Your First Local AI Rig

Before you hit "Order" or start tearing apart your current server, run through this tactical checklist to ensure your hardware can actually handle the heat of local inference.

1. The VRAM Verification

  • [ ] Identify your "Target Intelligence": Are you running 8B (entry), 34B (mid-tier), or 70B+ (pro) models?
  • [ ] The 20% Overhead Rule: Calculate your model size and add 20% for KV Cache and "Context Window" room. If a model needs 14GB to load, you need a 16GB card to actually talk to it.
  • [ ] Multi-GPU Plan: If using two cards (e.g., dual 3090s), ensure your motherboard supports x8/x8 PCIe lane splitting. Running a second card at x4 will bottleneck your ingest speeds.

2. Power & Infrastructure

  • [ ] PSU Headroom: 2026 high-end cards are thirsty. Ensure you have an ATX 3.1 compliant power supply with at least 1000W if running a 5090 or dual-GPU setup.
  • [ ] The "Thermal Defence" Plan: AI inference keeps a GPU at 100% load for long periods. Do you have at least two intake and two exhaust fans? For "Scrapyard" enterprise cards, is your 3D-printed fan shroud ready?
  • [ ] 12V-2x6 Cable Check: If using NVIDIA 40/50 series, ensure you are using native cablesโ€”no sketchy adapters that might melt during a 4-hour fine-tuning session.

3. OS & Driver Hardening

  • [ ] Linux First: While Windows is catching up, Ubuntu 24.04 LTS or Debian 13 remains the gold standard for stability and driver support (especially for AMD ROCm).
  • [ ] Kernel Check: Ensure your kernel version is compatible with the latest NVIDIA 570+ or ROCm 7.x drivers.
  • [ ] Docker Ready: Install the nvidia-container-toolkit so you can run Ollama or LocalAI in isolated containers without "polluting" your host OS.

4. Maintenance & Monitoring

  • [ ] Install nvtop or btop: You need a way to see VRAM usage in real-time. If you hit 99% usage, the system will swap to system RAM and performance will tank.
  • [ ] Automated Pruning: Set up a cron job or script (like our Ghost Janitor logic!) to prune old, unused model weightsโ€”these files are 5GBโ€“50GB each and will eat your SSD alive.

Ready to Build or Advance? ๐Ÿค–

๐Ÿ“ก
You've picked your hardware. Now it's time for the fun part! My complete technical guide will walk you through setting up all the software, from Docker to Ollama, to get your local LLM running.
Unleash Local AI: Running Ollama and Big-AGI with Docker Compose
Introduction The rise of Large Language Models (LLMs) has been incredible, but running them locally can be a challenge. Ollama simplifies the process of downloading and running LLMs, while Big-AGI provides a framework for building autonomous AI agents. Combining these two tools unlocks powerful possibilities for local experimentation and development.

If you still need more or better hardware, checkout my Homelab Hardware Guide which has plenty of prebuilt systems to think around.

๐Ÿ‘‰ If you're still exploring what AI tools you can actually use once your local LLM is running, Your Tech Compass has a practical guide to the best AI tools for getting real results, useful context before you dive into the deep end of self-hosting.

Ask AI: The Best AI Tools to Get Answers, Help, and Results Fast - Your Tech Compass
Stop wasting time on Google. Ask AI instead; our guide covers the 6 best AI tools, what each does best, and how to ask smarter questions.

Ask AI: For the bet AI tools for getting real results.


๐Ÿ™‹โ€โ™‚๏ธ Local AI Architect FAQ

Q: Is 16GB of VRAM enough for 2026-era models?

A: Yes, but it is now the "Mid-Tier" baseline. A 16GB cardโ€”like the RTX 5080, RTX 4060 Ti, or the new AMD RX 8800 XTโ€”is the perfect home for 34B parameter models using Q4_K_M quantization. It also allows you to run "Small" models (8B) at maximum precision (Q8/FP16) for high-accuracy tasks like Python coding.

Q: Why should I buy a used RTX 3090 instead of a new RTX 4070 Ti?

A: In local AI, VRAM is non-negotiable. The 4070 Ti has faster clock speeds, but its 12GB limit means it will physically fail to load a 70B model. The 3090โ€™s 24GB pool allows you to run much larger, more "intelligent" models that the 4070 Ti simply cannot touch. If the model doesn't fit in VRAM, speed is irrelevant.

Q: How much faster is the RTX 5090 compared to the 4090 for inference?

A: While raw compute is up, the real winner is the GDDR7 memory bandwidth. AI inference is often "memory-bound," meaning the speed at which data moves between the VRAM and the cores is the bottleneck. The 5090 processes tokens significantly faster, especially during "Prompt Ingest" on massive context windows (128k+ tokens).

Q: Can I mix NVIDIA and AMD GPUs in the same homelab server?

A: You can, but you shouldn't. Most inference engines (Ollama, llama.cpp, vLLM) expect a unified backendโ€”either CUDA or ROCm. Mixing brands creates "Driver Hell" and often results in the system only recognizing one card or crashing during VRAM offloading. Stick to one ecosystem per build.

Q: What is the "Unified Memory" advantage for Macs?

A: Unlike PCs, where VRAM is soldered to the GPU, Apple Silicon uses Unified Memory. If you have a Mac Studio with 128GB of RAM, the GPU can use nearly all of it. This allows you to run massive 400B+ models that would require $20,000 worth of enterprise NVIDIA cards to fit on a PC.

Q: Does PCIe 5.0 matter for running local LLMs?

A: Not really. Once a model is loaded into your GPU, the PCIe bus sits mostly idle while the GPU does the heavy lifting internally. Don't spend extra on a PCIe 5.0 motherboard just for AI; put that money toward a card with more VRAM instead.