OllamaModels - Find the Best Local AI Models for Your Hardware

Code Llama

Meta · 34B

Meta's specialized coding model based on Llama 2. Excellent for code completion and generation.

coding codingprogramminginfill

Quantization

Size: 20GB

RAM: 26GB

VRAM: 22GB

Context: 16K

Performance (tokens/sec)

5

CPU

28

Mid GPU

65

High GPU

ollama pull codellama

Command R

Cohere · 35B

Cohere's model optimized for RAG and tool use. Excellent at following complex instructions.

general ragtool-useenterprise

Quantization

Size: 20GB

RAM: 26GB

VRAM: 22GB

Context: 128K

Performance (tokens/sec)

5

CPU

26

Mid GPU

60

High GPU

ollama pull command-r

DeepSeek R1

DeepSeek · 70B

State-of-the-art reasoning model with chain-of-thought capabilities. Excels at math, coding, and complex reasoning.

reasoning reasoningmathcoding

Quantization

Size: 43GB

RAM: 48GB

VRAM: 44GB

Context: 64K

Performance (tokens/sec)

2.5

CPU

14

Mid GPU

42

High GPU

ollama pull deepseek-r1

DeepSeek R1 Distill (Qwen)

DeepSeek · 32B

Distilled version of R1 based on Qwen. Great reasoning in a smaller package.

reasoning reasoningdistilledefficient

Quantization

Size: 19GB

RAM: 24GB

VRAM: 20GB

Context: 32K

Performance (tokens/sec)

6

CPU

28

Mid GPU

65

High GPU

ollama pull deepseek-r1-distill-qwen

Gemma 2

Google · 27B

Google's open model built from Gemini research. Available in 2B, 9B, and 27B sizes.

general googleefficientinstruct

Quantization

Size: 16GB

RAM: 20GB

VRAM: 18GB

Context: 8K

Performance (tokens/sec)

7

CPU

32

Mid GPU

75

High GPU

ollama pull gemma2

Llama 3.2

Meta · 3B

Efficient smaller models from Meta, perfect for on-device deployment. Available in 1B and 3B sizes.

small lightweightfastmobile

Quantization

Size: 2GB

RAM: 4GB

VRAM: 3GB

Context: 128K

Performance (tokens/sec)

35

CPU

120

Mid GPU

200

High GPU

ollama pull llama3.2

Llama 3.2 Vision

Meta · 11B

Multimodal model that can understand images and text. Available in 11B and 90B variants.

vision multimodalimage-understandingvision

Quantization

Size: 6.5GB

RAM: 10GB

VRAM: 8GB

Context: 128K

Performance (tokens/sec)

12

CPU

45

Mid GPU

90

High GPU

ollama pull llama3.2-vision

Llama 3.3

Meta · 70B

Meta's latest and most capable open model. Excellent for general tasks, coding, and reasoning with 128K context.

general chatinstructmultilingual

Quantization

Size: 43GB

RAM: 48GB

VRAM: 44GB

Context: 128K

Performance (tokens/sec)

3

CPU

15

Mid GPU

45

High GPU

ollama pull llama3.3

LLaVA

LLaVA Team · 13B

Visual instruction-tuned model combining CLIP vision with Llama. Great for image understanding.

vision visionmultimodalimage-understanding

Quantization

Size: 8GB

RAM: 12GB

VRAM: 10GB

Context: 4K

Performance (tokens/sec)

10

CPU

40

Mid GPU

85

High GPU

ollama pull llava

Mistral 7B

Mistral AI · 7B

Highly efficient 7B model that punches above its weight. Great balance of speed and capability.

general efficientfastinstruct

Quantization

Size: 4.1GB

RAM: 8GB

VRAM: 6GB

Context: 32K

Performance (tokens/sec)

20

CPU

80

Mid GPU

150

High GPU

ollama pull mistral

Mistral Small

Mistral AI · 24B

Latest Mistral model optimized for efficiency. Enterprise-grade quality in a compact size.

general enterpriseefficientfunction-calling

Quantization

Size: 14GB

RAM: 18GB

VRAM: 16GB

Context: 32K

Performance (tokens/sec)

8

CPU

35

Mid GPU

80

High GPU

ollama pull mistral-small

Mixtral 8x7B

Mistral AI · 8x7B (47B)

Mixture of Experts model that activates only 2 experts per token. Fast inference with high quality.

general moeefficientmultilingual

Quantization

Size: 26GB

RAM: 32GB

VRAM: 28GB

Context: 32K

Performance (tokens/sec)

5

CPU

25

Mid GPU

60

High GPU

ollama pull mixtral

Mxbai Embed Large

Mixedbread AI · 335M

State-of-the-art embedding model. Top performance on MTEB benchmarks.

embedding embeddingragretrieval

Quantization

Size: 0.67GB

RAM: 2GB

VRAM: 1GB

Context: 1K

Performance (tokens/sec)

300

CPU

1500

Mid GPU

4000

High GPU

ollama pull mxbai-embed-large

Nomic Embed Text

Nomic AI · 137M

High-quality text embedding model. Perfect for RAG, semantic search, and similarity matching.

embedding embeddingragsearch

Quantization

Size: 0.27GB

RAM: 1GB

VRAM: 0.5GB

Context: 8K

Performance (tokens/sec)

500

CPU

2000

Mid GPU

5000

High GPU

ollama pull nomic-embed-text

Orca Mini

Pankaj Mathur · 7B

Compact model with strong reasoning. Good balance between size and capability.

small reasoningefficientinstruct

Quantization

Size: 4GB

RAM: 8GB

VRAM: 6GB

Context: 4K

Performance (tokens/sec)

22

CPU

85

Mid GPU

160

High GPU

ollama pull orca-mini

Phi-3

Microsoft · 14B

Microsoft's small language model. Surprisingly capable for its size, great for resource-constrained environments.

small efficientsmallreasoning

Quantization

Size: 8.2GB

RAM: 12GB

VRAM: 10GB

Context: 128K

Performance (tokens/sec)

15

CPU

55

Mid GPU

110

High GPU

ollama pull phi3

Qwen 2.5

Alibaba · 72B

Alibaba's flagship model with excellent multilingual support. Available from 0.5B to 72B.

general multilingualchinesechat

Quantization

Size: 44GB

RAM: 50GB

VRAM: 46GB

Context: 128K

Performance (tokens/sec)

2.8

CPU

14

Mid GPU

40

High GPU

ollama pull qwen2.5

Qwen 2.5 Coder

Alibaba · 32B

Specialized coding model with excellent code completion and generation. Supports 92 programming languages.

coding codingprogrammingcode-completion

Quantization

Size: 19GB

RAM: 24GB

VRAM: 20GB

Context: 128K

Performance (tokens/sec)

6

CPU

30

Mid GPU

70

High GPU

ollama pull qwen2.5-coder

StarCoder 2

BigCode · 15B

Code LLM trained on The Stack v2. Excellent for code completion across 600+ programming languages.

coding codingcode-completionprogramming

Quantization

Size: 9GB

RAM: 14GB

VRAM: 11GB

Context: 16K

Performance (tokens/sec)

12

CPU

50

Mid GPU

100

High GPU

ollama pull starcoder2

TinyLlama

TinyLlama · 1.1B

Ultra-compact 1.1B model. Perfect for testing, edge devices, and resource-limited environments.

small tinyfastedge

Quantization

Size: 0.64GB

RAM: 2GB

VRAM: 1GB

Context: 2K

Performance (tokens/sec)

60

CPU

200

Mid GPU

350

High GPU

ollama pull tinyllama

Find Your Perfect Local AI Model

Your Hardware

Code Llama

Command R

DeepSeek R1

DeepSeek R1 Distill (Qwen)

Gemma 2

Llama 3.2

Llama 3.2 Vision

Llama 3.3

LLaVA

Mistral 7B

Mistral Small

Mixtral 8x7B

Mxbai Embed Large

Nomic Embed Text

Orca Mini

Phi-3

Qwen 2.5

Qwen 2.5 Coder

StarCoder 2

TinyLlama