torchao
Package for applying ao techniques to GPU models
Description
TorchAO
</div>PyTorch-Native Training-to-Serving Model Optimization
- Pre-train Llama-3.1-70B 1.5x faster with float8 training
- Recover 67% of quantized accuracy degradation on Gemma3-4B with QAT
- Quantize Llama-3-8B to int4 for 1.89x faster inference with 58% less memory
Latest News | Overview | Quick Start | Installation | Integrations | Inference | Training | Videos | Citation
</div>📣 Latest News
- [Oct 25] QAT is now integrated into Unsloth for both full and LoRA fine-tuning! Try it out using this notebook.
- [Oct 25] MXFP8 MoE training prototype achieved ~1.45x speedup for MoE layer in Llama4 Scout, and ~1.25x speedup for MoE layer in DeepSeekV3 671b - with comparable numerics to bfloat16! Check out the docs to try it out.
- [Sept 25] MXFP8 training achieved 1.28x speedup on Crusoe B200 cluster with virtually identical loss curve to bfloat16!
- [Sept 19] TorchAO Quantized Model and Quantization Recipes Now Available on Huggingface Hub!
- [Jun 25] Our TorchAO paper was accepted to CodeML @ ICML 2025!
- [May 25] QAT is now integrated into Axolotl for fine-tuning (docs)!
- [Apr 25] Float8 rowwise training yielded 1.34-1.43x training speedup at 2k H100 GPU scale
- [Apr 25] TorchAO is added as a quantization backend to vLLM (docs)!
- [Mar 25] Our 2:4 Sparsity paper was accepted to SLLM @ ICLR 2025!
- [Jan 25] Our integration with GemLite and SGLang yielded 1.1-2x faster inference with int4 and float8 quantization across different batch sizes and tensor parallel sizes
- [Jan 25] We added 1-8 bit ARM CPU kernels for linear and embedding ops
- [Nov 24] We achieved 1.43-1.51x faster pre-training on Llama-3.1-70B and 405B using float8 training
- [Oct 24] TorchAO is added as a quantization backend to HF Transformers!
- [Sep 24] We officially launched TorchAO. Check out our blog here!
- [Jul 24] QAT recovered up to 96% accuracy degradation from quantization on Llama-3-8B
- [Jun 24] Semi-structured 2:4 sparsity achieved 1.1x inference speedup and 1.3x training speedup on the SAM and ViT models respectively
- [Jun 24] Block sparsity achieved 1.46x training speeedup on the ViT model with <2% drop in accuracy
🌅 Overview
TorchAO is an easy to use quantization library for native PyTorch. TorchAO works out-of-the-box with torch.compile() and FSDP2 across most HuggingFace PyTorch models.
For a detailed overview of stable and prototype workflows for different hardware and dtypes, see the Workflows documentation.
Check out our docs for more details!
🚀 Quick Start
First, install TorchAO. We recommend installing the latest stable version:
pip install torchao
Quantize your model weights to int4!
import torch
from torchao.quantization import Int4WeightOnlyConfig, quantize_
if torch.cuda.is_available():
# quantize on CUDA
quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq"))
elif torch.xpu.is_available():
# quantize on XPU
quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="plain_int32"))
See our quick start guide for more details.
🛠 Installation
To install the latest stable version:
pip install torchao
<details>
<summary>Other installation options</summary>
# Nightly
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu128
# Different CUDA versions
pip install torchao --index-url https://download.pytorch.org/whl/cu126 # CUDA 12.6
pip install torchao --index-url https://download.pytorch.org/whl/cu129 # CUDA 12.9
pip install torchao --index-url https://download.pytorch.org/whl/xpu # XPU
pip install torchao --index-url https://download.pytorch.org/whl/cpu # CPU only
# For developers
# Note: the `--no-build-isolation` flag is required.
USE_CUDA=1 pip install -e . --no-build-isolation
USE_XPU=1 pip install -e . --no-build-isolation
USE_CPP=0 pip install -e . --no-build-isolation
</details>
Please see the torchao compability table for version requirements for dependencies.
🔎 Inference
TorchAO delivers substantial performance gains with minimal code changes:
- Int4 weight-only: 1.73x speedup with 65% less memory for Gemma3-12b-it on H100 with slight impact on accuracy
- Float8 dynamic quantization: 1.5-1.6x speedup on gemma-3-27b-it and 1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively on H100 with preserved quality
- Int8 activation quantization and int4 weight quantization: Quantized Qwen3-4B running with 14.8 tokens/s with 3379 MB memory usage on iPhone 15 Pro through ExecuTorch
- Int4 + 2:4 Sparsity: 2.37x throughput with 67.7% memory reduction on Llama-3-8B
Following is our recommended flow for quantization and deployment:
from transformers import TorchAoConfig, AutoModelForCausalLM
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
# Create quantization configuration
quantization_config = TorchAoConfig(quant_type=Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()))
# Load and automatically quantize
quantized_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-32B",
dtype="auto",
device_map="auto",
quantization_config=quantization_config
)
Alternative quantization API to use when the above doesn't work is quantize_ API in quick start guide.
Serving with vllm on 1xH100 machine:
# Server
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B -O3
# Client
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "pytorch/Qwen3-32B-FP8",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 32768
}'
For diffusion models, you can quantize using Hugging Face diffusers
import torch
from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig
from torchao.quantization import Int8WeightOnlyConfig
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig(group_size=128))}
)
pipeline = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
device_map="cuda"
)
We also support deployment to edge devices through ExecuTorch, for more detail, see quantization and serving guide. We also release pre-quantized models here.
🚅 Training
Quantization-Aware Training
Post-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization-Aware Training (QAT) to overcome this limitation, especially for lower bit-width dtypes such as int4. In collaboration with TorchTune, we've developed a QAT recipe that demonstrates significant accuracy improvemen