TorchAO

</div>

PyTorch-Native Training-to-Serving Model Optimization

Pre-train Llama-3.1-70B 1.5x faster with float8 training
Recover 67% of quantized accuracy degradation on Gemma3-4B with QAT
Quantize Llama-3-8B to int4 for 1.89x faster inference with 58% less memory

</div>

📣 Latest News

[Oct 25] QAT is now integrated into Unsloth for both full and LoRA fine-tuning! Try it out using this notebook.
[Oct 25] MXFP8 MoE training prototype achieved ~1.45x speedup for MoE layer in Llama4 Scout, and ~1.25x speedup for MoE layer in DeepSeekV3 671b - with comparable numerics to bfloat16! Check out the docs to try it out.
[Sept 25] MXFP8 training achieved 1.28x speedup on Crusoe B200 cluster with virtually identical loss curve to bfloat16!
[Sept 19] TorchAO Quantized Model and Quantization Recipes Now Available on Huggingface Hub!
[Jun 25] Our TorchAO paper was accepted to CodeML @ ICML 2025!

<details> <summary>Older news</summary>

[May 25] QAT is now integrated into Axolotl for fine-tuning (docs)!
[Apr 25] Float8 rowwise training yielded 1.34-1.43x training speedup at 2k H100 GPU scale
[Apr 25] TorchAO is added as a quantization backend to vLLM (docs)!
[Mar 25] Our 2:4 Sparsity paper was accepted to SLLM @ ICLR 2025!
[Jan 25] Our integration with GemLite and SGLang yielded 1.1-2x faster inference with int4 and float8 quantization across different batch sizes and tensor parallel sizes
[Jan 25] We added 1-8 bit ARM CPU kernels for linear and embedding ops
[Nov 24] We achieved 1.43-1.51x faster pre-training on Llama-3.1-70B and 405B using float8 training
[Oct 24] TorchAO is added as a quantization backend to HF Transformers!
[Sep 24] We officially launched TorchAO. Check out our blog here!
[Jul 24] QAT recovered up to 96% accuracy degradation from quantization on Llama-3-8B
[Jun 24] Semi-structured 2:4 sparsity achieved 1.1x inference speedup and 1.3x training speedup on the SAM and ViT models respectively
[Jun 24] Block sparsity achieved 1.46x training speeedup on the ViT model with <2% drop in accuracy

</details>

🌅 Overview

TorchAO is an easy to use quantization library for native PyTorch. TorchAO works out-of-the-box with torch.compile() and FSDP2 across most HuggingFace PyTorch models.

For a detailed overview of stable and prototype workflows for different hardware and dtypes, see the Workflows documentation.

Check out our docs for more details!

🚀 Quick Start

First, install TorchAO. We recommend installing the latest stable version:

pip install torchao

Quantize your model weights to int4!

import torch
from torchao.quantization import Int4WeightOnlyConfig, quantize_
if torch.cuda.is_available():
  # quantize on CUDA
  quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq"))
elif torch.xpu.is_available():
  # quantize on XPU
  quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="plain_int32"))

See our quick start guide for more details.

🛠 Installation

To install the latest stable version:

pip install torchao

<details> <summary>Other installation options</summary>

# Nightly
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu128

# Different CUDA versions
pip install torchao --index-url https://download.pytorch.org/whl/cu126  # CUDA 12.6
pip install torchao --index-url https://download.pytorch.org/whl/cu129  # CUDA 12.9
pip install torchao --index-url https://download.pytorch.org/whl/xpu    # XPU
pip install torchao --index-url https://download.pytorch.org/whl/cpu    # CPU only

# For developers
# Note: the `--no-build-isolation` flag is required.
USE_CUDA=1 pip install -e . --no-build-isolation
USE_XPU=1 pip install -e . --no-build-isolation
USE_CPP=0 pip install -e . --no-build-isolation

</details>

Please see the torchao compability table for version requirements for dependencies.

🔎 Inference

TorchAO delivers substantial performance gains with minimal code changes:

Int4 weight-only: 1.73x speedup with 65% less memory for Gemma3-12b-it on H100 with slight impact on accuracy
Float8 dynamic quantization: 1.5-1.6x speedup on gemma-3-27b-it and 1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively on H100 with preserved quality
Int8 activation quantization and int4 weight quantization: Quantized Qwen3-4B running with 14.8 tokens/s with 3379 MB memory usage on iPhone 15 Pro through ExecuTorch
Int4 + 2:4 Sparsity: 2.37x throughput with 67.7% memory reduction on Llama-3-8B

Following is our recommended flow for quantization and deployment:

from transformers import TorchAoConfig, AutoModelForCausalLM
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow

# Create quantization configuration
quantization_config = TorchAoConfig(quant_type=Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()))

# Load and automatically quantize
quantized_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-32B",
    dtype="auto",
    device_map="auto",
    quantization_config=quantization_config
)

Alternative quantization API to use when the above doesn't work is quantize_ API in quick start guide.

Serving with vllm on 1xH100 machine:

# Server
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B -O3

# Client
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "pytorch/Qwen3-32B-FP8",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "max_tokens": 32768
}'

For diffusion models, you can quantize using Hugging Face diffusers

import torch
from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig
from torchao.quantization import Int8WeightOnlyConfig

pipeline_quant_config = PipelineQuantizationConfig(
    quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig(group_size=128))}
)
pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

We also support deployment to edge devices through ExecuTorch, for more detail, see quantization and serving guide. We also release pre-quantized models here.

🚅 Training

Quantization-Aware Training

Post-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization-Aware Training (QAT) to overcome this limitation, especially for lower bit-width dtypes such as int4. In collaboration with TorchTune, we've developed a QAT recipe that demonstrates significant accuracy improvemen

torchao

Description