Give AlbumentationsX a star on GitHub — it powers this leaderboard

Star on GitHub

torchao

Package for applying ao techniques to GPU models

Rank: #2847Downloads: 1,891,661 (30 days)Stars: 2,709Forks: 440

Description

<div align="center">

TorchAO

</div>

PyTorch-Native Training-to-Serving Model Optimization

  • Pre-train Llama-3.1-70B 1.5x faster with float8 training
  • Recover 67% of quantized accuracy degradation on Gemma3-4B with QAT
  • Quantize Llama-3-8B to int4 for 1.89x faster inference with 58% less memory
<div align="center">

license

Latest News | Overview | Quick Start | Installation | Integrations | Inference | Training | Videos | Citation

</div>

📣 Latest News

<details> <summary>Older news</summary> </details>

🌅 Overview

TorchAO is an easy to use quantization library for native PyTorch. TorchAO works out-of-the-box with torch.compile() and FSDP2 across most HuggingFace PyTorch models.

For a detailed overview of stable and prototype workflows for different hardware and dtypes, see the Workflows documentation.

Check out our docs for more details!

🚀 Quick Start

First, install TorchAO. We recommend installing the latest stable version:

pip install torchao

Quantize your model weights to int4!

import torch
from torchao.quantization import Int4WeightOnlyConfig, quantize_
if torch.cuda.is_available():
  # quantize on CUDA
  quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq"))
elif torch.xpu.is_available():
  # quantize on XPU
  quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="plain_int32"))

See our quick start guide for more details.

🛠 Installation

To install the latest stable version:

pip install torchao
<details> <summary>Other installation options</summary>
# Nightly
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu128

# Different CUDA versions
pip install torchao --index-url https://download.pytorch.org/whl/cu126  # CUDA 12.6
pip install torchao --index-url https://download.pytorch.org/whl/cu129  # CUDA 12.9
pip install torchao --index-url https://download.pytorch.org/whl/xpu    # XPU
pip install torchao --index-url https://download.pytorch.org/whl/cpu    # CPU only

# For developers
# Note: the `--no-build-isolation` flag is required.
USE_CUDA=1 pip install -e . --no-build-isolation
USE_XPU=1 pip install -e . --no-build-isolation
USE_CPP=0 pip install -e . --no-build-isolation
</details>

Please see the torchao compability table for version requirements for dependencies.

🔎 Inference

TorchAO delivers substantial performance gains with minimal code changes:

Following is our recommended flow for quantization and deployment:

from transformers import TorchAoConfig, AutoModelForCausalLM
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow

# Create quantization configuration
quantization_config = TorchAoConfig(quant_type=Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()))

# Load and automatically quantize
quantized_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-32B",
    dtype="auto",
    device_map="auto",
    quantization_config=quantization_config
)

Alternative quantization API to use when the above doesn't work is quantize_ API in quick start guide.

Serving with vllm on 1xH100 machine:

# Server
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B -O3
# Client
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "pytorch/Qwen3-32B-FP8",
  "messages": [
    {"role": "user", "content": "Give me a short introduction to large language models."}
  ],
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "max_tokens": 32768
}'

For diffusion models, you can quantize using Hugging Face diffusers

import torch
from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig
from torchao.quantization import Int8WeightOnlyConfig

pipeline_quant_config = PipelineQuantizationConfig(
    quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig(group_size=128))}
)
pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

We also support deployment to edge devices through ExecuTorch, for more detail, see quantization and serving guide. We also release pre-quantized models here.

🚅 Training

Quantization-Aware Training

Post-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization-Aware Training (QAT) to overcome this limitation, especially for lower bit-width dtypes such as int4. In collaboration with TorchTune, we've developed a QAT recipe that demonstrates significant accuracy improvemen