nvidia-cusparselt-cu12
NVIDIA cuSPARSELt
Downloads: 0 (30 days)
Description
###################################################################################
cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication
###################################################################################
**NVIDIA cuSPARSELt** is a high-performance CUDA library dedicated to general matrix-matrix operations in which at least one operand is a structured sparse matrix with 50\% sparsity ratio:
.. math::
D = Activation(\alpha op(A) \cdot op(B) + \beta op(C) + bias)
where :math:`op(A)/op(B)` refers to in-place operations such as transpose/non-transpose, and :math:`alpha, beta` are scalars or vectors.
The *cuSPARSELt APIs* allow flexibility in the algorithm/operation selection, epilogue, and matrix characteristics, including memory layout, alignment, and data types.
**Download:** `developer.nvidia.com/cusparselt/downloads <https://developer.nvidia.com/cusparselt/downloads>`_
**Provide Feedback:** `Math-Libs-Feedback@nvidia.com <mailto:Math-Libs-Feedback@nvidia.com?subject=cuSPARSELt-Feedback>`_
**Examples**:
`cuSPARSELt Example 1 <https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul>`_,
`cuSPARSELt Example 2 <https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuSPARSELt/matmul_advanced>`_
**Blog post**:
- `Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt <https://developer.nvidia.com/blog/exploiting-ampere-structured-sparsity-with-cusparselt/>`_
- `Structured Sparsity in the NVIDIA Ampere Architecture and Applications in Search Engines <https://developer.nvidia.com/blog/structured-sparsity-in-the-nvidia-ampere-architecture-and-applications-in-search-engines/>`__
- `Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture <https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31552/>`__
================================================================================
Key Features
================================================================================
* *NVIDIA Sparse MMA tensor core* support
* Mixed-precision computation support:
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| Input A/B | Input C | Output D | Compute | Block scaled | Support SM arch |
+==============+================+=================+=============+=================================+====================+
| `FP32` | `FP32` | `FP32` | `FP32` | No | |
+--------------+----------------+-----------------+-------------+ + |
| `BF16` | `BF16` | `BF16` | `FP32` | | `8.0, 8.6, 8.7` |
+--------------+----------------+-----------------+-------------+ + `9.0, 10.0, 10.1` |
| `FP16` | `FP16` | `FP16` | `FP32` | | `11.0, 12.0, 12.1` |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| `FP16` | `FP16` | `FP16` | `FP16` | No | `9.0` |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| `INT8` | `INT8` | `INT8` | `INT32` | No | |
+ +----------------+-----------------+ + + `8.0, 8.6, 8.7` +
| | `INT32` | `INT32` | | | `9.0, 10.0, 10.1` |
+ +----------------+-----------------+ + + `11.0, 12.0, 12.1` +
| | `FP16` | `FP16` | | | |
+ +----------------+-----------------+ + + +
| | `BF16` | `BF16` | | | |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| `INT8` | `INT8` | `INT8` | `INT32` | No | |
+ +----------------+-----------------+ + + `8.0, 8.6, 8.7` +
| | `INT32` | `INT32` | | | `9.0, 10.0, 10.1` |
+ +----------------+-----------------+ + + `11.0, 12.0, 12.1` +
| | `FP16` | `FP16` | | | |
+ +----------------+-----------------+ + + +
| | `BF16` | `BF16` | | | |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| `E4M3` | `FP16` | `E4M3` | `FP32` | No | `9.0, 10.0, 10.1` |
+ +----------------+-----------------+ + + `11.0, 12.0, 12.1` +
| | `BF16` | `E4M3` | | | |
+ +----------------+-----------------+ + + +
| | `FP16` | `FP16` | | | |
+ +----------------+-----------------+ + + +
| | `BF16` | `BF16` | | | |
+ +----------------+-----------------+ + + +
| | `FP32` | `FP32` | | | |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| `E5M2` | `FP16` | `E5M2` | `FP32` | No | `9.0, 10.0, 10.1` |
+ +----------------+-----------------+ + + `11.0, 12.0, 12.1` +
| | `BF16` | `E5M2` | | | |
+ +----------------+-----------------+ + + +
| | `FP16` | `FP16` | | | |
+ +----------------+-----------------+ + + +
| | `BF16` | `BF16` | | | |
+ +----------------+-----------------+ + + +
| | `FP32` | `FP32` | | | |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| `E4M3` | `FP16` | `E4M3` | `FP32` | A/B/D_OUT_SCALE = `VEC64_UE8M0` | `10.0, 10.1, 11.0` |
+ +----------------+-----------------+ + + `12.0, 12.1` +
| | `BF16` | `E4M3` | | D_SCALE = `32F` | |
+ +----------------+-----------------+ +---------------------------------+ +
| | `FP16` | `FP16` | | A/B_SCALE = `VEC64_UE8M0` | |
+ +----------------+-----------------+ + + +
| | `BF16` | `BF16` | | | |
+ +----------------+-----------------+ + + +
| | `FP32` | `FP32` | | | |
+--------------+----------------+-----------------+-------------+---------------------------------+--------------------+
| `E2M1` | `FP16` | `E2M1` | `FP32` | A/B/D_SCALE = `VEC32_UE4M3` | `10.0, 10.1, 11.0` |
+ +----------------+-----------------+ + + `12.0, 12.1` +
| | `BF16` | `E2M1` | | D_SCALE = `32F` | |
+ +----------------+-----------------+ +---------------------------------+ +
| | `FP16` | `FP16` | | A/B_SCALE = `VEC32_UE4M3` | |
+ +----------------+-----------------+ + + +
| | `BF16`