[FEAT] [ROCm] Add AITER int8 scaled gemm kernel #15433

tjtanaa · 2025-03-25T03:16:37Z

Description

This is a PR to integrate int8 scaled gemm kernel and focus on the model generated in compressed tensor format using llm-compressor.
To use this feature, set VLLM_ROCM_USE_AITER=1. (Default value of VLLM_ROCM_USE_AITER_LINEAR is 1 [enabled by default when AITER is used])

Performance Gain

Experiment setup:
GPU: 4 * MI300X
Model: neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a8

Input Token Length	Output Token Length	Kernel Used	Throughput Perf Gain in % over Triton
128	128	AITER	20.2%
128	2048	AITER	22.3%
2048	128	AITER	13.6 %
2048	2048	AITER	10.9%

Accuracy Comparison: Triton Scaled MM vs AITER Scaled GEMM Kernel

Kernel	Filter	Exact Match	Stderr
Triton	flexible-extract	0.9477	0.0061
Triton	strict-match	0.9439	0.0063
AIter	flexible-extract	0.9477	0.0061
AIter	strict-match	0.9431	0.0064

Model: neuralmagic/Meta-Llama-3.1-405B-Instruct-quantized.w8a8 (5-shot, TP=4)

Unit tests

tests/quantization/test_compressed_tensors.py

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

github-actions · 2025-03-25T03:16:45Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

ProExpertProg · 2025-03-25T14:21:25Z

I will do a deeper review later but could you please use the ScaledMMKernel abstraction for this?

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

tjtanaa · 2025-03-26T06:31:09Z

@ProExpertProg
I have implemented a AiterScaledMMLinearKernel subclass.
It is now ready for review.

mergify · 2025-03-26T08:34:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjtanaa.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ProExpertProg · 2025-03-26T14:04:14Z

Please resolve the merge conflicts, thanks!

mgoin · 2025-03-26T14:07:24Z

vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py

+        # try import aiter
+        try:
+            pass
+        except Exception:
+            return (
+                False,
+                "AiterScaledMMLinearKernel requires `aiter` which is not " +
+                "installed supported on ROCm.")


It seems you forgot to import aiter here

Yeah agreed this is missing the import

Thank you. It seems ruff remove the import aiter. I have annotated this line. Ruff will not changed it into pass.

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

ProExpertProg

A few initial comments!

ProExpertProg · 2025-03-26T14:05:24Z

tests/quantization/test_compressed_tensors.py

+                assert qkv_proj.weight.dtype is (torch.float8_e4m3fnuz
+                                                 if current_platform.is_rocm()
+                                                 else torch.float8_e4m3fn)


Use current_platform.fp8_dtype()

ProExpertProg · 2025-03-26T14:09:34Z

vllm/model_executor/layers/quantization/compressed_tensors/triton_scaled_mm.py

+    # assert scale_a.shape == torch.Size([1, 1]) or scale_a.shape == torch.Size(
+    #     [M, 1])
+    # assert scale_b.shape == torch.Size([1, 1]) or scale_b.shape == torch.Size(
+    #     [N, 1])


Please clean up comments

ProExpertProg · 2025-03-26T14:10:48Z

vllm/_custom_ops.py

-            "triton_scaled_mm")
-        triton_scaled_mm = triton_scaled_mm_module.triton_scaled_mm
-        return triton_scaled_mm(a, b, scale_a, scale_b, out_dtype, bias)
+        if is_rocm_aiter_gemm_w8a8_scaled_mm_enabled():


This shouldn't live inside cutlass_scaled_mm, has nothing to do with cutlass. This code should just live inside AiterScaledMMLinearKernel.apply.

I know the Triton kernel is here but it shouldn't be either, I'm currently refactoring that.

ProExpertProg · 2025-03-26T14:11:45Z

vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py

+        if current_platform.is_cpu():
+            return (
+                False,
+                "AiterScaledMMLinearKernel requires `aiter` which is not " +
+                "currently supported on CPU.")
+        if not current_platform.is_rocm():


This can be a single check

ProExpertProg · 2025-03-26T14:12:02Z

vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py

+        # try import aiter
+        try:
+            pass
+        except Exception:
+            return (
+                False,
+                "AiterScaledMMLinearKernel requires `aiter` which is not " +
+                "installed supported on ROCm.")


Yeah agreed this is missing the import

ProExpertProg · 2025-03-26T14:13:40Z

tests/quantization/test_compressed_tensors.py

@@ -57,6 +71,11 @@ def use_v0_only(monkeypatch):
 )
 def test_compressed_tensors_w8a8_static_setup(vllm_runner, model_args):
    model_path, strategy, quant_type, shape_0, is_symmetric = model_args
+


This logic is a bit confusing. What models are and aren't supported by aiter vs Triton?

AITER only supports per-channel-per-channel INT8 gemm and per-tensor-per-tensor INT8 GEMM. It does not support mix precision MM and mix quantization scheme.

Could you add that as a comment?

ProExpertProg · 2025-03-26T14:19:15Z

tests/quantization/test_compressed_tensors.py

+@pytest.mark.parametrize("num_logprobs", [10])
+@pytest.mark.skipif(not current_platform.is_rocm(),
+                    reason="This tests is skipped on non-ROCm platform.")
+def test_compressed_tensors_w8a8_logprobs_rocm_aiter(


Could this be folkded into the existing tests, by adding a boolean use_aiter parameter in the tests? And we can do [False] if <platform ...> else [False, True]

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

ProExpertProg

A few more comments, thanks for working with me on this!

ProExpertProg · 2025-03-27T23:34:02Z

vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py

@@ -20,25 +27,20 @@ def get_min_capability(cls) -> int:
    @classmethod
    def can_implement(
            cls, c: ScaledMMLinearLayerConfig) -> Tuple[bool, Optional[str]]:
-        if current_platform.is_cpu():
+        if current_platform.is_cpu() or not current_platform.is_rocm():


Why can't this be simpler?

Suggested change

if current_platform.is_cpu() or not current_platform.is_rocm():

if not current_platform.is_rocm():

Ok. I have removed the check for CPU.

ProExpertProg · 2025-03-27T23:35:42Z

tests/quantization/test_compressed_tensors.py

@@ -57,6 +71,11 @@ def use_v0_only(monkeypatch):
 )
 def test_compressed_tensors_w8a8_static_setup(vllm_runner, model_args):
    model_path, strategy, quant_type, shape_0, is_symmetric = model_args
+


Could you add that as a comment?

ProExpertProg · 2025-03-27T23:36:54Z

vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py

+from .ScaledMMLinearKernel import ScaledMMLinearLayerConfig
+
+
+def is_rocm_aiter_gemm_w8a8_scaled_mm_enabled() -> bool:


This method should just be inlined to the sole callsite (unless I'm missing another use)

ProExpertProg · 2025-03-27T23:40:18Z

vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py

+        per_channel_tensor_scale_a = (x_s.numel() == m)
+        per_channel_tensor_scale_b = (w_s.numel() == n)


Isn't this more accurate here?

Suggested change

per_channel_tensor_scale_a = (x_s.numel() == m)

per_channel_tensor_scale_b = (w_s.numel() == n)

per_token_scale_a = (x_s.numel() == m)

per_channel_scale_b = (w_s.numel() == n)

Yes you are right. I have made the amendments. Thank you so much.

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

ProExpertProg · 2025-03-28T16:58:20Z

vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py

+                "installed supported on ROCm.")
+        # Check if rocm_aiter_gemm_w8a8_scaled_mm is enabled
+        if not (
+            current_platform.is_rocm() \


Already checked this above

OK. removed current_platform.is_rocm()

ProExpertProg · 2025-03-28T17:02:22Z

vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py

+                    " ATIER block scaled GEMM yet.")
+
+        from aiter import gemm_a8w8_CK
+        return gemm_a8w8_CK(x_q, w_q.t(), x_s, w_s, bias).to(out_dtype)


Just curious for future work: does this kernel support fp8?

Also, can you add a comment why w_q needs to be transposed here? I assume because it's using the Cutlass prepare weights which are transposed so here we restore it?

ROCm/aiter does not support FP8 at this moment.
I have added the comment.

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

ProExpertProg · 2025-03-28T17:57:47Z

vllm/model_executor/layers/quantization/kernels/scaled_mm/aiter.py

+        # per-channel-per-channel a8w8 scacled GEMM
+        assert ((per_tensor_scale_a and per_tensor_scale_b)
+                or (per_token_scale_a and per_channel_scale_b)), (
+                    "Currently only support per-tensor-per-tensor GEMM " +
+                    " and per-channel-per-channel GEMM through AITER"
+                    " w8a8 scaled gemm. `cutlass_scaled_mm` does not support" +
+                    " ATIER block scaled GEMM yet.")


Typos/inaccuracies:

Suggested change

# per-channel-per-channel a8w8 scacled GEMM

assert ((per_tensor_scale_a and per_tensor_scale_b)

or (per_token_scale_a and per_channel_scale_b)), (

"Currently only support per-tensor-per-tensor GEMM " +

" and per-channel-per-channel GEMM through AITER"

" w8a8 scaled gemm. `cutlass_scaled_mm` does not support" +

" ATIER block scaled GEMM yet.")

# per-token-per-channel a8w8 scaled GEMM

assert ((per_tensor_scale_a and per_tensor_scale_b)

or (per_token_scale_a and per_channel_scale_b)), (

"Currently only support per-tensor-per-tensor GEMM " +

" and per-token-per-channel GEMM through AITER"

" w8a8 scaled gemm. `cutlass_scaled_mm` does not support" +

" AITER block scaled GEMM yet.")

And then what does this mean: cutlass_scaled_mm does not support AITER block scaled GEMM yet."?

My bad. This is the modified version

# @TODO: # Maybe broadcast the per-tensor-scale into per-channel-scale # if one of the scale is a per-channel-scale. # For now, it only supports: # - per-tensor-per-tensor a8w8 scaled GEMM, and # - per-token-per-channel a8w8 scaled GEMM assert ((per_tensor_scale_a and per_tensor_scale_b) or (per_token_scale_a and per_channel_scale_b)), ( "Currently only support per-tensor-per-tensor GEMM " + " and per-token-per-channel GEMM through AITER" " w8a8 scaled gemm. `AiterScaledMMLinearKernel` " + "does not support AITER block scaled GEMM.")

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

mgoin

LGTM, nice work!

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

tjtanaa added 2 commits March 25, 2025 03:06

add AITER int8 a8w8 gemm kernel

492c6db

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Merge remote-tracking branch 'origin/main' into aiter-int8-linear

86f994a

tjtanaa mentioned this pull request Mar 25, 2025

[Feature] [ROCm]: AITER Kernel Integration #14964

Open

15 tasks

enable compressed tensors for AITER and ROCm

895d6ba

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

tjtanaa marked this pull request as ready for review March 25, 2025 13:27

tjtanaa requested review from mgoin and robertgshaw2-redhat as code owners March 25, 2025 13:27

Add AiterScaledMMKernel abstraction

a5a25a3

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

tjtanaa requested a review from tlrmchlsmth as a code owner March 26, 2025 06:16

mergify bot added the needs-rebase label Mar 26, 2025

mgoin reviewed Mar 26, 2025

View reviewed changes

merge with main

6e2832d

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

mergify bot removed the needs-rebase label Mar 26, 2025

ProExpertProg reviewed Mar 26, 2025

View reviewed changes

tjtanaa added 2 commits March 27, 2025 16:07

extract aiter ops into AITERScaledMMKernel; update unittests

caf94ee

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

annotate import

a26b31c

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

ProExpertProg reviewed Mar 27, 2025

View reviewed changes

update code and unittest documentation

4d231f4

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

ProExpertProg reviewed Mar 28, 2025

View reviewed changes

tjtanaa added 2 commits March 28, 2025 17:49

add more comments

ab52481

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

add more comments

9d81390

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

ProExpertProg reviewed Mar 28, 2025

View reviewed changes

add more comments

9754921

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

ProExpertProg approved these changes Mar 28, 2025

View reviewed changes

mgoin approved these changes Mar 28, 2025

View reviewed changes

mgoin added rocm Related to AMD ROCm quantization ready ONLY add when PR is ready to merge/full CI is needed labels Mar 28, 2025

mgoin enabled auto-merge (squash) March 28, 2025 18:51

vllm-bot merged commit 4965ec4 into vllm-project:main Mar 29, 2025
42 of 44 checks passed

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Apr 2, 2025

[FEAT] [ROCm] Add AITER int8 scaled gemm kernel (vllm-project#15433)

b3ee21f

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025

[FEAT] [ROCm] Add AITER int8 scaled gemm kernel (vllm-project#15433)

6015cf7

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[FEAT] [ROCm] Add AITER int8 scaled gemm kernel (vllm-project#15433)

5860555

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025

[FEAT] [ROCm] Add AITER int8 scaled gemm kernel (vllm-project#15433)

fe129bd

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

	if current_platform.is_cpu() or not current_platform.is_rocm():
	if not current_platform.is_rocm():

		from .ScaledMMLinearKernel import ScaledMMLinearLayerConfig


		def is_rocm_aiter_gemm_w8a8_scaled_mm_enabled() -> bool:

		per_channel_tensor_scale_a = (x_s.numel() == m)
		per_channel_tensor_scale_b = (w_s.numel() == n)

[FEAT] [ROCm] Add AITER int8 scaled gemm kernel #15433

[FEAT] [ROCm] Add AITER int8 scaled gemm kernel #15433

Conversation

tjtanaa commented Mar 25, 2025 • edited by github-actions bot Loading

Description

Performance Gain

Accuracy Comparison: Triton Scaled MM vs AITER Scaled GEMM Kernel

Unit tests

github-actions bot commented Mar 25, 2025

ProExpertProg commented Mar 25, 2025

tjtanaa commented Mar 26, 2025

mergify bot commented Mar 26, 2025

ProExpertProg commented Mar 26, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ProExpertProg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ProExpertProg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tjtanaa Mar 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

tjtanaa commented Mar 25, 2025 •

edited by github-actions bot

Loading

tjtanaa Mar 28, 2025 •

edited

Loading