[Quantization][FP8] Adding support for fp8 gemm layer input in fp8 #14578

gshtras · 2025-03-10T21:15:20Z

Adding support for the case where both inputs to the FP8 GEMM are in FP8 datatype and not only weights (in preparation for attention with fused FP8 conversion)
Functionality ported over from ROCm/vllm

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

github-actions · 2025-03-10T21:15:31Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

ProExpertProg · 2025-03-12T22:08:25Z

vllm/model_executor/layers/quantization/utils/w8a8_utils.py

@@ -172,6 +173,9 @@ def apply(
        if use_per_token_if_dynamic is None:
            use_per_token_if_dynamic = self.use_per_token_if_dynamic

+        if out_dtype is None:
+            out_dtype = input.dtype
+
        # cutlass_scaled_mm supports per tensor/channel W and per tensor/token A
        if self.cutlass_fp8_supported:
            qinput, x_scale = ops.scaled_fp8_quant(


Either add an assert here or use the same logic as the non-cutlass case

ProExpertProg · 2025-03-12T22:09:35Z

vllm/model_executor/layers/quantization/fp8.py

@@ -116,6 +116,21 @@ def get_quant_method(self, layer: torch.nn.Module,
            return Fp8KVCacheMethod(self)
        return None

+    def get_cache_scale(self, name: str) -> Optional[str]:


Where is this method used?

For models with additional scales such as amd/Llama-3.1-8B-Instruct-FP8-QKV-Proj
Been broken since this de0526f#diff-48d2ca5476d5b776f6401436fcf015c5ce4dc1a23d2b78a09e08fb85acc3697cL399 change

Could we add a test for this? Or does it already exist in CI

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

robertgshaw2-redhat · 2025-03-20T16:27:55Z

...model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py

@@ -23,6 +23,7 @@ class CompressedTensorsW8A8Fp8(CompressedTensorsScheme):

    def __init__(self, strategy: str, is_static_input_scheme: bool):
        self.strategy = strategy
+        self.out_dtype = torch.get_default_dtype()


@ProExpertProg - are you use this is okay?

I know that throughput the models, we pipe the dtype through.

I didn't know we did that - I thought it the default dtype was used for unquantized

That's the unquantized dtype by design here, to do fp8 x fp8 -> half instead of half x fp8 -> half

I looked and we use the default dtype in many places (attention, RMSNorm, etc.). So I think this is fine @robertgshaw2-redhat

vllm/model_executor/layers/quantization/utils/w8a8_utils.py

Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

robertgshaw2-redhat · 2025-03-28T01:21:12Z

Thanks @gshtras - nice work.

…llm-project#14578) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

…llm-project#14578) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

…llm-project#14578) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

…llm-project#14578) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

…llm-project#14578) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

Adding support for fp8 gemm layer input in fp8

b38f2c4

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

gshtras requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners March 10, 2025 21:15

gshtras added 2 commits March 12, 2025 21:16

Merge remote-tracking branch 'origin/main' into fp8_input_upstream

15f748e

using current_platform.dp8_dtype

3c05bb7

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

ProExpertProg reviewed Mar 12, 2025

View reviewed changes

Validate input dtype is not quantized for cutlass

d5f5eae

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

ProExpertProg approved these changes Mar 18, 2025

View reviewed changes

robertgshaw2-redhat reviewed Mar 20, 2025

View reviewed changes

ProExpertProg reviewed Mar 24, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/w8a8_utils.py Outdated Show resolved Hide resolved

Update vllm/model_executor/layers/quantization/utils/w8a8_utils.py

28a1958

Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

gshtras force-pushed the fp8_input_upstream branch from 73be373 to 28a1958 Compare March 24, 2025 21:13

ProExpertProg mentioned this pull request Mar 27, 2025

[Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel #12591

Merged

robertgshaw2-redhat approved these changes Mar 28, 2025

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) March 28, 2025 01:21

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 28, 2025

robertgshaw2-redhat merged commit 4d0ec37 into vllm-project:main Mar 28, 2025
49 checks passed

gshtras deleted the fp8_input_upstream branch April 7, 2025 14:59

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

charlifu mentioned this pull request Apr 25, 2025

[ROCm][Misc] Follow-ups for Skinny Gemms on ROCm. #17011

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quantization][FP8] Adding support for fp8 gemm layer input in fp8 #14578

[Quantization][FP8] Adding support for fp8 gemm layer input in fp8 #14578

gshtras commented Mar 10, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 10, 2025

ProExpertProg Mar 12, 2025

ProExpertProg Mar 12, 2025

gshtras Mar 12, 2025

ProExpertProg Mar 18, 2025

robertgshaw2-redhat Mar 20, 2025

ProExpertProg Mar 20, 2025

gshtras Mar 21, 2025

ProExpertProg Mar 24, 2025

robertgshaw2-redhat commented Mar 28, 2025

[Quantization][FP8] Adding support for fp8 gemm layer input in fp8 #14578

[Quantization][FP8] Adding support for fp8 gemm layer input in fp8 #14578

Conversation

gshtras commented Mar 10, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-redhat commented Mar 28, 2025

gshtras commented Mar 10, 2025 •

edited by github-actions bot

Loading