[NVIDIA] Support nvfp4 cutlass gemm #13571

kaixih · 2025-02-19T23:19:31Z

Forked from #12519 (Will be closed soon), we decide to separate the fp4 quantization and fp4 gemm as two PRs. (1) fp4 quantization (PR merged); (2) fp4 gemm (This PR).

This PR requires Cutlass 3.8 (which hasn't been officially released yet) to fully function. However, it should still build using placeholder functions.

cc. @pavanimajety @kushanam

github-actions · 2025-02-19T23:19:42Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: kaixih <kaixih@nvidia.com>

LucasWilkinson · 2025-02-20T01:12:21Z

csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu

+  using LayoutD = decltype(cute::make_layout(make_shape(0, 0, 0), StrideD{}));
+};
+
+struct Fp4GemmSm100Bfloat16 {


Looks like theres alot of commonality between Fp4GemmSm100Float, Fp4GemmSm100Half and Fp4GemmSm100Bfloat16, could we just template this out? i.e. Fp4GemmSm100<float>, Fp4GemmSm100<half_t> and Fp4GemmSm100<bfloat16_t>, this will likely help duplication when start tuning tile sizes for perf

Currently there's an issue with CUTLASS 3.8 and the gcc version we use for compiling the templates. Hence, we have a dumb fix for an initial version of the fp4 gemm. Will add a to-do track this and go back to templating once that issue is fixed.

csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu

LucasWilkinson · 2025-02-20T01:23:57Z

@tlrmchlsmth thoughts on renaming the cutlass_w8a8 folder to cutlass_wXaX or cutlass_scaled_mm and moving these files there?

Signed-off-by: kaixih <kaixih@nvidia.com>

kaixih · 2025-02-20T21:59:14Z

Again seems the failed tests are not related to this PR. PTAL.

LucasWilkinson

Sorry thanks for the updates, overall looks ok to me (left some comments for future work) but wanted to follow up on (sorry just saw this now!):

This PR requires Cutlass 3.8 (which hasn't been officially released yet) to fully function. However, it should still build using placeholder functions.

What's the motivation for landing this before 3.8 is released? to allow users to preview it using VLLM_CUTLASS_SRC_DIR? if so we should probably add a more verbose comment in:

 template <typename T>
 void runGemm(at::Tensor& D, at::Tensor const& A, at::Tensor const& B,
              at::Tensor const& A_sf, at::Tensor const& B_sf,
              at::Tensor const& alpha, int64_t m, int64_t n, int64_t k,
              cudaStream_t stream) {
   TORCH_CHECK(false, "Unsupported cutlass version");
 }

on how to preview it

LucasWilkinson · 2025-02-21T02:12:55Z

csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu

+}
+
+template <typename T>
+void runGemm(at::Tensor& D, at::Tensor const& A, at::Tensor const& B,


future work: we should see if can unify this with cutlass_gemm_caller in csrc/quantization/cutlass_w8a8/c3x/cutlass_gemm_caller.cuh

LucasWilkinson · 2025-02-21T02:16:39Z

csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu

+};
+
+template <typename T>
+struct Fp4GemmSm100 {


future work: I think we should try to unify this in the future with cutlass_3x_gemm in csrc/quantization/cutlass_w8a8/c3x/scaled_mm.cuh

LucasWilkinson · 2025-02-21T02:22:08Z

csrc/quantization/fp4/nvfp4_scaled_mm_entry.cu

+#if defined ENABLE_NVFP4 && ENABLE_NVFP4
+  return cutlass_scaled_fp4_mm_sm100a(D, A, B, A_sf, B_sf, alpha);
+#endif
+  TORCH_CHECK_NOT_IMPLEMENTED(false, "No compiled nvfp4 mm kernel.");


can you please elaborate on this a bit, like say something like:

TORCH_CHECK_NOT_IMPLEMENTED(false, "No compiled nvfp4 mm kernel, vLLM should be compiled using CUDA 12.8 and target compute capability 100 or above.");

tlrmchlsmth · 2025-02-21T02:28:21Z

@tlrmchlsmth thoughts on renaming the cutlass_w8a8 folder to cutlass_wXaX or cutlass_scaled_mm and moving these files there?

Totally makes sense

Signed-off-by: kaixih <kaixih@nvidia.com>

kaixih · 2025-02-21T18:28:04Z

Thanks for the review! I’ve addressed the comments, except for the future work related to code structure changes.

@LucasWilkinson PTAL.

LucasWilkinson · 2025-02-21T19:42:17Z

@kaixih Thanks for the hard work! apologies for the long back and forth but looks like CUTLASS 3.8 just got released!! Do you think you can upgrade to that in this PR? Given that its required (and doesnt make a ton of sense to rush and land this if its not going to be functional without it)

Also while we are still making changes, lets try to adopt this: #13571 (comment)

Signed-off-by: kaixih <kaixih@nvidia.com>

kaixih · 2025-02-21T22:20:14Z

@LucasWilkinson Can we focus this PR on NVFP4 support and address the code structure changes in a separate PR?

Signed-off-by: Roger Wang <ywang@roblox.com>

tlrmchlsmth · 2025-02-25T03:23:04Z

CMakeLists.txt

  # Set CUTLASS_REVISION manually -- its revision detection doesn't work in this case.
  # Please keep this in sync with FetchContent_Declare line below.
-  set(CUTLASS_REVISION "v3.7.0" CACHE STRING "CUTLASS revision to use")
+  set(CUTLASS_REVISION "v3.8.0" CACHE STRING "CUTLASS revision to use")


Just noticed this PR didn't actually update CUTLASS to 3.8 (see line 250 below)

…t#13709) Signed-off-by: Roger Wang <ywang@roblox.com>

Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

…t#13709) Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

kaixih requested review from tlrmchlsmth and WoosukKwon as code owners February 19, 2025 23:19

Support cutlass nvfp4 gemm

50ac4fc

Signed-off-by: kaixih <kaixih@nvidia.com>

mergify bot added the ci/build label Feb 19, 2025

kaixih force-pushed the kaixih/nvfp4_scaled_mm branch from c2e2e58 to 50ac4fc Compare February 19, 2025 23:20

kaixih added 3 commits February 19, 2025 23:39

Formatting

87b200c

Signed-off-by: kaixih <kaixih@nvidia.com>

Formatting

722bfca

Signed-off-by: kaixih <kaixih@nvidia.com>

Formatting

4295589

Signed-off-by: kaixih <kaixih@nvidia.com>

mgoin requested review from mgoin and LucasWilkinson February 20, 2025 00:01

LucasWilkinson reviewed Feb 20, 2025

View reviewed changes

kaixih added 4 commits February 20, 2025 18:54

Add checks on scales

d0449a2

Signed-off-by: kaixih <kaixih@nvidia.com>

Formating

a736e2b

Signed-off-by: kaixih <kaixih@nvidia.com>

Use template for the gemm config

4407d7c

Signed-off-by: kaixih <kaixih@nvidia.com>

Formating

1c79d1e

Signed-off-by: kaixih <kaixih@nvidia.com>

LucasWilkinson approved these changes Feb 21, 2025

View reviewed changes

Improve error message

78fb480

Signed-off-by: kaixih <kaixih@nvidia.com>

Bump up CUTLASS to v3.8.0

d64b898

Signed-off-by: kaixih <kaixih@nvidia.com>

LucasWilkinson enabled auto-merge (squash) February 21, 2025 23:16

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 21, 2025

simon-mo merged commit e109e59 into vllm-project:main Feb 22, 2025
76 of 83 checks passed

ywang96 mentioned this pull request Feb 22, 2025

[CI/Build] Fix pre-commit errors from #13571 #13709

Merged

WoosukKwon pushed a commit that referenced this pull request Feb 23, 2025

[CI/Build] Fix pre-commit errors from #13571 (#13709)

82e0d60

Signed-off-by: Roger Wang <ywang@roblox.com>

tlrmchlsmth reviewed Feb 25, 2025

View reviewed changes

This was referenced Feb 26, 2025

Support FP4 gemm (1/2) sgl-project/sglang#3899

Merged

FP4 weight loading and inference (2/2) sgl-project/sglang#3972

Merged

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

[NVIDIA] Support nvfp4 cutlass gemm (vllm-project#13571)

b296490

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

[CI/Build] Fix pre-commit errors from vllm-project#13571 (vllm-projec…

e46908b

…t#13709) Signed-off-by: Roger Wang <ywang@roblox.com>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[NVIDIA] Support nvfp4 cutlass gemm (vllm-project#13571)

c12e558

Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[CI/Build] Fix pre-commit errors from vllm-project#13571 (vllm-projec…

e6ead61

…t#13709) Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] Support nvfp4 cutlass gemm #13571

[NVIDIA] Support nvfp4 cutlass gemm #13571

kaixih commented Feb 19, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 19, 2025

LucasWilkinson Feb 20, 2025

pavanimajety Feb 20, 2025 •

edited

Loading

LucasWilkinson commented Feb 20, 2025

kaixih commented Feb 20, 2025

LucasWilkinson left a comment

LucasWilkinson Feb 21, 2025

LucasWilkinson Feb 21, 2025

LucasWilkinson Feb 21, 2025

tlrmchlsmth commented Feb 21, 2025

kaixih commented Feb 21, 2025

LucasWilkinson commented Feb 21, 2025 •

edited

Loading

kaixih commented Feb 21, 2025

tlrmchlsmth Feb 25, 2025

[NVIDIA] Support nvfp4 cutlass gemm #13571

[NVIDIA] Support nvfp4 cutlass gemm #13571

Conversation

kaixih commented Feb 19, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 19, 2025

LucasWilkinson Feb 20, 2025

Choose a reason for hiding this comment

pavanimajety Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

LucasWilkinson commented Feb 20, 2025

kaixih commented Feb 20, 2025

LucasWilkinson left a comment

Choose a reason for hiding this comment

LucasWilkinson Feb 21, 2025

Choose a reason for hiding this comment

LucasWilkinson Feb 21, 2025

Choose a reason for hiding this comment

LucasWilkinson Feb 21, 2025

Choose a reason for hiding this comment

tlrmchlsmth commented Feb 21, 2025

kaixih commented Feb 21, 2025

LucasWilkinson commented Feb 21, 2025 • edited Loading

kaixih commented Feb 21, 2025

tlrmchlsmth Feb 25, 2025

Choose a reason for hiding this comment

kaixih commented Feb 19, 2025 •

edited by github-actions bot

Loading

pavanimajety Feb 20, 2025 •

edited

Loading

LucasWilkinson commented Feb 21, 2025 •

edited

Loading