[Kernels] LoRA - Retire SGMV and BGMV Kernels #14685

varun-sundar-rabindranath · 2025-03-12T16:08:11Z

Retire SGMV and BGMV LoRA kernels in favor of V1 kernels. This cleans up the tests and kernel dispatch logic.

github-actions · 2025-03-12T16:08:22Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/lora/ops/triton_ops/utils.py

varun-sundar-rabindranath · 2025-03-13T17:54:58Z

@jeejeelee This PR is good to go ! Can you take a pass at it when you get a chance please. Thanks ! 🙌

vllm/lora/punica_wrapper/punica_gpu.py

jeejeelee · 2025-03-14T07:55:52Z

vllm/lora/punica_wrapper/punica_gpu.py

-        w_t_all: torch.Tensor,
-        scale: float,
-    ):
-        bgmv_shrink(x, w_t_all, y, self.token_lora_indices, scale)


Have you tested the impact of replacing these kernels on the performance of V0 lora, especially in cudagraph and eager modes?

I did some minimal benchmarking for LoRA Rank ablations - https://docs.google.com/spreadsheets/d/1RMOnNc8sm5swWC-ZyI0x4LdmAfLJAoX30xWSi7p3Hng/edit?usp=sharing
I am running more benchmarks that covers more problem sizes. I'll share them when they are ready 👍

@varun-sundar-rabindranath Hi bro, I'm confused about some items in the performance data table, such as single-lora roofline using torch.mm (f16xf16=>f16) and SGMV_EXPAND(add_inputs=True) (f32xf16=>f16).
What is single-lora roofline using torch.mm? and SGMV_EXPAND I guess it's the implementation of the previous V0?

Hi @chenhongyu2048 ,

single-lora roofline using torch.mm

This is just the numbers for a single matmul. This is a very loose bound on good the LoRA kernel performance could be. Please take a look at

vllm/benchmarks/kernels/benchmark_lora.py

Line 604 in d5615af

def bench_torch_mm(ctx: BenchmarkContext,

SGMV_EXPAND I guess it's the implementation of the previous V0?

Yes.

Thanks. I've tried benchmark_lora.py and confirmed the performance improvement relative to V0 LoRA in my use case (H20).

jeejeelee · 2025-03-15T02:37:28Z

vllm/v1/worker/lora_model_runner_mixin.py

+        # Set is_prefill to True, so we always use the SGMV kernels on
+        # non-cuda platforms.
+        # On cuda platforms we use the same kernels for prefill and
+        # decode and this flag is generally ignored.
        lora_mapping = LoRAMapping(token_lora_mapping,


QQ: What do these comments mean?

For the cpu case, in punica_cpu.py, I see that we still dispatch to different kernels/operations based on if the tokens were all prefill or decode.

For the V1 kernels (now lora_kernels), we dont have this distinction. So the is_prefill variable in LoRAMapping is ignored.

jeejeelee · 2025-03-15T03:02:11Z

vllm/lora/punica_wrapper/punica_gpu.py

-    ):
-        #No LoRA request, so return directly
-        if self.no_lora:
-            return


It seems that deleting these is due to CUDAgraph, is that right?

#No LoRA request, so return directly if self.no_lora: return

CUDAGraphs is not the main issue.
The issue is torch.compile. In V1, all model_execute runs are done via torch.compile and, IIRC it doesn't support dynamic control flow.
But, let me try re-introducing it and see if I can make it work, so I can record my findings here.

I have PR here #15152 that re-introduces this flag in a way that would work for both V0 and V1.

jeejeelee · 2025-03-15T03:15:04Z

Could you plz merge with the main branch, so we can test the LoRA using torch==2.6.0

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

varun-sundar-rabindranath force-pushed the varun/lora-kernel-switch branch from 42728e1 to 21b8d11 Compare March 12, 2025 22:15

varun-sundar-rabindranath mentioned this pull request Mar 13, 2025

[Kernel] LoRA - Enable CUDAGraphs for V1 #14626

Merged

varun-sundar-rabindranath commented Mar 13, 2025

View reviewed changes

vllm/lora/ops/triton_ops/utils.py Show resolved Hide resolved

varun-sundar-rabindranath force-pushed the varun/lora-kernel-switch branch from e6603ec to 9596a8f Compare March 13, 2025 14:03

jeejeelee reviewed Mar 14, 2025

View reviewed changes

vllm/lora/punica_wrapper/punica_gpu.py Show resolved Hide resolved

jeejeelee reviewed Mar 14, 2025

View reviewed changes

varun-sundar-rabindranath force-pushed the varun/lora-kernel-switch branch from c581ceb to dfde5d2 Compare March 14, 2025 18:53

varun-sundar-rabindranath requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners March 14, 2025 18:53

mergify bot added the v1 label Mar 14, 2025

jeejeelee reviewed Mar 15, 2025

View reviewed changes

Varun Sundar Rabindranath added 7 commits March 14, 2025 23:48

Retire sgmv and bgmv kernels

8b3f031

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

move v1 kernels

9781a57

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

rename v1_shrink/v1_expand -> lora_shrink/lora_expand

e1cc94c

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

rename v1_ -> lora_

2ad4f58

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

collapse LoRAKernelMixin

2c96630

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

fix lora_expand dispatch - num_tokens

1168eda

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

formatting

cde6f3a

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

varun-sundar-rabindranath force-pushed the varun/lora-kernel-switch branch from e2456a6 to cde6f3a Compare March 15, 2025 03:56

formatting

372f0cd

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

jeejeelee approved these changes Mar 18, 2025

View reviewed changes

jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2025

jeejeelee enabled auto-merge (squash) March 18, 2025 05:41

jeejeelee added this to the v0.8.0 milestone Mar 18, 2025

jeejeelee merged commit 400d483 into vllm-project:main Mar 18, 2025
50 checks passed

simon-mo pushed a commit that referenced this pull request Mar 18, 2025

[Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685)

9e8f089

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

varun-sundar-rabindranath mentioned this pull request Mar 19, 2025

[misc] LoRA - Skip LoRA kernels when not required #15152

Merged

akeshet mentioned this pull request Mar 21, 2025

[Bug]: RuntimeError at vllm startup, V0 engine, Llama 3.1, "The size of tensor a (50) must match the size of tensor b (56) at non-singleton dimension 0" #15269

Closed

1 task

This was referenced Mar 21, 2025

[Bugfix] LoRA V0 - Fix case where max_num_seqs is between cudagraph capture sizes #15308

Merged

[Misc] Enable V1 LoRA by default #15320

Merged

yaochengji mentioned this pull request Apr 14, 2025

[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend #14238

Open

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernels] LoRA - Retire SGMV and BGMV Kernels #14685

[Kernels] LoRA - Retire SGMV and BGMV Kernels #14685

varun-sundar-rabindranath commented Mar 12, 2025 •

edited

Loading

github-actions bot commented Mar 12, 2025

varun-sundar-rabindranath commented Mar 13, 2025

jeejeelee Mar 14, 2025

varun-sundar-rabindranath Mar 14, 2025

chenhongyu2048 Apr 25, 2025

varun-sundar-rabindranath Apr 25, 2025 •

edited

Loading

chenhongyu2048 Apr 26, 2025

jeejeelee Mar 15, 2025

varun-sundar-rabindranath Mar 15, 2025

jeejeelee Mar 15, 2025

varun-sundar-rabindranath Mar 15, 2025

varun-sundar-rabindranath Mar 19, 2025

jeejeelee commented Mar 15, 2025

[Kernels] LoRA - Retire SGMV and BGMV Kernels #14685

[Kernels] LoRA - Retire SGMV and BGMV Kernels #14685

Conversation

varun-sundar-rabindranath commented Mar 12, 2025 • edited Loading

github-actions bot commented Mar 12, 2025

varun-sundar-rabindranath commented Mar 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varun-sundar-rabindranath Apr 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeejeelee commented Mar 15, 2025

varun-sundar-rabindranath commented Mar 12, 2025 •

edited

Loading

varun-sundar-rabindranath Apr 25, 2025 •

edited

Loading