Skip to content

[Kernels] LoRA - Retire SGMV and BGMV Kernels #14685

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Merged
merged 8 commits into from
Mar 18, 2025

Conversation

varun-sundar-rabindranath
Copy link
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath commented Mar 12, 2025

Retire SGMV and BGMV LoRA kernels in favor of V1 kernels. This cleans up the tests and kernel dispatch logic.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@varun-sundar-rabindranath
Copy link
Contributor Author

@jeejeelee This PR is good to go ! Can you take a pass at it when you get a chance please. Thanks ! 🙌

w_t_all: torch.Tensor,
scale: float,
):
bgmv_shrink(x, w_t_all, y, self.token_lora_indices, scale)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested the impact of replacing these kernels on the performance of V0 lora, especially in cudagraph and eager modes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some minimal benchmarking for LoRA Rank ablations - https://docs.google.com/spreadsheets/d/1RMOnNc8sm5swWC-ZyI0x4LdmAfLJAoX30xWSi7p3Hng/edit?usp=sharing
I am running more benchmarks that covers more problem sizes. I'll share them when they are ready 👍

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varun-sundar-rabindranath Hi bro, I'm confused about some items in the performance data table, such as single-lora roofline using torch.mm (f16xf16=>f16) and SGMV_EXPAND(add_inputs=True) (f32xf16=>f16).
What is single-lora roofline using torch.mm? and SGMV_EXPAND I guess it's the implementation of the previous V0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chenhongyu2048 ,

single-lora roofline using torch.mm

This is just the numbers for a single matmul. This is a very loose bound on good the LoRA kernel performance could be. Please take a look at

def bench_torch_mm(ctx: BenchmarkContext,

SGMV_EXPAND I guess it's the implementation of the previous V0?

Yes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've tried benchmark_lora.py and confirmed the performance improvement relative to V0 LoRA in my use case (H20).

# Set is_prefill to True, so we always use the SGMV kernels on
# non-cuda platforms.
# On cuda platforms we use the same kernels for prefill and
# decode and this flag is generally ignored.
lora_mapping = LoRAMapping(token_lora_mapping,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: What do these comments mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the cpu case, in punica_cpu.py, I see that we still dispatch to different kernels/operations based on if the tokens were all prefill or decode.

For the V1 kernels (now lora_kernels), we dont have this distinction. So the is_prefill variable in LoRAMapping is ignored.

):
#No LoRA request, so return directly
if self.no_lora:
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that deleting these is due to CUDAgraph, is that right?

      #No LoRA request, so return directly
      if self.no_lora:
          return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDAGraphs is not the main issue.
The issue is torch.compile. In V1, all model_execute runs are done via torch.compile and, IIRC it doesn't support dynamic control flow.
But, let me try re-introducing it and see if I can make it work, so I can record my findings here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have PR here #15152 that re-introduces this flag in a way that would work for both V0 and V1.

@jeejeelee
Copy link
Collaborator

Could you plz merge with the main branch, so we can test the LoRA using torch==2.6.0

Varun Sundar Rabindranath added 7 commits March 14, 2025 23:48
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
@jeejeelee jeejeelee added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2025
@jeejeelee jeejeelee enabled auto-merge (squash) March 18, 2025 05:41
@jeejeelee jeejeelee added this to the v0.8.0 milestone Mar 18, 2025
@jeejeelee jeejeelee merged commit 400d483 into vllm-project:main Mar 18, 2025
50 checks passed
simon-mo pushed a commit that referenced this pull request Mar 18, 2025
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
gmarinho2 pushed a commit to gmarinho2/vllm that referenced this pull request Apr 1, 2025
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants