-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Kernels] LoRA - Retire SGMV and BGMV Kernels #14685
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
[Kernels] LoRA - Retire SGMV and BGMV Kernels #14685
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
42728e1
to
21b8d11
Compare
e6603ec
to
9596a8f
Compare
@jeejeelee This PR is good to go ! Can you take a pass at it when you get a chance please. Thanks ! 🙌 |
w_t_all: torch.Tensor, | ||
scale: float, | ||
): | ||
bgmv_shrink(x, w_t_all, y, self.token_lora_indices, scale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested the impact of replacing these kernels on the performance of V0 lora, especially in cudagraph and eager modes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some minimal benchmarking for LoRA Rank ablations - https://docs.google.com/spreadsheets/d/1RMOnNc8sm5swWC-ZyI0x4LdmAfLJAoX30xWSi7p3Hng/edit?usp=sharing
I am running more benchmarks that covers more problem sizes. I'll share them when they are ready 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@varun-sundar-rabindranath Hi bro, I'm confused about some items in the performance data table, such as single-lora roofline using torch.mm (f16xf16=>f16) and SGMV_EXPAND(add_inputs=True) (f32xf16=>f16).
What is single-lora roofline using torch.mm? and SGMV_EXPAND I guess it's the implementation of the previous V0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @chenhongyu2048 ,
single-lora roofline using torch.mm
This is just the numbers for a single matmul. This is a very loose bound on good the LoRA kernel performance could be. Please take a look at
vllm/benchmarks/kernels/benchmark_lora.py
Line 604 in d5615af
def bench_torch_mm(ctx: BenchmarkContext, |
SGMV_EXPAND I guess it's the implementation of the previous V0?
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I've tried benchmark_lora.py and confirmed the performance improvement relative to V0 LoRA in my use case (H20).
c581ceb
to
dfde5d2
Compare
# Set is_prefill to True, so we always use the SGMV kernels on | ||
# non-cuda platforms. | ||
# On cuda platforms we use the same kernels for prefill and | ||
# decode and this flag is generally ignored. | ||
lora_mapping = LoRAMapping(token_lora_mapping, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: What do these comments mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the cpu case, in punica_cpu.py
, I see that we still dispatch to different kernels/operations based on if the tokens were all prefill or decode.
For the V1 kernels (now lora_kernels), we dont have this distinction. So the is_prefill
variable in LoRAMapping
is ignored.
): | ||
#No LoRA request, so return directly | ||
if self.no_lora: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that deleting these is due to CUDAgraph, is that right?
#No LoRA request, so return directly
if self.no_lora:
return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDAGraphs is not the main issue.
The issue is torch.compile
. In V1, all model_execute runs are done via torch.compile
and, IIRC it doesn't support dynamic control flow.
But, let me try re-introducing it and see if I can make it work, so I can record my findings here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have PR here #15152 that re-introduces this flag in a way that would work for both V0 and V1.
Could you plz merge with the main branch, so we can test the LoRA using torch==2.6.0 |
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
e2456a6
to
cde6f3a
Compare
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Retire SGMV and BGMV LoRA kernels in favor of V1 kernels. This cleans up the tests and kernel dispatch logic.