-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. #15732
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
[TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. #15732
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
This is great @vanbasten23 thanks for piping this in! Would we be able to add a Gemma 3 [text-only] test in please? |
dc66e34
to
2fab49d
Compare
EXPECTED_VALUE = 0.58 | ||
EXPECTED_VALUES = { | ||
"Qwen/Qwen2-1.5B-Instruct": 0.58, | ||
"google/gemma-3-1b-it": 0.25, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you compared this score with GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I ran on GPU and got the same accuracy (0.25) for gemma-3-1b (test output on GPU: https://gist.github.com/vanbasten23/eda147d28db5f11187178771372ee39f)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Note that the CI is not green.
You shuold merge from main to fix docker build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
I think some of the commits need signing, please take a look
e7bf947
to
be06f10
Compare
@yaochengji we shouldn't merge this until @vanbasten23 validates 27B as well. Let's wait for these results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waiting for multi chip validation - thank you!
I tested the 27b and it fails:
the failure is in https://gist.github.com/vanbasten23/637bd3af3c3946c00e6fb1ce27892e46 which fails at tpu_woker.py, outside the kernel. But I don't think it should block this PR. This PR intends to add sliding window/logit soft-capping support. @bvrockwell The gemma-3 27B failure is a separate issue and can be fixed in a later PR. |
ok understood @vanbasten23 , looking forward to the multichip gemma 3 PR! Thanks for adding sliding window and logits soft capping. |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Head branch was pushed to by a user without write access
b342a46
to
352fa12
Compare
hi @robertgshaw2-redhat , I resolved the merge conflicts but the auto-merge was canceled. Could you enable the auto-merge again? Thanks! |
FYI the TPU V1 test was failing when this was merged |
Good catch. This test fails even before this PR. For example, in the PR before #15586, the TPU CI soft fails with the same error. I think we should hard fail the test failure in TPU CI. |
The failing test seems to have been fixed later in #16041 for example:
|
…ntion kernel for TPU. (vllm-project#15732) Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
…ntion kernel for TPU. (vllm-project#15732) Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
…ntion kernel for TPU. (vllm-project#15732) Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
After I added the sliding window and logit soft-capping support to the paged attention Pallas kernel, this PR is intended to added the sliding window and logit soft-capping at the vLLM level, so that we can support gemma model.
Test plans:
The Gemma model that we care about are google/gemma-3-27b-it and google/gemma-3-1b-it.
cc @miladm