Skip to content

[TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. #15732

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Merged

Conversation

vanbasten23
Copy link
Collaborator

@vanbasten23 vanbasten23 commented Mar 28, 2025

After I added the sliding window and logit soft-capping support to the paged attention Pallas kernel, this PR is intended to added the sliding window and logit soft-capping at the vLLM level, so that we can support gemma model.

Test plans:

  • pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py
  • pytest -v -s tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine 2>&1 | tee out.txt

The Gemma model that we care about are google/gemma-3-27b-it and google/gemma-3-1b-it.

cc @miladm

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added ci/build v1 tpu Related to Google TPUs labels Mar 28, 2025
Copy link

mergify bot commented Mar 28, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vanbasten23.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 28, 2025
@brittrock
Copy link

This is great @vanbasten23 thanks for piping this in! Would we be able to add a Gemma 3 [text-only] test in please?

Cc @robertgshaw2-redhat @yarongmu-google @yaochengji

EXPECTED_VALUE = 0.58
EXPECTED_VALUES = {
"Qwen/Qwen2-1.5B-Instruct": 0.58,
"google/gemma-3-1b-it": 0.25,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you compared this score with GPU?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I ran on GPU and got the same accuracy (0.25) for gemma-3-1b (test output on GPU: https://gist.github.com/vanbasten23/eda147d28db5f11187178771372ee39f)

@vanbasten23 vanbasten23 requested a review from yaochengji April 2, 2025 00:03
Copy link
Collaborator

@yaochengji yaochengji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Note that the CI is not green.

@DarkLight1337
Copy link
Member

You shuold merge from main to fix docker build

Copy link
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!
I think some of the commits need signing, please take a look

@vanbasten23 vanbasten23 force-pushed the xiowei/sw_logit_cap branch from e7bf947 to be06f10 Compare April 2, 2025 17:25
@bvrockwell
Copy link
Contributor

bvrockwell commented Apr 2, 2025

@yaochengji we shouldn't merge this until @vanbasten23 validates 27B as well. Let's wait for these results
.

Copy link
Contributor

@bvrockwell bvrockwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting for multi chip validation - thank you!

@vanbasten23
Copy link
Collaborator Author

@yaochengji we shouldn't merge this until @vanbasten23 validates 27B as well. Let's wait for these results .

I tested the 27b and it fails:

MODEL=google/gemma-3-27b-it
VLLM_USE_V1=1 vllm serve $MODEL --seed 42 --disable-log-requests --gpu-memory-utilization 0.95 --max-num-batched-tokens 1024 --max-num-seqs 512 --tensor-parallel-size 4 --max-model-len 2048 &
VLLM_USE_V1=1 python3 benchmarks/benchmark_serving.py --model $MODEL  --dataset-name random --random-input-len 1800 --random-output-len 128

the failure is in https://gist.github.com/vanbasten23/637bd3af3c3946c00e6fb1ce27892e46 which fails at tpu_woker.py, outside the kernel.

But I don't think it should block this PR. This PR intends to add sliding window/logit soft-capping support. @bvrockwell The gemma-3 27B failure is a separate issue and can be fixed in a later PR.

@vanbasten23 vanbasten23 changed the title [TPU] Add sliding window and logit soft capping support for TPU. [TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. Apr 2, 2025
@bvrockwell
Copy link
Contributor

@yaochengji we shouldn't merge this until @vanbasten23 validates 27B as well. Let's wait for these results .

I tested the 27b and it fails:

MODEL=google/gemma-3-27b-it
VLLM_USE_V1=1 vllm serve $MODEL --seed 42 --disable-log-requests --gpu-memory-utilization 0.95 --max-num-batched-tokens 1024 --max-num-seqs 512 --tensor-parallel-size 4 --max-model-len 2048 &
VLLM_USE_V1=1 python3 benchmarks/benchmark_serving.py --model $MODEL  --dataset-name random --random-input-len 1800 --random-output-len 128

the failure is in https://gist.github.com/vanbasten23/637bd3af3c3946c00e6fb1ce27892e46 which fails at tpu_woker.py, outside the kernel.

But I don't think it should block this PR. This PR intends to add sliding window/logit soft-capping support. @bvrockwell The gemma-3 27B failure is a separate issue and can be fixed in a later PR.

ok understood @vanbasten23 , looking forward to the multichip gemma 3 PR! Thanks for adding sliding window and logits soft capping.

Copy link

mergify bot commented Apr 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vanbasten23.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 3, 2025
@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) April 3, 2025 00:20
@mergify mergify bot removed the needs-rebase label Apr 3, 2025
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 3, 2025
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
auto-merge was automatically disabled April 3, 2025 02:32

Head branch was pushed to by a user without write access

@vanbasten23 vanbasten23 force-pushed the xiowei/sw_logit_cap branch from b342a46 to 352fa12 Compare April 3, 2025 02:32
@vanbasten23
Copy link
Collaborator Author

hi @robertgshaw2-redhat , I resolved the merge conflicts but the auto-merge was canceled. Could you enable the auto-merge again? Thanks!

@robertgshaw2-redhat robertgshaw2-redhat merged commit b6be6f8 into vllm-project:main Apr 3, 2025
31 checks passed
@mgoin
Copy link
Member

mgoin commented Apr 4, 2025

FYI the TPU V1 test was failing when this was merged

@vanbasten23
Copy link
Collaborator Author

FYI the TPU V1 test was failing when this was merged

Good catch. This test fails even before this PR. For example, in the PR before #15586, the TPU CI soft fails with the same error.

I think we should hard fail the test failure in TPU CI.

@vanbasten23
Copy link
Collaborator Author

The failing test seems to have been fixed later in #16041 for example:


======================================================================= warnings summary =======================================================================
--
  | tests/tpu/test_compilation.py::test_tpu_compilation
  | :0: UserWarning: /usr/local/lib/python3.10/site-packages/depyf/explain/enable_debugging.py:163: You are trying to debug `torch.compile`. Please make sure the code runs multiple times to cover all the possible branches.
  |  
  | -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
  | =========================================================== 1 passed, 1 warning in 128.19s (0:02:08) ===========================================================


Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025
…ntion kernel for TPU. (vllm-project#15732)

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
…ntion kernel for TPU. (vllm-project#15732)

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025
…ntion kernel for TPU. (vllm-project#15732)

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants