[TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. #15732

vanbasten23 · 2025-03-28T23:45:05Z

After I added the sliding window and logit soft-capping support to the paged attention Pallas kernel, this PR is intended to added the sliding window and logit soft-capping at the vLLM level, so that we can support gemma model.

Test plans:

pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py
pytest -v -s tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine 2>&1 | tee out.txt

The Gemma model that we care about are google/gemma-3-27b-it and google/gemma-3-1b-it.

cc @miladm

github-actions · 2025-03-28T23:45:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-03-28T23:45:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vanbasten23.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

brittrock · 2025-03-30T19:09:17Z

This is great @vanbasten23 thanks for piping this in! Would we be able to add a Gemma 3 [text-only] test in please?

Cc @robertgshaw2-redhat @yarongmu-google @yaochengji

yaochengji · 2025-04-01T04:06:13Z

tests/entrypoints/llm/test_accuracy.py

-EXPECTED_VALUE = 0.58
+EXPECTED_VALUES = {
+    "Qwen/Qwen2-1.5B-Instruct": 0.58,
+    "google/gemma-3-1b-it": 0.25,


Have you compared this score with GPU?

Thanks for the suggestion! I ran on GPU and got the same accuracy (0.25) for gemma-3-1b (test output on GPU: https://gist.github.com/vanbasten23/eda147d28db5f11187178771372ee39f)

tests/v1/tpu/test_pallas.py

yaochengji

LGTM, thanks!

Note that the CI is not green.

DarkLight1337 · 2025-04-02T08:36:43Z

You shuold merge from main to fix docker build

NickLucche

Great job!
I think some of the commits need signing, please take a look

tests/v1/tpu/test_pallas.py

bvrockwell · 2025-04-02T18:27:02Z

@yaochengji we shouldn't merge this until @vanbasten23 validates 27B as well. Let's wait for these results
.

bvrockwell

waiting for multi chip validation - thank you!

vanbasten23 · 2025-04-02T22:02:31Z

@yaochengji we shouldn't merge this until @vanbasten23 validates 27B as well. Let's wait for these results .

I tested the 27b and it fails:

MODEL=google/gemma-3-27b-it
VLLM_USE_V1=1 vllm serve $MODEL --seed 42 --disable-log-requests --gpu-memory-utilization 0.95 --max-num-batched-tokens 1024 --max-num-seqs 512 --tensor-parallel-size 4 --max-model-len 2048 &
VLLM_USE_V1=1 python3 benchmarks/benchmark_serving.py --model $MODEL  --dataset-name random --random-input-len 1800 --random-output-len 128

the failure is in https://gist.github.com/vanbasten23/637bd3af3c3946c00e6fb1ce27892e46 which fails at tpu_woker.py, outside the kernel.

But I don't think it should block this PR. This PR intends to add sliding window/logit soft-capping support. @bvrockwell The gemma-3 27B failure is a separate issue and can be fixed in a later PR.

bvrockwell · 2025-04-02T23:27:49Z

@yaochengji we shouldn't merge this until @vanbasten23 validates 27B as well. Let's wait for these results .

I tested the 27b and it fails:
MODEL=google/gemma-3-27b-it
VLLM_USE_V1=1 vllm serve $MODEL --seed 42 --disable-log-requests --gpu-memory-utilization 0.95 --max-num-batched-tokens 1024 --max-num-seqs 512 --tensor-parallel-size 4 --max-model-len 2048 &
VLLM_USE_V1=1 python3 benchmarks/benchmark_serving.py --model $MODEL  --dataset-name random --random-input-len 1800 --random-output-len 128
the failure is in https://gist.github.com/vanbasten23/637bd3af3c3946c00e6fb1ce27892e46 which fails at tpu_woker.py, outside the kernel.

But I don't think it should block this PR. This PR intends to add sliding window/logit soft-capping support. @bvrockwell The gemma-3 27B failure is a separate issue and can be fixed in a later PR.

ok understood @vanbasten23 , looking forward to the multichip gemma 3 PR! Thanks for adding sliding window and logits soft capping.

mergify · 2025-04-03T00:20:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vanbasten23.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

vanbasten23 · 2025-04-03T03:07:13Z

hi @robertgshaw2-redhat , I resolved the merge conflicts but the auto-merge was canceled. Could you enable the auto-merge again? Thanks!

mgoin · 2025-04-04T12:02:10Z

FYI the TPU V1 test was failing when this was merged

vanbasten23 · 2025-04-04T17:22:44Z

FYI the TPU V1 test was failing when this was merged

Good catch. This test fails even before this PR. For example, in the PR before #15586, the TPU CI soft fails with the same error.

I think we should hard fail the test failure in TPU CI.

vanbasten23 · 2025-04-04T17:35:47Z

The failing test seems to have been fixed later in #16041 for example:


======================================================================= warnings summary =======================================================================
--
  | tests/tpu/test_compilation.py::test_tpu_compilation
  | :0: UserWarning: /usr/local/lib/python3.10/site-packages/depyf/explain/enable_debugging.py:163: You are trying to debug `torch.compile`. Please make sure the code runs multiple times to cover all the possible branches.
  |  
  | -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
  | =========================================================== 1 passed, 1 warning in 128.19s (0:02:08) ===========================================================

…ntion kernel for TPU. (vllm-project#15732) Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

…ntion kernel for TPU. (vllm-project#15732) Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

…ntion kernel for TPU. (vllm-project#15732) Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

mergify bot added ci/build v1 tpu Related to Google TPUs labels Mar 28, 2025

mergify bot added the needs-rebase label Mar 28, 2025

vanbasten23 force-pushed the xiowei/sw_logit_cap branch from dc66e34 to 2fab49d Compare March 31, 2025 18:14

mergify bot removed the needs-rebase label Mar 31, 2025

vanbasten23 marked this pull request as ready for review March 31, 2025 18:23

vanbasten23 requested review from DarkLight1337, robertgshaw2-redhat, simon-mo, WoosukKwon, njhill, ywang96, comaniac and alexm-redhat as code owners March 31, 2025 18:23

yaochengji reviewed Apr 1, 2025

View reviewed changes

tests/v1/tpu/test_pallas.py Show resolved Hide resolved

tests/v1/tpu/test_pallas.py Show resolved Hide resolved

vanbasten23 requested a review from yaochengji April 2, 2025 00:03

yaochengji approved these changes Apr 2, 2025

View reviewed changes

NickLucche approved these changes Apr 2, 2025

View reviewed changes

tests/v1/tpu/test_pallas.py Show resolved Hide resolved

vanbasten23 force-pushed the xiowei/sw_logit_cap branch from e7bf947 to be06f10 Compare April 2, 2025 17:25

bvrockwell suggested changes Apr 2, 2025

View reviewed changes

vanbasten23 changed the title ~~[TPU] Add sliding window and logit soft capping support for TPU.~~ [TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. Apr 2, 2025

bvrockwell approved these changes Apr 2, 2025

View reviewed changes

robertgshaw2-redhat approved these changes Apr 3, 2025

View reviewed changes

mergify bot added the needs-rebase label Apr 3, 2025

robertgshaw2-redhat enabled auto-merge (squash) April 3, 2025 00:20

mergify bot removed the needs-rebase label Apr 3, 2025

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 3, 2025

vanbasten23 added 8 commits April 3, 2025 02:31

add sliding window and logit soft capping

ad5a094

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

add test

e8188ab

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

Add test for the sliding window and logit softcapping

967ea6d

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

fix the test faster the rebase.

b838fab

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

adding test for gemma

de1b347

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

fix the ci

497be2c

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

fix linter

7af49a1

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

fix comment

352fa12

Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

auto-merge was automatically disabled April 3, 2025 02:32
Head branch was pushed to by a user without write access

vanbasten23 force-pushed the xiowei/sw_logit_cap branch from b342a46 to 352fa12 Compare April 3, 2025 02:32

robertgshaw2-redhat merged commit b6be6f8 into vllm-project:main Apr 3, 2025
31 checks passed

nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025

[TPU] Support sliding window and logit soft capping in the paged atte…

747381b

…ntion kernel for TPU. (vllm-project#15732) Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. #15732

[TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. #15732

vanbasten23 commented Mar 28, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 28, 2025

mergify bot commented Mar 28, 2025

brittrock commented Mar 30, 2025

yaochengji Apr 1, 2025

vanbasten23 Apr 1, 2025

yaochengji left a comment

DarkLight1337 commented Apr 2, 2025

NickLucche left a comment

bvrockwell commented Apr 2, 2025 •

edited

Loading

bvrockwell left a comment

vanbasten23 commented Apr 2, 2025

bvrockwell commented Apr 2, 2025

mergify bot commented Apr 3, 2025

vanbasten23 commented Apr 3, 2025

mgoin commented Apr 4, 2025

vanbasten23 commented Apr 4, 2025

vanbasten23 commented Apr 4, 2025

[TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. #15732

[TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. #15732

Conversation

vanbasten23 commented Mar 28, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 28, 2025

mergify bot commented Mar 28, 2025

brittrock commented Mar 30, 2025

yaochengji Apr 1, 2025

Choose a reason for hiding this comment

vanbasten23 Apr 1, 2025

Choose a reason for hiding this comment

yaochengji left a comment

Choose a reason for hiding this comment

DarkLight1337 commented Apr 2, 2025

NickLucche left a comment

Choose a reason for hiding this comment

bvrockwell commented Apr 2, 2025 • edited Loading

bvrockwell left a comment

Choose a reason for hiding this comment

vanbasten23 commented Apr 2, 2025

bvrockwell commented Apr 2, 2025

mergify bot commented Apr 3, 2025

vanbasten23 commented Apr 3, 2025

mgoin commented Apr 4, 2025

vanbasten23 commented Apr 4, 2025

vanbasten23 commented Apr 4, 2025

vanbasten23 commented Mar 28, 2025 •

edited by github-actions bot

Loading

bvrockwell commented Apr 2, 2025 •

edited

Loading