[TPU][V1] Fix Sampler recompilation #15309

NickLucche · 2025-03-21T21:35:39Z

Fix XLA recompilations introduced .
Namely it factors out the on-device slicing that is happening inside InputBatch._make_sampling_metadata as well as an issue with XLA not pre-compiling sample_from_hidden when its output isn't moved to cpu.

Update:

Persistent sampling metadata tensors, re-using the ones in input_batch to reduce waste
(hopefully) substantially simplified code. TPUSupportedSamplingMetadata is now a simpler wrapper around tensors managed in input_batch.

Signed-off-by: NickLucche <nlucches@redhat.com>

github-actions · 2025-03-21T21:35:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

yaochengji · 2025-03-22T00:03:37Z

Hi @NickLucche , one thing I noticed is that the host overhead becomes larger.

Previously the average is 3.5ms, now the average is about 5ms.

You can set the environemnt variable VLLM_TORCH_PROFILER_DIR and add --profile in benchmark_script.py to enable profiling.

lsy323 · 2025-03-22T00:09:39Z

Hi @NickLucche I found sampler test is not added to TPU CI. Also it's running in enforce_eager, could we turn off enforce_eager so that recompilation can be checked in our tests?

NickLucche · 2025-03-24T10:59:27Z

Hi @NickLucche , one thing I noticed is that the host overhead becomes larger.

Thanks for looking into it. I had to move the sampling pre-processing (slicing) from on device to host. I'll look into optimizing it, but on GPU we just keep everything on device and do the slicing there, which may cause recompilation depending on the number of reqs being scheduled. TL;DR It's a trade-off in integrating existing logic.

Also it's running in enforce_eager, could we turn off enforce_eager so that recompilation can be checked in our tests?

Yes but I'd rather change the logic then, probably merge the two tests into one.

… code Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche · 2025-03-25T16:23:43Z

@lsy323 there's still a very small graph being re-compiled at runtime that I haven't been able to track down. I have to hold on that test until it is figured out.

mgoin

Looks good to me, just needs some cleanup. Have you validated a performance win smoke test?

vllm/v1/worker/tpu_model_runner.py

mgoin · 2025-03-25T18:43:39Z

vllm/v1/worker/tpu_model_runner.py

                indices = torch.zeros(
                    num_reqs_to_sample,
                    dtype=torch.int32,
                    device=device,
                )
+                xm.mark_step()


Worth leaving a comment why this isn't at the end of the loop

yaochengji · 2025-03-25T20:42:45Z

@lsy323 there's still a very small graph being re-compiled at runtime that I haven't been able to track down. I have to hold on that test until it is figured out.

@NickLucche , we can use PT_XLA_DEBUG_LEVEL=2 to get where the recompilation is triggered. And please remember to clear the xla compilation cache before the execution.

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Wes Medford <wryanmedford@gmail.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche added 2 commits March 21, 2025 19:00

wip: slice samplingmetadata tensors on cpu + pad

e3f5211

Signed-off-by: NickLucche <nlucches@redhat.com>

address sample_from_hidden recompilation

7aecd1d

Signed-off-by: NickLucche <nlucches@redhat.com>

mergify bot added the v1 label Mar 21, 2025

persistent input_batch tensor re-used for sampling metadata; simplify…

21253de

… code Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche marked this pull request as ready for review March 25, 2025 15:54

NickLucche requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners March 25, 2025 15:54

mgoin added tpu Related to Google TPUs ready ONLY add when PR is ready to merge/full CI is needed bug Something isn't working labels Mar 25, 2025

mgoin approved these changes Mar 25, 2025

View reviewed changes

mgoin merged commit a0dd7dc into vllm-project:main Mar 25, 2025
40 checks passed

wrmedford pushed a commit to wrmedford/vllm that referenced this pull request Mar 26, 2025

[TPU][V1] Fix Sampler recompilation (vllm-project#15309)

7274e8a

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Wes Medford <wryanmedford@gmail.com>

This was referenced Mar 26, 2025

[V1][TPU] Enable Top K #15489

Merged

[Bugfix][TPU][V1] Fix recompilation #15553

Merged

lengrongfu pushed a commit to lengrongfu/vllm that referenced this pull request Apr 2, 2025

[TPU][V1] Fix Sampler recompilation (vllm-project#15309)

624fa07

Signed-off-by: NickLucche <nlucches@redhat.com>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[TPU][V1] Fix Sampler recompilation (vllm-project#15309)

d6ca7f7

Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025

[TPU][V1] Fix Sampler recompilation (vllm-project#15309)

7c693eb

Signed-off-by: NickLucche <nlucches@redhat.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU][V1] Fix Sampler recompilation #15309

[TPU][V1] Fix Sampler recompilation #15309

NickLucche commented Mar 21, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 21, 2025

yaochengji commented Mar 22, 2025

lsy323 commented Mar 22, 2025 •

edited

Loading

NickLucche commented Mar 24, 2025 •

edited

Loading

NickLucche commented Mar 25, 2025

mgoin left a comment

mgoin Mar 25, 2025

yaochengji commented Mar 25, 2025

[TPU][V1] Fix Sampler recompilation #15309

[TPU][V1] Fix Sampler recompilation #15309

Conversation

NickLucche commented Mar 21, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 21, 2025

yaochengji commented Mar 22, 2025

lsy323 commented Mar 22, 2025 • edited Loading

NickLucche commented Mar 24, 2025 • edited Loading

NickLucche commented Mar 25, 2025

mgoin left a comment

Choose a reason for hiding this comment

mgoin Mar 25, 2025

Choose a reason for hiding this comment

yaochengji commented Mar 25, 2025

NickLucche commented Mar 21, 2025 •

edited by github-actions bot

Loading

lsy323 commented Mar 22, 2025 •

edited

Loading

NickLucche commented Mar 24, 2025 •

edited

Loading