[TPU] optimize the all-reduce performance #15903

yaochengji · 2025-04-01T20:10:09Z

Before the PR, the all-reduce performance is not optimal due to two reasons:

on v6e-8, XLA compiler accidentally apply 2D-ring strategy, while 1D-ring is expected
The ring-order cannot be automatically adjusted

github-actions · 2025-04-01T20:10:20Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

yaochengji · 2025-04-01T20:12:58Z

I have two questions:

Currently I wrap the all-reduce into a pytorch python custom op to make it compatible with dynamo, should I wrap it in vllm/distributed/parallel_state.py instead?
I need to another environment variable LIBTPU_INIT_ARGS="--xla_tpu_force_1d_allreduce_at_chunk_count=1" for the performance optimization, should I add it in the code?

alexm-redhat · 2025-04-01T20:18:48Z

@yaochengji thanks for this important PR!

About dynamo, I don't have a strong opinion there.
About libtpu flag, I think you can detect inside init_device() of tpu_worker that it is a v6 and simply add the env var. Similar to the code below (that adds PJRT_DEVICE flag):

def init_device(self):
        os.environ["PJRT_DEVICE"] = "TPU"

yaochengji · 2025-04-01T20:28:58Z

Thanks @alexm-redhat for your suggestion.

Hi @youkaichao , do you have any suggestion on the dynamo custom op?

yaochengji · 2025-04-01T20:29:02Z

My local experiments show that the throughput can improve from ~4.2 reqs/s to ~4.9reqs/s for Llama 70B on 8 v6e.

youkaichao · 2025-04-02T13:47:01Z

vllm/distributed/device_communicators/tpu_communicator.py


    if USE_RAY:
        from vllm.executor import ray_utils


+@torch.library.custom_op("tpu::all_reduce", mutates_args=())


is it possible to use custom op call in vllm directly? e.g. extending this line

vllm/vllm/distributed/parallel_state.py

Line 222 in 44f9905

self.use_custom_op_call = current_platform.is_cuda_alike()

to include tpu.

i didn't do it initially, because i remember tpu has some custom dynamo-related logic.

Thanks for you suggestion!

I removed the custom op in tpu_communicator.py and made use of the custom op for TPU in parallel_state.py

alexm-redhat

LGTM!

Signed-off-by: Chengji Yao <chengjiyao@google.com>

robertgshaw2-redhat · 2025-04-03T00:15:53Z

Magical incantations!

robertgshaw2-redhat · 2025-04-03T00:16:57Z

NOTE: V1 test failing fixed by #15969

Signed-off-by: Chengji Yao <chengjiyao@google.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

Signed-off-by: Chengji Yao <chengjiyao@google.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Signed-off-by: Chengji Yao <chengjiyao@google.com>

yaochengji marked this pull request as draft April 1, 2025 20:10

yaochengji requested a review from alexm-redhat April 1, 2025 20:10

mergify bot added the tpu Related to Google TPUs label Apr 1, 2025

yaochengji requested a review from youkaichao April 1, 2025 20:27

yaochengji marked this pull request as ready for review April 2, 2025 06:10

yaochengji requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96 and comaniac as code owners April 2, 2025 06:10

mergify bot added the v1 label Apr 2, 2025

youkaichao reviewed Apr 2, 2025

View reviewed changes

alexm-redhat approved these changes Apr 2, 2025

View reviewed changes

alexm-redhat enabled auto-merge (squash) April 2, 2025 17:06

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 2, 2025

yaochengji added 3 commits April 2, 2025 20:28

optimize all-reduce

3876d37

Signed-off-by: Chengji Yao <chengjiyao@google.com>

fix

87a58ca

Signed-off-by: Chengji Yao <chengjiyao@google.com>

improve custom op wrapper

91628ee

Signed-off-by: Chengji Yao <chengjiyao@google.com>

auto-merge was automatically disabled April 2, 2025 22:04
Head branch was pushed to by a user without write access

yaochengji force-pushed the chengji/optimize-allreduce branch from 5edfa49 to 91628ee Compare April 2, 2025 22:04

robertgshaw2-redhat approved these changes Apr 3, 2025

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) April 3, 2025 00:15

robertgshaw2-redhat merged commit 01b6113 into vllm-project:main Apr 3, 2025
38 checks passed

Potabk mentioned this pull request Apr 3, 2025

[Bugfix]Lazy import vllm config vllm-project/vllm-ascend#462

Merged

Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025

[TPU] optimize the all-reduce performance (vllm-project#15903)

fbff907

Signed-off-by: Chengji Yao <chengjiyao@google.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[TPU] optimize the all-reduce performance (vllm-project#15903)

5280c49

Signed-off-by: Chengji Yao <chengjiyao@google.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025

[TPU] optimize the all-reduce performance (vllm-project#15903)

f3f1e6f

Signed-off-by: Chengji Yao <chengjiyao@google.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TPU] optimize the all-reduce performance #15903

[TPU] optimize the all-reduce performance #15903

yaochengji commented Apr 1, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 1, 2025

yaochengji commented Apr 1, 2025

alexm-redhat commented Apr 1, 2025 •

edited

Loading

yaochengji commented Apr 1, 2025

yaochengji commented Apr 1, 2025

youkaichao Apr 2, 2025

yaochengji Apr 2, 2025

alexm-redhat left a comment

robertgshaw2-redhat commented Apr 3, 2025

robertgshaw2-redhat commented Apr 3, 2025

[TPU] optimize the all-reduce performance #15903

[TPU] optimize the all-reduce performance #15903

Conversation

yaochengji commented Apr 1, 2025 • edited by github-actions bot Loading

github-actions bot commented Apr 1, 2025

yaochengji commented Apr 1, 2025

alexm-redhat commented Apr 1, 2025 • edited Loading

yaochengji commented Apr 1, 2025

yaochengji commented Apr 1, 2025

youkaichao Apr 2, 2025

Choose a reason for hiding this comment

yaochengji Apr 2, 2025

Choose a reason for hiding this comment

alexm-redhat left a comment

Choose a reason for hiding this comment

robertgshaw2-redhat commented Apr 3, 2025

robertgshaw2-redhat commented Apr 3, 2025

yaochengji commented Apr 1, 2025 •

edited by github-actions bot

Loading

alexm-redhat commented Apr 1, 2025 •

edited

Loading