[ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 #15172

vermouth1992 · 2025-03-20T00:03:08Z

This PR modifies the distributed/test_torchrun_example.py to run on two nodes with 4 GPUs, with tp=2, external_dp=2. This makes sure that the SPMD based runner will not break when there is a external dp (e.g., replica of vllm rollouts)

github-actions · 2025-03-20T00:03:16Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Chi Zhang <zhangchi.usc1992@bytedance.com>

youkaichao · 2025-03-20T14:46:34Z

.buildkite/test-pipeline.yaml

@@ -493,6 +493,8 @@ steps:
    - VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
  - # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
    - VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
+  - VLLM_USE_V1=1 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_torchrun_example.py


this is 2 node test, you need to copy the command to run in different nodes.

you can put it in 4 GPUs tests.

Signed-off-by: youkaichao <youkaichao@gmail.com>

youkaichao

thanks for the investigation!

vllm-project#15172) Signed-off-by: Chi Zhang <zhangchi.usc1992@bytedance.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>

vllm-project#15172) Signed-off-by: Chi Zhang <zhangchi.usc1992@bytedance.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

vllm-project#15172) Signed-off-by: Chi Zhang <zhangchi.usc1992@bytedance.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>

mergify bot added the ci/build label Mar 20, 2025

Update test-pipeline.yaml

1f7cc94

Signed-off-by: Chi Zhang <zhangchi.usc1992@bytedance.com>

vermouth1992 force-pushed the patch-1 branch from 73ead34 to 1f7cc94 Compare March 20, 2025 00:10

youkaichao reviewed Mar 20, 2025

View reviewed changes

youkaichao added 3 commits March 20, 2025 22:50

fix tests

30ecf94

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix init

5ca6c60

Signed-off-by: youkaichao <youkaichao@gmail.com>

test both v0 and v1

91e4ff6

Signed-off-by: youkaichao <youkaichao@gmail.com>

youkaichao approved these changes Mar 20, 2025

View reviewed changes

youkaichao merged commit 086b568 into vllm-project:main Mar 20, 2025
11 of 13 checks passed

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 #15172

[ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 #15172

vermouth1992 commented Mar 20, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 20, 2025

youkaichao Mar 20, 2025

youkaichao left a comment

[ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 #15172

[ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 #15172

Conversation

vermouth1992 commented Mar 20, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 20, 2025

youkaichao Mar 20, 2025

Choose a reason for hiding this comment

youkaichao left a comment

Choose a reason for hiding this comment

vermouth1992 commented Mar 20, 2025 •

edited by github-actions bot

Loading