-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[V1][PP] Do not block engine core when no requests to schedule #14585
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix
scheduled_batch = (scheduler_output is not None | ||
and scheduler_output.total_num_scheduled_tokens > 0) | ||
|
||
# If no more requests can be scheduled and the job queue is not empty, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: If no more requests can be scheduled now, there may be new ones added in the near future. So we have two options, one is to block here waiting for the oldest batch to finish, one is to move forward to the next loop iteration. We preferred blocking here. Would this always be the best strategy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We block here only when the batch queue is full (so all stages are busy). In this case if we schedule one more batch then the scheduling overhead of this batch can be hidden so it might be better. However, this batch cannot include any on the fly requests, so I'm not sure if this is the optimal.
# Blocking until the first result is available. | ||
model_output = future.result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need any timeout here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because in this case we have nothing else to do but can just wait for the first batch to finish (similar to PP=1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but in this case, the engine will not add any requests to the scheduler in the meantime, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but doesn't matter because all GPUs are busy. On the other hand if we no wait here and add new requests to the scheduler, we will suffer from two issues:
- The new added requests cannot be scheduled anyways due to no resources.
- If the oldest batch is finished during the time we are adding new requests, then that batch will be delayed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh btw what about the FSM compilation? Can it be done in parallel with the execution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess so, although we won't be able to schedule the request before its FSM is compiled, we should be able to kick off the compilation first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice optimization!
…project#14585) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
The current engine step with batch queue blocks the busy loop for
POLLING_TIMEOUT_S
when the batch queue is empty. This results in unstable and possibly longer TTFT when the first request comes in. This PR solves the issue.V0
V1 (main)
V1 (this PR)
cc @ruisearch42