Skip to content

[Feature]: Implement Concurrent Partial Prefills In V1 Engine #14003

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Open
1 task done
robertgshaw2-redhat opened this issue Feb 28, 2025 · 4 comments
Open
1 task done
Labels
feature request New feature or request help wanted Extra attention is needed

Comments

@robertgshaw2-redhat
Copy link
Collaborator

🚀 The feature, motivation and pitch

In V0, we support concurrent partial prefills to avoid TTFT latency with long requests. Implement it in V1

cc @WoosukKwon

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@robertgshaw2-redhat robertgshaw2-redhat added feature request New feature or request help wanted Extra attention is needed labels Feb 28, 2025
@zhuohan123 zhuohan123 moved this to Todo in Onboarding Tasks Mar 7, 2025
@plops655
Copy link

I can take this up.

@ccw1996
Copy link

ccw1996 commented Mar 20, 2025

I'm interested in contributing this.

@houseroad
Copy link
Collaborator

Shall we mark this as done? Since I feel in the context of chunked prefill, probably doesn't make sense to add support for max_num_partial_prefills and max_long_partial_prefills. cc: @comaniac

@comaniac
Copy link
Collaborator

Shall we mark this as done? Since I feel in the context of chunked prefill, probably doesn't make sense to add support for max_num_partial_prefills and max_long_partial_prefills. cc: @comaniac

IIRC in v0 we still apply max_num_partial_prefills for example even when chunked prefill is enabled, but I'm not fully sure which one (max_num_partial_prefills or long_prefill_token_threshold has a higher priority). This is a topic worth to discuss. cc @joerunde @njhill

No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
feature request New feature or request help wanted Extra attention is needed
Projects
Status: Todo
Development

No branches or pull requests

5 participants