[V1] Support long_prefill_token_threshold in v1 scheduler #15419

houseroad · 2025-03-24T21:26:57Z

To address #14003

To support concurrent partial prefill, we need to define long_prefill_token_threshold. Then the tokens from different prefill requests can be added. Since there is no difference between prefill and decode, just need to control the token threshold.

pytest tests/v1/core/test_scheduler.py
e2e test: VLLM_USE_V1=1 pytest tests/v1/core/test_scheduler_e2e.py

Signed-off-by: Lu Fang <lufang@fb.com>

github-actions · 2025-03-24T21:27:05Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

plops655 · 2025-03-24T21:52:55Z

I am confused as to whether this PR properly enables chunked prefill for v1. In the original issue, my understanding is that concurrent partial prefills are used only when chunked prefills are enabled. Also, within the original scheduler code, we made sure that prefills that are scheduled to finish so that these sequences are prioritized over unfinished prefills upon transitioning to decode stage.

# Because multiple prefills may be running concurrently, we need to
        # make sure that prefills which are scheduled to finish are listed
        # before those that won't. This is so that on the next scheduling
        # iteration when they have transitioned to the decode stage, they are
        # properly prioritized over sequences that are still in the prefill
        # stage.
        self.running.extend(
            self._order_finishing_prefills_first(
                running_scheduled.prefill_seq_groups))
        self.running.extend([s.seq_group for s in prefills.seq_groups])

in vllm/core/scheduler.py _schedule_chunked_prefill

houseroad · 2025-03-24T22:07:39Z

This PR is only for concurrent partial prefills, not for chunked prefill. The assumption is that v1 always has the chunked prefill, so no difference between prefill and decoding.

Feel free to suggest any fix if something is wrong. :-)

plops655 · 2025-03-24T22:15:08Z

Thanks for responding. I have one more question I would be grateful to know, as I am relatively new to vLLM. In v0, chunked prefill is explicitly run in _schedule_chunked_prefill. This function creates a PartiallPrefillMetadata object for controlling the number of concurrent partial prefills, and batches both chunked prefills and decodes together.

I do not see any explicit chunking happening in the v1 schedule() code, and was wondering if you could enlighten me as to how chunked prefill is the default in v1.

comaniac · 2025-03-24T22:23:20Z

Thanks for responding. I have one more question I would be grateful to know, as I am relatively new to vLLM. In v0, chunked prefill is explicitly run in _schedule_chunked_prefill. This function creates a PartiallPrefillMetadata object for controlling the number of concurrent partial prefills, and batches both chunked prefills and decodes together.

I do not see any explicit chunking happening in the v1 schedule() code, and was wondering if you could enlighten me as to how chunked prefill is the default in v1.

Chunked prefill is the first class citizen in v1 scheduler so there's no explicit flag to configure it. There are some places you can observe this behavior:

When determining the number of new tokens to schedule (https://github.com/vllm-project/vllm/blob/main/vllm/v1/core/sched/scheduler.py#L290), we don't enforce the entire prompt to be scheduled at once. This is because of "chunked" prefill.
When scheduling running requests, we can schedule arbitrary number of tokens, and don't check whether it's prompt tokens or decode tokens. This is also because of the support of chunked prefill.

Signed-off-by: Lu Fang <lufang@fb.com>

comaniac

LGTM!
Also cc @joerunde @njhill

…ct#15419) Signed-off-by: Lu Fang <lufang@fb.com> Signed-off-by: Wes Medford <wryanmedford@gmail.com>

njhill · 2025-03-26T18:48:31Z

vllm/v1/core/sched/scheduler.py

+            if self.scheduler_config.long_prefill_token_threshold > 0:
+                num_new_tokens = min(
+                    num_new_tokens,
+                    self.scheduler_config.long_prefill_token_threshold)


Just looking at this now.. small nit that this could be simplified slightly

if 0 < self.scheduler_config.long_prefill_token_threshold < num_new_tokens: num_new_tokens = self.scheduler_config.long_prefill_token_threshold

Good point. Fixed in #15307

…ct#15419) Signed-off-by: Lu Fang <lufang@fb.com>

…ct#15419) Signed-off-by: Lu Fang <lufang@fb.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

…ct#15419) Signed-off-by: Lu Fang <lufang@fb.com>

Support long_prefill_token_threshold in v1 scheduler

d5667a7

Signed-off-by: Lu Fang <lufang@fb.com>

mergify bot added the v1 label Mar 24, 2025

comaniac self-assigned this Mar 24, 2025

houseroad added 2 commits March 24, 2025 22:55

Add tests

92b0135

Signed-off-by: Lu Fang <lufang@fb.com>

Add e2e test

fadaa74

Signed-off-by: Lu Fang <lufang@fb.com>

houseroad marked this pull request as ready for review March 25, 2025 07:16

houseroad requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners March 25, 2025 07:16

houseroad changed the title ~~[v1][wip] Support long_prefill_token_threshold in v1 scheduler~~ [V1] Support long_prefill_token_threshold in v1 scheduler Mar 25, 2025

comaniac approved these changes Mar 25, 2025

View reviewed changes

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 25, 2025

comaniac merged commit 082ab86 into vllm-project:main Mar 25, 2025
45 checks passed

wrmedford pushed a commit to wrmedford/vllm that referenced this pull request Mar 26, 2025

[V1] Support long_prefill_token_threshold in v1 scheduler (vllm-proje…

c08de65

…ct#15419) Signed-off-by: Lu Fang <lufang@fb.com> Signed-off-by: Wes Medford <wryanmedford@gmail.com>

njhill reviewed Mar 26, 2025

View reviewed changes

lengrongfu pushed a commit to lengrongfu/vllm that referenced this pull request Apr 2, 2025

[V1] Support long_prefill_token_threshold in v1 scheduler (vllm-proje…

531bdcf

…ct#15419) Signed-off-by: Lu Fang <lufang@fb.com>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[V1] Support long_prefill_token_threshold in v1 scheduler (vllm-proje…

7c60fd4

…ct#15419) Signed-off-by: Lu Fang <lufang@fb.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025

[V1] Support long_prefill_token_threshold in v1 scheduler (vllm-proje…

f45b4c8

…ct#15419) Signed-off-by: Lu Fang <lufang@fb.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Support long_prefill_token_threshold in v1 scheduler #15419

[V1] Support long_prefill_token_threshold in v1 scheduler #15419

houseroad commented Mar 24, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 24, 2025

plops655 commented Mar 24, 2025

houseroad commented Mar 24, 2025

plops655 commented Mar 24, 2025

comaniac commented Mar 24, 2025

comaniac left a comment

njhill Mar 26, 2025

comaniac Mar 26, 2025

[V1] Support long_prefill_token_threshold in v1 scheduler #15419

[V1] Support long_prefill_token_threshold in v1 scheduler #15419

Conversation

houseroad commented Mar 24, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 24, 2025

plops655 commented Mar 24, 2025

houseroad commented Mar 24, 2025

plops655 commented Mar 24, 2025

comaniac commented Mar 24, 2025

comaniac left a comment

Choose a reason for hiding this comment

njhill Mar 26, 2025

Choose a reason for hiding this comment

comaniac Mar 26, 2025

Choose a reason for hiding this comment

houseroad commented Mar 24, 2025 •

edited by github-actions bot

Loading