[V1] Implement sliding window attention in kv_cache_manager #14097

heheda12345 · 2025-03-02T17:27:59Z

~~Build on top of #14079, should be merged after it.~~

This pr supports “real” sliding window in v1:

Support dropping blocks outside sliding window
For prefix caching, only requires the last sliding_window tokens to be cached to achieve a prefix cache hit. e.g., for request ABCDE with sliding window size 2 & block_size 1, if DE are cached while ABC are not, we can still regard ABCDE as the cached prefix.

For models with global attention + sliding window attention, still regard as global-attention-only model in kv cache manager.

Some questions in #13296
It isn’t compatible with cascade attention yet but should be correct due to

vllm/vllm/v1/attention/backends/flash_attn.py

Line 351 in cfbb8c9

if use_alibi or use_sliding_window:

Q: How does it work with chunked prefill?
A: It will allocate blocks for tokens that will be computed in the current step, and free the blocks that outside sliding window in the next step.
Assume window size 1k, chunk size 2k, prompt length 4k, block_size=1

chunk prefill of 2k tokens: block table: [0:2k]
chunk prefill of 2k tokens:
1. Free the first 1k blocks as they won’t be used after step 1. block_table becomes [null_block*1000] + [1k:2k]
2. Allocate blocks for the [2k:4k] tokens that will be computed. block_table becomes [null_block*1000] + [1k:4k]
First decode step
1. Free the [1k:3k] blocks as they won’t be used after step 2. block_table becomes [null_block*3000] + [3k:4k]
2. Allocate a new slot for decoding, block_table becomes [null_block*3000] + [3k:4001]

Q: What's the shape of the block table for SWA? Is it append-only?
A: It is with the same length as global attention, but changes the blocks outside the sliding window to a special null_block. This replacement only happens in the kv_cache_manager side. As model_runner’s block_table is append-only, we do not replace existing blocks to null blocks in model runner. The result is correct because model runner won’t access the blocks outside sliding window.

This pr is part of the hybrid allocator #11382

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

github-actions · 2025-03-02T17:28:10Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/v1/worker/gpu_worker.py

mergify · 2025-03-05T07:00:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

WoosukKwon · 2025-03-08T20:46:06Z

@zhuohan123 Did you have a chance to take a look?

zhuohan123

WIP partial review. Will add more tmrw

vllm/v1/core/block_pool.py

vllm/v1/core/kv_cache_utils.py

zhuohan123 · 2025-03-09T05:09:26Z

vllm/v1/core/kv_cache_utils.py

+    # Verify that the virtual layers of each rank are the same.
+    for kv_cache_config in kv_cache_configs[1:]:
+        for virtual_layer1, virtual_layer2 in zip(
+                kv_cache_configs[0].virtual_layers,
+                kv_cache_config.virtual_layers):
+            assert virtual_layer1.kv_cache_spec == virtual_layer2.kv_cache_spec


Q: Will pipeline parallelism fail this assert for some models?

Yes, it will fail when different stages have different type of layers. For hybrid models, just throw an error as the first step. For non-hybrid models, the assert won't fail.

This function is introduced in https://github.com/vllm-project/vllm/pull/14079/files, should we discuss it there?

vllm/v1/executor/abstract.py

vllm/v1/core/kv_cache_utils.py

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345

Thanks for your review! Replied to some questions, will update the code after #14079

vllm/v1/core/block_pool.py

vllm/v1/core/kv_cache_utils.py

heheda12345 · 2025-03-09T06:53:03Z

vllm/v1/core/kv_cache_utils.py

+    # Verify that the virtual layers of each rank are the same.
+    for kv_cache_config in kv_cache_configs[1:]:
+        for virtual_layer1, virtual_layer2 in zip(
+                kv_cache_configs[0].virtual_layers,
+                kv_cache_config.virtual_layers):
+            assert virtual_layer1.kv_cache_spec == virtual_layer2.kv_cache_spec


Yes, it will fail when different stages have different type of layers. For hybrid models, just throw an error as the first step. For non-hybrid models, the assert won't fail.

This function is introduced in https://github.com/vllm-project/vllm/pull/14079/files, should we discuss it there?

vllm/v1/executor/abstract.py

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

vllm/v1/core/block_pool.py

mergify · 2025-03-29T00:25:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

vllm/v1/core/block_pool.py

vllm/v1/core/kv_cache_manager.py

vllm/v1/core/specialized_manager.py

WoosukKwon · 2025-03-30T09:33:51Z

As we discussed offline, I think we need a clear separation of the two APIs of SpecializedManager:

The first API describing how to free the KV cache for a request. This API is invoked whenever the request calls allocate_slots.
The second API describing how to check the prefix cache hits. This API is invoked in get_computed_blocks.

Two particular things I want to fix in this PR are that

The first API is also used in get_computed_blocks, blurring the separation.
The second API does not provide "efficient intersection" or "early stopping". I think this can be addressed by adding an extended API (e.g., get longest prefix).

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345

I've change the prefix caching API in SpecializedManager to find_longest_cache_hit. Seems that most complexity can be fixed after that. This find_longest_cache_hit also works for hybrid allocator as long as we can avoid the recomputation between the multiple calls of it for the same request.

vllm/v1/core/block_pool.py

vllm/v1/core/kv_cache_manager.py

vllm/v1/core/specialized_manager.py

vllm/v1/core/kv_cache_manager.py

vllm/v1/core/specialized_manager.py

WoosukKwon

Great simplification. I love it!

vllm/v1/core/kv_cache_utils.py

vllm/v1/core/sched/utils.py

vllm/v1/core/specialized_manager.py

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

vllm/v1/core/specialized_manager.py

vllm/v1/core/sched/utils.py

vllm/v1/core/specialized_manager.py

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

WoosukKwon

LGTM. Thanks for the awesome work! I’m really happy with how the final APIs and simplifications turned out 🔥🔥

Also, huge thanks for your incredible patience throughout all the back-and-forth edits and long discussions. Really appreciate it!

vllm/v1/core/specialized_manager.py

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

…ject#14097) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

…ject#14097) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>

…ject#14097) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

…ject#14097) Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345 added 10 commits March 1, 2025 04:54

kv cache config refactor

c209a59

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

update comments

2b30e35

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

setup test for sliding window

34b283e

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Merge branch 'main' of github.com:vllm-project/vllm into sliding_window

87578e2

can run sliding window, cannot run prefix cache

cd54423

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Merge branch 'main' of github.com:vllm-project/vllm into sliding_window

a000848

real_null_block, can run prefix cache

b9c9e0b

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

update tests

e733bcd

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

minor updates

7140643

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

hack for interleaved model

40e0967

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners March 2, 2025 17:28

mergify bot added the v1 label Mar 2, 2025

WoosukKwon reviewed Mar 5, 2025

View reviewed changes

vllm/v1/worker/gpu_worker.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Mar 5, 2025

zhuohan123 reviewed Mar 9, 2025

View reviewed changes

heheda12345 added 2 commits March 8, 2025 22:25

address review comments

93adab8

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Merge branch 'main' of github.com:vllm-project/vllm into virtual_layer

9082be5

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345 commented Mar 9, 2025

View reviewed changes

heheda12345 added 4 commits March 9, 2025 09:24

ManagerKVLayer

4d05626

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

update names

530d4bf

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Merge remote-tracking branch 'origin/main' into virtual_layer

b9fd999

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

update comments

56e0b5d

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

fix comment

2d4645f

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

WoosukKwon reviewed Mar 29, 2025

View reviewed changes

vllm/v1/core/block_pool.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Mar 29, 2025

heheda12345 added 2 commits March 28, 2025 20:00

exclude null block

9b966b1

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Merge branch 'main' of github.com:vllm-project/vllm into sliding_window

818fb83

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

mergify bot removed the needs-rebase label Mar 29, 2025

WoosukKwon reviewed Mar 30, 2025

View reviewed changes

heheda12345 added 2 commits March 30, 2025 09:26

simplify specialized manager interface

2ca014e

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

update based on review

941f770

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345 force-pushed the sliding_window branch from 3097eba to 941f770 Compare March 30, 2025 16:27

minor updates

0451bd3

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345 commented Mar 30, 2025

View reviewed changes

WoosukKwon reviewed Mar 30, 2025

View reviewed changes

change the meaning of sliding window and fix some nits

2d25ef1

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

WoosukKwon reviewed Mar 31, 2025

View reviewed changes

vllm/v1/core/specialized_manager.py Outdated Show resolved Hide resolved

WoosukKwon reviewed Mar 31, 2025

View reviewed changes

vllm/v1/core/sched/utils.py Outdated Show resolved Hide resolved

WoosukKwon reviewed Mar 31, 2025

View reviewed changes

vllm/v1/core/specialized_manager.py Show resolved Hide resolved

heheda12345 added 2 commits March 30, 2025 23:58

handle last block problem in kv cache manager

8c9e5a5

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

remove empty line

8eeb8d5

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

WoosukKwon approved these changes Mar 31, 2025

View reviewed changes

vllm/v1/core/specialized_manager.py Show resolved Hide resolved

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 31, 2025

Merge branch 'main' of github.com:vllm-project/vllm into sliding_window

809fa99

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

WoosukKwon merged commit 3a5f0af into vllm-project:main Apr 1, 2025
33 checks passed

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Apr 2, 2025

[V1] Implement sliding window attention in kv_cache_manager (vllm-pro…

bbc59dc

…ject#14097) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[V1] Implement sliding window attention in kv_cache_manager (vllm-pro…

2ea26ba

…ject#14097) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025

[V1] Implement sliding window attention in kv_cache_manager (vllm-pro…

95628dd

…ject#14097) Signed-off-by: Chen Zhang <zhangch99@outlook.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Implement sliding window attention in kv_cache_manager #14097

[V1] Implement sliding window attention in kv_cache_manager #14097

heheda12345 commented Mar 2, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 2, 2025

mergify bot commented Mar 5, 2025

WoosukKwon commented Mar 8, 2025

zhuohan123 left a comment

zhuohan123 Mar 9, 2025

heheda12345 Mar 9, 2025

heheda12345 left a comment

heheda12345 Mar 9, 2025

mergify bot commented Mar 29, 2025

WoosukKwon commented Mar 30, 2025 •

edited

Loading

heheda12345 left a comment

WoosukKwon left a comment

WoosukKwon left a comment

[V1] Implement sliding window attention in kv_cache_manager #14097

[V1] Implement sliding window attention in kv_cache_manager #14097

Conversation

heheda12345 commented Mar 2, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 2, 2025

mergify bot commented Mar 5, 2025

WoosukKwon commented Mar 8, 2025

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 Mar 9, 2025

Choose a reason for hiding this comment

heheda12345 Mar 9, 2025

Choose a reason for hiding this comment

heheda12345 left a comment

Choose a reason for hiding this comment

heheda12345 Mar 9, 2025

Choose a reason for hiding this comment

mergify bot commented Mar 29, 2025

WoosukKwon commented Mar 30, 2025 • edited Loading

heheda12345 left a comment

Choose a reason for hiding this comment

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon left a comment

Choose a reason for hiding this comment

heheda12345 commented Mar 2, 2025 •

edited by github-actions bot

Loading

WoosukKwon commented Mar 30, 2025 •

edited

Loading