-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[V1] Implement sliding window attention in kv_cache_manager #14097
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Conversation
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
@zhuohan123 Did you have a chance to take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIP partial review. Will add more tmrw
vllm/v1/core/kv_cache_utils.py
Outdated
# Verify that the virtual layers of each rank are the same. | ||
for kv_cache_config in kv_cache_configs[1:]: | ||
for virtual_layer1, virtual_layer2 in zip( | ||
kv_cache_configs[0].virtual_layers, | ||
kv_cache_config.virtual_layers): | ||
assert virtual_layer1.kv_cache_spec == virtual_layer2.kv_cache_spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Will pipeline parallelism fail this assert for some models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will fail when different stages have different type of layers. For hybrid models, just throw an error as the first step. For non-hybrid models, the assert won't fail.
This function is introduced in https://github.com/vllm-project/vllm/pull/14079/files, should we discuss it there?
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review! Replied to some questions, will update the code after #14079
vllm/v1/core/kv_cache_utils.py
Outdated
# Verify that the virtual layers of each rank are the same. | ||
for kv_cache_config in kv_cache_configs[1:]: | ||
for virtual_layer1, virtual_layer2 in zip( | ||
kv_cache_configs[0].virtual_layers, | ||
kv_cache_config.virtual_layers): | ||
assert virtual_layer1.kv_cache_spec == virtual_layer2.kv_cache_spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will fail when different stages have different type of layers. For hybrid models, just throw an error as the first step. For non-hybrid models, the assert won't fail.
This function is introduced in https://github.com/vllm-project/vllm/pull/14079/files, should we discuss it there?
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
As we discussed offline, I think we need a clear separation of the two APIs of
Two particular things I want to fix in this PR are that
|
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
3097eba
to
941f770
Compare
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've change the prefix caching API in SpecializedManager to find_longest_cache_hit
. Seems that most complexity can be fixed after that. This find_longest_cache_hit
also works for hybrid allocator as long as we can avoid the recomputation between the multiple calls of it for the same request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great simplification. I love it!
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the awesome work! I’m really happy with how the final APIs and simplifications turned out 🔥🔥
Also, huge thanks for your incredible patience throughout all the back-and-forth edits and long discussions. Really appreciate it!
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
…ject#14097) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
…ject#14097) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
…ject#14097) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
…ject#14097) Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Build on top of #14079, should be merged after it.This pr supports “real” sliding window in v1:
For models with global attention + sliding window attention, still regard as global-attention-only model in kv cache manager.
Some questions in #13296
It isn’t compatible with cascade attention yet but should be correct due to
vllm/vllm/v1/attention/backends/flash_attn.py
Line 351 in cfbb8c9
Q: How does it work with chunked prefill?
A: It will allocate blocks for tokens that will be computed in the current step, and free the blocks that outside sliding window in the next step.
Assume window size 1k, chunk size 2k, prompt length 4k, block_size=1
Q: What's the shape of the block table for SWA? Is it append-only?
A: It is with the same length as global attention, but changes the blocks outside the sliding window to a special null_block. This replacement only happens in the kv_cache_manager side. As model_runner’s block_table is append-only, we do not replace existing blocks to null blocks in model runner. The result is correct because model runner won’t access the blocks outside sliding window.
This pr is part of the hybrid allocator #11382