[Kernel] LoRA - Enable CUDAGraphs for V1 #14626

varun-sundar-rabindranath · 2025-03-11T16:57:36Z

Enable CUDAGraphs support for V1 LoRA

Related issue: #10617

github-actions · 2025-03-11T16:57:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

varun-sundar-rabindranath · 2025-03-11T19:07:48Z

vllm/lora/layers.py

+                                        1, 0)
+        embeddings_indices = torch.narrow(
+            self.punica_wrapper._embeddings_indices, 1, 0, x.size(0))
+


^ changes are to avoid errors such as,

raise ConstraintViolationError( torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['input_ids'].size()[0], L['positions'].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - Not all values of RelaxedUnspecConstraint(L['input_ids'].size()[0]) are valid because L['input_ids'].size()[0] was inferred to be a constant (8192). - Not all values of RelaxedUnspecConstraint(L['positions'].size()[0]) are valid because L['positions'].size()[0] was inferred to be a constant (8192).

varun-sundar-rabindranath · 2025-03-11T20:04:20Z

vllm/lora/layers.py

-        full_output = self.base_layer.forward(
-            x.add_(indices * added_tokens_mask))
+        full_output = self.base_layer.forward(x +
+                                              (indices * added_tokens_mask))


x here is the input_ids. In V1, we don't zero out the cuda graph pad region.
Avoid the in-place update here to prevent accumulating garbage into the input buffer.

varun-sundar-rabindranath · 2025-03-11T20:08:56Z

vllm/config.py

+            vllm_factors.append(
+                hashlib.md5(
+                    str(self.scheduler_config.max_num_batched_tokens).encode()
+                ).hexdigest())


During torch.compile, LoRA static buffers like in

vllm/vllm/lora/punica_wrapper/punica_base.py

Line 133 in 5305673

self._token_lora_indices = torch.empty(max_num_batched_tokens,

and

vllm/vllm/lora/ops/triton_ops/v1/v1_kernel_metadata.py

Line 24 in 5305673

token_lora_mapping = torch.empty(max_num_tokens,

get captured along with their sizes and strides (they aren't dynamic)

When max_num_batched_tokens changes, and when the captured graph is executed, we hit assert_size_stride asserts on these tensors. As a solution, we simply recompile when max_num_batched_tokens change.

str(self.scheduler_config.max_num_batched_tokens) should be enough. you only need to add a string into factors. no need to hash it here.

varun-sundar-rabindranath · 2025-03-13T01:46:51Z

LoRA TP test times increase from 43m to 51m . I believe this is mostly coming from CUDAGraph capture model for V1.
The test times of other LoRA test are relatively stable.

varun-sundar-rabindranath · 2025-03-13T01:48:09Z

vllm/lora/punica_wrapper/punica_gpu.py

-            y = self._apply_bias(self.token_lora_indices, y, output_slices,
+            token_lora_indices = torch.narrow(self._token_lora_indices, 0, 0,
+                                              y.size(0))
+            y = self._apply_bias(token_lora_indices, y, output_slices,
                                 lora_bias_stacked)


^ Changes are to avoid errors such as,

raise ConstraintViolationError( torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['input_ids'].size()[0], L['positions'].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - Not all values of RelaxedUnspecConstraint(L['input_ids'].size()[0]) are valid because L['input_ids'].size()[0] was inferred to be a constant (8192). - Not all values of RelaxedUnspecConstraint(L['positions'].size()[0]) are valid because L['positions'].size()[0] was inferred to be a constant (8192).

varun-sundar-rabindranath · 2025-03-13T01:49:02Z

@jeejeelee @youkaichao @bnellnm @ProExpertProg Can you please take a look when you get a chance ! Thanks 🙏

jeejeelee · 2025-03-13T02:12:38Z

Thank you for your outstanding work. The following are my local test results.

v1 eager vs cudagraph
V0 cudagraph vs V1 cudagrpah

varun-sundar-rabindranath · 2025-03-13T02:36:50Z

Thanks @jeejeelee for running this 🙌

I believe #14685 should help the V0 case 👍

youkaichao

this is surprising, I didn't expect lora code can be traced by Dynamo correctly. how do you pass the lora metadata, e.g. which lora is used for which request?

varun-sundar-rabindranath · 2025-03-13T04:17:24Z

this is surprising, I didn't expect lora code can be traced by Dynamo correctly. how do you pass the lora metadata, e.g. which lora is used for which request?

Hi @youkaichao - the metadata update is done like before, in punica_gpu.py .
This is the metadata file - https://github.com/vllm-project/vllm/blob/main/vllm/lora/ops/triton_ops/v1/v1_kernel_metadata.py
the update happens here -

vllm/vllm/lora/punica_wrapper/punica_gpu.py

Line 118 in 1bd32bc

self._v1_prepare_metadata_tensors(self.token_lora_indices,

I did run into issues with the dynamic shape tracing. I have added comments in such places in the PR. fortunately there were few.
There were cases where the metadata tensors are traced with their shapes and strides. The shapes were all dependent on max_num_batched_tokens. I resorted to recompilation in such cases https://github.com/vllm-project/vllm/pull/14626/files#r1990059792

youkaichao · 2025-03-13T05:06:50Z

vllm/config.py

@@ -3443,12 +3455,6 @@ def __post_init__(self):
                " Disabling `torch.compile`.")
            self.compilation_config.level = CompilationLevel.NO_COMPILATION

-        if self.lora_config is not None and self.compilation_config.level !=\


is V0 lora compatible with torch.compile ?

Not at the moment. The SGMV ops input some forward pass specific metadata, such as token_nums and max_seq_length as python integers and IIUC, during tracing these are captured as constants but they shouldn't be.

The WIth the lora/layers.py changes in this PR and with #14685 , V0 LoRA should become compatible.

thanks for the information. then can you keep the assert in v0?

Yes. re-introduced the check for V0 👍

youkaichao · 2025-03-13T05:16:02Z

vllm/lora/layers.py

@@ -237,16 +237,19 @@ def set_lora(
                self.embeddings_weights[:embeddings.shape[0]].copy_(embeddings)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
-        added_tokens_mask = x > self.base_layer.org_vocab_size - 1
-        embeddings_indices = self.punica_wrapper.embeddings_indices


yeah this is the culprit. it is actually a function, and slices a tensor using a python int, which will fail the symbolic shape compilation. changing to torch.narrow with x.size(0) is the correct fix 👍

youkaichao

the change makes sense to me. leave it to @jeejeelee to verify the correctness.

youkaichao

some general comments:

for a computation graph to be compatible with vLLM's torch.compile integration, all the output/input tensors of all operation seen by the pytorch compiler (to be specific, Dynamo), must share the same dynamic shape, and all the other shapes must be static.

if you want to slice a tensor along a certain dimension to be num_tokens, you cannot use a python int, but should use x.size(0)

ProExpertProg · 2025-03-13T13:02:26Z

@varun-sundar-rabindranath do you know why V1 CUDA graph TPOT is worse than V1 eager?

varun-sundar-rabindranath · 2025-03-13T13:51:21Z

@varun-sundar-rabindranath do you know why V1 CUDA graph TPOT is worse than V1 eager?

Hi @ProExpertProg , where are you seeing this ?

ProExpertProg · 2025-03-13T13:53:11Z

Thank you for your outstanding work. The following are my local test results.

v1 eager vs cudagraph

V0 cudagraph vs V1 cudagrpah

~~Here~~

Oops, did not realize this was toks/s, not seconds.

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

bnellnm · 2025-03-13T14:37:11Z

tests/lora/test_worker.py

@@ -52,6 +52,7 @@ def set_active_loras(worker: Union[Worker, V1Worker],
            seed=0,
            dtype="float16",
            revision=None,
+            enforce_eager=True,


are you planning on keeping this eager?

is it testing code that should be removed before this pr is ready?

I intend to keep it. The CI test was running out of memory, which I assume is because of the cudagraph capture.

also, that specific test, doesn't actually run the model.

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: Richard Liu <ricliu@google.com>

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

varun-sundar-rabindranath requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners March 11, 2025 16:57

varun-sundar-rabindranath mentioned this pull request Mar 11, 2025

[Do Not Merge] - LoRA V1 Reference PR #11613

Closed

mergify bot added the v1 label Mar 11, 2025

varun-sundar-rabindranath force-pushed the varun/v1-lora-cudagraph branch from 544b16c to f92f3e2 Compare March 11, 2025 19:01

varun-sundar-rabindranath commented Mar 11, 2025

View reviewed changes

varun-sundar-rabindranath mentioned this pull request Mar 12, 2025

[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none. #14435

Open

1 task

varun-sundar-rabindranath force-pushed the varun/v1-lora-cudagraph branch from 83efaeb to e18366c Compare March 12, 2025 22:13

varun-sundar-rabindranath commented Mar 13, 2025

View reviewed changes

jeejeelee requested a review from youkaichao March 13, 2025 02:06

youkaichao reviewed Mar 13, 2025

View reviewed changes

youkaichao approved these changes Mar 13, 2025

View reviewed changes

youkaichao reviewed Mar 13, 2025

View reviewed changes

Varun Sundar Rabindranath added 5 commits March 13, 2025 10:13

add cudagraph support

40d1548

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

fix input shape inference

759b63d

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

test worker in eager mode

940afec

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

fixes

b8e4387

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

remove unnecessary hashing

d57fba2

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

varun-sundar-rabindranath force-pushed the varun/v1-lora-cudagraph branch from e18366c to d57fba2 Compare March 13, 2025 14:27

bnellnm reviewed Mar 13, 2025

View reviewed changes

keep torch compile checks for V0

214057f

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

jeejeelee approved these changes Mar 14, 2025

View reviewed changes

jeejeelee enabled auto-merge (squash) March 14, 2025 01:56

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 14, 2025

vllm-bot merged commit 0b1cfa6 into vllm-project:main Mar 14, 2025
51 of 53 checks passed

aarnphm mentioned this pull request Mar 19, 2025

[Bug]: Capture CudaGraph with LoRA #15090

Open

1 task

youkaichao deleted the varun/v1-lora-cudagraph branch March 19, 2025 15:54

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] LoRA - Enable CUDAGraphs for V1 #14626

[Kernel] LoRA - Enable CUDAGraphs for V1 #14626

varun-sundar-rabindranath commented Mar 11, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 11, 2025

varun-sundar-rabindranath Mar 11, 2025

varun-sundar-rabindranath Mar 11, 2025

varun-sundar-rabindranath Mar 11, 2025

youkaichao Mar 13, 2025

varun-sundar-rabindranath commented Mar 13, 2025

varun-sundar-rabindranath Mar 13, 2025

varun-sundar-rabindranath commented Mar 13, 2025

jeejeelee commented Mar 13, 2025

varun-sundar-rabindranath commented Mar 13, 2025

youkaichao left a comment

varun-sundar-rabindranath commented Mar 13, 2025

youkaichao Mar 13, 2025

varun-sundar-rabindranath Mar 13, 2025

youkaichao Mar 13, 2025

varun-sundar-rabindranath Mar 13, 2025

youkaichao Mar 13, 2025

youkaichao left a comment

youkaichao left a comment

ProExpertProg commented Mar 13, 2025

varun-sundar-rabindranath commented Mar 13, 2025

ProExpertProg commented Mar 13, 2025 •

edited

Loading

bnellnm Mar 13, 2025

youkaichao Mar 13, 2025

varun-sundar-rabindranath Mar 13, 2025 •

edited

Loading

[Kernel] LoRA - Enable CUDAGraphs for V1 #14626

[Kernel] LoRA - Enable CUDAGraphs for V1 #14626

Conversation

varun-sundar-rabindranath commented Mar 11, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varun-sundar-rabindranath commented Mar 13, 2025

Choose a reason for hiding this comment

varun-sundar-rabindranath commented Mar 13, 2025

jeejeelee commented Mar 13, 2025

varun-sundar-rabindranath commented Mar 13, 2025

youkaichao left a comment

Choose a reason for hiding this comment

varun-sundar-rabindranath commented Mar 13, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

youkaichao left a comment

Choose a reason for hiding this comment

youkaichao left a comment

Choose a reason for hiding this comment

ProExpertProg commented Mar 13, 2025

varun-sundar-rabindranath commented Mar 13, 2025

ProExpertProg commented Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

varun-sundar-rabindranath Mar 13, 2025 • edited Loading

Choose a reason for hiding this comment

varun-sundar-rabindranath commented Mar 11, 2025 •

edited by github-actions bot

Loading

ProExpertProg commented Mar 13, 2025 •

edited

Loading

varun-sundar-rabindranath Mar 13, 2025 •

edited

Loading