[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend #14152

tdoublep · 2025-03-03T19:23:35Z

In this PR we target performance improvement for the V1 ROCmAttentionBackend (which should be renamed TritonAttentionBackend once this PR is merged).

cc @SageMoore

Performance

All of the below results are for meta-llama/Llama-3.1-8B-Instruct on an NVIDIA H100 GPU.

The ROCmAttention backend on main (we have to hack it a bit to make this happen on NVIDIA) currently produces the following results for the serving benchmark:

$ python benchmarks/benchmark_serving.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  107.33    
Total input tokens:                      215196    
Total generated tokens:                  197284    
Request throughput (req/s):              9.32      
Output token throughput (tok/s):         1838.05   
Total Token throughput (tok/s):          3842.98   
---------------Time to First Token----------------
Mean TTFT (ms):                          10264.30  
Median TTFT (ms):                        8952.08   
P99 TTFT (ms):                           24269.34  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          408.87    
Median TPOT (ms):                        261.25    
P99 TPOT (ms):                           1208.50   
---------------Inter-token Latency----------------
Mean ITL (ms):                           210.67    
Median ITL (ms):                         172.46    
P99 ITL (ms):                            1308.82   
==================================================

Whereas, using the default FlashAttention backend produces:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  20.47     
Total input tokens:                      215196    
Total generated tokens:                  198001    
Request throughput (req/s):              48.86     
Output token throughput (tok/s):         9673.74   
Total Token throughput (tok/s):          20187.58  
---------------Time to First Token----------------
Mean TTFT (ms):                          3473.32   
Median TTFT (ms):                        3405.53   
P99 TTFT (ms):                           6302.31   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          82.06     
Median TPOT (ms):                        47.72     
P99 TPOT (ms):                           212.20    
---------------Inter-token Latency----------------
Mean ITL (ms):                           36.34     
Median ITL (ms):                         24.48     
P99 ITL (ms):                            216.40    
==================================================

So, the Triton backend is really under-performing on this hardware: the total throughput is more than 5x worse than the FlashAttention baseline.

Whereas, using the optimized ROCmAttention from this branch produces:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  29.38     
Total input tokens:                      215196    
Total generated tokens:                  197281    
Request throughput (req/s):              34.03     
Output token throughput (tok/s):         6713.84   
Total Token throughput (tok/s):          14037.37  
---------------Time to First Token----------------
Mean TTFT (ms):                          3711.10   
Median TTFT (ms):                        3781.32   
P99 TTFT (ms):                           6787.05   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          100.91    
Median TPOT (ms):                        66.34     
P99 TPOT (ms):                           249.37    
---------------Inter-token Latency----------------
Mean ITL (ms):                           53.15     
Median ITL (ms):                         42.07     
P99 ITL (ms):                            259.42    
==================================================

Thus, we see a 3.65x performance boost in terms of throughput, and performance is only 30% off the FlashAttention baseline. We feel like with smart auto-tuning of the kernels (#13305 already goes in this direction), that the performance gap can be closed further.

We are in the process of evaluating the performance on AMD GPUs.

Correctness

The new kernel produces correct results.

Using FlashAttention on H100 we see:

$ export VLLM_V1_USE_TRITON_BACKEND=0
$ VLLM_USE_V1=1 lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.792|±  |0.0182|
|     |       |strict-match    |     5|exact_match|↑  |0.770|±  |0.0188|

Using TritonAttention on H100 we see:

$ export VLLM_V1_USE_TRITON_BACKEND=1
$ VLLM_USE_V1=1 lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.798|±  |0.0180|
|     |       |strict-match    |     5|exact_match|↑  |0.780|±  |0.0185|

How is this performance improvement achieved?

We have noticed that the prefix_prefill kernel is not well optimized for sequences that are in decode phase. While it may be possible to optimize the kernel to make it better for decode, it is quite limiting to try to combine everything within a single Triton kernel. It implies we have to always use the same grid for prefill, chunked prefills and decodes, and also we have to take great care to choose the tile sizes and auto-tuning settings to work well across all cases. Instead, we think it is better to handle prefill/chunked prefill separately from decodes. We have implemented chunked_prefill_paged_decode in ibm_triton_lib so that the prefix_prefill kernel is used for the sequences in the batch that have query length > 1, and we used an optimized paged decoding Triton kernel for all other sequences. As you can see, no change to vLLM V1 metadata is required.

What is `ibm_triton_lib`?

To run the code on this branch one needs to install the following dependency:

pip install "git+https://github.com/foundation-model-stack/vllm-triton-backend.git#egg=ibm-triton-lib&subdirectory=ibm-triton-lib"

This is, hopefully, just a temporary place for these kernels. If preferred by the community, we would be happy to contribute these new kernels to vLLM repo. Alternatively, we feel like there might be some benefit to having a Triton-equivalent to FlashAttention (e.g., an external project that implements optimized Triton kernels). We feel this could un-burden vLLM from performing extensive AOT auto-tuning that is required to get the most out of these kernels (see @bringlein's talk from Ray Summit). Rather, we could do this in an external project and bundle the optimized configs inside a wheel that vLLM could easily consume. We discussed this a bit already on this RFC. If there is interested from the wider community, we could think about having this as a vLLM ecosystem project. Please note that currently the Triton kernels being used here have not been auto-tuned, so there is definitely more performance improvements to be had here.

Update: this PR is now self-contained and does not have a dependency on ibm_triton_lib

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

github-actions · 2025-03-03T19:23:48Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

SageMoore · 2025-03-03T19:53:07Z

Thanks for the contribution! I'm totally fine with the dependency if it's easily installable via the requirements-rocm-build.txt file. I do think it would be nice to have this kernel integrated into vllm so that any performance optimizations made to this kernel in the future are visible to the vllm community.

bringlein · 2025-03-04T18:51:55Z

I did some benchmarks for different number of concurrent users on A100 and H100 for this PR, using the benchmarks/benchmark_serving.py as described by @tdoublep above:

Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2025-03-04T19:50:32Z

I do think it would be nice to have this kernel integrated into vllm so that any performance optimizations made to this kernel in the future are visible to the vllm community.

@SageMoore We've added the kernel to this PR. This is now a standalone contribution with no dependency on ibm_triton_lib.

maleksan85 · 2025-03-04T19:56:35Z

Hi Thomas,

Can we do joined effort and:

take the kernel from my PR: [Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 #13305 plus return for ctx len == 1 (because your _fwd_kernel seems the same as original except return for context len 1). Use mine only starting from triton 3.2
take your kernel kernel_paged_attention_2d
put them into path of vllm upstream: /vllm/attention/ops/prefix_prefill.py

With this joined thing above I got:

VLLM_ENABLE_V1_MULTIPROCESSING=0 VLLM_USE_V1=1 lm_eval --model vllm --model_args pretrained=/data/models/Llama-3.1-8B-Instruct --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500

2025-03-04:19:53:53,697 INFO [evaluation_tracker.py:269] Output path not provided, skipping saving results aggregated
vllm (pretrained=/data/models/Llama-3.1-8B-Instruct), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.790	±	0.0182
		strict-match	5	exact_match	↑	0.774	±	0.0187

and:

python vllm/benchmarks/benchmark_serving.py --model /data/models/Llama-3.1-8B-Instruct --dataset-name sharegpt --dataset-path datasets/ShareGPT/ShareGPT_V3_unfiltered_cleaned_split.json

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  23.68
Total input tokens:                      215196
Total generated tokens:                  198343
Request throughput (req/s):              42.22
Output token throughput (tok/s):         8374.29
Total Token throughput (tok/s):          17460.14
---------------Time to First Token----------------
Mean TTFT (ms):                          1311.16
Median TTFT (ms):                        1366.36
P99 TTFT (ms):                           1808.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.81
Median TPOT (ms):                        44.80
P99 TPOT (ms):                           72.73
---------------Inter-token Latency----------------
Mean ITL (ms):                           38.17
Median ITL (ms):                         39.32
P99 ITL (ms):                            81.78
==================================================

cc @SageMoore

tdoublep · 2025-03-04T20:01:14Z

@maleksan85 awesome! Yes, there is no reason why these two improvements can't be combined. I had tried it yesterday but didn't see additive gains like that. I will try it again using the latest commits.

maleksan85 · 2025-03-04T20:07:47Z

prefix_prefill.py.txt
I tried benchmarks above on latest main + prefix_prefill.py (from attach) + your file triton_paged_decode_attention_2d.py https://github.com/foundation-model-stack/vllm-triton-backend/blob/40b2fa560f5c5ae00724960852f426a3eb99d71e/ibm-triton-lib/ibm_triton_lib/kernels/triton_paged_decode_attention_2d.py#L73 copied into same folder as prefix_prefill.py.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

SageMoore

This PR generally looks good to me. I only have some small nits. Thanks so much for the contribution! The performance improvements are very exciting.

SageMoore · 2025-03-04T22:56:45Z

vllm/attention/ops/chunked_prefill_paged_decode.py

+        num_query_heads: tl.constexpr,  # int
+        num_queries_per_kv: tl.constexpr,  # int
+        block_table_stride: tl.constexpr,  # int
+        query_stride_0: tl.constexpr,  # int


Nit: Could you explain a bit why all of the strides are constexpr? My understanding is that triton will generate a new kernel each time one of these constexprs changes, which could be non-ideal.

Yes, that's right, triton will generate a new binary if a constant changes. However, having strides as constant enables better optimization of memory accesses (and therefore, automatic tiling, coalescing, etc.). Our experience shows this could be quite significant, esp. for small requests (we've seen latency reduction up to 2x for this 2d kernel in the past for very small batches).
Additionally, we don't expect that num_query_heads or block_table_stride changes within a running inference server (although it will change based on the vllm configuration or model).

vllm/platforms/cuda.py

vllm/attention/ops/chunked_prefill_paged_decode.py

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2025-03-05T13:59:30Z

Thanks for the review @SageMoore @maleksan85.

We have addressed all the comments.

SageMoore · 2025-03-05T14:44:47Z

LGTM

tlrmchlsmth · 2025-03-05T15:00:16Z

vllm/platforms/cuda.py

+            if os.environ.get("VLLM_V1_USE_TRITON_BACKEND", "0") == "1":
+                logger.info_once(
+                    "Using Triton/ROCm Attention backend on V1 engine.")
+                return ("vllm.v1.attention.backends.rocm_attn."
+                        "ROCmAttentionBackend")


A couple of things:

any environment variables should be handled invllm/envs.py

We should try to avoid proliferation of environment variables in general when possible. I know you're working on a triton backend in general - is there any community consensus for how this should be integrated yet?

Sure, we actually just added this env variable as a way to test the ROCm/Triton backend on NVIDIA hardware. We were kind of hoping that this PR might be merged in the meantime which would give the user a way to select the ROCm/Triton backend without introducing a new env variable.

We have developed a Triton backend for vLLM V0 which was open-sourced as an out-of-tree plugin here. Since V1 already has a Triton-only backend (e.g., what is currently called ROCmBackend) I think the V0 backend code we have developed is now somewhat redundant, however the kernels that we have developed can be re-used for V1 (with some important changes) which is essentially what this PR is doing). We are happy to contribute these kernels to vLLM repo (as suggested by @SageMoore above) so the community can track the progress and contribute further optimizations.

Perhaps we should simply remove the changes to platforms/cuda.py from this PR? It would mean that we can only run this backend on AMD for now, but that is the current behaviour on main anyway. Adding the option to use the Triton backend on NVIDIA can be addressed in a separate PR (e.g. #14071)

All of that sounds good and makes sense to me.

For now, let's remove the changes platforms/cuda.py for this PR and iterate on it future ones

SageMoore · 2025-03-05T16:36:20Z

I ran the same benchmark_serving experiments that you posted above on an MI300x machine and the results are quite good.

branch: main
Command: python benchmarks/benchmark_serving.py \
			   --model meta-llama/Llama-3.1-8B-Instruct \
			   --dataset-name sharegpt \
			   --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
			   --port 4000
Results:
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  136.33
Total input tokens:                      215196
Total generated tokens:                  197051
Request throughput (req/s):              7.34
Output token throughput (tok/s):         1445.39
Total Token throughput (tok/s):          3023.89
---------------Time to First Token----------------
Mean TTFT (ms):                          23988.34
Median TTFT (ms):                        18652.88
P99 TTFT (ms):                           62421.28
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          358.72
Median TPOT (ms):                        325.15
P99 TPOT (ms):                           722.79
---------------Inter-token Latency----------------
Mean ITL (ms):                           249.32
Median ITL (ms):                         171.76
P99 ITL (ms):                            826.05
==================================================


branch: tpa-v1-clean
Command: python benchmarks/benchmark_serving.py \
			   --model meta-llama/Llama-3.1-8B-Instruct \
			   --dataset-name sharegpt \
			   --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
			   --port 4000
Results:
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  36.09
Total input tokens:                      215196
Total generated tokens:                  197037
Request throughput (req/s):              27.71
Output token throughput (tok/s):         5459.77
Total Token throughput (tok/s):          11422.72
---------------Time to First Token----------------
Mean TTFT (ms):                          6215.30
Median TTFT (ms):                        5872.31
P99 TTFT (ms):                           12840.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          74.53
Median TPOT (ms):                        73.89
P99 TPOT (ms):                           122.55
---------------Inter-token Latency----------------
Mean ITL (ms):                           55.83
Median ITL (ms):                         47.13
P99 ITL (ms):                            132.60
==================================================

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tlrmchlsmth · 2025-03-05T21:23:58Z

vllm/attention/ops/chunked_prefill_paged_decode.py

+        assert value_cache.dtype == torch.uint8
+
+        if kv_cache_dtype in ("fp8", "fp8_e4m3"):
+            target_dtype = torch.float8_e4m3fn


should this be torch.float8_e4m3fnuz in some cases on RoCM?

Not entirely sure, but this logic is taking direct from what is already happening here.

FYI @SageMoore is going to double-check accuracy on an MI300 system

VLLM_USE_V1=1 lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500

vllm (pretrained=meta-llama/Llama-3.1-8B-Instruct), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |-----|------:|----------------|-----:|-----------|---|----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.780|± |0.0185| | | |strict-match | 5|exact_match|↑ |0.762|± |0.0191|

Looks reasonable

Thanks @SageMoore. And since this is just KV cache compression, the fp8_e4m3fn on RoCM issue is not worrying me like it did when it first caught my eye

tlrmchlsmth

This looks great to me, thanks for the contribution!

tlrmchlsmth · 2025-03-05T22:48:50Z

vllm/attention/ops/chunked_prefill_paged_decode.py

+        assert value_cache.dtype == torch.uint8
+
+        if kv_cache_dtype in ("fp8", "fp8_e4m3"):
+            target_dtype = torch.float8_e4m3fn


Thanks @SageMoore. And since this is just KV cache compression, the fp8_e4m3fn on RoCM issue is not worrying me like it did when it first caught my eye

Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep · 2025-03-06T09:09:18Z

PR failure is unrelated. I checked and that test is actually failing on main (see here for a quick fix).

bringlein · 2025-03-06T17:25:14Z

just do document: This PR enables similar performance improvement also on MI250 (4.1x)

* Fix `head_dim` not existing in all model configs (Transformers backend) (vllm-project#14141) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [V0][Metrics] Remove unimplemented `vllm:tokens_total` (vllm-project#14134) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [V0][Metrics] Deprecate some KV/prefix cache metrics (vllm-project#14136) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [V1] Simplify stats logging (vllm-project#14082) Signed-off-by: Nick Hill <nhill@redhat.com> * [WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics (vllm-project#14055) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 (vllm-project#14100) Signed-off-by: mgoin <mgoin64@gmail.com> * [Kernel] Optimize moe intermediate_cache usage (vllm-project#13625) Signed-off-by: mgoin <mgoin64@gmail.com> * [Docs] Add GPTQModel (vllm-project#14056) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> * [v1] Add comments to the new ragged paged attention Pallas kernel (vllm-project#14155) Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> * [Model] Add support for GraniteMoeShared models (vllm-project#13313) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> * [core] moe fp8 block quant tuning support (vllm-project#14068) Signed-off-by: Divakar Verma <divakar.verma@amd.com> * [Misc] Remove lru_cache in NvmlCudaPlatform (vllm-project#14156) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> * [core] Pass all driver env vars to ray workers unless excluded (vllm-project#14099) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> * Use math.prod instead of np.prod for trivial ops (vllm-project#14142) * Fix benchmark_moe.py tuning for CUDA devices (vllm-project#14164) * [platform] add debug logging during inferring the device type (vllm-project#14195) Signed-off-by: youkaichao <youkaichao@gmail.com> * [sleep mode] error out with expandable_segments (vllm-project#14189) Signed-off-by: youkaichao <youkaichao@gmail.com> * [doc] add "Failed to infer device type" to faq (vllm-project#14200) Signed-off-by: youkaichao <youkaichao@gmail.com> * [Bugfix] Restrict MacOS CPU detection (vllm-project#14210) Signed-off-by: mgoin <mgoin64@gmail.com> * [V1][BugFix] Fix remaining sync engine client shutdown errors/hangs (vllm-project#13869) Signed-off-by: Nick Hill <nhill@redhat.com> * [V0][Metrics] Deprecate some questionable request time metrics (vllm-project#14135) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [V1][Molmo] Fix get_multimodal_embeddings() in molmo.py (vllm-project#14161) * add cutlass support for blackwell fp8 gemm (vllm-project#13798) * [TPU][Profiler] Support start_profile/stop_profile in TPU worker (vllm-project#13988) Signed-off-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: mgoin <mgoin64@gmail.com> * Fix performance when `--generation-config` is not `None` (vllm-project#14223) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Frontend] Do `prompt_logprobs` clamping for chat as well as completions (vllm-project#14225) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Docs] Update Dockerfile dependency image (vllm-project#14215) Signed-off-by: mgoin <mgoin64@gmail.com> * [v1][Metrics] Add design doc (vllm-project#12745) Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> * [Security] Serialize using safetensors instead of pickle in Mooncake Pipe (vllm-project#14228) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> * Clean up unused padding_idx variables across many model definitions (vllm-project#13240) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [ROCm] Disable a few more kernel tests that are broken on ROCm (vllm-project#14145) Signed-off-by: Sage Moore <sage@neuralmagic.com> * [V1][TPU] TPU multimodal model support for ragged attention (vllm-project#14158) Signed-off-by: Michael Goin <mgoin64@gmail.com> * [misc] announce china meetup (vllm-project#14248) Signed-off-by: youkaichao <youkaichao@gmail.com> * Moved numba from common requirements to cuda/rocm specific requirements (vllm-project#14199) Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> * Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 (vllm-project#14157) Signed-off-by: mgoin <mgoin64@gmail.com> * [Bugfix] Fix gptq_marlin for deepseek-v3 (vllm-project#13750) Signed-off-by: dangshunya <dangshunya@baichuan-inc.com> Co-authored-by: dangshunya <dangshunya@baichuan-inc.com> * [V1][Bugfix] Do not reset prefix caching metrics (vllm-project#14235) * [Model] New model support for Phi-4-multimodal-instruct (vllm-project#14119) * [V1] EP/TP MoE + DP Attention (vllm-project#13931) * [platforms] improve rocm debugging info (vllm-project#14257) * Temporarily disable test_awq_gemm_opcheck (vllm-project#14251) Signed-off-by: mgoin <mgoin64@gmail.com> * [Frontend] Allow return_tokens_as_token_ids to be passed as a request param (vllm-project#14066) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> * [Misc][V1] Avoid using `envs.VLLM_USE_V1` in mm processing (vllm-project#14256) Signed-off-by: Roger Wang <ywang@roblox.com> * [Bugfix][V1] Fix allowed_token_ids for v1 Sampler (vllm-project#14169) Signed-off-by: Lu Fang <lufang@fb.com> * [Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID (vllm-project#14217) Signed-off-by: Iacopo Poli <iacopo@lighton.ai> * [Doc] [3/N] Refer code examples for common cases in dev multimodal processor (vllm-project#14278) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Small update for external_launcher backend docs (vllm-project#14288) * [V1][Frontend] Add Testing For V1 Runtime Parameters (vllm-project#14159) Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> * [LoRA] Remove linear hack outside transformers backend (vllm-project#14177) Signed-off-by: Isotr0py <2037008807@qq.com> * [Misc] Add Qwen2MoeForCausalLM moe tuning support (vllm-project#14276) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * prefix_caching.md: Fixed typo (vllm-project#14293) Signed-off-by: Daivid Savernin-Frenk <daivid.frank@TurboNext.ai> * [Bugfix] Fix broken vision language example (vllm-project#14292) Signed-off-by: Isotr0py <2037008807@qq.com> * [Docs] Add Meta Slides (vllm-project#14297) Signed-off-by: simon-mo <simon.mo@hey.com> * [V1][Minor] Remove obsolete FIXME comment (vllm-project#14304) Signed-off-by: Nick Hill <nhill@redhat.com> * Deprecate `best_of` Sampling Parameter in anticipation for vLLM V1 (vllm-project#13997) Signed-off-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [V1][BugFix] Fix for mixed top_k batch (vllm-project#14301) Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Ye Cao <caoye.cao@alibaba-inc.com> * [misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env (vllm-project#14267) * [V1][Easy] Add empty allowed_token_ids in the v1 sampler test (vllm-project#14308) Signed-off-by: Lu Fang <lufang@fb.com> * init Signed-off-by: Sage Moore <sage@neuralmagic.com> * [Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch (vllm-project#14237) Signed-off-by: pyc96 <pychen96@gmail.com> * [Bugfix] Remove num_tokens_across_dp (vllm-project#14302) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [BugFix] Fix prefix caching V0 MLA (vllm-project#14255) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Co-authored-by: Ying Zhong <zhongyingmatrix@gmail.com> * [CI/Build] Use spawn multiprocessing mode for V1 test pipeline (vllm-project#14243) Signed-off-by: Russell Bryant <rbryant@redhat.com> * Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (vllm-project#13917) Signed-off-by: mgoin <mgoin64@gmail.com> * [Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation (vllm-project#13850) Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * [BugFix] MLA + V1, illegal memory access and accuracy issues (vllm-project#14253) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> * [misc] Mention `ray list nodes` command to troubleshoot ray issues (vllm-project#14318) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> * [Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 (vllm-project#14114) * [V1] LoRA - Enable more V1 tests (vllm-project#14315) Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> * [Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention (vllm-project#11301) * [Hardware] Update the flash attn tag to support Blackwell (vllm-project#14244) * [Model] Update Paligemma multimodal processing with PromptUpdate (vllm-project#14015) Signed-off-by: Kyle Huang <kylhuang@nvidia.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [V1][VLM][Pixtral-HF] Support Pixtral-HF on V1 (vllm-project#14275) Signed-off-by: Linkun Chen <github@lkchen.net> * [Core] Optimizing cross-attention `QKVParallelLinear` computation (vllm-project#12325) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal> Co-authored-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal> * [Frontend][Docs] Transcription API streaming (vllm-project#13301) Signed-off-by: NickLucche <nlucches@redhat.com> * [Doc] Update reasoning with stream example to use OpenAI library (vllm-project#14077) Signed-off-by: liuyanyi <wolfsonliu@163.com> * [Doc] Correct beam_search using in generative_models.md (vllm-project#14363) * [Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (vllm-project#14152) * [Bugfix][Core] fix abort_seq_group and memory leak when n>1 (vllm-project#14326) Signed-off-by: courage17340 <courage17340@163.com> * [Core] Don't use cache during multi-modal profiling (vllm-project#14336) * [Doc] Fix date typo in README.md (vllm-project#14366) Signed-off-by: Jitse Klomp <jitse.klomp@conclusionxforce.nl> * [RLHF] use worker_extension_cls for compatibility with V0 and V1 (vllm-project#14185) Signed-off-by: youkaichao <youkaichao@gmail.com> * Reinstate `best_of` for V0 (vllm-project#14356) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Adding cpu inference with VXE ISA for s390x architecture (vllm-project#12613) Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com> Signed-off-by: Rishika Kedia <rishika.kedia@in.ibm.com> Co-authored-by: Rishika Kedia <rishika.kedia@in.ibm.com> * Add authors to license header. (vllm-project#14371) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com> * Fix mla prefill context performance (vllm-project#13897) Signed-off-by: ZhongYingMatrix <zhongyingmatrix@gmail.com> * [V1] Do not detokenize if sampling param detokenize is False (vllm-project#14224) Signed-off-by: Himanshu Jaju <hj@mistral.ai> Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> * [Distributed] Add enable_expert_parallel arg (vllm-project#14305) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa (vllm-project#13569) Signed-off-by: mgoin <mgoin64@gmail.com> * [CI] Disable spawn when running V1 Test (vllm-project#14345) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> * [Kernel] Add needs_fixed_stride_order tag to most GEMMs (vllm-project#14306) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Bugfix] Fix use_direct_call condition in FusedMoE layer for (vllm-project#14382) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Bug] Fix Attention when ignored in by quant_method (vllm-project#14313) Signed-off-by: mgoin <mgoin64@gmail.com> * [V1][Bugfix] Standardize quantized kv cache rejection for attention backends (vllm-project#14221) Signed-off-by: mgoin <mgoin64@gmail.com> * [Docs] Add nsight guide to profiling docs (vllm-project#14298) Signed-off-by: mgoin <mgoin64@gmail.com> * cleanup boolean logic Signed-off-by: Sage Moore <sage@neuralmagic.com> * [Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue (vllm-project#14310) Signed-off-by: Chengji Yao <chengjiyao@google.com> * [Doc] Fix a typo (vllm-project#14385) * [Bugfix] Correctly call `cudaProfilerStop` in benchmarks script (vllm-project#14183) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> * [Perf] Reduce MLA CPU overheads in V1 (vllm-project#14384) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> * [FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object (vllm-project#14390) Signed-off-by: luka <luka@neuralmagic.com> * [BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs (vllm-project#14396) * [Bugfix] Fix JambaForCausalLM LoRA (vllm-project#14370) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Build] Add nightly wheel fallback when latest commit wheel unavailable (vllm-project#14358) Signed-off-by: Isotr0py <2037008807@qq.com> * OpenVINO: added CPU-like conditions (vllm-project#14338) Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com> * [GH] Auto-apply multi-modality label to relevant PRs (vllm-project#14402) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * correct wrong markdown syntax (vllm-project#14414) Signed-off-by: vincent-pli <justdoit.pli@gmail.com> * [Bugfix] Further clean up LoRA test (vllm-project#14422) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [Bugfix] Clean up multi-modal processors (vllm-project#14417) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [Misc] Set default value of seed to None (vllm-project#14274) Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> * [BUGFIX] Skip tokenization support for throughput benchmark (vllm-project#12712) Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> * Fix missing `kv_caches` and `attn_metadata` in `OpenVINOCausalLM` (vllm-project#14271) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Use the optimized block sizes after tuning the kernel. (vllm-project#14329) * [V1][Core] Support for Structured Outputs (vllm-project#12388) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Signed-off-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com> * [Doc] Update prefix_caching.md to match the example image (vllm-project#14420) * [Benchmarks] Make detokenization optional in benchmark scripts (vllm-project#11697) Signed-off-by: Jeremy Arnold <Jeremy.Arnold@amd.com> * comments Signed-off-by: Sage Moore <sage@neuralmagic.com> * [Kernel] optimize performance of gptq marlin kernel when n is small (vllm-project#14138) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> * [Misc] Add Phi4-MM example (vllm-project#14343) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [v1] torch.compile integration explanation (vllm-project#14437) Signed-off-by: youkaichao <youkaichao@gmail.com> * [V1] Eagerly remove finished requests from the batch (vllm-project#14388) Signed-off-by: Nick Hill <nhill@redhat.com> * [V1][Metrics] Fix traceback with preemptions+LoRA (vllm-project#14220) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [Bugfix] Fix torch_xla which can't handle None seed introduced in vllm-project#14274 (vllm-project#14459) Signed-off-by: Yarong Mu <ymu@google.com> * [V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC (vllm-project#13949) * [Bugfix][V1] Handle MLA in kv_cache_interface (vllm-project#14462) Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> * Revert "[Perf] Reduce MLA CPU overheads in V1 (vllm-project#14384)" (vllm-project#14471) * [Bugfix][Disaggregated] Add a check in send_kv_caches_and_hidden_states and fix the reshape of the KVCache (vllm-project#14369) Signed-off-by: Mathis Felardos <mathis@mistral.ai> * [MISC][V1] Register process killing handler only in the main thread (vllm-project#14380) Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> * [core] add `extra_args` to `SamplingParams` (vllm-project#13300) Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> * [CI/Build] refactor: set timezone of container to UTC (vllm-project#12888) Signed-off-by: Roger Meier <r.meier@siemens.com> * Default to `generation_config` from model (vllm-project#12622) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Doc]add doc for Qwen models tool calling (vllm-project#14478) Signed-off-by: WangErXiao <863579016@qq.com> * [Doc] Added QwQ-32B to the supported models list in the reasoning out… (vllm-project#14479) Signed-off-by: WangErXiao <863579016@qq.com> * [Bugfix] Make the deviceprofiler include LoRA memory. (vllm-project#14469) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * Add training doc signposting to TRL (vllm-project#14439) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Build/BugFix] Fix hopper 12.8 build (vllm-project#14354) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> * Add RLHF document (vllm-project#14482) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [CI/Build] Use a fixed seed to avoid flaky tests (vllm-project#14480) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [V1] TPU - Add tensor parallel support via Ray (vllm-project#13618) Signed-off-by: Alexander Matveev <amatveev@redhat.com> * [VLM] Add TP support for Phi-4-MM (vllm-project#14453) Signed-off-by: Isotr0py <2037008807@qq.com> * [Misc] add `use_tqdm_on_load` to reduce logs (vllm-project#14407) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> * [V1][Core] Fix memory issue with logits & sampling (vllm-project#13776) Signed-off-by: Roger Wang <ywang@roblox.com> * [benchmarks] Add option to use unique jsonschema for each request (vllm-project#14457) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [Misc] Don't run ruff at all on 3rd party libs (vllm-project#14493) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * Move requirements into their own directory (vllm-project#12547) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix] DeepSeek Accuracy (vllm-project#14476) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> * [Bugfix] Fix profiling OOM and decouple encoder multimodal profiling (vllm-project#14361) Signed-off-by: Isotr0py <2037008807@qq.com> * Update CODEOWNERS for structured output (vllm-project#14496) Signed-off-by: Russell Bryant <rbryant@redhat.com> * [Misc] Upgrade to Python 3.9 typing for additional directories (vllm-project#14492) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * [V1] Support bad_words in sampler (vllm-project#13376) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Co-authored-by: Nick Hill <nhill@redhat.com> * Revert "[V1][Core] Fix memory issue with logits & sampling" (vllm-project#14504) Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <ywang@roblox.com> * [Attention] Default to FlashMLA backend for MLA (vllm-project#14451) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> * [V1][TPU] Remove unnecessary padding for running on TPU. (vllm-project#14467) * [Feat] Support chunked prefill for LMCache connector (vllm-project#14505) Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> * [Bugfix] Fix tqdm progress bar when SamplingParams.n > 1 (vllm-project#12428) Signed-off-by: Yuchen Yan <740987012@qq.com> * [Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work (vllm-project#14498) Signed-off-by: Isotr0py <2037008807@qq.com> * [Hardware][TPU] Fix the recompiling issue in logits processor after warmup (vllm-project#14510) Signed-off-by: Chengji Yao <chengjiyao@google.com> * [Misc] Ensure out-of-tree quantization method recognize by cli args (vllm-project#14328) Signed-off-by: liuyanyi <wolfsonliu@163.com> * [Bugfix] Wrong requirements path - rocm (vllm-project#14527) Signed-off-by: Martin Hoyer <mhoyer@redhat.com> * [Feature] Consolidate performance benchmark datasets (vllm-project#14036) Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> * [Misc] Add log information for handle_process_request. (vllm-project#14130) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [Docs] Mention `model_impl` arg when explaining Transformers fallback (vllm-project#14552) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Frontend] support image embeds (vllm-project#13955) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [Kernel] Add more dtype support for GGUF kernels (vllm-project#14043) Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com> Signed-off-by: SzymonOzog <szymon.ozog@gmail.com> * [Doc] Update PaliGemma note to a warning (vllm-project#14565) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> * V1 rocm support (#469) * Initial commit for V1 successfull compilation * Small improvement for linear * Small improvement for linear * making use of forward_cuda for all except ROPE in llama --------- Co-authored-by: maleksan85 <maleksan@amd.com> * nightly_fixed_aiter_integration_final_20250305 README update (#470) * nightly_fixed_aiter_integration_final_20250305 README update (perf results only) * Update Docker Manifest git hash * Update Docker Manifest and added nightly_fixed_aiter_integration_final_20250305 * some more updates * Update AITER section with example * Updated AITER command with larger batch size and model name * Fixing typo * Removed --max-model-len in AITER command * Updating AITER instructions * typo * Another typo * Whitespace * modifying whats new section * Another typo --------- Co-authored-by: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> --------- Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com> Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Divakar Verma <divakar.verma@amd.com> Signed-off-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Siyuan Liu <lsiyuan@google.com> Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com> Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com> Signed-off-by: dangshunya <dangshunya@baichuan-inc.com> Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: Lu Fang <lufang@fb.com> Signed-off-by: Iacopo Poli <iacopo@lighton.ai> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com> Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Daivid Savernin-Frenk <daivid.frank@TurboNext.ai> Signed-off-by: simon-mo <simon.mo@hey.com> Signed-off-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: pyc96 <pychen96@gmail.com> Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Signed-off-by: Kyle Huang <kylhuang@nvidia.com> Signed-off-by: Linkun Chen <github@lkchen.net> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal> Signed-off-by: liuyanyi <wolfsonliu@163.com> Signed-off-by: courage17340 <courage17340@163.com> Signed-off-by: Jitse Klomp <jitse.klomp@conclusionxforce.nl> Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com> Signed-off-by: Rishika Kedia <rishika.kedia@in.ibm.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: ZhongYingMatrix <zhongyingmatrix@gmail.com> Signed-off-by: Himanshu Jaju <hj@mistral.ai> Signed-off-by: Chengji Yao <chengjiyao@google.com> Signed-off-by: luka <luka@neuralmagic.com> Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com> Signed-off-by: vincent-pli <justdoit.pli@gmail.com> Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Signed-off-by: Jeremy Arnold <Jeremy.Arnold@amd.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com> Signed-off-by: Yarong Mu <ymu@google.com> Signed-off-by: Mathis Felardos <mathis@mistral.ai> Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com> Signed-off-by: Roger Meier <r.meier@siemens.com> Signed-off-by: WangErXiao <863579016@qq.com> Signed-off-by: Alexander Matveev <amatveev@redhat.com> Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> Signed-off-by: Yuchen Yan <740987012@qq.com> Signed-off-by: Martin Hoyer <mhoyer@redhat.com> Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com> Signed-off-by: SzymonOzog <szymon.ozog@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: iefgnoix <isaacwxf23@gmail.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Co-authored-by: Zhanwen Chen <phil.zhanwen.chen@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: lkchen <github@lkchen.net> Co-authored-by: kushanam <42385577+kushanam@users.noreply.github.com> Co-authored-by: Siyuan Liu <lsiyuan@google.com> Co-authored-by: Kuntai Du <kuntai@uchicago.edu> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com> Co-authored-by: rainkert <93575312+rainkert@users.noreply.github.com> Co-authored-by: dangshunya <dangshunya@baichuan-inc.com> Co-authored-by: Congcong Chen <congcongchen@microsoft.com> Co-authored-by: Benjamin Chislett <benjamin.chislett@centml.ai> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> Co-authored-by: Iacopo Poli <iacopo@lighton.ai> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Zhe Zhang <zhz@apache.org> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: DaividFrank <49250948+DaividFrank@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Vincent <vincentzhongy+githubvincent4@gmail.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Ye Cao <caoye.cao@alibaba-inc.com> Co-authored-by: Serena <yangsijia.614@bytedance.com> Co-authored-by: pyc96 <pychen96@gmail.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Ying Zhong <zhongyingmatrix@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Ce Gao <cegao@tensorchord.ai> Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Pavani Majety <pmajety@nvidia.com> Co-authored-by: kYLe <kylhuang@nvidia.com> Co-authored-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal> Co-authored-by: Yanyi Liu <wolfsonliu@163.com> Co-authored-by: Irina Yuryeva <76484191+upayuryeva@users.noreply.github.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: courage17340 <courage17340@users.noreply.github.com> Co-authored-by: Jitse Klomp <jitse.klomp@conclusionxforce.nl> Co-authored-by: Dilip Gowda Bhagavan <110233170+dilipgb@users.noreply.github.com> Co-authored-by: Rishika Kedia <rishika.kedia@in.ibm.com> Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com> Co-authored-by: Himanshu Jaju <hj@mistral.ai> Co-authored-by: Chengji Yao <chengjiyao@google.com> Co-authored-by: Daniel Li <dyli@google.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Ilya Lavrenov <ilya.lavrenov@intel.com> Co-authored-by: Peng Li <justdoit.pli@gmail.com> Co-authored-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aaron Pham <contact@aarnphm.xyz> Co-authored-by: York-RDWang <103811994+York-RDWang@users.noreply.github.com> Co-authored-by: Jeremy Arnold <103538711+JArnoldAMD@users.noreply.github.com> Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com> Co-authored-by: yarongmu-google <150371854+yarongmu-google@users.noreply.github.com> Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com> Co-authored-by: Mathis Felardos <mathis@mistral.ai> Co-authored-by: Aviv Keshet <akeshet@scaledcognition.com> Co-authored-by: Roger Meier <r.meier@siemens.com> Co-authored-by: Robin <863579016@qq.com> Co-authored-by: Alexander Matveev <59768536+alexm-redhat@users.noreply.github.com> Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: Jiayi Yao <82156730+YaoJiayi@users.noreply.github.com> Co-authored-by: Yuchen Yan <50619811+yanyc428@users.noreply.github.com> Co-authored-by: Martin Hoyer <mhoyer@redhat.com> Co-authored-by: Jennifer Zhao <JenZhao@users.noreply.github.com> Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Szymon Ożóg <58388001+SzymonOzog@users.noreply.github.com> Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com> Co-authored-by: Mcirino1 <57415822+Mcirino1@users.noreply.github.com> Co-authored-by: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com>

…-project#14152)

…-project#14152) Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

tdoublep and others added 7 commits February 28, 2025 08:32

Working changes

c21782b

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Use ibm_triton_lib

00bafc0

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

working changes

152e234

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

making backend enabled via env variable

c7087bc

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

cleanup

576e1b1

Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>

resolve conflicts

00c0251

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Fix merge error

e9485d2

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners March 3, 2025 19:23

mergify bot added the v1 label Mar 3, 2025

hongxiayang added the rocm Related to AMD ROCm label Mar 3, 2025

Add chunked_prefill_paged_decode kernel.

8a7a883

Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

Added unit tests for chunked_prefill_paged_decode

dc1029a

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep requested a review from tlrmchlsmth as a code owner March 4, 2025 22:17

SageMoore approved these changes Mar 4, 2025

View reviewed changes

maleksan85 reviewed Mar 4, 2025

View reviewed changes

vllm/attention/ops/chunked_prefill_paged_decode.py Outdated Show resolved Hide resolved

maleksan85 reviewed Mar 4, 2025

View reviewed changes

vllm/attention/ops/chunked_prefill_paged_decode.py Outdated Show resolved Hide resolved

maleksan85 reviewed Mar 4, 2025

View reviewed changes

vllm/attention/ops/chunked_prefill_paged_decode.py Outdated Show resolved Hide resolved

tdoublep changed the title ~~[Kernel] Improved performance for V1 Triton (ROCm) backend~~ [Kernel] [V1] Improved performance for V1 Triton (ROCm) backend Mar 5, 2025

Address review comments

b3b873b

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tlrmchlsmth reviewed Mar 5, 2025

View reviewed changes

Revert changes in cuda platform.

5baa879

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tlrmchlsmth reviewed Mar 5, 2025

View reviewed changes

tlrmchlsmth approved these changes Mar 5, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 6, 2025

tlrmchlsmth enabled auto-merge (squash) March 6, 2025 01:42

tdoublep and others added 2 commits March 6, 2025 02:30

Merge branch 'main' into tpa-v1-clean

eb5a8d1

Ensuring co-authors survive squash.

16500bf

Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com> Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

vllm-bot merged commit 6bd1dd9 into vllm-project:main Mar 6, 2025
35 of 37 checks passed

tdoublep mentioned this pull request Mar 6, 2025

Add authors to license header. #14371

Merged

tdoublep deleted the tpa-v1-clean branch March 6, 2025 16:48

captainzmc pushed a commit to captainzmc/vllm that referenced this pull request Mar 12, 2025

[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (vllm…

fe862db

…-project#14152)

tlrmchlsmth mentioned this pull request Mar 14, 2025

[ROCm] Enable custom paged attention kernel for Navi3/4 #13843

Closed

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (vllm…

c78f96c

…-project#14152) Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend #14152

[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend #14152

tdoublep commented Mar 3, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 3, 2025

SageMoore commented Mar 3, 2025 •

edited

Loading

bringlein commented Mar 4, 2025

tdoublep commented Mar 4, 2025 •

edited

Loading

maleksan85 commented Mar 4, 2025 •

edited

Loading

tdoublep commented Mar 4, 2025

maleksan85 commented Mar 4, 2025 •

edited

Loading

SageMoore left a comment

SageMoore Mar 4, 2025

bringlein Mar 5, 2025

tdoublep commented Mar 5, 2025

SageMoore commented Mar 5, 2025

tlrmchlsmth Mar 5, 2025

tdoublep Mar 5, 2025 •

edited

Loading

tdoublep Mar 5, 2025 •

edited

Loading

tlrmchlsmth Mar 5, 2025

tdoublep Mar 5, 2025

SageMoore commented Mar 5, 2025

tlrmchlsmth Mar 5, 2025

tdoublep Mar 5, 2025

tlrmchlsmth Mar 5, 2025

SageMoore Mar 5, 2025

tlrmchlsmth Mar 5, 2025

tlrmchlsmth left a comment

tlrmchlsmth Mar 5, 2025

tdoublep commented Mar 6, 2025

bringlein commented Mar 6, 2025

[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend #14152

[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend #14152

Conversation

tdoublep commented Mar 3, 2025 • edited by github-actions bot Loading

Performance

Correctness

How is this performance improvement achieved?

What is ibm_triton_lib?

github-actions bot commented Mar 3, 2025

SageMoore commented Mar 3, 2025 • edited Loading

bringlein commented Mar 4, 2025

tdoublep commented Mar 4, 2025 • edited Loading

maleksan85 commented Mar 4, 2025 • edited Loading

tdoublep commented Mar 4, 2025

maleksan85 commented Mar 4, 2025 • edited Loading

SageMoore left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdoublep commented Mar 5, 2025

SageMoore commented Mar 5, 2025

Choose a reason for hiding this comment

tdoublep Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

tdoublep Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SageMoore commented Mar 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdoublep commented Mar 6, 2025

bringlein commented Mar 6, 2025

tdoublep commented Mar 3, 2025 •

edited by github-actions bot

Loading

What is `ibm_triton_lib`?

SageMoore commented Mar 3, 2025 •

edited

Loading

tdoublep commented Mar 4, 2025 •

edited

Loading

maleksan85 commented Mar 4, 2025 •

edited

Loading

maleksan85 commented Mar 4, 2025 •

edited

Loading

tdoublep Mar 5, 2025 •

edited

Loading

tdoublep Mar 5, 2025 •

edited

Loading