-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Model] RowParallelLinear: pass bias to quant_method.apply #6327
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Conversation
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
CI failures are a bit weird (see #6332) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I see that for the TP > 1 case we can't naively fuse the bias, since each rank will add the bias to its output and then the biases will be redundantly summed during the AllReduce.
Another thing to try is to pick one of the ranks to apply the bias. The overhead of applying a bias during a GEMM is low enough that we probably don't need to worry load imbalance, and this will let every rank skip a load and store of the activation tensor.
which should lead to more use of the fused CUTLASS kernels.
FYI this won't lead to more use of the CUTLASS kernels. Currently they're used whenever there is no bias. The issue
vllm/model_executor/layers/linear.py
Outdated
@@ -753,18 +753,23 @@ def forward(self, input_): | |||
|
|||
# Matrix multiply. | |||
assert self.quant_method is not None | |||
output_parallel = self.quant_method.apply(self, input_parallel) | |||
bias_ = None if (self.tp_size > 1 or self.skip_bias_add) else self.bias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment explaining what happens when tp_size > 1? I think it should say something like:
# Fuse bias into the GEMM in the TP == 1 case.
# The TP > 1 case is problematic as the bias will be redundantly summed during the AllReduce if fused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment (although it reads a bit differently since the TP>1 case is now handled).
If the bias is fused into the GEMM, it will likely be added to the output at a higher precision. For example in our CUTLASS fp8 kernels, the bias add happens in fp32. On main the output will be converted to the output dtype before adding the bias. That could explain the test failure you're seeing. |
Thanks, yes that explains why this change causes the tests to fail (since the same tests also fail in fp32). Now need to figure out why tests don't work in fp32, it is somehow related to chunked prefill tests only. |
OK, so the failing tests on this branch are now passing in fp32 but only after this unrelated bug is fixed: #6373 |
The offending tests actually also fail on |
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
I like that idea - I implemented it and it makes the code even simpler.
Right, but they will be after we merge #6270
The CI tests should pass after we merge #6409 |
/ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, although the spec decoding tests look suspicious, so I'll wait until those are resolved before accepting
Have you done any e2e benchmarking of e.g. Granite or Qwen?
I agree - will look into it.
Not yet but tomorrow hopefully, will share once I have some data. |
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@tlrmchlsmth I fixed the CI issue by changing the failing TP=2 test to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@tdoublep could you share perf results when you have them? Curious about what the impact will be. Thanks! |
@tlrmchlsmth got sidetracked with a few things, but had some time to run benchmarks today. I ran the benchmarking using
Here are the results: The results look as expected imo cc @cyang49 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice simplicity with the final version
You are right - have updated the comment + figure accordingly. |
…ect#6327) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
…ect#6327) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Alvant <alvasian@yandex.ru>
…ect#6327) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>
I think for TP=1 case we can pass the
bias
to the apply function, which should lead to more use of the fused CUTLASS kernels.cc @tlrmchlsmth @cyang49
Update: this is now implemented for TP>1 too