-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Add support to modelopt quantization of Mixtral model #15961
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Add support to modelopt quantization of Mixtral model #15961
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
a1e9348
to
cd7c625
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to add support for kv cache quantization to mixtral. However there is no MoE layer registered for modelopt to use, so what happens when the MoE layers are quantized?
vllm/vllm/model_executor/layers/quantization/modelopt.py
Lines 222 to 231 in 463bbb1
def get_quant_method(self, layer: torch.nn.Module, | |
prefix: str) -> Optional["QuantizeMethodBase"]: | |
from vllm.attention.layer import Attention # Avoid circular import | |
if isinstance(layer, LinearBase): | |
if is_layer_skipped(prefix, self.exclude_modules): | |
return UnquantizedLinearMethod() | |
return ModelOptNvFp4LinearMethod(self) | |
elif isinstance(layer, Attention): | |
return ModelOptFp8KVCacheMethod(self) | |
return None |
Hi @mgoin , the purpose of this PR is to do key remapping: As for the moe layer, for each expert, a input_scale and weight_scale is added to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@mgoin This is a python change, don't see why the docker is failing. Can we restart the CI?
I think this PR was made before we made changes to the docker image. Can you please merge with main? |
cd7c625
to
395ad17
Compare
Signed-off-by: Yue <yueshen@nvidia.com>
395ad17
to
af3a071
Compare
…5961) Signed-off-by: Yue <yueshen@nvidia.com>
…5961) Signed-off-by: Yue <yueshen@nvidia.com> Signed-off-by: Yang Wang <elainewy@meta.com>
Add vllm support to modelopt fp8 quantization of Mixtral8x7b model