Skip to content

Add support to modelopt quantization of Mixtral model #15961

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Conversation

yueshen2016
Copy link
Contributor

@yueshen2016 yueshen2016 commented Apr 2, 2025

Add vllm support to modelopt fp8 quantization of Mixtral8x7b model

Copy link

github-actions bot commented Apr 2, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@yueshen2016 yueshen2016 force-pushed the yueshen/modelopt-quantization-mixtral-support branch from a1e9348 to cd7c625 Compare April 2, 2025 19:01
@DarkLight1337 DarkLight1337 requested a review from mgoin April 3, 2025 08:58
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to add support for kv cache quantization to mixtral. However there is no MoE layer registered for modelopt to use, so what happens when the MoE layers are quantized?

def get_quant_method(self, layer: torch.nn.Module,
prefix: str) -> Optional["QuantizeMethodBase"]:
from vllm.attention.layer import Attention # Avoid circular import
if isinstance(layer, LinearBase):
if is_layer_skipped(prefix, self.exclude_modules):
return UnquantizedLinearMethod()
return ModelOptNvFp4LinearMethod(self)
elif isinstance(layer, Attention):
return ModelOptFp8KVCacheMethod(self)
return None

@yueshen2016
Copy link
Contributor Author

This seems to add support for kv cache quantization to mixtral. However there is no MoE layer registered for modelopt to use, so what happens when the MoE layers are quantized?

def get_quant_method(self, layer: torch.nn.Module,
prefix: str) -> Optional["QuantizeMethodBase"]:
from vllm.attention.layer import Attention # Avoid circular import
if isinstance(layer, LinearBase):
if is_layer_skipped(prefix, self.exclude_modules):
return UnquantizedLinearMethod()
return ModelOptNvFp4LinearMethod(self)
elif isinstance(layer, Attention):
return ModelOptFp8KVCacheMethod(self)
return None

Hi @mgoin , the purpose of this PR is to do key remapping:
model.layers.X.self_attn.k_proj.k_scale -> model.layers.X.self_attn.attn.k_scale
model.layers.X.self_attn.v_proj.v_scale -> model.layers.X.self_attn.attn.v_scale

As for the moe layer, for each expert, a input_scale and weight_scale is added to it.

@pavanimajety
Copy link
Contributor

pavanimajety commented Apr 3, 2025

@mgoin This is mixtral_quant.py which doesn't use FusedMoE layer. It is simply an MLP layer: here

And the switch between architectures happens here

edit: just fixed the links to vllm-project

Copy link
Contributor

@pavanimajety pavanimajety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
@mgoin This is a python change, don't see why the docker is failing. Can we restart the CI?

@pavanimajety
Copy link
Contributor

pavanimajety commented Apr 8, 2025

@simon-mo / @mgoin Can you PTAL?

@mgoin
Copy link
Member

mgoin commented Apr 8, 2025

I think this PR was made before we made changes to the docker image. Can you please merge with main?

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 8, 2025
@yueshen2016 yueshen2016 force-pushed the yueshen/modelopt-quantization-mixtral-support branch from cd7c625 to 395ad17 Compare April 8, 2025 22:45
Signed-off-by: Yue <yueshen@nvidia.com>
@yueshen2016 yueshen2016 force-pushed the yueshen/modelopt-quantization-mixtral-support branch from 395ad17 to af3a071 Compare April 8, 2025 23:24
@mgoin mgoin enabled auto-merge (squash) April 9, 2025 01:13
@mgoin mgoin merged commit 1f4b09b into vllm-project:main Apr 9, 2025
44 checks passed
nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025
yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025
…5961)

Signed-off-by: Yue <yueshen@nvidia.com>
Signed-off-by: Yang Wang <elainewy@meta.com>
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants