Add support to modelopt quantization of Mixtral model #15961

yueshen2016 · 2025-04-02T18:50:49Z

Add vllm support to modelopt fp8 quantization of Mixtral8x7b model

github-actions · 2025-04-02T18:51:00Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mgoin

This seems to add support for kv cache quantization to mixtral. However there is no MoE layer registered for modelopt to use, so what happens when the MoE layers are quantized?

vllm/vllm/model_executor/layers/quantization/modelopt.py

Lines 222 to 231 in 463bbb1

    
           def get_quant_method(self, layer: torch.nn.Module, 
        
                                prefix: str) -> Optional["QuantizeMethodBase"]: 
        
               from vllm.attention.layer import Attention  # Avoid circular import 
        
               if isinstance(layer, LinearBase): 
        
                   if is_layer_skipped(prefix, self.exclude_modules): 
        
                       return UnquantizedLinearMethod() 
        
                   return ModelOptNvFp4LinearMethod(self) 
        
               elif isinstance(layer, Attention): 
        
                   return ModelOptFp8KVCacheMethod(self) 
        
               return None

yueshen2016 · 2025-04-03T18:18:29Z

This seems to add support for kv cache quantization to mixtral. However there is no MoE layer registered for modelopt to use, so what happens when the MoE layers are quantized?

vllm/vllm/model_executor/layers/quantization/modelopt.py

Lines 222 to 231 in 463bbb1

def get_quant_method(self, layer: torch.nn.Module,

prefix: str) -> Optional["QuantizeMethodBase"]:

from vllm.attention.layer import Attention # Avoid circular import

if isinstance(layer, LinearBase):

if is_layer_skipped(prefix, self.exclude_modules):

return UnquantizedLinearMethod()

return ModelOptNvFp4LinearMethod(self)

elif isinstance(layer, Attention):

return ModelOptFp8KVCacheMethod(self)

return None

Hi @mgoin , the purpose of this PR is to do key remapping:
model.layers.X.self_attn.k_proj.k_scale -> model.layers.X.self_attn.attn.k_scale
model.layers.X.self_attn.v_proj.v_scale -> model.layers.X.self_attn.attn.v_scale

As for the moe layer, for each expert, a input_scale and weight_scale is added to it.

pavanimajety · 2025-04-03T18:39:54Z

@mgoin This is mixtral_quant.py which doesn't use FusedMoE layer. It is simply an MLP layer: here

And the switch between architectures happens here

edit: just fixed the links to vllm-project

pavanimajety

LGTM.
@mgoin This is a python change, don't see why the docker is failing. Can we restart the CI?

pavanimajety · 2025-04-08T21:10:43Z

@simon-mo / @mgoin Can you PTAL?

mgoin · 2025-04-08T22:30:09Z

I think this PR was made before we made changes to the docker image. Can you please merge with main?

Signed-off-by: Yue <yueshen@nvidia.com>

…5961) Signed-off-by: Yue <yueshen@nvidia.com>

…5961) Signed-off-by: Yue <yueshen@nvidia.com> Signed-off-by: Yang Wang <elainewy@meta.com>

yueshen2016 force-pushed the yueshen/modelopt-quantization-mixtral-support branch from a1e9348 to cd7c625 Compare April 2, 2025 19:01

DarkLight1337 requested a review from mgoin April 3, 2025 08:58

mgoin reviewed Apr 3, 2025

View reviewed changes

pavanimajety approved these changes Apr 3, 2025

View reviewed changes

mgoin approved these changes Apr 8, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 8, 2025

yueshen2016 force-pushed the yueshen/modelopt-quantization-mixtral-support branch from cd7c625 to 395ad17 Compare April 8, 2025 22:45

Add support to ModelOpt quantization of Mixtral8x7b model

af3a071

Signed-off-by: Yue <yueshen@nvidia.com>

yueshen2016 force-pushed the yueshen/modelopt-quantization-mixtral-support branch from 395ad17 to af3a071 Compare April 8, 2025 23:24

mgoin enabled auto-merge (squash) April 9, 2025 01:13

mgoin merged commit 1f4b09b into vllm-project:main Apr 9, 2025
44 checks passed

nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025

Add support to modelopt quantization of Mixtral model (vllm-project#1…

7c06532

…5961) Signed-off-by: Yue <yueshen@nvidia.com>

yangw-dev pushed a commit to yangw-dev/vllm that referenced this pull request Apr 21, 2025

Add support to modelopt quantization of Mixtral model (vllm-project#1…

d354ca6

…5961) Signed-off-by: Yue <yueshen@nvidia.com> Signed-off-by: Yang Wang <elainewy@meta.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to modelopt quantization of Mixtral model #15961

Add support to modelopt quantization of Mixtral model #15961

yueshen2016 commented Apr 2, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 2, 2025

mgoin left a comment

yueshen2016 commented Apr 3, 2025

pavanimajety commented Apr 3, 2025 •

edited

Loading

pavanimajety left a comment

pavanimajety commented Apr 8, 2025 •

edited

Loading

mgoin commented Apr 8, 2025

	def get_quant_method(self, layer: torch.nn.Module,
	prefix: str) -> Optional["QuantizeMethodBase"]:
	from vllm.attention.layer import Attention # Avoid circular import
	if isinstance(layer, LinearBase):
	if is_layer_skipped(prefix, self.exclude_modules):
	return UnquantizedLinearMethod()
	return ModelOptNvFp4LinearMethod(self)
	elif isinstance(layer, Attention):
	return ModelOptFp8KVCacheMethod(self)
	return None

Add support to modelopt quantization of Mixtral model #15961

Add support to modelopt quantization of Mixtral model #15961

Conversation

yueshen2016 commented Apr 2, 2025 • edited by github-actions bot Loading

github-actions bot commented Apr 2, 2025

mgoin left a comment

Choose a reason for hiding this comment

yueshen2016 commented Apr 3, 2025

pavanimajety commented Apr 3, 2025 • edited Loading

pavanimajety left a comment

Choose a reason for hiding this comment

pavanimajety commented Apr 8, 2025 • edited Loading

mgoin commented Apr 8, 2025

yueshen2016 commented Apr 2, 2025 •

edited by github-actions bot

Loading

pavanimajety commented Apr 3, 2025 •

edited

Loading

pavanimajety commented Apr 8, 2025 •

edited

Loading