Skip to content

[Model] Add support for Gemma 3 #14660

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Merged
merged 54 commits into from
Mar 12, 2025
Merged

[Model] Add support for Gemma 3 #14660

merged 54 commits into from
Mar 12, 2025

Conversation

WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Mar 12, 2025

This PR adds the support for Gemma 3, an open-source vision-language model from Google.

NOTE:

  • The PR doesn't implement the pan-and-scan pre-processing algorithm. It will be implemented by a followup PR cc @DarkLight1337
  • For the text-only inputs, both V0 and V1 should produce accurate outputs with good performance.
  • For image inputs, only V0 implements the attention in the correct way. Gemma 3 uses bidirectional attention only for the image tokens, which is not supported efficiently by any of the current attention backends. Therefore, we temporarily use the naive PyTorch SDPA with masking tensors. This could lead to significant memory usage for long prompts (w/ images).
  • For V1, we currently do not strictly follow the original attention in Gemma 3. The model still generates reasonable outputs, but this needs to be fixed to get the full accuracy.

Thanks for the help @ywang96 and @DarkLight1337!

FIX #14663

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
@DarkLight1337
Copy link
Member

Expediting merge as the other tests have passed in the previous build: 4111004

@vllm-bot vllm-bot merged commit c0c25e2 into main Mar 12, 2025
15 of 35 checks passed
@vllm-bot vllm-bot deleted the woosuk-gemma3 branch March 12, 2025 15:36
@moficodes
Copy link

Do we know when the latest docker image is being published?

@WoosukKwon
Copy link
Collaborator Author

@DarkLight1337 @ywang96 Thanks for all the fixes!

@erdaltoprak
Copy link

For V1, we currently do not strictly follow the original attention in Gemma 3. The model still generates reasonable outputs, but this needs to be fixed to get the full accuracy

Hi @WoosukKwon, can you explain the main difference between the two approaches ? Thanks :-)

@Dilesh-chouhan
Copy link

Dilesh-chouhan commented Mar 14, 2025

do we know that when the docker latest image will published?

@DarkLight1337
Copy link
Member

The next release is very soon: https://github.com/vllm-project/vllm/milestone/1

richardsliu pushed a commit to richardsliu/vllm that referenced this pull request Mar 14, 2025
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Richard Liu <ricliu@google.com>
@Swipe4057
Copy link

How can I run Gemma 3 with a backend different from Xformers? It works very slowly. And it doesn't start with flashinfer.

@francis2tm
Copy link

How can I run Gemma 3 with a backend different from Xformers? It works very slowly. And it doesn't start with flashinfer.

Having the same problem

@DarkLight1337
Copy link
Member

Can you try setting VLLM_USE_V1=0 to enable more backends?

@Swipe4057
Copy link

@DarkLight1337
image
and
image

@DarkLight1337
Copy link
Member

The error message shows that your --max-model-len is too high

@Swipe4057
Copy link

@DarkLight1337 I understand that perfectly, but the Gemma3 model has a context window: Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size.

@DarkLight1337
Copy link
Member

I think here the max model len corresponds to the sliding window length, not the total window length

@Swipe4057
Copy link

In general, try to run the model with any other backend and you'll see that it doesn't work, while Xformers is terribly slow.

@xihuai18
Copy link

When will v1 support gemma3?

@DarkLight1337
Copy link
Member

It's supported if you install from main branch, but there might be correctness issues because their attention mask is not fully implemented in V1.

@anunknowperson
Copy link

anunknowperson commented Mar 22, 2025

I think here the max model len corresponds to the sliding window length, not the total window length

--max-model-len is model total context size, see https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html

There are some problems with gemma 3 support currently. V0 doesn't support flashattention and context size is huge. And with V1 I wasn't able to load gptq quant.

Edit: I was able to load gpt quant, but the context size is very low(<8192) in 24gb gpu
Edit: I was able to load 8k context with max-num-seq 10 and --gpu-memory-utilization 0.99, though I expected more context to fit since model is 16gb, so there should be 8 gb of free vram for context.

@pietrobolcato
Copy link

pietrobolcato commented Mar 23, 2025

Edit: I was able to load gpt quant, but the context size is very low(<8192) in 24gb gpu
Edit: I was able to load 8k context with max-num-seq 10 and --gpu-memory-utilization 0.99, though I expected more context to fit since model is 16gb, so there should be 8 gb of free vram for context.

Could you kindly share a code snippet? I am facing few issues too and that would help a ton! @anunknowperson 🙏

@anunknowperson
Copy link

@pietrobolcato

My launch string for openai server is

CUDA_VISIBLE_DEVICES=0 vllm serve ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --max-model-len 8192 --max-num-seqs 10 --gpu-memory-utilization=0.99

Probably you can get params from here for code. Engine is V1, V0 context is too big.

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
@hahmad2008
Copy link

hahmad2008 commented Apr 7, 2025

@DarkLight1337 @WoosukKwon
could you please check this issue related to Gemma3-AWQ?

nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[New Model]: New models Gemma 3