-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Model] Add support for Gemma 3 #14660
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Conversation
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Expediting merge as the other tests have passed in the previous build: 4111004 |
Do we know when the latest docker image is being published? |
@DarkLight1337 @ywang96 Thanks for all the fixes! |
Hi @WoosukKwon, can you explain the main difference between the two approaches ? Thanks :-) |
do we know that when the docker latest image will published? |
The next release is very soon: https://github.com/vllm-project/vllm/milestone/1 |
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Richard Liu <ricliu@google.com>
How can I run Gemma 3 with a backend different from Xformers? It works very slowly. And it doesn't start with flashinfer. |
Having the same problem |
Can you try setting |
@DarkLight1337 |
The error message shows that your |
@DarkLight1337 I understand that perfectly, but the Gemma3 model has a context window: Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and 32K tokens for the 1B size. |
I think here the max model len corresponds to the sliding window length, not the total window length |
In general, try to run the model with any other backend and you'll see that it doesn't work, while Xformers is terribly slow. |
When will v1 support gemma3? |
It's supported if you install from main branch, but there might be correctness issues because their attention mask is not fully implemented in V1. |
--max-model-len is model total context size, see https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html There are some problems with gemma 3 support currently. V0 doesn't support flashattention and context size is huge. And with V1 I wasn't able to load gptq quant. Edit: I was able to load gpt quant, but the context size is very low(<8192) in 24gb gpu |
Could you kindly share a code snippet? I am facing few issues too and that would help a ton! @anunknowperson 🙏 |
My launch string for openai server is CUDA_VISIBLE_DEVICES=0 vllm serve ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --max-model-len 8192 --max-num-seqs 10 --gpu-memory-utilization=0.99 Probably you can get params from here for code. Engine is V1, V0 context is too big. |
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
@DarkLight1337 @WoosukKwon |
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
This PR adds the support for Gemma 3, an open-source vision-language model from Google.
NOTE:
Thanks for the help @ywang96 and @DarkLight1337!
FIX #14663