[VLM][Model] TP support for ViTs #7186

ChristopherCho · 2024-08-06T05:07:08Z

As a follow-up PR to #6942, I've implemented the TP version of various ViTs. The following models have been changed:

Siglip
Clip
Blip
Intern ViT

Following the Idefics2VisionAttention, I've used the memory_efficient_attention_forward from xformers.

To load the models correctly, the load_weights part of the models that use these ViTs should also be updated. Thus, the following models have been changed.

Paligemma
Llava
Llava-next
Phi3v
Blip2
InternVL

github-actions · 2024-08-06T05:07:23Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

ChristopherCho · 2024-08-06T11:30:37Z

[Intermediate status]
The llava and llava_next models aren't passing the test with the updated ClipAttention. (The generated output is completely wrong.)
I'm currently working on this.
Fixed correctly.

ChristopherCho · 2024-08-07T04:55:15Z

With the following simple test codes, I can successfully run all listed models on both tensor_parallel_size=1 and tensor_parallel_size=2 scenarios with expected outputs.

import requests
from PIL import Image
import argparse
from vllm import LLM, SamplingParams
from huggingface_hub import snapshot_download

prompt = "What is on the flower?"
image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true"

# prompt = "caption es"
# image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"

image = Image.open(requests.get(image_file, stream=True).raw)

model_map = {
    # Siglip based models
    "paligemma": {
        "prompt_template": "{prompt}",
        "model_id": "google/paligemma-3b-mix-224",
        "max_model_len": None,
    },

    # Clip based models
    "llava_next": {
        "prompt_template": (
            "A chat between a curious human and an artificial intelligence assistant. "
            "The assistant gives helpful, detailed, and polite answers to the human's "
            "questions. "
            "USER: <image>\n{prompt} ASSISTANT:"
        ),
        "model_id": "llava-hf/llava-v1.6-vicuna-7b-hf",
        "max_model_len": None,
    },
    "llava": {
        "prompt_template": "USER: <image>\n{prompt}\nASSISTANT:",
        "model_id": "llava-hf/llava-1.5-7b-hf",
        "max_model_len": None,
    },
    "phi3v": {
        "prompt_template": "<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n",
        "model_id": "microsoft/Phi-3-vision-128k-instruct",
        "max_model_len": 4096,
    },

    # Blip based models
    "blip2": {
        "prompt_template": "Question: {prompt} Answer:",
        "model_id": "Salesforce/blip2-opt-2.7b",
        "max_model_len": None,
    },

    # InternVL based models
    "internvl": {
        "prompt_template": "<|im_start|>User\n<image>\nWhat's the content in the center of the image?<|im_end|>\n<|im_start|>Assistant\n",
        "model_id": snapshot_download("OpenGVLab/InternVL2-1B"),
        "max_model_len": None,
    }
}

def test_suite(model_name, tp_size):
    print("#" * 10 + "#" * len(f" Testing {model_name} ") + "#" * 10)
    print("#" + " " * 9 + " " * len(f" Testing {model_name} ") + " " * 9 + "#")
    print("#" + " " * 9 + f" Testing {model_name} " + " " * 9 + "#")
    print("#" + " " * 9 + " " * len(f" Testing {model_name} ") + " " * 9 + "#")
    print("#" * 10 + "#" * len(f" Testing {model_name} ") + "#" * 10)

    llm = LLM(
        model=model_map[model_name]["model_id"],
        trust_remote_code=True,
        max_model_len=model_map[model_name]["max_model_len"],
        tensor_parallel_size=tp_size
    )
    sampling_params = SamplingParams(
        temperature=0.0
    )

    input_dict = {
        "prompt": model_map[model_name]["prompt_template"].format(prompt=prompt),
        "multi_modal_data": {
            "image": image,
        }
    }
    outputs = llm.generate(input_dict, sampling_params)

    print(f"{model_name} outputs:")
    print(outputs[0].outputs[0].text)
    print("\n" * 5)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--method", type=str, default="paligemma")
    parser.add_argument("--tensor_parallel_size", type=int, default=1)
    args = parser.parse_args()

    test_suite(args.method, args.tensor_parallel_size)

ywang96 · 2024-08-07T05:31:52Z

With the following simple test codes, I can successfully run all listed models on both tensor_parallel_size=1 and tensor_parallel_size=2 scenarios with expected outputs.

import requests
from PIL import Image
import argparse
from vllm import LLM, SamplingParams
from huggingface_hub import snapshot_download

prompt = "What is on the flower?"
image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true"

# prompt = "caption es"
# image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"

image = Image.open(requests.get(image_file, stream=True).raw)

model_map = {
    # Siglip based models
    "paligemma": {
        "prompt_template": "{prompt}",
        "model_id": "google/paligemma-3b-mix-224",
        "max_model_len": None,
    },

    # Clip based models
    "llava_next": {
        "prompt_template": (
            "A chat between a curious human and an artificial intelligence assistant. "
            "The assistant gives helpful, detailed, and polite answers to the human's "
            "questions. "
            "USER: <image>\n{prompt} ASSISTANT:"
        ),
        "model_id": "llava-hf/llava-v1.6-vicuna-7b-hf",
        "max_model_len": None,
    },
    "llava": {
        "prompt_template": "USER: <image>\n{prompt}\nASSISTANT:",
        "model_id": "llava-hf/llava-1.5-7b-hf",
        "max_model_len": None,
    },
    "phi3v": {
        "prompt_template": "<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n",
        "model_id": "microsoft/Phi-3-vision-128k-instruct",
        "max_model_len": 4096,
    },

    # Blip based models
    "blip2": {
        "prompt_template": "Question: {prompt} Answer:",
        "model_id": "Salesforce/blip2-opt-2.7b",
        "max_model_len": None,
    },

    # InternVL based models
    "internvl": {
        "prompt_template": "<|im_start|>User\n<image>\nWhat's the content in the center of the image?<|im_end|>\n<|im_start|>Assistant\n",
        "model_id": snapshot_download("OpenGVLab/InternVL2-1B"),
        "max_model_len": None,
    }
}

def test_suite(model_name, tp_size):
    print("#" * 10 + "#" * len(f" Testing {model_name} ") + "#" * 10)
    print("#" + " " * 9 + " " * len(f" Testing {model_name} ") + " " * 9 + "#")
    print("#" + " " * 9 + f" Testing {model_name} " + " " * 9 + "#")
    print("#" + " " * 9 + " " * len(f" Testing {model_name} ") + " " * 9 + "#")
    print("#" * 10 + "#" * len(f" Testing {model_name} ") + "#" * 10)

    llm = LLM(
        model=model_map[model_name]["model_id"],
        trust_remote_code=True,
        max_model_len=model_map[model_name]["max_model_len"],
        tensor_parallel_size=tp_size
    )
    sampling_params = SamplingParams(
        temperature=0.0
    )

    input_dict = {
        "prompt": model_map[model_name]["prompt_template"].format(prompt=prompt),
        "multi_modal_data": {
            "image": image,
        }
    }
    outputs = llm.generate(input_dict, sampling_params)

    print(f"{model_name} outputs:")
    print(outputs[0].outputs[0].text)
    print("\n" * 5)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--method", type=str, default="paligemma")
    parser.add_argument("--tensor_parallel_size", type=int, default=1)
    args = parser.parse_args()

    test_suite(args.method, args.tensor_parallel_size)

This is great and thank you so much for the implementation and thorough testing coverage. I will take a look this week and get back to you!

…o/vllm into tp-support-for-vit

ChristopherCho · 2024-08-30T07:06:36Z

@ywang96 @DarkLight1337
Thanks for your feedback! I’ve implemented the changes based on your comments and also merged the main branch for the CI flag.
It looks good to me now—ready to proceed when you are.

ywang96

LGTM! I've run all the models again with your test file, so let's get this in! Thank you for the work! @ChristopherCho

ywang96 · 2024-08-30T08:00:52Z

Ah... this will actually break the CPU test. How about using transformers Attention module as a fallback in case xformers is not available? @ChristopherCho

ChristopherCho · 2024-08-30T08:17:39Z

@ywang96
Oh... I forgot that case. I'll do it right away.

ChristopherCho · 2024-08-30T08:46:44Z

@ywang96 I've checked the error message and found some issues here.

In CPU mode, xformers is not installed.
However, in vllm/tests/models/test_internvl.py and vllm/tests/models/test_intern_vit.py due to some required dependencies, they import internvl.py or intern_vit.py.
For importing them, xformers should be installed.
-> This causes the error ModuleNotFoundError: No module named 'xformers' in Intel CPU Test

However, for other VLM models, they're deselected (as run-cpu-test.sh test for "not vlm" models) while testing, and no need to import original model files, thus not causing errors.

I think we can avoid the error by just importing xformers only when available, but not sure whether this is a good solution.
Do you have any good ideas for this?

ywang96 · 2024-08-30T09:02:38Z

@ChristopherCho I see, let me try to move those import statements to inside the run_test call and see if that helps

SovereignRemedy · 2024-09-02T10:34:57Z

#8055 (comment)
@ywang96 @ChristopherCho Hello, will this case be fixed in this issue?

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: Alvant <alvasian@yandex.ru>

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>

ChristopherCho changed the title ~~Tp support for vit~~ TP support for ViTs Aug 6, 2024

ywang96 self-assigned this Aug 6, 2024

ywang96 mentioned this pull request Aug 6, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

53 tasks

ChristopherCho added 16 commits August 6, 2024 19:34

feat: replace siglipattention with tp'ed one

0919bd1

feat: tp blip attention

7cfc98c

feat: clip attention replaced

079c53f

fix: style

9a9af50

fix: provide qunatization config

8176c8e

fix: return value of attention

b3bdbef

fix: add tp config in clip attention

8927414

feat: tp attention in intern vit

22a0a84

fix: style fix

f6063b1

feat: weight loading for clip based models

8e54ef6

feat: weight loading for siglip based models

87043e4

feat: weight loading for blip based models

2c18f3a

fix: bug in clip weight loading

c6015f5

fix: bug in clip attention

be0e190

fix: bug in blip attention

0121445

fix: blip does not require sharding

414040f

ChristopherCho force-pushed the tp-support-for-vit branch from f214028 to 414040f Compare August 6, 2024 11:08

ChristopherCho added 3 commits August 7, 2024 10:51

fix: style

f28aec3

fix: phi3v weight loading logic fixed

734fcb1

fix: make intern vit working

f1329c9

ChristopherCho changed the title ~~TP support for ViTs~~ [Model] TP support for ViTs Aug 7, 2024

fix: fix for tp input

c776662

fix: minor refactoring

5ad0d22

ChristopherCho added 3 commits August 30, 2024 13:48

Merge branch 'tp-support-for-vit' of https://github.com/ChristopherCh…

300e8a9

…o/vllm into tp-support-for-vit

feat: option for disabling bias in blip

659adc5

Merge branch 'vllm-project:main' into tp-support-for-vit

ebf3503

ywang96 approved these changes Aug 30, 2024

View reviewed changes

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 30, 2024

ywang96 enabled auto-merge (squash) August 30, 2024 07:46

patch internvl

ffb176b

ywang96 changed the title ~~[Model] TP support for ViTs~~ [VLM][Model] TP support for ViTs Aug 30, 2024

WoosukKwon disabled auto-merge August 30, 2024 15:19

WoosukKwon merged commit f97be32 into vllm-project:main Aug 30, 2024
34 of 37 checks passed

Isotr0py mentioned this pull request Sep 1, 2024

[Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend #8061

Merged

4 tasks

DarkLight1337 mentioned this pull request Sep 2, 2024

[CI/Build] Update CPU tests to include all "standard" tests #5481

Merged

Isotr0py mentioned this pull request Sep 2, 2024

[Bugfix] Fix internlm2 tensor parallel inference #8055

Merged

stikkireddy mentioned this pull request Sep 5, 2024

Upgrade to vllm 0.6.0 stikkireddy/mlflow-extensions#4

Closed

Isotr0py mentioned this pull request Sep 8, 2024

[Model][VLM] Decouple weight loading logic for Paligemma #8269

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VLM][Model] TP support for ViTs #7186

[VLM][Model] TP support for ViTs #7186

ChristopherCho commented Aug 6, 2024 •

edited

Loading

github-actions bot commented Aug 6, 2024

ChristopherCho commented Aug 6, 2024 •

edited

Loading

ChristopherCho commented Aug 7, 2024 •

edited

Loading

ywang96 commented Aug 7, 2024

ChristopherCho commented Aug 30, 2024

ywang96 left a comment

ywang96 commented Aug 30, 2024

ChristopherCho commented Aug 30, 2024

ChristopherCho commented Aug 30, 2024 •

edited

Loading

ywang96 commented Aug 30, 2024

SovereignRemedy commented Sep 2, 2024

[VLM][Model] TP support for ViTs #7186

[VLM][Model] TP support for ViTs #7186

Conversation

ChristopherCho commented Aug 6, 2024 • edited Loading

github-actions bot commented Aug 6, 2024

ChristopherCho commented Aug 6, 2024 • edited Loading

ChristopherCho commented Aug 7, 2024 • edited Loading

ywang96 commented Aug 7, 2024

ChristopherCho commented Aug 30, 2024

ywang96 left a comment

Choose a reason for hiding this comment

ywang96 commented Aug 30, 2024

ChristopherCho commented Aug 30, 2024

ChristopherCho commented Aug 30, 2024 • edited Loading

ywang96 commented Aug 30, 2024

SovereignRemedy commented Sep 2, 2024

ChristopherCho commented Aug 6, 2024 •

edited

Loading

ChristopherCho commented Aug 6, 2024 •

edited

Loading

ChristopherCho commented Aug 7, 2024 •

edited

Loading

ChristopherCho commented Aug 30, 2024 •

edited

Loading