Skip to content

[VLM][Model] TP support for ViTs #7186

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Merged
merged 33 commits into from
Aug 30, 2024

Conversation

ChristopherCho
Copy link
Contributor

@ChristopherCho ChristopherCho commented Aug 6, 2024

As a follow-up PR to #6942, I've implemented the TP version of various ViTs. The following models have been changed:

  • Siglip
  • Clip
  • Blip
  • Intern ViT

Following the Idefics2VisionAttention, I've used the memory_efficient_attention_forward from xformers.

To load the models correctly, the load_weights part of the models that use these ViTs should also be updated. Thus, the following models have been changed.

  • Paligemma
  • Llava
  • Llava-next
  • Phi3v
  • Blip2
  • InternVL

Copy link

github-actions bot commented Aug 6, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@ChristopherCho ChristopherCho changed the title Tp support for vit TP support for ViTs Aug 6, 2024
@ywang96 ywang96 self-assigned this Aug 6, 2024
@ChristopherCho
Copy link
Contributor Author

ChristopherCho commented Aug 6, 2024

[Intermediate status]
The llava and llava_next models aren't passing the test with the updated ClipAttention. (The generated output is completely wrong.)
I'm currently working on this.

Fixed correctly.

@ChristopherCho ChristopherCho changed the title TP support for ViTs [Model] TP support for ViTs Aug 7, 2024
@ChristopherCho
Copy link
Contributor Author

ChristopherCho commented Aug 7, 2024

With the following simple test codes, I can successfully run all listed models on both tensor_parallel_size=1 and tensor_parallel_size=2 scenarios with expected outputs.

import requests
from PIL import Image
import argparse
from vllm import LLM, SamplingParams
from huggingface_hub import snapshot_download

prompt = "What is on the flower?"
image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true"

# prompt = "caption es"
# image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"

image = Image.open(requests.get(image_file, stream=True).raw)

model_map = {
    # Siglip based models
    "paligemma": {
        "prompt_template": "{prompt}",
        "model_id": "google/paligemma-3b-mix-224",
        "max_model_len": None,
    },

    # Clip based models
    "llava_next": {
        "prompt_template": (
            "A chat between a curious human and an artificial intelligence assistant. "
            "The assistant gives helpful, detailed, and polite answers to the human's "
            "questions. "
            "USER: <image>\n{prompt} ASSISTANT:"
        ),
        "model_id": "llava-hf/llava-v1.6-vicuna-7b-hf",
        "max_model_len": None,
    },
    "llava": {
        "prompt_template": "USER: <image>\n{prompt}\nASSISTANT:",
        "model_id": "llava-hf/llava-1.5-7b-hf",
        "max_model_len": None,
    },
    "phi3v": {
        "prompt_template": "<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n",
        "model_id": "microsoft/Phi-3-vision-128k-instruct",
        "max_model_len": 4096,
    },

    # Blip based models
    "blip2": {
        "prompt_template": "Question: {prompt} Answer:",
        "model_id": "Salesforce/blip2-opt-2.7b",
        "max_model_len": None,
    },

    # InternVL based models
    "internvl": {
        "prompt_template": "<|im_start|>User\n<image>\nWhat's the content in the center of the image?<|im_end|>\n<|im_start|>Assistant\n",
        "model_id": snapshot_download("OpenGVLab/InternVL2-1B"),
        "max_model_len": None,
    }
}

def test_suite(model_name, tp_size):
    print("#" * 10 + "#" * len(f" Testing {model_name} ") + "#" * 10)
    print("#" + " " * 9 + " " * len(f" Testing {model_name} ") + " " * 9 + "#")
    print("#" + " " * 9 + f" Testing {model_name} " + " " * 9 + "#")
    print("#" + " " * 9 + " " * len(f" Testing {model_name} ") + " " * 9 + "#")
    print("#" * 10 + "#" * len(f" Testing {model_name} ") + "#" * 10)

    llm = LLM(
        model=model_map[model_name]["model_id"],
        trust_remote_code=True,
        max_model_len=model_map[model_name]["max_model_len"],
        tensor_parallel_size=tp_size
    )
    sampling_params = SamplingParams(
        temperature=0.0
    )

    input_dict = {
        "prompt": model_map[model_name]["prompt_template"].format(prompt=prompt),
        "multi_modal_data": {
            "image": image,
        }
    }
    outputs = llm.generate(input_dict, sampling_params)

    print(f"{model_name} outputs:")
    print(outputs[0].outputs[0].text)
    print("\n" * 5)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--method", type=str, default="paligemma")
    parser.add_argument("--tensor_parallel_size", type=int, default=1)
    args = parser.parse_args()

    test_suite(args.method, args.tensor_parallel_size)

@ywang96
Copy link
Member

ywang96 commented Aug 7, 2024

With the following simple test codes, I can successfully run all listed models on both tensor_parallel_size=1 and tensor_parallel_size=2 scenarios with expected outputs.

import requests
from PIL import Image
import argparse
from vllm import LLM, SamplingParams
from huggingface_hub import snapshot_download

prompt = "What is on the flower?"
image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true"

# prompt = "caption es"
# image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"

image = Image.open(requests.get(image_file, stream=True).raw)

model_map = {
    # Siglip based models
    "paligemma": {
        "prompt_template": "{prompt}",
        "model_id": "google/paligemma-3b-mix-224",
        "max_model_len": None,
    },

    # Clip based models
    "llava_next": {
        "prompt_template": (
            "A chat between a curious human and an artificial intelligence assistant. "
            "The assistant gives helpful, detailed, and polite answers to the human's "
            "questions. "
            "USER: <image>\n{prompt} ASSISTANT:"
        ),
        "model_id": "llava-hf/llava-v1.6-vicuna-7b-hf",
        "max_model_len": None,
    },
    "llava": {
        "prompt_template": "USER: <image>\n{prompt}\nASSISTANT:",
        "model_id": "llava-hf/llava-1.5-7b-hf",
        "max_model_len": None,
    },
    "phi3v": {
        "prompt_template": "<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n",
        "model_id": "microsoft/Phi-3-vision-128k-instruct",
        "max_model_len": 4096,
    },

    # Blip based models
    "blip2": {
        "prompt_template": "Question: {prompt} Answer:",
        "model_id": "Salesforce/blip2-opt-2.7b",
        "max_model_len": None,
    },

    # InternVL based models
    "internvl": {
        "prompt_template": "<|im_start|>User\n<image>\nWhat's the content in the center of the image?<|im_end|>\n<|im_start|>Assistant\n",
        "model_id": snapshot_download("OpenGVLab/InternVL2-1B"),
        "max_model_len": None,
    }
}

def test_suite(model_name, tp_size):
    print("#" * 10 + "#" * len(f" Testing {model_name} ") + "#" * 10)
    print("#" + " " * 9 + " " * len(f" Testing {model_name} ") + " " * 9 + "#")
    print("#" + " " * 9 + f" Testing {model_name} " + " " * 9 + "#")
    print("#" + " " * 9 + " " * len(f" Testing {model_name} ") + " " * 9 + "#")
    print("#" * 10 + "#" * len(f" Testing {model_name} ") + "#" * 10)

    llm = LLM(
        model=model_map[model_name]["model_id"],
        trust_remote_code=True,
        max_model_len=model_map[model_name]["max_model_len"],
        tensor_parallel_size=tp_size
    )
    sampling_params = SamplingParams(
        temperature=0.0
    )

    input_dict = {
        "prompt": model_map[model_name]["prompt_template"].format(prompt=prompt),
        "multi_modal_data": {
            "image": image,
        }
    }
    outputs = llm.generate(input_dict, sampling_params)

    print(f"{model_name} outputs:")
    print(outputs[0].outputs[0].text)
    print("\n" * 5)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--method", type=str, default="paligemma")
    parser.add_argument("--tensor_parallel_size", type=int, default=1)
    args = parser.parse_args()

    test_suite(args.method, args.tensor_parallel_size)

This is great and thank you so much for the implementation and thorough testing coverage. I will take a look this week and get back to you!

@ChristopherCho
Copy link
Contributor Author

@ywang96 @DarkLight1337
Thanks for your feedback! I’ve implemented the changes based on your comments and also merged the main branch for the CI flag.
It looks good to me now—ready to proceed when you are.

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I've run all the models again with your test file, so let's get this in! Thank you for the work! @ChristopherCho

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 30, 2024
@ywang96 ywang96 enabled auto-merge (squash) August 30, 2024 07:46
@ywang96
Copy link
Member

ywang96 commented Aug 30, 2024

Ah... this will actually break the CPU test. How about using transformers Attention module as a fallback in case xformers is not available? @ChristopherCho

@ChristopherCho
Copy link
Contributor Author

@ywang96
Oh... I forgot that case. I'll do it right away.

@ChristopherCho
Copy link
Contributor Author

ChristopherCho commented Aug 30, 2024

@ywang96 I've checked the error message and found some issues here.

  • In CPU mode, xformers is not installed.
  • However, in vllm/tests/models/test_internvl.py and vllm/tests/models/test_intern_vit.py due to some required dependencies, they import internvl.py or intern_vit.py.
  • For importing them, xformers should be installed.
    -> This causes the error ModuleNotFoundError: No module named 'xformers' in Intel CPU Test

However, for other VLM models, they're deselected (as run-cpu-test.sh test for "not vlm" models) while testing, and no need to import original model files, thus not causing errors.

I think we can avoid the error by just importing xformers only when available, but not sure whether this is a good solution.
Do you have any good ideas for this?

@ywang96
Copy link
Member

ywang96 commented Aug 30, 2024

@ChristopherCho I see, let me try to move those import statements to inside the run_test call and see if that helps

@ywang96 ywang96 changed the title [Model] TP support for ViTs [VLM][Model] TP support for ViTs Aug 30, 2024
@WoosukKwon WoosukKwon disabled auto-merge August 30, 2024 15:19
@WoosukKwon WoosukKwon merged commit f97be32 into vllm-project:main Aug 30, 2024
34 of 37 checks passed
@SovereignRemedy
Copy link

#8055 (comment)
@ywang96 @ChristopherCho Hello, will this case be fixed in this issue?

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Alvant <alvasian@yandex.ru>
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants