Skip to content

[Examples] vLLM example for SkyServe + Mixtral #2948

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Merged
merged 4 commits into from
Jan 11, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions llm/vllm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,3 +126,94 @@ curl http://$IP:8000/v1/chat/completions \
}
}
```

## Serving the above Llama-2 example with vLLM and SkyServe

1. Adding an `service` section in the above `serve-openai-api.yaml` file to make it an [`SkyServe Service YAML` file](https://skypilot.readthedocs.io/en/latest/serving/service-yaml-spec.html):

```yaml
# The newly-added `service` section to the `serve-openai-api.yaml` file.
service:
# Specifying the path to the endpoint to check the readiness of the service.
readiness_probe: /v1/models
# How many replicas to manage.
replicas: 2
```

The whole Service YAML is shown here: [serve-openai-api-service.yaml](service.yaml).

2. Start serving by using [SkyServe](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) CLI:
```bash
sky serve up -n vllm-llama2 service.yaml
```

3. Use `sky serve status` to check the status of the serving:
```bash
sky serve status vllm-llama2
```

You should get a similar output as the following:

```console
Services
NAME UPTIME STATUS REPLICAS ENDPOINT
vllm-llama2 7m 43s READY 2/2 3.84.15.251:30001

Service Replicas
SERVICE_NAME ID IP LAUNCHED RESOURCES STATUS REGION
vllm-llama2 1 34.66.255.4 11 mins ago 1x GCP({'L4': 1}) READY us-central1
vllm-llama2 2 35.221.37.64 15 mins ago 1x GCP({'L4': 1}) READY us-east4
```

4. Check the endpoint of the service:
```bash
ENDPOINT=$(sky serve status --endpoint vllm-llama2)
```

4. Once it status is `READY`, you can use the endpoint to interact with the model:

```bash
curl -L $ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-chat-hf",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Who are you?"
}
]
}'
```

Notice that it is the same with previously curl command, except for thr `-L` argument. You should get a similar response as the following:

```console
{
"id": "cmpl-879a58992d704caf80771b4651ff8cb6",
"object": "chat.completion",
"created": 1692650569,
"model": "meta-llama/Llama-2-7b-chat-hf",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": " Hello! I'm just an AI assistant, here to help you"
},
"finish_reason": "length"
}],
"usage": {
"prompt_tokens": 31,
"total_tokens": 47,
"completion_tokens": 16
}
}
```

## Serving Mixtral 8x7b model with vLLM and SkyServe

Please refer to the [Mixtral 8x7b example](https://github.com/skypilot-org/skypilot/tree/master/llm/mixtral) for more details.
42 changes: 42 additions & 0 deletions llm/vllm/service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# service.yaml
# The newly-added `service` section to the `serve-openai-api.yaml` file.
service:
# Specifying the path to the endpoint to check the readiness of the service.
readiness_probe: /v1/models
# How many replicas to manage.
replicas: 2

# Fields below are the same with `serve-openai-api.yaml`.
envs:
MODEL_NAME: meta-llama/Llama-2-7b-chat-hf
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token

resources:
accelerators: L4:1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use multiple accelerators for this and the original yaml files, so that a user without GCP credentials can use the yaml out-of-the-box, e.g.,{L4:1, A10G:1, A10:1, A100:1, A100-80GB:1}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks!

ports:
- 8000

setup: |
conda activate vllm
if [ $? -ne 0 ]; then
conda create -n vllm python=3.9 -y
conda activate vllm
fi

git clone https://github.com/vllm-project/vllm.git || true
# Install fschat and accelerate for chat completion
pip install fschat
pip install accelerate

cd vllm
pip list | grep vllm || pip install .
python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"


run: |
conda activate vllm
echo 'Starting vllm openai api server...'
python -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \
--host 0.0.0.0