|
| 1 | +# SGLang: Fast and Expressive LLM Inference with RadixAttention for 5x throughput |
| 2 | + |
| 3 | +This README contains instructions to run a demo for SGLang, an open-source library for fast and expressive LLM inference and serving with **5x throughput**. |
| 4 | + |
| 5 | +* [Repo](https://github.com/sgl-project/sglang) |
| 6 | +* [Blog](https://lmsys.org/blog/2024-01-17-sglang) |
| 7 | + |
| 8 | +## Prerequisites |
| 9 | +Install the latest SkyPilot and check your setup of the cloud credentials: |
| 10 | +```bash |
| 11 | +pip install "skypilot-nightly[all]" |
| 12 | +sky check |
| 13 | +``` |
| 14 | + |
| 15 | +## Serving Llama-2 with SGLang using SkyServe |
| 16 | +1. Create a [`SkyServe Service YAML`](https://skypilot.readthedocs.io/en/latest/serving/service-yaml-spec.html) with a `service` section: |
| 17 | + |
| 18 | +```yaml |
| 19 | +service: |
| 20 | + # Specifying the path to the endpoint to check the readiness of the service. |
| 21 | + readiness_probe: /health |
| 22 | + # How many replicas to manage. |
| 23 | + replicas: 2 |
| 24 | +``` |
| 25 | +
|
| 26 | +The entire Service YAML can be found here: [sglang.yaml](sglang.yaml). |
| 27 | +
|
| 28 | +2. Start serving by using [SkyServe](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) CLI: |
| 29 | +```bash |
| 30 | +sky serve up -n sglang sglang.yaml |
| 31 | +``` |
| 32 | + |
| 33 | +3. Use `sky serve status` to check the status of the serving: |
| 34 | +```bash |
| 35 | +sky serve status sglang |
| 36 | +``` |
| 37 | + |
| 38 | +You should get a similar output as the following: |
| 39 | + |
| 40 | +```console |
| 41 | +Services |
| 42 | +NAME VERSION UPTIME STATUS REPLICAS ENDPOINT |
| 43 | +sglang 1 8m 16s READY 2/2 34.32.43.41:30001 |
| 44 | + |
| 45 | +Service Replicas |
| 46 | +SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION |
| 47 | +sglang 1 1 34.85.154.76 16 mins ago 1x GCP({'L4': 1}) READY us-east4 |
| 48 | +sglang 2 1 34.145.195.253 16 mins ago 1x GCP({'L4': 1}) READY us-east4 |
| 49 | +``` |
| 50 | + |
| 51 | +4. Check the endpoint of the service: |
| 52 | +```bash |
| 53 | +ENDPOINT=$(sky serve status --endpoint sglang) |
| 54 | +``` |
| 55 | + |
| 56 | +4. Once it status is `READY`, you can use the endpoint to interact with the model: |
| 57 | + |
| 58 | +```bash |
| 59 | +curl -L $ENDPOINT/v1/chat/completions \ |
| 60 | + -H "Content-Type: application/json" \ |
| 61 | + -d '{ |
| 62 | + "model": "meta-llama/Llama-2-7b-chat-hf", |
| 63 | + "messages": [ |
| 64 | + { |
| 65 | + "role": "system", |
| 66 | + "content": "You are a helpful assistant." |
| 67 | + }, |
| 68 | + { |
| 69 | + "role": "user", |
| 70 | + "content": "Who are you?" |
| 71 | + } |
| 72 | + ] |
| 73 | + }' |
| 74 | +``` |
| 75 | + |
| 76 | +You should get a similar response as the following: |
| 77 | + |
| 78 | +```console |
| 79 | +{ |
| 80 | + "id": "cmpl-879a58992d704caf80771b4651ff8cb6", |
| 81 | + "object": "chat.completion", |
| 82 | + "created": 1692650569, |
| 83 | + "model": "meta-llama/Llama-2-7b-chat-hf", |
| 84 | + "choices": [{ |
| 85 | + "index": 0, |
| 86 | + "message": { |
| 87 | + "role": "assistant", |
| 88 | + "content": " Hello! I'm just an AI assistant, here to help you" |
| 89 | + }, |
| 90 | + "finish_reason": "length" |
| 91 | + }], |
| 92 | + "usage": { |
| 93 | + "prompt_tokens": 31, |
| 94 | + "total_tokens": 47, |
| 95 | + "completion_tokens": 16 |
| 96 | + } |
| 97 | +} |
| 98 | +``` |
0 commit comments