Skip to content

Performance of triton+trtllm on llava-onevision compared to vllm and sglang #689

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Open
alexemme opened this issue Feb 3, 2025 · 0 comments

Comments

@alexemme
Copy link

alexemme commented Feb 3, 2025

Hello everyone,
I have been experimenting with the llava-onevision model.
The conversion of the model to trt-llm and serving it with Triton Inference Server works well,
but I have the impression that I haven’t optimized the configuration of the individual models properly.

I performed quick benchmarks of the same model on vLLM and sglang (the deployer recommended in the onevision documentation).
The results show that, with zero concurrency, Triton is faster.
However, as concurrency increases, Triton performs better than vLLM, but significantly worse than sglang.

I tried various parameter combinations, especially with batch_size and inflight-batching,
which helped improve inference times, but not enough to match the performance of SGLANG.

Looking at the /metrics endpoint, everything seems fine.
The only strange thing is that the cumulative queue time for preprocessing and multimodal_encoders is quite high,
but not high enough to fully explain the large performance difference.
For example, to complete 300 total requests with a concurrency of 30,
Triton takes around 65 seconds, whereas SGLANG takes about 50 seconds.

My question is:
Since TRT-LLM as an inference backend appears to be faster than SGLANG's inference engine
(this can be observed from the lower e2e latency on single inference requests),
can I always expect better performance with high concurrency by finding the right combination of parameters,
or could SGLANG have additional optimizations that inherently prevent achieving the same performance with Triton Inference Server?

No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant