You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello everyone,
I have been experimenting with the llava-onevision model.
The conversion of the model to trt-llm and serving it with Triton Inference Server works well,
but I have the impression that I haven’t optimized the configuration of the individual models properly.
I performed quick benchmarks of the same model on vLLM and sglang (the deployer recommended in the onevision documentation).
The results show that, with zero concurrency, Triton is faster.
However, as concurrency increases, Triton performs better than vLLM, but significantly worse than sglang.
I tried various parameter combinations, especially with batch_size and inflight-batching,
which helped improve inference times, but not enough to match the performance of SGLANG.
Looking at the /metrics endpoint, everything seems fine.
The only strange thing is that the cumulative queue time for preprocessing and multimodal_encoders is quite high,
but not high enough to fully explain the large performance difference.
For example, to complete 300 total requests with a concurrency of 30,
Triton takes around 65 seconds, whereas SGLANG takes about 50 seconds.
My question is:
Since TRT-LLM as an inference backend appears to be faster than SGLANG's inference engine
(this can be observed from the lower e2e latency on single inference requests),
can I always expect better performance with high concurrency by finding the right combination of parameters,
or could SGLANG have additional optimizations that inherently prevent achieving the same performance with Triton Inference Server?
The text was updated successfully, but these errors were encountered:
Hello everyone,
I have been experimenting with the llava-onevision model.
The conversion of the model to trt-llm and serving it with Triton Inference Server works well,
but I have the impression that I haven’t optimized the configuration of the individual models properly.
I performed quick benchmarks of the same model on vLLM and sglang (the deployer recommended in the onevision documentation).
The results show that, with zero concurrency, Triton is faster.
However, as concurrency increases, Triton performs better than vLLM, but significantly worse than sglang.
I tried various parameter combinations, especially with batch_size and inflight-batching,
which helped improve inference times, but not enough to match the performance of SGLANG.
Looking at the /metrics endpoint, everything seems fine.
The only strange thing is that the cumulative queue time for preprocessing and multimodal_encoders is quite high,
but not high enough to fully explain the large performance difference.
For example, to complete 300 total requests with a concurrency of 30,
Triton takes around 65 seconds, whereas SGLANG takes about 50 seconds.
My question is:
Since TRT-LLM as an inference backend appears to be faster than SGLANG's inference engine
(this can be observed from the lower e2e latency on single inference requests),
can I always expect better performance with high concurrency by finding the right combination of parameters,
or could SGLANG have additional optimizations that inherently prevent achieving the same performance with Triton Inference Server?
The text was updated successfully, but these errors were encountered: