Skip to content

TensorRT-LLM Backend with multiple LLMs (not to be confused with multi-model) #714

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Open
dduck999 opened this issue Feb 27, 2025 · 5 comments

Comments

@dduck999
Copy link

dduck999 commented Feb 27, 2025

I followed this guide: https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#quick-start

I got this working but this only serves one model. Not only that, it breaks my understanding of the one-model equals one-folder structure of the Triton model repository. Inflight batching of a model consists of 4-5 model folders, then how would I add a another model which consists of another 4-5 model folders, and how would they be named?

Is inflight batching only an option (but every example for Trion + TRT-LLM backend is always inflight batching!) Is what I wanted to do simply be accomplished by copying the engine files (rank0.engine, rank1.engine etc.) into the model folders?

I've been stuck on this for a week now. Any hints/help/comments would be grateful!

@adityarajsahu
Copy link

Hi @dduck999, I am trying to deploy Qwen2.5-0.5B-Instruct model on Triton using tensorrtllm_backend, I have converted the model to TensorRT-LLM engine, but after that I am unable to deploy the engine on Triton. Can you please tell me the steps to deploy the engine on Triton along with the docker image and package versions used?

@dduck999
Copy link
Author

Hi @adityarajsahu , have you built the folder structure along with the template files as in the example? Then have you run the script to write into these files?

@adityarajsahu
Copy link

Hey @dduck999, just found the issue, thanks.

@dduck999
Copy link
Author

dduck999 commented Mar 12, 2025

Hey @dduck999, just found the issue, thanks.

@adityarajsahu great! If you ever find a way to host more than one model behind one Triton TRT-LLM endpoint, please share 😊

@jadhosn
Copy link

jadhosn commented Mar 24, 2025

If you ever find a way to host more than one model behind one Triton TRT-LLM endpoint, please share 😊

Have you tried custom Python BLS or an ensemble? You can host N-many models behind a single ensemble endpoint.

No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants