-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[MISC] Dump model runner inputs when crashing #8305
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Can we add instructions in the GitHub issues template so users can share their logs upon encountering such errors? |
Good point. Will do |
do we need to add a flag for this? looks like some debugging feature that can also be added in https://docs.vllm.ai/en/latest/getting_started/debugging.html |
The goal is to be able to get logs from production usage to help track down hard to replicate bugs (like illegal mem access in prefix caching). So having a flag defeats the purpose |
makes sense then. please ignore my comment. |
@DarkLight1337 added to issue template. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We should be careful when loading untrusted pickle files though.
It should be fine as we never load it automatically? But yeah you may get virus if someone post a malicious pickle file to an issue... |
vllm-project#8305 was recently added to dump model running inputs when encountering a fatal error. If this happens during decode however it will include the kvcache tensors which are typically huge (~60GB in the case I was testing), and can therefore take minutes to write to disk. When this happens the engine loop is blocked and health checks time-out causing the server to be killed. This change replaces kvcache tensors with their dtype + shape. With this the pickling is sub-second and the filesize in my test case was 7KB.
Signed-off-by: Alvant <alvasian@yandex.ru>
Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>
To better reproduce the model runner crashing due to illegal memory access and possibly other errors, this PR introduces a utility that dumps model runner inputs when crashing. Since the model runner inputs may be long, I dumped them using pickle for now. Any suggestions or better ideas are welcome.
cc @robertgshaw2-neuralmagic @simon-mo @DarkLight1337