-
Notifications
You must be signed in to change notification settings - Fork 631
[jobs] catch unimportant errors in controller proccess #4615
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Conversation
4dee032
to
5d480f0
Compare
/quicktest-core |
/smoke-test --managed-jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505!
sky/jobs/controller.py
Outdated
logger.info( | ||
'The user job failed. Please check the logs below.\n' | ||
f'== Logs of the user job (ID: {self._job_id}) ==\n') | ||
|
||
self._download_log_and_stream(task_id, handle) | ||
try: | ||
self._download_log_and_stream(task_id, handle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seem we already have the error handling in the function. Should we handle it again here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handling inside the function only handles some case. I think we should catch a broader exception here. I will check if we can move the handling inside the function without breaking skyserve usage.
sky/jobs/controller.py
Outdated
end_time = managed_job_utils.get_job_timestamp( | ||
self._backend, cluster_name, get_end_time=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have the function to handle the exceptions inside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems counterintuitive, could break some other use of the function that would not expect this fallback behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it seems this get_job_timestamp
is only called twice in this file, and both of them needs the exception handling with the same logic. We can probably rename the function to something like try_to_get_job_timestamp_or_current_time
and have the exception handling inside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505! LGTM
After the managed job completes, there are a few cases where we still try to access the cluster. If that fails, don't crash the controller.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh