Skip to content

[jobs] catch unimportant errors in controller proccess #4615

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Merged
merged 3 commits into from
Feb 1, 2025

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Jan 28, 2025

After the managed job completes, there are a few cases where we still try to access the cluster. If that fails, don't crash the controller.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@cg505 cg505 requested a review from romilbhardwaj January 28, 2025 01:05
@cg505 cg505 force-pushed the jobs-preemption-handling branch from 4dee032 to 5d480f0 Compare January 28, 2025 01:07
@cg505 cg505 added this to the v0.8.0 milestone Jan 28, 2025
@cg505
Copy link
Collaborator Author

cg505 commented Jan 28, 2025

/quicktest-core

@cg505
Copy link
Collaborator Author

cg505 commented Jan 28, 2025

/smoke-test --managed-jobs

@cg505 cg505 requested a review from Michaelvll January 29, 2025 19:17
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505!

logger.info(
'The user job failed. Please check the logs below.\n'
f'== Logs of the user job (ID: {self._job_id}) ==\n')

self._download_log_and_stream(task_id, handle)
try:
self._download_log_and_stream(task_id, handle)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem we already have the error handling in the function. Should we handle it again here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The handling inside the function only handles some case. I think we should catch a broader exception here. I will check if we can move the handling inside the function without breaking skyserve usage.

Comment on lines 317 to 318
end_time = managed_job_utils.get_job_timestamp(
self._backend, cluster_name, get_end_time=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have the function to handle the exceptions inside?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems counterintuitive, could break some other use of the function that would not expect this fallback behavior.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it seems this get_job_timestamp is only called twice in this file, and both of them needs the exception handling with the same logic. We can probably rename the function to something like try_to_get_job_timestamp_or_current_time and have the exception handling inside?

@cg505 cg505 requested a review from Michaelvll January 31, 2025 21:07
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505! LGTM

@cg505 cg505 merged commit 5c324c1 into skypilot-org:master Feb 1, 2025
18 checks passed
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants