[jobs] catch unimportant errors in controller proccess #4615

cg505 · 2025-01-28T01:05:40Z

After the managed job completes, there are a few cases where we still try to access the cluster. If that fails, don't crash the controller.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

cg505 · 2025-01-28T22:27:28Z

/quicktest-core

cg505 · 2025-01-28T22:27:35Z

/smoke-test --managed-jobs

Michaelvll

Thanks @cg505!

Michaelvll · 2025-01-29T19:42:08Z

sky/jobs/controller.py

                    logger.info(
                        'The user job failed. Please check the logs below.\n'
                        f'== Logs of the user job (ID: {self._job_id}) ==\n')

-                    self._download_log_and_stream(task_id, handle)
+                    try:
+                        self._download_log_and_stream(task_id, handle)


Seem we already have the error handling in the function. Should we handle it again here?

The handling inside the function only handles some case. I think we should catch a broader exception here. I will check if we can move the handling inside the function without breaking skyserve usage.

Michaelvll · 2025-01-29T19:43:08Z

sky/jobs/controller.py

+                        end_time = managed_job_utils.get_job_timestamp(
+                            self._backend, cluster_name, get_end_time=True)


Should we have the function to handle the exceptions inside?

That seems counterintuitive, could break some other use of the function that would not expect this fallback behavior.

Well, it seems this get_job_timestamp is only called twice in this file, and both of them needs the exception handling with the same logic. We can probably rename the function to something like try_to_get_job_timestamp_or_current_time and have the exception handling inside?

Michaelvll

Thanks @cg505! LGTM

cg505 requested a review from romilbhardwaj January 28, 2025 01:05

[jobs] catch unimportant errors in controller proccess

5d480f0

cg505 force-pushed the jobs-preemption-handling branch from 4dee032 to 5d480f0 Compare January 28, 2025 01:07

cg505 added this to the v0.8.0 milestone Jan 28, 2025

fix lint

47d51f7

cg505 requested a review from Michaelvll January 29, 2025 19:17

Michaelvll reviewed Jan 29, 2025

View reviewed changes

push down error handling

11e54ac

cg505 requested a review from Michaelvll January 31, 2025 21:07

Michaelvll approved these changes Jan 31, 2025

View reviewed changes

cg505 merged commit 5c324c1 into skypilot-org:master Feb 1, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jobs] catch unimportant errors in controller proccess #4615

[jobs] catch unimportant errors in controller proccess #4615

cg505 commented Jan 28, 2025 •

edited

Loading

cg505 commented Jan 28, 2025

cg505 commented Jan 28, 2025

Michaelvll left a comment

Michaelvll Jan 29, 2025

cg505 Jan 31, 2025

Michaelvll Jan 29, 2025

cg505 Jan 31, 2025

Michaelvll Jan 31, 2025

Michaelvll left a comment

		end_time = managed_job_utils.get_job_timestamp(
		self._backend, cluster_name, get_end_time=True)

[jobs] catch unimportant errors in controller proccess #4615

[jobs] catch unimportant errors in controller proccess #4615

Conversation

cg505 commented Jan 28, 2025 • edited Loading

cg505 commented Jan 28, 2025

cg505 commented Jan 28, 2025

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Jan 29, 2025

Choose a reason for hiding this comment

cg505 Jan 31, 2025

Choose a reason for hiding this comment

Michaelvll Jan 29, 2025

Choose a reason for hiding this comment

cg505 Jan 31, 2025

Choose a reason for hiding this comment

Michaelvll Jan 31, 2025

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

cg505 commented Jan 28, 2025 •

edited

Loading