Skip to content

[UX] Allow purging non-existent k8s cluster #4980

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Merged
merged 6 commits into from
Mar 25, 2025

Conversation

kyuds
Copy link
Collaborator

@kyuds kyuds commented Mar 18, 2025

Fixes #4769 .

Skypilot was crashing as it was unable to find the kubernetes cluster in the _detect_abnormal_non_terminated_nodes helper method under cloud_vm_ray_backend.py. This happens because of an earlier no-cluster exception in the port cleanup section. The order of the function calls is such that during port clean up skypilot will also clean up the metadata in the state tables. However, the relevant method calls for port cleanups is processed BEFORE the metadata, this means that when an exception is raised on port clean up, metadata is not cleared, hence leading to a call of _detect_abnormal_non_terminated_nodes. As that function call is the last in the chain, we simply check if purge is set and clear any metadata.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or conda deactivate; bash -i tests/backward_compatibility_tests.sh (local)

@kyuds
Copy link
Collaborator Author

kyuds commented Mar 18, 2025

@romilbhardwaj @Michaelvll requesting a view when you have time

'Failed abnormal non-terminated nodes cleanup. '
'Skipping and cleaning up as purge is set. Cleaning '
f'up cluster configuration data. Details: {msg}')
self.remove_cluster_config(handle)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Collaborator Author

@kyuds kyuds Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its mainly to remove the config yaml file too, as remove_cluster (called line 4373) doesn't really contain code to do so. One of the main reasons code executes to this point despite a missing cluster is because the yaml file does in the first place.

However, I think it is worth considering that when we look at the definition for remove_cluster_config:

def remove_cluster_config(self, handle: CloudVmRayResourceHandle) -> None:
        """Remove the YAML config of a cluster."""
        handle.cluster_yaml = None
        global_user_state.update_cluster_handle(handle.cluster_name, handle)
        common_utils.remove_file_if_exists(handle.cluster_yaml)

cluster_yaml is always None, so internally the common_utils.remove_file_... call does nothing, I missed that. Either this is a bug with the remove_cluster_config code (possible expected behavior is to call common_utils.remove_file... BEFORE handle.cluster_yaml = None) or we can remove the self.remove_cluster_config call entirely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michaelvll do you know what's happened here? Seems like we have inconsistent handling of the cluster yaml file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cg505 any potential updates on this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither me nor @Michaelvll is sure what happened here. I'd like to dig into it a little bit and get back to you, but might have to be tomorrow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that would be helpful to clarify - in the non-error case (that is, the cluster is normal and the cluster_yaml is present), when does the cluster yaml get removed? Why don't we hit that path in the non-existent purge case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not 100% confident about that part of the code path atm, but Ill be able to look into it a bit more tomorrow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so there are two issues:

  1. Should we call self.remove_cluster_config here?
  2. self.remove_cluster_config has a bug where it doesn't actually do anything.

Let's table 2. I think it's pretty clear what the function was supposed to do, but I'm worried fixing it may introduce some other problems. We want to be careful with that fix, so let's not worry about it here. I'll open an issue.

For 1. Let's avoid calling it here, since ideally we should refactor the above if terminate block to handle this case instead. However, that also seems broken because we will never hit this if statement if it succeeds...

The more I look at this code the more I'm convinced it's pure luck that it works 😂.

Don't want to block this PR on it, so let's just remove the remove_cluster_config call and get this fix in.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I committed the change, also ran some tests to make sure that the original intended functionality worked.

Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one suggestion, otherwise this looks good. Thank you!

kyuds and others added 3 commits March 25, 2025 09:04
Co-authored-by: Christopher Cooper <christopher@cg505.com>
@kyuds kyuds requested a review from cg505 March 25, 2025 00:40
Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix! Sorry for the format issue :)

@cg505 cg505 merged commit 2dba50f into skypilot-org:master Mar 25, 2025
18 checks passed
@kyuds kyuds deleted the fix/ne-jobc-k8s-p branch March 26, 2025 00:18
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[UX] Terminating jobs controller with that is on a non-existing kubernetes cluster fails with -p
2 participants