[UX] Allow purging non-existent k8s cluster #4980

kyuds · 2025-03-18T13:33:17Z

Fixes #4769 .

Skypilot was crashing as it was unable to find the kubernetes cluster in the _detect_abnormal_non_terminated_nodes helper method under cloud_vm_ray_backend.py. This happens because of an earlier no-cluster exception in the port cleanup section. The order of the function calls is such that during port clean up skypilot will also clean up the metadata in the state tables. However, the relevant method calls for port cleanups is processed BEFORE the metadata, this means that when an exception is raised on port clean up, metadata is not cleared, hence leading to a call of _detect_abnormal_non_terminated_nodes. As that function call is the last in the chain, we simply check if purge is set and clear any metadata.

Tested (run the relevant ones):

Code formatting: install pre-commit (auto-check on commit) or bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
Backward compatibility: /quicktest-core (CI) or conda deactivate; bash -i tests/backward_compatibility_tests.sh (local)

kyuds · 2025-03-18T14:38:12Z

@romilbhardwaj @Michaelvll requesting a view when you have time

cg505 · 2025-03-18T20:52:03Z

sky/backends/cloud_vm_ray_backend.py

+                        'Failed abnormal non-terminated nodes cleanup. '
+                        'Skipping and cleaning up as purge is set. Cleaning '
+                        f'up cluster configuration data. Details: {msg}')
+                    self.remove_cluster_config(handle)


Why is this needed?

its mainly to remove the config yaml file too, as remove_cluster (called line 4373) doesn't really contain code to do so. One of the main reasons code executes to this point despite a missing cluster is because the yaml file does in the first place.

However, I think it is worth considering that when we look at the definition for remove_cluster_config:

def remove_cluster_config(self, handle: CloudVmRayResourceHandle) -> None: """Remove the YAML config of a cluster.""" handle.cluster_yaml = None global_user_state.update_cluster_handle(handle.cluster_name, handle) common_utils.remove_file_if_exists(handle.cluster_yaml)

cluster_yaml is always None, so internally the common_utils.remove_file_... call does nothing, I missed that. Either this is a bug with the remove_cluster_config code (possible expected behavior is to call common_utils.remove_file... BEFORE handle.cluster_yaml = None) or we can remove the self.remove_cluster_config call entirely.

@Michaelvll do you know what's happened here? Seems like we have inconsistent handling of the cluster yaml file.

@cg505 any potential updates on this?

Neither me nor @Michaelvll is sure what happened here. I'd like to dig into it a little bit and get back to you, but might have to be tomorrow.

One thing that would be helpful to clarify - in the non-error case (that is, the cluster is normal and the cluster_yaml is present), when does the cluster yaml get removed? Why don't we hit that path in the non-existent purge case?

Hmm not 100% confident about that part of the code path atm, but Ill be able to look into it a bit more tomorrow.

Okay so there are two issues:

Should we call self.remove_cluster_config here?

self.remove_cluster_config has a bug where it doesn't actually do anything.

Let's table 2. I think it's pretty clear what the function was supposed to do, but I'm worried fixing it may introduce some other problems. We want to be careful with that fix, so let's not worry about it here. I'll open an issue.

For 1. Let's avoid calling it here, since ideally we should refactor the above if terminate block to handle this case instead. However, that also seems broken because we will never hit this if statement if it succeeds...

The more I look at this code the more I'm convinced it's pure luck that it works 😂.

Don't want to block this PR on it, so let's just remove the remove_cluster_config call and get this fix in.

Got it. I committed the change, also ran some tests to make sure that the original intended functionality worked.

cg505

Left one suggestion, otherwise this looks good. Thank you!

sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Christopher Cooper <christopher@cg505.com>

cg505

Thank you for the fix! Sorry for the format issue :)

kyuds added 2 commits March 17, 2025 23:41

handle non-existent cluster clean up with purge in post_teardown_cleanup

b833003

style

563fa21

romilbhardwaj mentioned this pull request Mar 18, 2025

[k8s] sky down --purge not working for k8s #4921

Closed

cg505 requested review from romilbhardwaj and Michaelvll March 18, 2025 20:49

cg505 reviewed Mar 18, 2025

View reviewed changes

cg505 mentioned this pull request Mar 21, 2025

cluster_yaml is not cleaned up #5011

Open

remove remove_cluster_config call; move to separate issue

b51b096

cg505 reviewed Mar 24, 2025

View reviewed changes

sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved

kyuds and others added 3 commits March 25, 2025 09:04

Update sky/backends/cloud_vm_ray_backend.py

368473e

Co-authored-by: Christopher Cooper <christopher@cg505.com>

format

cf1e943

format attempt 2

323071a

kyuds requested a review from cg505 March 25, 2025 00:40

cg505 approved these changes Mar 25, 2025

View reviewed changes

cg505 merged commit 2dba50f into skypilot-org:master Mar 25, 2025
18 checks passed

kyuds deleted the fix/ne-jobc-k8s-p branch March 26, 2025 00:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UX] Allow purging non-existent k8s cluster #4980

[UX] Allow purging non-existent k8s cluster #4980

kyuds commented Mar 18, 2025

kyuds commented Mar 18, 2025

cg505 Mar 18, 2025

kyuds Mar 19, 2025 •

edited

Loading

cg505 Mar 19, 2025

kyuds Mar 20, 2025

cg505 Mar 20, 2025

cg505 Mar 20, 2025

kyuds Mar 21, 2025

cg505 Mar 21, 2025

cg505 Mar 21, 2025

kyuds Mar 22, 2025

cg505 left a comment

cg505 left a comment

[UX] Allow purging non-existent k8s cluster #4980

[UX] Allow purging non-existent k8s cluster #4980

Conversation

kyuds commented Mar 18, 2025

kyuds commented Mar 18, 2025

Choose a reason for hiding this comment

kyuds Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cg505 left a comment

Choose a reason for hiding this comment

cg505 left a comment

Choose a reason for hiding this comment

kyuds Mar 19, 2025 •

edited

Loading