-
Notifications
You must be signed in to change notification settings - Fork 631
[UX] Allow purging non-existent k8s cluster #4980
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Conversation
@romilbhardwaj @Michaelvll requesting a view when you have time |
sky/backends/cloud_vm_ray_backend.py
Outdated
'Failed abnormal non-terminated nodes cleanup. ' | ||
'Skipping and cleaning up as purge is set. Cleaning ' | ||
f'up cluster configuration data. Details: {msg}') | ||
self.remove_cluster_config(handle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its mainly to remove the config yaml file too, as remove_cluster (called line 4373) doesn't really contain code to do so. One of the main reasons code executes to this point despite a missing cluster is because the yaml file does in the first place.
However, I think it is worth considering that when we look at the definition for remove_cluster_config
:
def remove_cluster_config(self, handle: CloudVmRayResourceHandle) -> None:
"""Remove the YAML config of a cluster."""
handle.cluster_yaml = None
global_user_state.update_cluster_handle(handle.cluster_name, handle)
common_utils.remove_file_if_exists(handle.cluster_yaml)
cluster_yaml is always None, so internally the common_utils.remove_file_...
call does nothing, I missed that. Either this is a bug with the remove_cluster_config code (possible expected behavior is to call common_utils.remove_file...
BEFORE handle.cluster_yaml = None
) or we can remove the self.remove_cluster_config
call entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Michaelvll do you know what's happened here? Seems like we have inconsistent handling of the cluster yaml file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cg505 any potential updates on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither me nor @Michaelvll is sure what happened here. I'd like to dig into it a little bit and get back to you, but might have to be tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that would be helpful to clarify - in the non-error case (that is, the cluster is normal and the cluster_yaml is present), when does the cluster yaml get removed? Why don't we hit that path in the non-existent purge case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not 100% confident about that part of the code path atm, but Ill be able to look into it a bit more tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so there are two issues:
- Should we call
self.remove_cluster_config
here? self.remove_cluster_config
has a bug where it doesn't actually do anything.
Let's table 2. I think it's pretty clear what the function was supposed to do, but I'm worried fixing it may introduce some other problems. We want to be careful with that fix, so let's not worry about it here. I'll open an issue.
For 1. Let's avoid calling it here, since ideally we should refactor the above if terminate
block to handle this case instead. However, that also seems broken because we will never hit this if statement if it succeeds...
The more I look at this code the more I'm convinced it's pure luck that it works 😂.
Don't want to block this PR on it, so let's just remove the remove_cluster_config
call and get this fix in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I committed the change, also ran some tests to make sure that the original intended functionality worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one suggestion, otherwise this looks good. Thank you!
Co-authored-by: Christopher Cooper <christopher@cg505.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix! Sorry for the format issue :)
Fixes #4769 .
Skypilot was crashing as it was unable to find the kubernetes cluster in the
_detect_abnormal_non_terminated_nodes
helper method undercloud_vm_ray_backend.py
. This happens because of an earlier no-cluster exception in the port cleanup section. The order of the function calls is such that during port clean up skypilot will also clean up the metadata in the state tables. However, the relevant method calls for port cleanups is processed BEFORE the metadata, this means that when an exception is raised on port clean up, metadata is not cleared, hence leading to a call of_detect_abnormal_non_terminated_nodes
. As that function call is the last in the chain, we simply check if purge is set and clear any metadata.Tested (run the relevant ones):
bash format.sh
/smoke-test
(CI) orpytest tests/test_smoke.py
(local)/smoke-test -k test_name
(CI) orpytest tests/test_smoke.py::test_name
(local)/quicktest-core
(CI) orconda deactivate; bash -i tests/backward_compatibility_tests.sh
(local)