-
Notifications
You must be signed in to change notification settings - Fork 632
[core] optimizer/provisioner fails over for cloud with expired credentials #5015
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Conversation
Manual failover test cases:
Result:
Similar outputs for each cloud as in case 1. |
/quicktest-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! First pass, have some questions about which exceptions we need to catch.
Also, should we catch the new exception type when we call write_cluster_config? Here:
skypilot/sky/backends/cloud_vm_ray_backend.py
Lines 1404 to 1416 in 51e4765
try: | |
config_dict = backend_utils.write_cluster_config( | |
to_provision, | |
num_nodes, | |
_get_cluster_config_template(to_provision.cloud), | |
cluster_name, | |
self._local_wheel_path, | |
self._wheel_hash, | |
region=region, | |
zones=zones, | |
dryrun=dryrun, | |
keep_launch_fields_in_existing_config=cluster_exists) | |
except exceptions.ResourcesUnavailableError as e: |
sky/backends/cloud_vm_ray_backend.py
Outdated
if output.find('InvalidCloudCredentials') != -1: | ||
_add_to_blocked_resources( | ||
blocked_resources, resources_lib.Resources(cloud=clouds.AWS())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this already be handled by the default_handler? Why do we need a special case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default handler only blocks one zone or one region one time, while we block the whole cloud in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, looks like the exception handling:
except exceptions.InvalidCloudCredentials as e:
# Failed due to invalid cloud credentials.
already did similar work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default handler only blocks one zone or one region one time
I'm not sure this is right - it does to_provision.copy(region=None, zone=None)
. I think removing the zone and region information should mean that this will block the whole cloud. Could we double-check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cg505 Check the default handler logic below, it only blocks the region or zones at a time, and provisioner will retry the others one by one.
https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L1105
https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L1110
@aylei Check the detailed provisioner logic below, the exception is not handled duplicated.
sky/backends/cloud_vm_ray_backend.py
_provision()
try:
provision_with_retries()
except exceptions.ResourcesUnavailableError as e:
fail
provision_with_retries()
while True:
try:
_retry_zones()
except (exceptions.InvalidClusterNameError,
exceptions.NotSupportedError,
exceptions.CloudUserIdentityError) as e:
continue
except exceptions.ResourcesUnavailableError as e:
if not launchable_retries_disabled:
continue
_retry_zones()
for zones:
try:
write_cluster_config()
except exceptions.ResourcesUnavailableError as e:
continue
except exceptions.InvalidCloudCredentials as e: # newly added
block whole cloud
raise exceptions.ResourcesUnavailableError # this will trigger failover in provision_with_retries()
except exceptions.InvalidCloudConfigs as e:
block whole cloud
raise exceptions.ResourcesUnavailableError
try:
bulk_provision()
except provision_common.StopFailoverError:
raise
except Exception as e:
# This will call the _$cloud_handler to block the resources in zone, region, or cloud
# The _default_handler will only block one zone at a time if `zones` is not empty
FailoverCloudErrorHandlerV2.update_blocklist_on_error(
self._blocked_resources, to_provision, region, zones, e)
continue
sky/provision/provisioner.py
bulk_provision()
try:
_bulk_provision()
except exceptions.NoClusterLaunchedError:
raise
except exceptions.InvalidCloudCredentials: # newly added, this will trigger blocking resources in handlers like _aws_handler
raise
except Exception:
# try to stop/terminate the resources that were provisioned
raiseStopFailoverError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'm sorry I missed that.
I guess for most clouds, InvalidCloudCredentials
should block the whole cloud, so we probably could consider moving this to the default handler. But we can follow up later on that.
The new exception type is caught in the code change, PTAL again. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @DanielZhangQD ! Mostly LGTM, just some minors:
The log looks too verbose. Can we reduce it to a single error line?
D 03-23 18:32:18 backend_utils.py:674] Using ssh_proxy_command: None
E 03-23 18:32:18 authentication.py:209] Error getting GCP project: google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})
W 03-23 18:32:18 common_utils.py:475] Caught google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'}). Retrying.
E 03-23 18:32:20 authentication.py:209] Error getting GCP project: google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})
W 03-23 18:32:20 common_utils.py:475] Caught google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'}). Retrying.
E 03-23 18:32:22 authentication.py:209] Error getting GCP project: google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})
sky/backends/cloud_vm_ray_backend.py
Outdated
if output.find('InvalidCloudCredentials') != -1: | ||
_add_to_blocked_resources( | ||
blocked_resources, resources_lib.Resources(cloud=clouds.AWS())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, looks like the exception handling:
except exceptions.InvalidCloudCredentials as e:
# Failed due to invalid cloud credentials.
already did similar work
Remove the log in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the fixes! Looks good as long as smoke tests are passing.
sky/backends/cloud_vm_ray_backend.py
Outdated
output = str(error) | ||
logger.info(f'AWS handler error: {output}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor change:
output = str(error) | |
logger.info(f'AWS handler error: {output}') | |
logger.info(f'AWS handler error: {error}') |
Since we only use this variable once, I think this is cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, updated
/smoke-test --aws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
The failed smoke test cases are not related to this PR and are traced in #5045 |
Fix #4373
Tested (run the relevant ones):
bash format.sh
/smoke-test
(CI) orpytest tests/test_smoke.py
(local)/smoke-test -k test_name
(CI) orpytest tests/test_smoke.py::test_name
(local)/quicktest-core
(CI) orpytest tests/smoke_tests/test_backward_compat.py
(local)