Skip to content

[core] optimizer/provisioner fails over for cloud with expired credentials #5015

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Merged
merged 7 commits into from
Mar 27, 2025

Conversation

DanielZhangQD
Copy link
Collaborator

@DanielZhangQD DanielZhangQD commented Mar 23, 2025

Fix #4373

  • For Kubernetes with expired credentials, auto-exclude the contexts
  • For invalid credentials of aws, azure, and gcp after enabled in cache, failover to the other clouds for provisioning

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Details are posted in separate comments
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@DanielZhangQD
Copy link
Collaborator Author

Manual failover test cases:

  1. Kubernetes -> aws -> gcp -> azure
    task.yaml
resources:
  cpus: 0.5+
  ordered:
  - cloud: kubernetes
  - cloud: aws
  - cloud: gcp
  - cloud: azure

Result:

(.venv) ➜  skypilot git:(4373) ✗ sky launch --cpus 0.5+ -c test /Users/danz/dan/go/src/github.com/skypilot-org/tools/task.yaml
D 03-23 18:32:00 skypilot_config.py:155] Using config path: /Users/danz/.sky/config.yaml
D 03-23 18:32:00 skypilot_config.py:160] Config loaded:
D 03-23 18:32:00 skypilot_config.py:160] {}
D 03-23 18:32:00 skypilot_config.py:172] Config syntax check passed.
YAML to run: /Users/danz/dan/go/src/github.com/skypilot-org/tools/task.yaml
I 03-23 18:32:01 optimizer.py:1034] Using user-specified accelerators list (will be tried in the listed order): Kubernetes(cpus=0.5+), AWS(cpus=0.5+), GCP(cpus=0.5+), Azure(cpus=0.5+)
I 03-23 18:32:03 kubernetes.py:203] Excluding Kubernetes context hailong@basic.us-east-1.eksctl.io: Kubernetes API error: (401)
I 03-23 18:32:03 kubernetes.py:203] Reason: Unauthorized
I 03-23 18:32:03 kubernetes.py:203] HTTP response headers: HTTPHeaderDict({'Audit-Id': 'ad4e7d88-12fb-4632-a27b-dfaa240835e0', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Sun, 23 Mar 2025 10:32:03 GMT', 'Content-Length': '129'})
I 03-23 18:32:03 kubernetes.py:203] HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}
I 03-23 18:32:03 kubernetes.py:203]
I 03-23 18:32:03 kubernetes.py:203]
W 03-23 18:32:03 kubernetes.py:209] All Kubernetes contexts are unreachable. Retry if it is a transient error, or run sky check to refresh Kubernetes availability if permanent.
I 03-23 18:32:03 optimizer.py:1327] No resource satisfying Kubernetes(cpus=0.5+) on Kubernetes.
I 03-23 18:32:03 optimizer.py:1337] - Try specifying a different CPU count, or add "+" to the end of the CPU count to allow for larger instances.
D 03-23 18:32:03 optimizer.py:302] #### Task<name=minimal>(run=<empty>)
D 03-23 18:32:03 optimizer.py:302]   resources: AWS(cpus=0.5+) ####
D 03-23 18:32:03 optimizer.py:317] Defaulting the task's estimated time to 1 hour.
D 03-23 18:32:03 optimizer.py:339] resources: AWS(m6i.large)
...
03-23 18:32:05 optimizer.py:954] ------------------------------------------------------------------------------------------------
I 03-23 18:32:05 optimizer.py:954]  CLOUD   INSTANCE          vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
I 03-23 18:32:05 optimizer.py:954] ------------------------------------------------------------------------------------------------
I 03-23 18:32:05 optimizer.py:954]  AWS     m6i.large         2       8         -              us-east-1       0.10          ✔
I 03-23 18:32:05 optimizer.py:954]  GCP     n2-standard-2     2       8         -              us-central1-a   0.10
I 03-23 18:32:05 optimizer.py:954]  Azure   Standard_D2s_v5   2       8         -              eastus          0.10
I 03-23 18:32:05 optimizer.py:954] ------------------------------------------------------------------------------------------------
Launching a new cluster 'test'. Proceed? [Y/n]:
...
I 03-23 18:32:13 cloud_vm_ray_backend.py:1548] ⚙︎ Launching on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1f).
...
E 03-23 18:32:15 utils.py:65] Failed to fetch IAM instance profile data for skypilot-v1 from AWS.
E 03-23 18:32:15 utils.py:65] Error code: ExpiredToken
C 03-23 18:32:15 utils.py:68] Your AWS session has expired.
C 03-23 18:32:15 utils.py:68] You can request a new one using aws sts get-session-token --serial-number arn:aws:iam::ROOT_ACCOUNT_ID:mfa/AWS_USERNAME --token-code TWO_FACTOR_AUTH_CODE then expose it to SkyPilot by setting export AWS_SECRET_ACCESS_KEY = REPLACE_ME # found at Credentials.SecretAccessKey export AWS_SESSION_TOKEN = REPLACE_ME # found at Credentials.SessionToken export AWS_ACCESS_KEY_ID = REPLACE_ME # found at Credentials.AccessKeyId
C 03-23 18:32:15 utils.py:68] You can find a script that automates this at:https://gist.github.com/maximsmol/a0284e1d97b25d417bd9ae02e5f450cf
D 03-23 18:32:15 common_utils.py:541] Tried to remove /Users/danz/.sky/generated/ssh/test but failed to find it. Skip.
I 03-23 18:32:15 cloud_vm_ray_backend.py:1099] AWS handler error: InvalidCloudCredentials: Failed to fetch IAM instance profile data for skypilot-v1 from AWS.
I 03-23 18:32:15 cloud_vm_ray_backend.py:1099] Error code: ExpiredToken
W 03-23 18:32:15 cloud_vm_ray_backend.py:2090] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in us-east-1 for {Kubernetes(cpus=0.5+), GCP(cpus=0.5+), Azure(cpus=0.5+), AWS(cpus=0.5+)}.
W 03-23 18:32:15 cloud_vm_ray_backend.py:2126]
W 03-23 18:32:15 cloud_vm_ray_backend.py:2126] ↺ Trying other potential resources.
...
I 03-23 18:32:17 optimizer.py:886] Considered resources (1 node):
I 03-23 18:32:17 optimizer.py:954] ------------------------------------------------------------------------------------------------
I 03-23 18:32:17 optimizer.py:954]  CLOUD   INSTANCE          vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
I 03-23 18:32:17 optimizer.py:954] ------------------------------------------------------------------------------------------------
I 03-23 18:32:17 optimizer.py:954]  GCP     n2-standard-2     2       8         -              us-central1-a   0.10          ✔
I 03-23 18:32:17 optimizer.py:954]  Azure   Standard_D2s_v5   2       8         -              eastus          0.10
I 03-23 18:32:17 optimizer.py:954] ------------------------------------------------------------------------------------------------
D 03-23 18:32:18 common.py:234] Updated GCP catalog gcp/images.csv.
D 03-23 18:32:18 backend_utils.py:674] Using ssh_proxy_command: None
E 03-23 18:32:18 authentication.py:209] Error getting GCP project: google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})
W 03-23 18:32:18 common_utils.py:475] Caught google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'}). Retrying.
E 03-23 18:32:20 authentication.py:209] Error getting GCP project: google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})
W 03-23 18:32:20 common_utils.py:475] Caught google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'}). Retrying.
E 03-23 18:32:22 authentication.py:209] Error getting GCP project: google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})
W 03-23 18:32:22 cloud_vm_ray_backend.py:1440] sky.exceptions.InvalidCloudCredentials: google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})
W 03-23 18:32:22 cloud_vm_ray_backend.py:2090] sky.exceptions.ResourcesUnavailableError: Failed to provision on cloud GCP due to invalid cloud credentials: sky.exceptions.InvalidCloudCredentials: google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})
W 03-23 18:32:22 cloud_vm_ray_backend.py:2126]
W 03-23 18:32:22 cloud_vm_ray_backend.py:2126] ↺ Trying other potential resources.
...
I 03-23 18:32:23 optimizer.py:886] Considered resources (1 node):
I 03-23 18:32:23 optimizer.py:954] ----------------------------------------------------------------------------------------------
I 03-23 18:32:23 optimizer.py:954]  CLOUD   INSTANCE          vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
I 03-23 18:32:23 optimizer.py:954] ----------------------------------------------------------------------------------------------
I 03-23 18:32:23 optimizer.py:954]  Azure   Standard_D2s_v5   2       8         -              eastus        0.10          ✔
I 03-23 18:32:23 optimizer.py:954] ----------------------------------------------------------------------------------------------
D 03-23 18:32:23 backend_utils.py:674] Using ssh_proxy_command: None
I 03-23 18:32:23 cloud_vm_ray_backend.py:1548] ⚙︎ Launching on Azure eastus.
...
E 03-23 18:32:23 config.py:125] Failed to authenticate with Azure. Please check your Azure credentials. ClientAuthenticationError: azure.identity._exceptions.CredentialUnavailableError: Please run 'az login' to set up an account
D 03-23 18:32:23 common_utils.py:541] Tried to remove /Users/danz/.sky/generated/ssh/test but failed to find it. Skip.
W 03-23 18:32:23 cloud_vm_ray_backend.py:2090] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in eastus for {Kubernetes(cpus=0.5+), GCP(cpus=0.5+), Azure(cpus=0.5+), AWS(cpus=0.5+)}.
W 03-23 18:32:23 cloud_vm_ray_backend.py:2126]
W 03-23 18:32:23 cloud_vm_ray_backend.py:2126] ↺ Trying other potential resources.
...
D 03-23 18:32:25 sdk.py:77] To stream request logs: sky api logs 141b1427-3fd8-4cce-add5-fd00f31e5a7c
sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes(cpus=0.5+)
To keep retrying until the cluster is up, use the `--retry-until-up` flag.
Reasons for provision failures (for details, please check the log above):
Resour  Reason
AWS(m6  Failed to acquire resources in all zones in us-east-1 for
i.larg  {Kubernetes(cpus=0.5+), GCP(cpus=0.5+), Azure(cpus=0.5+),
e)      AWS(cpus=0.5+)}.
GCP(n2  Failed to provision on cloud GCP due to invalid cloud credentials:
-stand  sky.exceptions.InvalidCloudCredentials:
ard-2)  google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request',
        {'error': 'invalid_grant', 'error_description': 'Bad Request'})
Azure(  Failed to acquire resources in all zones in eastus for
Standa  {Kubernetes(cpus=0.5+), GCP(cpus=0.5+), Azure(cpus=0.5+),
rd_D2s  AWS(cpus=0.5+)}.
_v5)
  1. Kubernetes -> aws -> azure -> gcp
    task.yaml
resources:
  cpus: 0.5+
  ordered:
  - cloud: kubernetes
  - cloud: aws
  - cloud: azure
  - cloud: gcp

Similar outputs for each cloud as in case 1.
3. Kubernetes -> aws -> azure(valid credentials), launched successfully in azure
4. Kubernetes -> aws -> azure -> gcp(valid credentials), launched successfully in gcp

@DanielZhangQD
Copy link
Collaborator Author

/quicktest-core

@DanielZhangQD DanielZhangQD marked this pull request as ready for review March 23, 2025 11:35
Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! First pass, have some questions about which exceptions we need to catch.

Also, should we catch the new exception type when we call write_cluster_config? Here:

try:
config_dict = backend_utils.write_cluster_config(
to_provision,
num_nodes,
_get_cluster_config_template(to_provision.cloud),
cluster_name,
self._local_wheel_path,
self._wheel_hash,
region=region,
zones=zones,
dryrun=dryrun,
keep_launch_fields_in_existing_config=cluster_exists)
except exceptions.ResourcesUnavailableError as e:

Comment on lines 1101 to 1103
if output.find('InvalidCloudCredentials') != -1:
_add_to_blocked_resources(
blocked_resources, resources_lib.Resources(cloud=clouds.AWS()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this already be handled by the default_handler? Why do we need a special case here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default handler only blocks one zone or one region one time, while we block the whole cloud in this case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, looks like the exception handling:

            except exceptions.InvalidCloudCredentials as e:
                # Failed due to invalid cloud credentials.

already did similar work

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default handler only blocks one zone or one region one time

I'm not sure this is right - it does to_provision.copy(region=None, zone=None). I think removing the zone and region information should mean that this will block the whole cloud. Could we double-check this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cg505 Check the default handler logic below, it only blocks the region or zones at a time, and provisioner will retry the others one by one.
https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L1105
https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L1110

@aylei Check the detailed provisioner logic below, the exception is not handled duplicated.

sky/backends/cloud_vm_ray_backend.py
    _provision()
        try:
            provision_with_retries()
        except exceptions.ResourcesUnavailableError as e:
            fail

    provision_with_retries()
        while True:
            try:
                _retry_zones()
            except (exceptions.InvalidClusterNameError,
                    exceptions.NotSupportedError,
                    exceptions.CloudUserIdentityError) as e:
                continue
            except exceptions.ResourcesUnavailableError as e:
                if not launchable_retries_disabled:
                    continue
    _retry_zones()
        for zones:
            try:
                write_cluster_config()
            except exceptions.ResourcesUnavailableError as e:
                continue
            except exceptions.InvalidCloudCredentials as e: # newly added
                block whole cloud
                raise exceptions.ResourcesUnavailableError # this will trigger failover in provision_with_retries()
            except exceptions.InvalidCloudConfigs as e:
                block whole cloud
                raise exceptions.ResourcesUnavailableError

            try:
                bulk_provision()
            except provision_common.StopFailoverError:
                raise
            except Exception as e:
                # This will call the _$cloud_handler to block the resources in zone, region, or cloud
                # The _default_handler will only block one zone at a time if `zones` is not empty
                FailoverCloudErrorHandlerV2.update_blocklist_on_error(
                        self._blocked_resources, to_provision, region, zones, e)
                continue

sky/provision/provisioner.py
    bulk_provision()
        try:
            _bulk_provision()
        except exceptions.NoClusterLaunchedError:
            raise
        except exceptions.InvalidCloudCredentials: # newly added, this will trigger blocking resources in handlers like _aws_handler
            raise
        except Exception:
            # try to stop/terminate the resources that were provisioned
            raiseStopFailoverError

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'm sorry I missed that.
I guess for most clouds, InvalidCloudCredentials should block the whole cloud, so we probably could consider moving this to the default handler. But we can follow up later on that.

@DanielZhangQD
Copy link
Collaborator Author

Thanks! First pass, have some questions about which exceptions we need to catch.

Also, should we catch the new exception type when we call write_cluster_config? Here:

try:
config_dict = backend_utils.write_cluster_config(
to_provision,
num_nodes,
_get_cluster_config_template(to_provision.cloud),
cluster_name,
self._local_wheel_path,
self._wheel_hash,
region=region,
zones=zones,
dryrun=dryrun,
keep_launch_fields_in_existing_config=cluster_exists)
except exceptions.ResourcesUnavailableError as e:

The new exception type is caught in the code change, PTAL again. Thanks!

@DanielZhangQD DanielZhangQD requested a review from aylei March 25, 2025 02:03
Copy link
Collaborator

@aylei aylei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DanielZhangQD ! Mostly LGTM, just some minors:

The log looks too verbose. Can we reduce it to a single error line?

D 03-23 18:32:18 backend_utils.py:674] Using ssh_proxy_command: None
E 03-23 18:32:18 authentication.py:209] Error getting GCP project: google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})
W 03-23 18:32:18 common_utils.py:475] Caught google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'}). Retrying.
E 03-23 18:32:20 authentication.py:209] Error getting GCP project: google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})
W 03-23 18:32:20 common_utils.py:475] Caught google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'}). Retrying.
E 03-23 18:32:22 authentication.py:209] Error getting GCP project: google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})

Comment on lines 1101 to 1103
if output.find('InvalidCloudCredentials') != -1:
_add_to_blocked_resources(
blocked_resources, resources_lib.Resources(cloud=clouds.AWS()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, looks like the exception handling:

            except exceptions.InvalidCloudCredentials as e:
                # Failed due to invalid cloud credentials.

already did similar work

@DanielZhangQD
Copy link
Collaborator Author

Thanks @DanielZhangQD ! Mostly LGTM, just some minors:

The log looks too verbose. Can we reduce it to a single error line?

D 03-23 18:32:18 backend_utils.py:674] Using ssh_proxy_command: None
E 03-23 18:32:18 authentication.py:209] Error getting GCP project: google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})
W 03-23 18:32:18 common_utils.py:475] Caught google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'}). Retrying.
E 03-23 18:32:20 authentication.py:209] Error getting GCP project: google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})
W 03-23 18:32:20 common_utils.py:475] Caught google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'}). Retrying.
E 03-23 18:32:22 authentication.py:209] Error getting GCP project: google.auth.exceptions.RefreshError: ('invalid_grant: Bad Request', {'error': 'invalid_grant', 'error_description': 'Bad Request'})

Remove the log in authentication.py.

@DanielZhangQD
Copy link
Collaborator Author

@cg505 @aylei Comments addressed/replied, PTAL when you're available. Thanks!

Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the fixes! Looks good as long as smoke tests are passing.

Comment on lines 1098 to 1099
output = str(error)
logger.info(f'AWS handler error: {output}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change:

Suggested change
output = str(error)
logger.info(f'AWS handler error: {output}')
logger.info(f'AWS handler error: {error}')

Since we only use this variable once, I think this is cleaner.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, updated

@DanielZhangQD
Copy link
Collaborator Author

/smoke-test --aws

Copy link
Collaborator

@aylei aylei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@DanielZhangQD
Copy link
Collaborator Author

The failed smoke test cases are not related to this PR and are traced in #5045

@DanielZhangQD DanielZhangQD merged commit b523c0e into skypilot-org:master Mar 27, 2025
18 of 19 checks passed
@DanielZhangQD DanielZhangQD deleted the 4373 branch March 27, 2025 02:31
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] Expired credentials causes unexpected failure of sky launch
3 participants