Skip to content

Add Nebius Cloud #4573

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Merged
merged 63 commits into from
Feb 20, 2025
Merged

Add Nebius Cloud #4573

merged 63 commits into from
Feb 20, 2025

Conversation

SalikovAlex
Copy link
Contributor

@SalikovAlex SalikovAlex commented Jan 16, 2025

Add support Nebius Cloud

Tested (run the relevant ones):

  • run 'sky launch -c test-single-instance --cloud nebius echo hi'

  • run sky stop test-single-instance; sky start test-single-instance

  • run sky down test-single-instance

  • run sky launch -c test-single-instance --cloud nebius echo hi; sky stop test-single-instance; sky down test-single-instance

  • [UNSUPPORTED] sky launch --cloud nebius -c test-autostop -i 1 echo hi

  • [UNSUPPORTED] sky launch --cloud fluffycloud -c test-autodown -i 1 --down echo hi

  • sky launch examples/multi_hostname.yaml --cloud nebius;

  • Code formatting: bash format.sh

  • All smoke tests: pytest tests/test_smoke.py

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SalikovAlex for this amazing work! I think the PR is in a good shape. Left some nits. In the same time, could you help test the basic functionality of this new cloud? include but not limited to:

  • Launch CPU only instance
  • Launch GPU instance
  • Stop & Re-launch, check if the disk is persistent (write some content before stop, and cat them after re-launch)
  • Autostop & Autodown
  • Launch on existing cluster
  • SSH to the cluster
  • Failover: make sure it can failover from lambda to other clouds and the exceptions are printed correctly
  • launch on other clouds without nebius dependencies installed (make sure it does not introduce unnecessary dependencies when vast is not enabled)

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 22, 2025

@SalikovAlex
Copy link
Contributor Author

SalikovAlex commented Jan 26, 2025

Unresolved problems:

  • Nebius require grpcio>=1.56.2 so ray>=2.6.1. So, we need to fix dependencies.py>remote
  • Nebius require python>=3.10

Feature request:
Split autoterminate to autostop and autodown

@SalikovAlex
Copy link
Contributor Author

All tests are marked, now pytest tests/test_smoke.py --nebius is green.

@romilbhardwaj
Copy link
Collaborator

Hey @SalikovAlex thanks for the amazing work!

Nebius require grpcio>=1.56.2 so ray>=2.6.1. So, we need to fix dependencies.py>remote

Perhaps we can pin grpcio>=1.56.2 as remote dependency requirement. From my very primitive manual testing, seems like it works without requiring us to bump the remote ray version. cc @Michaelvll @cblmemo wdyt?

Nebius require python>=3.10

This should be ok, since pip on py3.9 environments throws a clear error. Users wanting to use nebius can upgrade to py3.10, which is also supported in SkyPilot.

ERROR: Ignored the following versions that require a different python version: 0.2.0 Requires-Python >=3.10; 0.2.1 Requires-Python >=3.10
ERROR: Could not find a version that satisfies the requirement nebius>=0.2.0; extra == "nebius" (from skypilot[nebius]) (from versions: none)
ERROR: No matching distribution found for nebius>=0.2.0; extra == "nebius"

@romilbhardwaj
Copy link
Collaborator

BTW, we may need to fix the nebius adaptor to lazy import nebius only when required. E.g., on this branch if I install pip install -e .[aws] without nebius python package in my environment, I get this error:

>>> import sky
Traceback (most recent call last):
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/adaptors/common.py", line 37, in load_module
    self._module = importlib.import_module(self._module_name)
  File "/Users/romilb/tools/anaconda3/envs/py39/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'nebius'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/__init__.py", line 82, in <module>
    from sky import backends
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/__init__.py", line 4, in <module>
    from sky.backends.cloud_vm_ray_backend import CloudVmRayBackend
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/backends/cloud_vm_ray_backend.py", line 30, in <module>
    from sky import check as sky_check
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/check.py", line 11, in <module>
    from sky import clouds as sky_clouds
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/clouds/__init__.py", line 15, in <module>
    from sky.clouds.aws import AWS
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/clouds/aws.py", line 16, in <module>
    from sky import provision as provision_lib
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/__init__.py", line 23, in <module>
    from sky.provision import nebius
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/nebius/__init__.py", line 4, in <module>
    from sky.provision.nebius.instance import cleanup_ports
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/nebius/instance.py", line 9, in <module>
    from sky.provision.nebius import utils
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/nebius/utils.py", line 12, in <module>
    sdk = nebius.sdk(credentials=nebius.get_iam_token())
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/adaptors/nebius.py", line 73, in sdk
    return nebius.sdk.SDK(credentials=credentials)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/adaptors/common.py", line 52, in __getattr__
    return getattr(self.load_module(), name)
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/adaptors/common.py", line 42, in load_module
    raise ImportError(self._import_error_message) from e
ImportError: Failed to import dependencies for Nebius AI Cloud. Try running: pip install "skypilot[nebius]"

@SalikovAlex
Copy link
Contributor Author

SalikovAlex commented Jan 30, 2025

BTW, we may need to fix the nebius adaptor to lazy import nebius only when required. E.g., on this branch if I install pip install -e .[aws]

Fixed

@SalikovAlex
Copy link
Contributor Author

This should be ok, since pip on py3.9 environments throws a clear error. Users wanting to use nebius can upgrade to py3.10, which is also supported in SkyPilot.

Yes, it's ok. But we need to change the version in GitHub actions for pep, lint and etc

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 30, 2025

Hey @SalikovAlex thanks for the amazing work!

Nebius require grpcio>=1.56.2 so ray>=2.6.1. So, we need to fix dependencies.py>remote

Perhaps we can pin grpcio>=1.56.2 as remote dependency requirement. From my very primitive manual testing, seems like it works without requiring us to bump the remote ray version. cc @Michaelvll @cblmemo wdyt?

Nebius require python>=3.10

This should be ok, since pip on py3.9 environments throws a clear error. Users wanting to use nebius can upgrade to py3.10, which is also supported in SkyPilot.

ERROR: Ignored the following versions that require a different python version: 0.2.0 Requires-Python >=3.10; 0.2.1 Requires-Python >=3.10
ERROR: Could not find a version that satisfies the requirement nebius>=0.2.0; extra == "nebius" (from skypilot[nebius]) (from versions: none)
ERROR: No matching distribution found for nebius>=0.2.0; extra == "nebius"

This looks good to me!

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 30, 2025

This should be ok, since pip on py3.9 environments throws a clear error. Users wanting to use nebius can upgrade to py3.10, which is also supported in SkyPilot.

Yes, it's ok. But we need to change the version in GitHub actions for pep, lint and etc

I think in the GH actions it will not install real cloud dependencies. Every test is running without actually provision VMs on the cloud, so IIUC no nebius dependency needs to be installed?

Those test that will provision VM and run workloads is located in the smoke_test folder. In GH actions we only run the unittests.

@SalikovAlex
Copy link
Contributor Author

SalikovAlex commented Jan 30, 2025

I think in the GH actions it will not install real cloud dependencies. Every test is running without actually provision VMs on the cloud, so IIUC no nebius dependency needs to be installed?

Those test that will provision VM and run workloads is located in the smoke_test folder. In GH actions we only run the unittests.

pylint, pytest, doc build run uv pip install ".[all]"

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 30, 2025

I think in the GH actions it will not install real cloud dependencies. Every test is running without actually provision VMs on the cloud, so IIUC no nebius dependency needs to be installed?
Those test that will provision VM and run workloads is located in the smoke_test folder. In GH actions we only run the unittests.

pylint, pytest, doc build run uv pip install ".[all]"

Humm, good point! cc @romilbhardwaj @Michaelvll for a look here

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt fix @SalikovAlex! Mostly looks good to me. Left some discussions ;)

@SalikovAlex
Copy link
Contributor Author

SalikovAlex commented Jan 31, 2025

Also, why is there project metadata id limitation? Is this project a per-user thing? IIUC if it is the case, different user would have different id?

Added this to the comments.

    # To find a project in a specific region, we rely on the project ID to
    # deduce the region, since there is currently no method to retrieve region
    # information directly from the project. Additionally, there is only one
    # project per region, and projects cannot be created at this time.
    # The region is determined from the project ID using a region-specific
    # identifier embedded in it.
    # https://docs.nebius.com/overview/regions

@cblmemo
Copy link
Collaborator

cblmemo commented Jan 31, 2025

https://docs.nebius.com/overview/regions

Just to confirm, so if a new user is using nebius for skypilot, does that user need to replace the project id here?

@SalikovAlex
Copy link
Contributor Author

Just to confirm, so if a new user is using nebius for skypilot, does that user need to replace the project id here?

No. I'm requesting them by API. The user's project id looks like "project-e00xxxxxxxxxxxxxx", where "e00" is - the code of the region.

Where do you find hardcoded project id?

@cblmemo
Copy link
Collaborator

cblmemo commented Feb 1, 2025

Just to confirm, so if a new user is using nebius for skypilot, does that user need to replace the project id here?

No. I'm requesting them by API. The user's project id looks like "project-e00xxxxxxxxxxxxxx", where "e00" is - the code of the region.

Where do you find hardcoded project id?

Oh ic! that makes sense. I thought the e00 is some unique value related to your project id.

If that is the case, can we add a comment for what this 8-11 for (iiuc it is the length of project-?) and what is the e00 & e01 for? to reduce confusion :)) Thanks!

image

@SalikovAlex
Copy link
Contributor Author

If that is the case, can we add a comment for what this 8-11 for (iiuc it is the length of project-?) and what is the e00 & e01 for? to reduce confusion :)) Thanks!

added

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SalikovAlex for this amazing work! Mostly looks good to me. Left final nits. I'll try to confirm the grpcio issue, and after that it should be ready to go!

Could you also help to merge the latest master?

@cblmemo
Copy link
Collaborator

cblmemo commented Feb 6, 2025

Bumping this: #4629 (comment)

Also, could you help to merge the latest master branch?

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SalikovAlex for the prompt fix! Left some final nits ;)

SalikovAlex and others added 11 commits February 6, 2025 21:38
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Moved GRPC logging suppression logic into the LazyImport setup. This ensures the environment variable is set only during the lazy import process, avoiding potential side effects on other modules.
Reorganized imports to improve structure and readability. Changed the cloud registry decorator to use `sky.utils.registry` for consistency across the codebase. No functional changes introduced.
@SalikovAlex
Copy link
Contributor Author

@cblmemo final check?

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SalikovAlex for the fix! Left final suggestions. After those it should be ready for testing (& merging)!

Ensure each node name is unique by appending a UUID, preventing naming conflicts during instance launches. Additionally, corrected the `None` check for GPU cluster ID to improve clarity and robustness.
The updated comment specifies that unique names are needed to prevent conflicts between multiple worker VMs, improving clarity for maintainers and future contributors. No functional changes were made.
@cblmemo
Copy link
Collaborator

cblmemo commented Feb 19, 2025

Bump on the grpcio>=1.56.2 version requirements. Do we still need it?

Also, could you help to merge the latest master branch? I'll run smoke tests after that.

@cblmemo
Copy link
Collaborator

cblmemo commented Feb 19, 2025

/smoke-test --aws

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just triggered the smoke test on AWS. @SalikovAlex Could you help run the smoke tests on nebius again? After this it should be ready to go!

Tests `test_skyserve_https` and `test_skyserve_multi_ports` are marked with `no_nebius` since the Nebius cloud does not support Autodown and Autostop. This ensures these tests are skipped in unsupported environments.
@SalikovAlex
Copy link
Contributor Author

pytest tests/test_smoke.py --nebius is green

@cblmemo
Copy link
Collaborator

cblmemo commented Feb 20, 2025

pytest tests/test_smoke.py --nebius is green

Thanks @SalikovAlex for the prompt reply! LGTM. Merging now!

@cblmemo cblmemo merged commit 95e9ccf into skypilot-org:master Feb 20, 2025
18 checks passed
@SalikovAlex SalikovAlex deleted the nebiuscloud branch March 26, 2025 15:39
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants