-
Notifications
You must be signed in to change notification settings - Fork 631
Add Nebius Cloud #4573
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Add Nebius Cloud #4573
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SalikovAlex for this amazing work! I think the PR is in a good shape. Left some nits. In the same time, could you help test the basic functionality of this new cloud? include but not limited to:
- Launch CPU only instance
- Launch GPU instance
- Stop & Re-launch, check if the disk is persistent (write some content before stop, and cat them after re-launch)
- Autostop & Autodown
- Launch on existing cluster
- SSH to the cluster
- Failover: make sure it can failover from lambda to other clouds and the exceptions are printed correctly
- launch on other clouds without nebius dependencies installed (make sure it does not introduce unnecessary dependencies when vast is not enabled)
Unresolved problems:
Feature request: |
All tests are marked, now |
Hey @SalikovAlex thanks for the amazing work!
Perhaps we can pin
This should be ok, since pip on py3.9 environments throws a clear error. Users wanting to use nebius can upgrade to py3.10, which is also supported in SkyPilot.
|
BTW, we may need to fix the nebius adaptor to lazy import nebius only when required. E.g., on this branch if I install
|
Fixed |
Yes, it's ok. But we need to change the version in GitHub actions for pep, lint and etc |
This looks good to me! |
I think in the GH actions it will not install real cloud dependencies. Every test is running without actually provision VMs on the cloud, so IIUC no nebius dependency needs to be installed? Those test that will provision VM and run workloads is located in the |
pylint, pytest, doc build run |
Humm, good point! cc @romilbhardwaj @Michaelvll for a look here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prompt fix @SalikovAlex! Mostly looks good to me. Left some discussions ;)
Added this to the comments.
|
Just to confirm, so if a new user is using nebius for skypilot, does that user need to replace the project id here? |
No. I'm requesting them by API. The user's project id looks like "project-e00xxxxxxxxxxxxxx", where "e00" is - the code of the region. Where do you find hardcoded project id? |
added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SalikovAlex for this amazing work! Mostly looks good to me. Left final nits. I'll try to confirm the grpcio issue, and after that it should be ready to go!
Could you also help to merge the latest master?
Bumping this: #4629 (comment) Also, could you help to merge the latest master branch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SalikovAlex for the prompt fix! Left some final nits ;)
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Moved GRPC logging suppression logic into the LazyImport setup. This ensures the environment variable is set only during the lazy import process, avoiding potential side effects on other modules.
Reorganized imports to improve structure and readability. Changed the cloud registry decorator to use `sky.utils.registry` for consistency across the codebase. No functional changes introduced.
@cblmemo final check? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SalikovAlex for the fix! Left final suggestions. After those it should be ready for testing (& merging)!
Ensure each node name is unique by appending a UUID, preventing naming conflicts during instance launches. Additionally, corrected the `None` check for GPU cluster ID to improve clarity and robustness.
The updated comment specifies that unique names are needed to prevent conflicts between multiple worker VMs, improving clarity for maintainers and future contributors. No functional changes were made.
Bump on the Also, could you help to merge the latest master branch? I'll run smoke tests after that. |
/smoke-test --aws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just triggered the smoke test on AWS. @SalikovAlex Could you help run the smoke tests on nebius again? After this it should be ready to go!
Tests `test_skyserve_https` and `test_skyserve_multi_ports` are marked with `no_nebius` since the Nebius cloud does not support Autodown and Autostop. This ensures these tests are skipped in unsupported environments.
|
Thanks @SalikovAlex for the prompt reply! LGTM. Merging now! |
Add support Nebius Cloud
Tested (run the relevant ones):
run 'sky launch -c test-single-instance --cloud nebius echo hi'
run
sky stop test-single-instance; sky start test-single-instance
run
sky down test-single-instance
run
sky launch -c test-single-instance --cloud nebius echo hi; sky stop test-single-instance; sky down test-single-instance
[UNSUPPORTED]
sky launch --cloud nebius -c test-autostop -i 1 echo hi
[UNSUPPORTED]
sky launch --cloud fluffycloud -c test-autodown -i 1 --down echo hi
sky launch examples/multi_hostname.yaml --cloud nebius;
Code formatting:
bash format.sh
All smoke tests:
pytest tests/test_smoke.py