-
Notifications
You must be signed in to change notification settings - Fork 631
[Core] Optimize kubernetes cmd executions with kubernetes command runner #3157
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Conversation
… into kubernetes-runner
… into kubernetes-runner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature @Michaelvll ! It will helps to speedup the k8s execution a lot. It mostly looks good to me, left something to discuss :))
Co-authored-by: Tian Xia <cblmemo@gmail.com>
Co-authored-by: Tian Xia <cblmemo@gmail.com>
…o kubernetes-runner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! LGTM except for several nits ;)
Co-authored-by: Tian Xia <cblmemo@gmail.com>
…kypilot into kubernetes-runner
This PR should be ready to go, cc'ing @romilbhardwaj for a final check before we get this in, as it may affect the kubernetes code path significantly |
…o kubernetes-runner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll! Took a quick look at the code and tried it out, provisioning feels much snappier now 🚀 LGTM if kubernetes smoke tests pass.
# It is important to use /bin/bash -c here to make sure we quote the | ||
# command to be run properly. Otherwise, directly appending commands | ||
# after '--' will not work for some commands, such as '&&', '>' etc. | ||
'/bin/bash', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Just noting, no action needed] Many docker images do not come with bash and instead use sh
(e.g., alpine linux based images). If we need to support those images in the future, we may need to update this and our container command
here:
skypilot/sky/templates/kubernetes-ray.yml.j2
Line 303 in 7a6df2f
command: ["/bin/bash", "-c", "--"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! We have a bunch of places using bash
and bashrc
, assuming the existence of bash
(even debian based image, due to the use of apt install
). We should update those places as well to support those base images. : )
# Require a `/` at the end to make sure the parent dir | ||
# are not created locally. We do not add additional '*' as | ||
# kubernetes's rsync does not work with an ending '*'. | ||
source=f'{remote_log_dir}/', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious - does the removal of *
impact how hidden files are handled? On my mac it does not make any difference and this change should be ok, but this article seems to suggest it does. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems no significant difference between /*
and /
, and the latter is more robust. The tailing *
means the shell expands the pattern to include all files and directories in src before rsync runs, while the latter relies on rsync's own logic to sync the content of src.
Tested with |
…ner (#3157) * remove job_owner * remove some clouds.Local related code * Remove Local cloud entirely * remove local cloud * fix * slurm runner * kubernetes runner * Use command runner for kubernetes * rename back to ssh * refactor runners in backend * fix * fix * fix rsync * Fix runner * Fix run() * errors and fix head runner * support different mode * format * use whoami instead of $USER * timeline for run and rsync * lazy imports for pandas and lazy data frame * fix fetch_aws * fix fetchers * avoid sync script for task * add timeline * cache cluster_info * format * cache cluster info * do not stream * fix skip lines * format * avoid source bashrc or -i for internal exec * format * use -i * Add None arg * fix merge conflicts * Fix source bashrc * add connect_timeout * format * Correctly quote the script without source bashrc * fix output * Fix connection output * Fix * check twice * add Job ID * fix * format * fix ip * fix rsync for kubectl command runner * format * Enable output check for kubernetes * Fix * * Fix comments * longer wait * longer wait * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * Update sky/provision/kubernetes/instance.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * address comments * refactor rsync * add comment * fix interface * Update sky/utils/command_runner.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * fix quote * Fix skip lines * fix smoke * format * fix * fix serve failures * Fix condition * trigger test --------- Co-authored-by: Ubuntu <azureuser@ray-dev-zhwu-9ce1-head-e359-868f0.h4nxbv2ixrmevnfzs0oyii0g1h.bx.internal.cloudapp.net> Co-authored-by: Tian Xia <cblmemo@gmail.com>
Fixes #3154, by refactoring the command runner and avoid using ssh in the backend for kubernetes pods. This is helpful for the future support of Slurm and other services that do not have ssh enabled.
This is also a prototype to reduce the launch and exec time on kubernetes cluster.
We will separate the optimization in this PR to multiple separate PRs:
source ~/.bashrc
for better speed #3484: avoid source bashrc for internal remote command execution (1.3x faster for exec)With the three optimization we can get 3x speed up for exec
Exec speed (1.8x speed up)
Master:
This PR
Launch on existing cluster (1.7x speed up)
sky launch -c test-existing
multitime -n 5 sky launch -y -c test-existing echo hi
master:
This PR:
New clusters without setup/run (1.3x speed up)
multitime -n 5 sky launch -y --cloud kubernetes --cpus 2
Master:
This PR:
New clusters with setup/run (1.4x speed up)
multitime -n 5 sky launch -y --cloud kubernetes --cpus 2 task.yaml
Master:
This PR:
New clusters with file mounts(1.6x speed up)
task.yaml
multitime -n 5 sky launch -y --cloud kubernetes --cpus 1 task.yaml
Master:
This PR:
Blocked by #3037.Tested (run the relevant ones):
bash format.sh
sky launch -c test-ssh --cloud gcp --cpus 2 --num-nodes 2 task.yaml
sky launch -c test-k8s --cloud kubernetes --cpus 2 --num-nodes 2 task.yaml
sky launch --cloud aws --cpus 2 task.yaml
with internal ips and jump server.sky launch --cloud kubernetes --cpus 2 task.yaml
sky launch --cloud gcp --cpus 2 task.yaml
sky launch -c test-docker --cloud gcp --cpus 2 --image-id docker:ubuntu:18.04 task.yaml
sky launch -c test-docker --cloud aws --cpus 2 --image-id docker:ubuntu:18.04 task.yaml
pytest tests/test_smoke.py --kubernetes
pytest tests/test_smoke.py::test_fill_in_the_name
Make sure
sky logs --sync-down
works for other clouds.pytest tests/test_smoke.py::test_minimal --aws
pytest tests/test_smoke.py::test_minimal --gcp
pytest tests/test_smoke.py::test_minimal --kubernetes
sky launch --cloud aws --num-nodes 4 -c min echo hi; sky logs --sync-down min
and check the local log dirbash tests/backward_comaptibility_tests.sh
(with stop commands commented out)