Skip to content

[Core] Optimize kubernetes cmd executions with kubernetes command runner #3157

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Merged
merged 88 commits into from
Jun 7, 2024

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Feb 14, 2024

Fixes #3154, by refactoring the command runner and avoid using ssh in the backend for kubernetes pods. This is helpful for the future support of Slurm and other services that do not have ssh enabled.

This is also a prototype to reduce the launch and exec time on kubernetes cluster.

We will separate the optimization in this PR to multiple separate PRs:

  1. [Core] Reduce import time with lazy imports and exec time by avoiding script rsync #3394: lazy import and avoid rsync for exec (1.3x faster for exec)
  2. [Core] Command runner refactor and avoid source ~/.bashrc for better speed #3484: avoid source bashrc for internal remote command execution (1.3x faster for exec)
  3. This PR: kubernetes command runner

With the three optimization we can get 3x speed up for exec

Exec speed (1.8x speed up)
multitime -n 5 sky exec test-speed echo hi -d

Master:

            Mean        Std.Dev.    Min         Median      Max
real        10.377      0.379       9.948       10.372      10.821      
user        1.373       0.025       1.327       1.381       1.401       
sys         0.178       0.021       0.150       0.171       0.209 

This PR

            Mean        Std.Dev.    Min         Median      Max
real        5.715       0.146       5.499       5.668       5.925       
user        1.582       0.132       1.475       1.512       1.835       
sys         0.209       0.027       0.170       0.202       0.253  
Launch on existing cluster (1.7x speed up)

sky launch -c test-existing
multitime -n 5 sky launch -y -c test-existing echo hi
master:

            Mean        Std.Dev.    Min         Median      Max
real        40.467      2.363       37.563      39.865      44.683      
user        3.246       0.573       2.918       2.986       4.389       
sys         0.359       0.065       0.313       0.322       0.487 

This PR:

            Mean        Std.Dev.    Min         Median      Max
real        23.459      0.304       23.020      23.426      23.918      
user        3.220       0.041       3.155       3.223       3.266       
sys         0.433       0.028       0.389       0.438       0.463 
New clusters without setup/run (1.3x speed up)

multitime -n 5 sky launch -y --cloud kubernetes --cpus 2
Master:

            Mean        Std.Dev.    Min         Median      Max
real        55.359      2.039       53.175      54.318      58.499      
user        3.180       0.046       3.112       3.177       3.236       
sys         0.377       0.014       0.356       0.385       0.391  

This PR:

            Mean        Std.Dev.    Min         Median      Max
real        41.574      1.206       39.681      41.475      43.460      
user        3.809       0.506       3.466       3.577       4.816       
sys         0.553       0.081       0.462       0.540       0.699 
New clusters with setup/run (1.4x speed up)

multitime -n 5 sky launch -y --cloud kubernetes --cpus 2 task.yaml

setup: echo Some "setup commands"
run: echo Some "run commands"

Master:

            Mean        Std.Dev.    Min         Median      Max
real        74.712      2.938       71.890      73.015      79.771      
user        3.598       0.465       3.318       3.373       4.526       
sys         0.463       0.104       0.376       0.435       0.661 

This PR:

            Mean        Std.Dev.    Min         Median      Max
real        54.095      1.973       51.659      54.739      56.846      
user        4.221       0.068       4.174       4.181       4.354       
sys         0.647       0.026       0.599       0.652       0.672 
New clusters with file mounts(1.6x speed up)

task.yaml

resources:
  cpus: 1
  # image_id: docker:michaelvll/no-local-bin-test:v3
  # image_id: docker:michaelvll/k8s-dep-test:v2

file_mounts:
  /sky/examples: ./examples
  ~/.viminfo : ~/.viminfo
  ~/task.yaml : ./task.yaml
  ~/.config/gcloud: ~/.config/gcloud

run: |
  echo hi

multitime -n 5 sky launch -y --cloud kubernetes --cpus 1 task.yaml
Master:

            Mean        Std.Dev.    Min         Median      Max
real        99.381      2.770       95.061      101.003     102.193     
user        4.529       0.458       4.255       4.319       5.443       
sys         0.903       0.115       0.821       0.840       1.126  

This PR:

            Mean        Std.Dev.    Min         Median      Max
real        62.851      2.610       59.383      62.714      67.189      
user        5.895       0.590       5.473       5.558       7.043       
sys         1.366       0.236       1.044       1.333       1.778 

Blocked by #3037.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test-ssh --cloud gcp --cpus 2 --num-nodes 2 task.yaml
    • sky launch -c test-k8s --cloud kubernetes --cpus 2 --num-nodes 2 task.yaml
    • sky launch --cloud aws --cpus 2 task.yaml with internal ips and jump server.
    • sky launch --cloud kubernetes --cpus 2 task.yaml
    • sky launch --cloud gcp --cpus 2 task.yaml
    • sky launch -c test-docker --cloud gcp --cpus 2 --image-id docker:ubuntu:18.04 task.yaml
    • sky launch -c test-docker --cloud aws --cpus 2 --image-id docker:ubuntu:18.04 task.yaml
      resources:
        cpus: 2
      
      
      file_mounts:
        /my-examples: ./examples
        ~/task.yaml: ./task.yaml
      
      setup: |
        ls /my-examples
      
      run: |
        ls ~
  • All smoke tests: pytest tests/test_smoke.py --kubernetes
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
    Make sure sky logs --sync-down works for other clouds.
    • pytest tests/test_smoke.py::test_minimal --aws
    • pytest tests/test_smoke.py::test_minimal --gcp
    • pytest tests/test_smoke.py::test_minimal --kubernetes
    • sky launch --cloud aws --num-nodes 4 -c min echo hi; sky logs --sync-down min and check the local log dir
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh (with stop commands commented out)

@Michaelvll Michaelvll changed the base branch from master to remove-local-cloud February 14, 2024 22:25
@Michaelvll Michaelvll requested review from romilbhardwaj and removed request for romilbhardwaj February 14, 2024 22:27
Base automatically changed from remove-local-cloud to master February 18, 2024 08:37
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature @Michaelvll ! It will helps to speedup the k8s execution a lot. It mostly looks good to me, left something to discuss :))

@Michaelvll Michaelvll requested a review from cblmemo June 3, 2024 07:50
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! LGTM except for several nits ;)

@Michaelvll
Copy link
Collaborator Author

This PR should be ready to go, cc'ing @romilbhardwaj for a final check before we get this in, as it may affect the kubernetes code path significantly

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Michaelvll! Took a quick look at the code and tried it out, provisioning feels much snappier now 🚀 LGTM if kubernetes smoke tests pass.

# It is important to use /bin/bash -c here to make sure we quote the
# command to be run properly. Otherwise, directly appending commands
# after '--' will not work for some commands, such as '&&', '>' etc.
'/bin/bash',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Just noting, no action needed] Many docker images do not come with bash and instead use sh (e.g., alpine linux based images). If we need to support those images in the future, we may need to update this and our container command here:

command: ["/bin/bash", "-c", "--"]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! We have a bunch of places using bash and bashrc, assuming the existence of bash (even debian based image, due to the use of apt install). We should update those places as well to support those base images. : )

Comment on lines +3647 to +3650
# Require a `/` at the end to make sure the parent dir
# are not created locally. We do not add additional '*' as
# kubernetes's rsync does not work with an ending '*'.
source=f'{remote_log_dir}/',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious - does the removal of * impact how hidden files are handled? On my mac it does not make any difference and this change should be ok, but this article seems to suggest it does. Any thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems no significant difference between /* and /, and the latter is more robust. The tailing * means the shell expands the pattern to include all files and directories in src before rsync runs, while the latter relies on rsync's own logic to sync the content of src.

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Jun 7, 2024

Tested with pytest tests/test_smoke.py --kubernetes and passed. Merging now : )

@Michaelvll Michaelvll merged commit 53d1705 into master Jun 7, 2024
20 checks passed
@Michaelvll Michaelvll deleted the kubernetes-runner branch June 7, 2024 18:58
Michaelvll added a commit that referenced this pull request Aug 23, 2024
…ner (#3157)

* remove job_owner

* remove some clouds.Local related code

* Remove Local cloud entirely

* remove local cloud

* fix

* slurm runner

* kubernetes runner

* Use command runner for kubernetes

* rename back to ssh

* refactor runners in backend

* fix

* fix

* fix rsync

* Fix runner

* Fix run()

* errors and fix head runner

* support different mode

* format

* use whoami instead of $USER

* timeline for run and rsync

* lazy imports for pandas and lazy data frame

* fix fetch_aws

* fix fetchers

* avoid sync script for task

* add timeline

* cache cluster_info

* format

* cache cluster info

* do not stream

* fix skip lines

* format

* avoid source bashrc or -i for internal exec

* format

* use -i

* Add None arg

* fix merge conflicts

* Fix source bashrc

* add connect_timeout

* format

* Correctly quote the script without source bashrc

* fix output

* Fix connection output

* Fix

* check twice

* add Job ID

* fix

* format

* fix ip

* fix rsync for kubectl command runner

* format

* Enable output check for kubernetes

* Fix *

* Fix comments

* longer wait

* longer wait

* Update sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* Update sky/provision/kubernetes/instance.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* address comments

* refactor rsync

* add comment

* fix interface

* Update sky/utils/command_runner.py

Co-authored-by: Tian Xia <cblmemo@gmail.com>

* fix quote

* Fix skip lines

* fix smoke

* format

* fix

* fix serve failures

* Fix condition

* trigger test

---------

Co-authored-by: Ubuntu <azureuser@ray-dev-zhwu-9ce1-head-e359-868f0.h4nxbv2ixrmevnfzs0oyii0g1h.bx.internal.cloudapp.net>
Co-authored-by: Tian Xia <cblmemo@gmail.com>
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] Slow file_mounts on k8s
3 participants