Skip to content

[k8s] refactor gpu labeler script / gpu labeler optionally accepts a context #5072

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Merged
merged 3 commits into from
Apr 2, 2025

Conversation

SeungjinYang
Copy link
Collaborator

@SeungjinYang SeungjinYang commented Mar 28, 2025

  1. Use existing helper functions where applicable instead of re-doing a lot of custom logic within the script, which prepares this script to eventually be run on the server as a result of an API call.
  2. Allow the script to be called on a specific context, not just the active context.

UX:
if --context is specified

$ python -m sky.utils.kubernetes.gpu_labeler --context syang@my-cluster-2.us-east-1.eksctl.io                      
Found 1 GPU node(s) in the cluster
Using default RuntimeClass for GPU labeling.
Created GPU labeler job for node ip-192-168-56-224.ec2.internal
GPU labeling started - this may take 10 min or more to complete.
To check the status of GPU labeling jobs, run `kubectl get jobs -n kube-system -l job=sky-gpu-labeler --context syang@my-cluster-2.us-east-1.eksctl.io`
You can check if nodes have been labeled by running `kubectl describe nodes --context syang@my-cluster-2.us-east-1.eksctl.io` and looking for labels of the format `skypilot.co/accelerator: <gpu_name>`. 

If --context is not specified

$ python -m sky.utils.kubernetes.gpu_labeler                                                
Found 1 GPU node(s) in the cluster
Using default RuntimeClass for GPU labeling.
Created GPU labeler job for node ip-192-168-56-224.ec2.internal
GPU labeling started - this may take 10 min or more to complete.
To check the status of GPU labeling jobs, run `kubectl get jobs -n kube-system -l job=sky-gpu-labeler`
You can check if nodes have been labeled by running `kubectl describe nodes` and looking for labels of the format `skypilot.co/accelerator: <gpu_name>`. 

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Manual test: run gpu labeler script on non default context using the --context flag
  • Manual test (bw compatibility): run gpu labeler script on default context without using the --context flag
  • Manual test: run gpu labeler script using --context flag and make sure the initial cleanup is executed correctly

@SeungjinYang SeungjinYang force-pushed the gpu-labeler-refactor branch from 493f30c to cae5d67 Compare March 31, 2025 17:46
@SeungjinYang SeungjinYang requested a review from Michaelvll March 31, 2025 17:48
@SeungjinYang SeungjinYang marked this pull request as ready for review March 31, 2025 17:48
@SeungjinYang SeungjinYang requested a review from cg505 March 31, 2025 18:59
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @SeungjinYang! LGTM

@SeungjinYang SeungjinYang merged commit d02ec5a into master Apr 2, 2025
20 checks passed
@SeungjinYang SeungjinYang deleted the gpu-labeler-refactor branch April 2, 2025 00:18
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants