-
Notifications
You must be signed in to change notification settings - Fork 633
[Docs] K8s Debugging Docs #3128
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the doc for debugging the setup of kubernetes cluster @romilbhardwaj!
I just tested setting up a rancher cluster on two Azure VMs with 1 T4 GPU each, and after about 2-3 hours, I got the GPU work! However, the ports opening still fail for me. Just left several confusions and issues in the comments.
Setting up GPU support | ||
~~~~~~~~~~~~~~~~~~~~~~ | ||
If your Kubernetes cluster has Nvidia GPUs, ensure that: | ||
|
||
1. The Nvidia GPU operator is installed (i.e., ``nvidia.com/gpu`` resource is available on each node) and ``nvidia`` is set as the default runtime for your container engine. See `Nvidia's installation guide <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#install-nvidia-gpu-operator>`_ for more details. | ||
2. Each node in your cluster is labelled with the GPU type. This labelling can be done by adding a label of the format ``skypilot.co/accelerators: <gpu_name>``, where the ``<gpu_name>`` is the lowercase name of the GPU. For example, a node with V100 GPUs must have a label :code:`skypilot.co/accelerators: v100`. | ||
2. Each node in your cluster is labelled with the GPU type. This labelling can be done by adding a label of the format ``skypilot.co/accelerator: <gpu_name>``, where the ``<gpu_name>`` is the lowercase name of the GPU. For example, a node with V100 GPUs must have a label :code:`skypilot.co/accelerator: v100`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the tip and note below before 2.
as it seems more relevant to the GPU operator part?
It might be better to keep the label close to the script we provide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep the two steps next to each other since the basic requirements should be clear at a quick glance. Adding notes between them would clutter and make it harder to read. wdyt?
$ sky launch -y -c myserver --cloud kubernetes --port 8080 -- "python -m http.server 8080" | ||
|
||
# Obtain the endpoint of the service | ||
$ sky status --endpoint 8080 myserver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I run this after my cluster is up, I got the following error:
sky status --endpoint 8080 myserver
RuntimeError: Port 8080 not exposed yet. If the cluster was recently started, please retry after a while. Additionally, make sure your Ingress is configured correctly.
To debug, run: kubectl describe ingress && kubectl describe ingressclass
I run the two commands suggested, but fail to see what should I look at in the output.
In the kubectl describe ingress
, I do see the ingress setup:
kubectl describe ingress
Name: myserver-084e-skypilot-ingress--8080
Labels: <none>
Namespace: default
Address: 10.166.0.4,10.166.0.5
Ingress Class: nginx
Default backend: <default>
Rules:
Host Path Backends
---- ---- --------
*
/skypilot/myserver-084e/8080(/|$)(.*) myserver-084e-skypilot-service--8080:8080 (10.42.167.175:8080)
Annotations: field.cattle.io/publicEndpoints:
[{"addresses":["10.166.0.4","10.166.0.5"],"port":80,"protocol":"HTTP","serviceName":"default:myserver-084e-skypilot-service--8080","ingres...
nginx.ingress.kubernetes.io/rewrite-target: /$2
nginx.ingress.kubernetes.io/use-regex: true
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Sync 6m2s (x3 over 6m23s) nginx-ingress-controller Scheduled for sync
Normal Sync 6m2s (x3 over 6m23s) nginx-ingress-controller Scheduled for sync
However, the sky status --endpoint 8080 myserver
still shows the same error.
I also tried kubectl describe services
and seems the service for the nginx ingress is there:
Name: myserver-084e-skypilot-service--8080
Namespace: default
Labels: parent=skypilot
Annotations: <none>
Selector: skypilot-cluster=myserver-084e
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.43.98.142
IPs: 10.43.98.142
Port: <unset> 8080/TCP
TargetPort: 8080/TCP
Endpoints: 10.42.167.175:8080
Session Affinity: None
Events: <none>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tried to install the ingress controller with the quick start's recommendation, but got an error.
helm upgrade --install ingress-nginx ingress-nginx \
--repo https://kubernetes.github.io/ingress-nginx \
--namespace ingress-nginx --create-namespace
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /home/azureuser/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/azureuser/.kube/config
Release "ingress-nginx" does not exist. Installing it now.
Error: Unable to continue with install: IngressClass "nginx" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "ingress-nginx": current value is "rke2-ingress-nginx"; annotation validation error: key "meta.helm.sh/release-namespace" must equal "ingress-nginx": current value is "kube-system"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah looks like the default RKE nginx ingress doesn't have a service configured to access the ingress controller.
The Bare-Metal instructions on the nginx installation docs work here, tested on RKE2 cluster. I'll update the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a tip - let me know if that helps!
…o k8s_docs_debuggingpage # Conflicts: # docs/source/reference/kubernetes/kubernetes-setup.rst
Thanks @Michaelvll! Ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating this @romilbhardwaj! LGTM.
Adding a doc page to help debug common Kubernetes issues.
TODO: