[Docs] K8s Debugging Docs #3128

romilbhardwaj · 2024-02-09T02:14:39Z

Adding a doc page to help debug common Kubernetes issues.

TODO:

Add debugging instructions for opening ports

Michaelvll

Thanks for adding the doc for debugging the setup of kubernetes cluster @romilbhardwaj!

I just tested setting up a rancher cluster on two Azure VMs with 1 T4 GPU each, and after about 2-3 hours, I got the GPU work! However, the ports opening still fail for me. Just left several confusions and issues in the comments.

tests/kubernetes/cpu_test_pod.yaml

docs/source/reference/kubernetes/kubernetes-debugging.rst

docs/source/reference/kubernetes/kubernetes-setup.rst

Michaelvll · 2024-02-09T23:41:49Z

docs/source/reference/kubernetes/kubernetes-setup.rst

 Setting up GPU support
 ~~~~~~~~~~~~~~~~~~~~~~
 If your Kubernetes cluster has Nvidia GPUs, ensure that:

 1. The Nvidia GPU operator is installed (i.e., ``nvidia.com/gpu`` resource is available on each node) and ``nvidia`` is set as the default runtime for your container engine. See `Nvidia's installation guide <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#install-nvidia-gpu-operator>`_ for more details.
-2. Each node in your cluster is labelled with the GPU type. This labelling can be done by adding a label of the format ``skypilot.co/accelerators: <gpu_name>``, where the ``<gpu_name>`` is the lowercase name of the GPU. For example, a node with V100 GPUs must have a label :code:`skypilot.co/accelerators: v100`.
+2. Each node in your cluster is labelled with the GPU type. This labelling can be done by adding a label of the format ``skypilot.co/accelerator: <gpu_name>``, where the ``<gpu_name>`` is the lowercase name of the GPU. For example, a node with V100 GPUs must have a label :code:`skypilot.co/accelerator: v100`.


Can we move the tip and note below before 2. as it seems more relevant to the GPU operator part?

It might be better to keep the label close to the script we provide.

I wanted to keep the two steps next to each other since the basic requirements should be clear at a quick glance. Adding notes between them would clutter and make it harder to read. wdyt?

docs/source/reference/kubernetes/kubernetes-debugging.rst

docs/source/reference/kubernetes/kubernetes-setup.rst

docs/source/reference/kubernetes/kubernetes-debugging.rst

Michaelvll · 2024-02-10T00:21:39Z

docs/source/reference/kubernetes/kubernetes-debugging.rst

+    $ sky launch -y -c myserver --cloud kubernetes --port 8080 -- "python -m http.server 8080"
+
+    # Obtain the endpoint of the service
+    $ sky status --endpoint 8080 myserver


When I run this after my cluster is up, I got the following error:

sky status --endpoint 8080 myserver RuntimeError: Port 8080 not exposed yet. If the cluster was recently started, please retry after a while. Additionally, make sure your Ingress is configured correctly. To debug, run: kubectl describe ingress && kubectl describe ingressclass

I run the two commands suggested, but fail to see what should I look at in the output.

In the kubectl describe ingress, I do see the ingress setup:

kubectl describe ingress Name: myserver-084e-skypilot-ingress--8080 Labels: <none> Namespace: default Address: 10.166.0.4,10.166.0.5 Ingress Class: nginx Default backend: <default> Rules: Host Path Backends ---- ---- -------- * /skypilot/myserver-084e/8080(/|$)(.*) myserver-084e-skypilot-service--8080:8080 (10.42.167.175:8080) Annotations: field.cattle.io/publicEndpoints: [{"addresses":["10.166.0.4","10.166.0.5"],"port":80,"protocol":"HTTP","serviceName":"default:myserver-084e-skypilot-service--8080","ingres... nginx.ingress.kubernetes.io/rewrite-target: /$2 nginx.ingress.kubernetes.io/use-regex: true Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Sync 6m2s (x3 over 6m23s) nginx-ingress-controller Scheduled for sync Normal Sync 6m2s (x3 over 6m23s) nginx-ingress-controller Scheduled for sync

However, the sky status --endpoint 8080 myserver still shows the same error.

I also tried kubectl describe services and seems the service for the nginx ingress is there:

Name: myserver-084e-skypilot-service--8080 Namespace: default Labels: parent=skypilot Annotations: <none> Selector: skypilot-cluster=myserver-084e Type: ClusterIP IP Family Policy: SingleStack IP Families: IPv4 IP: 10.43.98.142 IPs: 10.43.98.142 Port: <unset> 8080/TCP TargetPort: 8080/TCP Endpoints: 10.42.167.175:8080 Session Affinity: None Events: <none>

Just tried to install the ingress controller with the quick start's recommendation, but got an error.

helm upgrade --install ingress-nginx ingress-nginx \ --repo https://kubernetes.github.io/ingress-nginx \ --namespace ingress-nginx --create-namespace WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /home/azureuser/.kube/config WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/azureuser/.kube/config Release "ingress-nginx" does not exist. Installing it now. Error: Unable to continue with install: IngressClass "nginx" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "ingress-nginx": current value is "rke2-ingress-nginx"; annotation validation error: key "meta.helm.sh/release-namespace" must equal "ingress-nginx": current value is "kube-system"

Ah looks like the default RKE nginx ingress doesn't have a service configured to access the ingress controller.

The Bare-Metal instructions on the nginx installation docs work here, tested on RKE2 cluster. I'll update the docs.

Added a tip - let me know if that helps!

…o k8s_docs_debuggingpage # Conflicts: # docs/source/reference/kubernetes/kubernetes-setup.rst

romilbhardwaj · 2024-02-28T21:18:08Z

Thanks @Michaelvll! Ready for another look.

Michaelvll

Thank you for updating this @romilbhardwaj! LGTM.

romilbhardwaj added 2 commits February 8, 2024 18:11

debugging docs

cd12735

fixes

9425543

concretevitamin requested a review from Michaelvll February 9, 2024 03:00

romilbhardwaj added 2 commits February 9, 2024 12:21

update gpu instructions

f179219

services debugging

789b2ff

Michaelvll reviewed Feb 10, 2024

View reviewed changes

romilbhardwaj added 4 commits February 12, 2024 10:14

services debugging

a3062de

Merge branch 'master' of https://github.com/skypilot-org/skypilot int…

db52702

…o k8s_docs_debuggingpage # Conflicts: # docs/source/reference/kubernetes/kubernetes-setup.rst

comments wip

ce6bdb3

comments wip

68acd2b

romilbhardwaj mentioned this pull request Feb 28, 2024

[k8s] Move socat/netcat dependency checks to sky check for k8s #3252

Closed

romilbhardwaj added 2 commits February 28, 2024 13:10

comments

43bc9c4

fixes

44be414

romilbhardwaj requested a review from Michaelvll February 29, 2024 20:51

Michaelvll approved these changes Mar 1, 2024

View reviewed changes

romilbhardwaj merged commit a63a56a into master Mar 2, 2024

romilbhardwaj deleted the k8s_docs_debuggingpage branch March 2, 2024 00:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] K8s Debugging Docs #3128

[Docs] K8s Debugging Docs #3128

romilbhardwaj commented Feb 9, 2024 •

edited

Loading

Michaelvll left a comment

Michaelvll Feb 9, 2024

romilbhardwaj Feb 28, 2024

Michaelvll Feb 10, 2024

Michaelvll Feb 10, 2024

romilbhardwaj Feb 10, 2024

romilbhardwaj Feb 28, 2024

romilbhardwaj commented Feb 28, 2024

Michaelvll left a comment

[Docs] K8s Debugging Docs #3128

[Docs] K8s Debugging Docs #3128

Conversation

romilbhardwaj commented Feb 9, 2024 • edited Loading

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Feb 9, 2024

Choose a reason for hiding this comment

romilbhardwaj Feb 28, 2024

Choose a reason for hiding this comment

Michaelvll Feb 10, 2024

Choose a reason for hiding this comment

Michaelvll Feb 10, 2024

Choose a reason for hiding this comment

romilbhardwaj Feb 10, 2024

Choose a reason for hiding this comment

romilbhardwaj Feb 28, 2024

Choose a reason for hiding this comment

romilbhardwaj commented Feb 28, 2024

Michaelvll left a comment

Choose a reason for hiding this comment

romilbhardwaj commented Feb 9, 2024 •

edited

Loading