Skip to content

[Docs] K8s Debugging Docs #3128

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Merged
merged 10 commits into from
Mar 2, 2024
Merged

[Docs] K8s Debugging Docs #3128

merged 10 commits into from
Mar 2, 2024

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Feb 9, 2024

Adding a doc page to help debug common Kubernetes issues.

TODO:

  • Add debugging instructions for opening ports

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the doc for debugging the setup of kubernetes cluster @romilbhardwaj!

I just tested setting up a rancher cluster on two Azure VMs with 1 T4 GPU each, and after about 2-3 hours, I got the GPU work! However, the ports opening still fail for me. Just left several confusions and issues in the comments.

Setting up GPU support
~~~~~~~~~~~~~~~~~~~~~~
If your Kubernetes cluster has Nvidia GPUs, ensure that:

1. The Nvidia GPU operator is installed (i.e., ``nvidia.com/gpu`` resource is available on each node) and ``nvidia`` is set as the default runtime for your container engine. See `Nvidia's installation guide <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#install-nvidia-gpu-operator>`_ for more details.
2. Each node in your cluster is labelled with the GPU type. This labelling can be done by adding a label of the format ``skypilot.co/accelerators: <gpu_name>``, where the ``<gpu_name>`` is the lowercase name of the GPU. For example, a node with V100 GPUs must have a label :code:`skypilot.co/accelerators: v100`.
2. Each node in your cluster is labelled with the GPU type. This labelling can be done by adding a label of the format ``skypilot.co/accelerator: <gpu_name>``, where the ``<gpu_name>`` is the lowercase name of the GPU. For example, a node with V100 GPUs must have a label :code:`skypilot.co/accelerator: v100`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the tip and note below before 2. as it seems more relevant to the GPU operator part?

It might be better to keep the label close to the script we provide.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep the two steps next to each other since the basic requirements should be clear at a quick glance. Adding notes between them would clutter and make it harder to read. wdyt?

$ sky launch -y -c myserver --cloud kubernetes --port 8080 -- "python -m http.server 8080"

# Obtain the endpoint of the service
$ sky status --endpoint 8080 myserver
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run this after my cluster is up, I got the following error:

sky status --endpoint 8080 myserver
RuntimeError: Port 8080 not exposed yet. If the cluster was recently started, please retry after a while. Additionally, make sure your Ingress is configured correctly.
To debug, run: kubectl describe ingress && kubectl describe ingressclass

I run the two commands suggested, but fail to see what should I look at in the output.

In the kubectl describe ingress, I do see the ingress setup:

kubectl describe ingress
Name:             myserver-084e-skypilot-ingress--8080
Labels:           <none>
Namespace:        default
Address:          10.166.0.4,10.166.0.5
Ingress Class:    nginx
Default backend:  <default>
Rules:
  Host        Path  Backends
  ----        ----  --------
  *           
              /skypilot/myserver-084e/8080(/|$)(.*)   myserver-084e-skypilot-service--8080:8080 (10.42.167.175:8080)
Annotations:  field.cattle.io/publicEndpoints:
                [{"addresses":["10.166.0.4","10.166.0.5"],"port":80,"protocol":"HTTP","serviceName":"default:myserver-084e-skypilot-service--8080","ingres...
              nginx.ingress.kubernetes.io/rewrite-target: /$2
              nginx.ingress.kubernetes.io/use-regex: true
Events:
  Type    Reason  Age                   From                      Message
  ----    ------  ----                  ----                      -------
  Normal  Sync    6m2s (x3 over 6m23s)  nginx-ingress-controller  Scheduled for sync
  Normal  Sync    6m2s (x3 over 6m23s)  nginx-ingress-controller  Scheduled for sync

However, the sky status --endpoint 8080 myserver still shows the same error.

I also tried kubectl describe services and seems the service for the nginx ingress is there:

Name:              myserver-084e-skypilot-service--8080
Namespace:         default
Labels:            parent=skypilot
Annotations:       <none>
Selector:          skypilot-cluster=myserver-084e
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.43.98.142
IPs:               10.43.98.142
Port:              <unset>  8080/TCP
TargetPort:        8080/TCP
Endpoints:         10.42.167.175:8080
Session Affinity:  None
Events:            <none>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tried to install the ingress controller with the quick start's recommendation, but got an error.

helm upgrade --install ingress-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace ingress-nginx --create-namespace
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /home/azureuser/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/azureuser/.kube/config
Release "ingress-nginx" does not exist. Installing it now.
Error: Unable to continue with install: IngressClass "nginx" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "ingress-nginx": current value is "rke2-ingress-nginx"; annotation validation error: key "meta.helm.sh/release-namespace" must equal "ingress-nginx": current value is "kube-system"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah looks like the default RKE nginx ingress doesn't have a service configured to access the ingress controller.

The Bare-Metal instructions on the nginx installation docs work here, tested on RKE2 cluster. I'll update the docs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a tip - let me know if that helps!

…o k8s_docs_debuggingpage

# Conflicts:
#	docs/source/reference/kubernetes/kubernetes-setup.rst
@romilbhardwaj
Copy link
Collaborator Author

Thanks @Michaelvll! Ready for another look.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating this @romilbhardwaj! LGTM.

@romilbhardwaj romilbhardwaj merged commit a63a56a into master Mar 2, 2024
@romilbhardwaj romilbhardwaj deleted the k8s_docs_debuggingpage branch March 2, 2024 00:05
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants