Skip to content

Commit 3441512

Browse files
Michaelvllcg505concretevitaminromilbhardwajzpoint
authored
Merge master (#131)
* [perf] use uv for venv creation and pip install (#4414) * Revert "remove `uv` from runtime setup due to azure installation issue (#4401)" This reverts commit 0b20d56. * on azure, use --prerelease=allow to install azure-cli * use uv venv --seed * fix backwards compatibility * really fix backwards compatibility * use uv to set up controller dependencies * fix python 3.8 * lint * add missing file * update comment * split out azure-cli dep * fix lint for dependencies * use runpy.run_path rather than modifying sys.path * fix cloud dependency installation commands * lint * Update sky/utils/controller_utils.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * [Minor] README updates. (#4436) * [Minor] README touches. * update * update * make --fast robust against credential or wheel updates (#4289) * add config_dict['config_hash'] output to write_cluster_config * fix docstring for write_cluster_config This used to be true, but since #2943, 'ray' is the only provisioner. Add other keys that are now present instead. * when using --fast, check if config_hash matches, and if not, provision * mock hashing method in unit test This is needed since some files in the fake file mounts don't actually exist, like the wheel path. * check config hash within provision with lock held * address other PR review comments * rename to skip_if_no_cluster_updates Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * add assert details Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * address PR comments and update docstrings * fix test * update docstrings Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * address PR comments * fix lint and tests * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * refactor skip_if_no_cluster_update var * clarify comment * format exception --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * [k8s] Add resource limits only if they exist (#4440) Add limits only if they exist * [robustness] cover some potential resource leakage cases (#4443) * if a newly-created cluster is missing from the cloud, wait before deleting Addresses #4431. * confirm cluster actually terminates before deleting from the db * avoid deleting cluster data outside the primary provision loop * tweaks * Apply suggestions from code review Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * use usage_intervals for new cluster detection get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check. * fix terminating/stopping state for Lambda and Paperspace * Revert "use usage_intervals for new cluster detection" This reverts commit aa6d2e9. * check cloud.STATUS_VERSION before calling query_instances * avoid try/catch when querying instances * update comments --------- Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com> * smoke tests support storage mount only (#4446) * smoke tests support storage mount only * fix verify command * rename to only_mount * [Feature] support spot pod on RunPod (#4447) * wip * wip * wip * wip * wip * wip * resolve comments * wip * wip * wip * wip * wip * wip --------- Co-authored-by: hwei <hwei@covariant.ai> * use lazy import for runpod (#4451) Fixes runpod import issues introduced in #4447. * [k8s] Fix show-gpus when running with incluster auth (#4452) * Add limits only if they exist * Fix incluster auth handling * Not mutate azure dep list at runtime (#4457) * add 1, 2, 4 size H100's to GCP (#4456) * add 1, 2, 4 size H100's to GCP * update * Support buildkite CICD and restructure smoke tests (#4396) * event based smoke test * more event based smoke test * more test cases * more test cases with managed jobs * bug fix * bump up seconds * merge master and resolve conflict * more test case * support test_managed_jobs_pipeline_failed_setup * support test_managed_jobs_recovery_aws * manged job status * bug fix * test managed job cancel * test_managed_jobs_storage * more test cases * resolve pr comment * private member function * bug fix * restructure * fix import * buildkite config * fix stdout problem * update pipeline test * test again * smoke test for buildkite * remove unsupport cloud for now * merge branch 'reliable_smoke_test_more' * bug fix * bug fix * bug fix * test pipeline pre merge * build test * test again * trigger test * bug fix * generate pipeline * robust generate pipeline * refactor pipeline * remove runpod * hot fix to pass smoke test * random order * allow parameter * bug fix * bug fix * exclude lambda cloud * dynamic generate pipeline * fix pre-commit * format * support SUPPRESS_SENSITIVE_LOG * support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log * support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log * add backward_compatibility_tests to pipeline * pip install uv for backward compatibility test * import style * generate all cloud * resolve PR comment * update comment * naming fix * grammar correction * resolve PR comment * fix import * fix import * support gcp on pre merge test * no gcp test case for pre merge * [k8s] Make node termination robust (#4469) * Add limits only if they exist * retry deletion * lint * lint * comments * lint * [Catalog] Bump catalog schema version (#4470) * Bump catalog schema version * trigger CI * [core] skip provider.availability_zone in the cluster config hash (#4463) skip provider.availability_zone in the cluster config hash * remove sky jobs launch --fast (#4467) * remove sky jobs launch --fast The --fast behavior is now always enabled. This was unsafe before but since \#4289 it should be safe. We will remove the flag before 0.8.0 so that it never touches a stable version. sky launch still has the --fast flag. This flag is unsafe because it could cause setup to be skipped even though it should be re-run. In the managed jobs case, this is not an issue because we fully control the setup and know it will not change. * fix lint * [docs] Change urls to docs.skypilot.co, add 404 page (#4413) * Add 404 page, change to docs.skypilot.co * lint * [UX] Fix unnecessary OCI logging (#4476) Sync PR: fix-oci-logging-master Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * [Example] PyTorch distributed training with minGPT (#4464) * Add example for distributed pytorch * update * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Fix --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> * Add tests for Azure spot instance (#4475) * verify azure spot instance * string style * echo * echo vm detail * bug fix * remove comment * rename pre-merge test to quicktest-core (#4486) * rename to test core * rename file * [k8s] support to use custom gpu resource name if it's not nvidia.com/gpu (#4337) * [k8s] support to use custom gpu resource name if it's not nvidia.com/gpu Signed-off-by: nkwangleiGIT <nkwanglei@126.com> * fix format issue Signed-off-by: nkwangleiGIT <nkwanglei@126.com> --------- Signed-off-by: nkwangleiGIT <nkwanglei@126.com> * [k8s] Fix IPv6 ssh support (#4497) * Add limits only if they exist * Fix ipv6 support * Fix ipv6 support * [Serve] Add and adopt least load policy as default poicy. (#4439) * [Serve] Add and adopt least load policy as default poicy. * Docs & smoke tests * error message for different lb policy * add minimal example * fix * [Docs] Update logo in docs (#4500) * WIP updating Elisa logo; issues with light/dark modes * Fix SVG in navbar rendering by hardcoding SVG + defining text color in css * Update readme images * newline --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * Replace `len()` Zero Checks with Pythonic Empty Sequence Checks (#4298) * style: mainly replace len() comparisons with 0/1 with pythonic empty sequence checks * chore: more typings * use `df.empty` for dataframe * fix: more `df.empty` * format * revert partially * style: add back comments * style: format * refactor: `dict[str, str]` Co-authored-by: Tian Xia <cblmemo@gmail.com> --------- Co-authored-by: Tian Xia <cblmemo@gmail.com> * [Docs] Fix logo file path (#4504) * Add limits only if they exist * rename * [Storage] Show logs for storage mount (#4387) * commit for logging change * logger for storage * grammar * fix format * better comment * resolve copilot review * resolve PR comment * remove unuse var * Update sky/data/data_utils.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * resolve PR comment * update comment for get_run_timestamp * rename backend_util.get_run_timestamp to sky_logging.get_run_timestamp --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * [Examples] Update Ollama setup commands (#4510) wip * [OCI] Support OCI Object Storage (#4501) * OCI Object Storage Support * example yaml update * example update * add more example yaml * Support RClone-RPM pkg * Add smoke test * ver * smoke test * Resolve dependancy conflict between oci-cli and runpod * Use latest RClone version (v1.68.2) * minor optimize * Address review comments * typo * test * sync code with repo * Address review comments & more testing. * address one more comment * [Jobs] Allowing to specify intermediate bucket for file upload (#4257) * debug * support workdir_bucket_name config on yaml file * change the match statement to if else due to mypy limit * pass mypy * yapf format fix * reformat * remove debug line * all dir to same bucket * private member function * fix mypy * support sub dir config to separate to different directory * rename and add smoke test * bucketname * support sub dir mount * private member for _bucket_sub_path and smoke test fix * support copy mount for sub dir * support gcs, s3 delete folder * doc * r2 remove_objects_from_sub_path * support azure remove directory and cos remove * doc string for remove_objects_from_sub_path * fix sky jobs subdir issue * test case update * rename to _bucket_sub_path * change the config schema * setter * bug fix and test update * delete bucket depends on user config or sky generated * add test case * smoke test bug fix * robust smoke test * fix comment * bug fix * set the storage manually * better structure * fix mypy * Update docs/source/reference/config.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update docs/source/reference/config.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * limit creation for bucket and delete sub dir only * resolve comment * Update docs/source/reference/config.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update sky/utils/controller_utils.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * resolve PR comment * bug fix * bug fix * fix test case * bug fix * fix * fix test case * bug fix * support is_sky_managed param in config * pass param intermediate_bucket_is_sky_managed * resolve PR comment * Update sky/utils/controller_utils.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * hide bucket creation log * reset green color * rename is_sky_managed to _is_sky_managed * bug fix * retrieve _is_sky_managed from stores * propogate the log --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * [Core] Deprecate LocalDockerBackend (#4516) Deprecate local docker backend * [docs] Add newer examples for AI tutorial and distributed training (#4509) * Update tutorial and distributed training examples. * Add examples link * add rdvz * [k8s] Fix L40 detection for nvidia GFD labels (#4511) Fix L40 detection * [docs] Support OCI Object Storage (#4513) * Support OCI Object Storage * Add oci bucket for file_mount * [Docs] Disable Kapa AI (#4518) Disable kapa * [DigitalOcean] droplet integration (#3832) * init digital ocean droplet integration * abbreviate cloud name * switch to pydo * adjust polling logic and mount block storage to instance * filter by paginated * lint * sky launch, start, stop functional * fix credential file mounts, autodown works now * set gpu droplet image * cleanup * remove more tests * atomically destroy instance and block storage simulatenously * install docker * disable spot test * fix ip address bug for multinode * lint * patch ssh from job/serve controller * switch to EA slugs * do adaptor * lint * Update sky/clouds/do.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * Update sky/clouds/do.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * comment template * comment patch * add h100 test case * comment on instance name length * Update sky/clouds/do.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * Update sky/clouds/service_catalog/do_catalog.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * comment on max node char len * comment on weird azure import * comment acc price is included in instance price * fix return type * switch with do_utils * remove broad except * Update sky/provision/do/instance.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * Update sky/provision/do/instance.py Co-authored-by: Tian Xia <cblmemo@gmail.com> * remove azure * comment on non_terminated_only * add open port debug message * wrap start instance api * use f-string * wrap stop * wrap instance down * assert credentials and check against all contexts * assert client is None * remove pending instances during instance restart * wrap rename * rename ssh key var * fix tags * add tags for block device * f strings for errors * support image ids * update do tests * only store head instance id * rename image slugs * add digital ocean alias * wait for docker to be available * update requirements and tests * increase docker timeout * lint * move tests * lint * patch test * lint * typo fix * fix typo * patch tests * fix tests * no_mark spot test * handle 2cpu serve tests * lint * lint * use logger.debug * fix none cred path * lint * handle get_cred path * pylint * patch for DO test_optimizer_dryruns.py * revert optimizer dryrun --------- Co-authored-by: Tian Xia <cblmemo@gmail.com> Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-12.ec2.internal> * [Docs] Refactor pod_config docs (#4427) * refactor pod_config docs * Update docs/source/reference/kubernetes/kubernetes-getting-started.rst Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * Update docs/source/reference/kubernetes/kubernetes-getting-started.rst Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> --------- Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> * [OCI] Set default image to ubuntu LTS 22.04 (#4517) * set default gpu image to skypilot:gpu-ubuntu-2204 * add example * remove comment line * set cpu default image to 2204 * update change history * [OCI] 1. Support specify OS with custom image id. 2. Corner case fix (#4524) * Support specify os type with custom image id. * trim space * nit * comment * Update intermediate bucket related doc (#4521) * doc * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * add tip * minor changes --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * [aws] cache user identity by 'aws configure list' (#4507) * [aws] cache user identity by 'aws configure list' Signed-off-by: Aylei <rayingecho@gmail.com> * refine get_user_identities docstring Signed-off-by: Aylei <rayingecho@gmail.com> * address review comments Signed-off-by: Aylei <rayingecho@gmail.com> --------- Signed-off-by: Aylei <rayingecho@gmail.com> * [k8s] Add validation for pod_config #4206 (#4466) * [k8s] Add validation for pod_config #4206 Check pod_config when run 'sky check k8s' by using k8s api * update: check pod_config when launch check merged pod_config during launch using k8s api * fix test * ignore check failed when test with dryrun if there is no kube config in env, ignore ValueError when launch with dryrun. For now, we don't support check schema offline. * use deserialize api to check pod_config schema * test * create another api_client with no kubeconfig * test * update error message * update test * test * test * Update sky/backends/backend_utils.py --------- Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * [core] fix wheel timestamp check (#4488) Previously, we were only taking the max timestamp of all the subdirectories of the given directory. So the timestamp could be incorrect if only a file changed, and no directory changed. This fixes the issue by looking at all directories and files given by os.walk(). * [docs] Add image_id doc in task YAML for OCI (#4526) * Add image_id doc for OCI * nit * Update docs/source/reference/yaml-spec.rst Co-authored-by: Tian Xia <cblmemo@gmail.com> --------- Co-authored-by: Tian Xia <cblmemo@gmail.com> * [UX] warning before launching jobs/serve when using a reauth required credentials (#4479) * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * Update sky/backends/cloud_vm_ray_backend.py Minor fix * Update sky/clouds/aws.py Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * wip * minor changes * wip --------- Co-authored-by: hong <hong@hongdeMacBook-Pro.local> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> * [GCP] Activate service account for storage and controller (#4529) * Activate service account for storage * disable logging if not using service account * Activate for controller as well. * revert controller activate * Add comments * format * fix smoke * [OCI] Support reuse existing VCN for SkyServe (#4530) * Support reuse existing VCN for SkyServe * fix * remove unused import * format * [docs] OCI: advanced configuration & add vcn_ocid (#4531) * Add vcn_ocid configuration * Update config.rst * fix merge issues WIP * fix merging issues * fix imports * fix stores --------- Signed-off-by: nkwangleiGIT <nkwanglei@126.com> Signed-off-by: Aylei <rayingecho@gmail.com> Co-authored-by: Christopher Cooper <cooperc@assemblesys.com> Co-authored-by: Zongheng Yang <zongheng.y@gmail.com> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@berkeley.edu> Co-authored-by: zpoint <zp0int@qq.com> Co-authored-by: Hong <weih1121@qq.com> Co-authored-by: hwei <hwei@covariant.ai> Co-authored-by: Yika <yikaluo@assemblesys.com> Co-authored-by: Seth Kimmel <seth.kimmel3@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Lei <nkwanglei@126.com> Co-authored-by: Tian Xia <cblmemo@gmail.com> Co-authored-by: Andy Lee <andylizf@outlook.com> Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com> Co-authored-by: Hysun He <hysunhe@foxmail.com> Co-authored-by: Andrew Aikawa <asai@berkeley.edu> Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-12.ec2.internal> Co-authored-by: Aylei <rayingecho@gmail.com> Co-authored-by: Chester Li <chaoleili2@gmail.com> Co-authored-by: hong <hong@hongdeMacBook-Pro.local>
1 parent 4d9c1d5 commit 3441512

File tree

179 files changed

+11472
-7330
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

179 files changed

+11472
-7330
lines changed

.buildkite/generate_pipeline.py

+252
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
"""
2+
This script generates a Buildkite pipeline from test files.
3+
4+
The script will generate two pipelines:
5+
6+
tests/smoke_tests
7+
├── test_*.py -> release pipeline
8+
├── test_quick_tests_core.py -> run quick tests on PR before merging
9+
10+
run `PYTHONPATH=$(pwd)/tests:$PYTHONPATH python .buildkite/generate_pipeline.py`
11+
to generate the pipeline for testing. The CI will run this script as a pre-step,
12+
and use the generated pipeline to run the tests.
13+
14+
1. release pipeline, which runs all smoke tests by default, generates all
15+
smoke tests for all clouds.
16+
2. pre-merge pipeline, which generates all smoke tests for all clouds,
17+
author should specify which clouds to run by setting env in the step.
18+
19+
We only have credentials for aws/azure/gcp/kubernetes(CLOUD_QUEUE_MAP and
20+
SERVE_CLOUD_QUEUE_MAP) now, smoke tests for those clouds are generated, other
21+
clouds are not supported yet, smoke tests for those clouds are not generated.
22+
"""
23+
24+
import ast
25+
import os
26+
import random
27+
from typing import Any, Dict, List, Optional
28+
29+
from conftest import cloud_to_pytest_keyword
30+
from conftest import default_clouds_to_run
31+
import yaml
32+
33+
DEFAULT_CLOUDS_TO_RUN = default_clouds_to_run
34+
PYTEST_TO_CLOUD_KEYWORD = {v: k for k, v in cloud_to_pytest_keyword.items()}
35+
36+
QUEUE_GENERIC_CLOUD = 'generic_cloud'
37+
QUEUE_GENERIC_CLOUD_SERVE = 'generic_cloud_serve'
38+
QUEUE_KUBERNETES = 'kubernetes'
39+
QUEUE_KUBERNETES_SERVE = 'kubernetes_serve'
40+
# Only aws, gcp, azure, and kubernetes are supported for now.
41+
# Other clouds do not have credentials.
42+
CLOUD_QUEUE_MAP = {
43+
'aws': QUEUE_GENERIC_CLOUD,
44+
'gcp': QUEUE_GENERIC_CLOUD,
45+
'azure': QUEUE_GENERIC_CLOUD,
46+
'kubernetes': QUEUE_KUBERNETES
47+
}
48+
# Serve tests runs long, and different test steps usually requires locks.
49+
# Its highly likely to fail if multiple serve tests are running concurrently.
50+
# So we use a different queue that runs only one concurrent test at a time.
51+
SERVE_CLOUD_QUEUE_MAP = {
52+
'aws': QUEUE_GENERIC_CLOUD_SERVE,
53+
'gcp': QUEUE_GENERIC_CLOUD_SERVE,
54+
'azure': QUEUE_GENERIC_CLOUD_SERVE,
55+
'kubernetes': QUEUE_KUBERNETES_SERVE
56+
}
57+
58+
GENERATED_FILE_HEAD = ('# This is an auto-generated Buildkite pipeline by '
59+
'.buildkite/generate_pipeline.py, Please do not '
60+
'edit directly.\n')
61+
62+
63+
def _get_full_decorator_path(decorator: ast.AST) -> str:
64+
"""Recursively get the full path of a decorator."""
65+
if isinstance(decorator, ast.Attribute):
66+
return f'{_get_full_decorator_path(decorator.value)}.{decorator.attr}'
67+
elif isinstance(decorator, ast.Name):
68+
return decorator.id
69+
elif isinstance(decorator, ast.Call):
70+
return _get_full_decorator_path(decorator.func)
71+
raise ValueError(f'Unknown decorator type: {type(decorator)}')
72+
73+
74+
def _extract_marked_tests(file_path: str) -> Dict[str, List[str]]:
75+
"""Extract test functions and filter clouds using pytest.mark
76+
from a Python test file.
77+
78+
We separate each test_function_{cloud} into different pipeline steps
79+
to maximize the parallelism of the tests via the buildkite CI job queue.
80+
This allows us to visualize the test results and rerun failures at the
81+
granularity of each test_function_{cloud}.
82+
83+
If we make pytest --serve a job, it could contain dozens of test_functions
84+
and run for hours. This makes it hard to visualize the test results and
85+
rerun failures. Additionally, the parallelism would be controlled by pytest
86+
instead of the buildkite job queue.
87+
"""
88+
with open(file_path, 'r', encoding='utf-8') as file:
89+
tree = ast.parse(file.read(), filename=file_path)
90+
91+
for node in ast.walk(tree):
92+
for child in ast.iter_child_nodes(node):
93+
setattr(child, 'parent', node)
94+
95+
function_cloud_map = {}
96+
for node in ast.walk(tree):
97+
if isinstance(node, ast.FunctionDef) and node.name.startswith('test_'):
98+
class_name = None
99+
if hasattr(node, 'parent') and isinstance(node.parent,
100+
ast.ClassDef):
101+
class_name = node.parent.name
102+
103+
clouds_to_include = []
104+
clouds_to_exclude = []
105+
is_serve_test = False
106+
for decorator in node.decorator_list:
107+
if isinstance(decorator, ast.Call):
108+
# We only need to consider the decorator with no arguments
109+
# to extract clouds.
110+
continue
111+
full_path = _get_full_decorator_path(decorator)
112+
if full_path.startswith('pytest.mark.'):
113+
assert isinstance(decorator, ast.Attribute)
114+
suffix = decorator.attr
115+
if suffix.startswith('no_'):
116+
clouds_to_exclude.append(suffix[3:])
117+
else:
118+
if suffix == 'serve':
119+
is_serve_test = True
120+
continue
121+
if suffix not in PYTEST_TO_CLOUD_KEYWORD:
122+
# This mark does not specify a cloud, so we skip it.
123+
continue
124+
clouds_to_include.append(
125+
PYTEST_TO_CLOUD_KEYWORD[suffix])
126+
clouds_to_include = (clouds_to_include if clouds_to_include else
127+
DEFAULT_CLOUDS_TO_RUN)
128+
clouds_to_include = [
129+
cloud for cloud in clouds_to_include
130+
if cloud not in clouds_to_exclude
131+
]
132+
cloud_queue_map = SERVE_CLOUD_QUEUE_MAP if is_serve_test else CLOUD_QUEUE_MAP
133+
final_clouds_to_include = [
134+
cloud for cloud in clouds_to_include if cloud in cloud_queue_map
135+
]
136+
if clouds_to_include and not final_clouds_to_include:
137+
print(f'Warning: {file_path}:{node.name} '
138+
f'is marked to run on {clouds_to_include}, '
139+
f'but we do not have credentials for those clouds. '
140+
f'Skipped.')
141+
continue
142+
if clouds_to_include != final_clouds_to_include:
143+
excluded_clouds = set(clouds_to_include) - set(
144+
final_clouds_to_include)
145+
print(
146+
f'Warning: {file_path}:{node.name} '
147+
f'is marked to run on {clouds_to_include}, '
148+
f'but we only have credentials for {final_clouds_to_include}. '
149+
f'clouds {excluded_clouds} are skipped.')
150+
function_name = (f'{class_name}::{node.name}'
151+
if class_name else node.name)
152+
function_cloud_map[function_name] = (final_clouds_to_include, [
153+
cloud_queue_map[cloud] for cloud in final_clouds_to_include
154+
])
155+
return function_cloud_map
156+
157+
158+
def _generate_pipeline(test_file: str) -> Dict[str, Any]:
159+
"""Generate a Buildkite pipeline from test files."""
160+
steps = []
161+
function_cloud_map = _extract_marked_tests(test_file)
162+
for test_function, clouds_and_queues in function_cloud_map.items():
163+
for cloud, queue in zip(*clouds_and_queues):
164+
step = {
165+
'label': f'{test_function} on {cloud}',
166+
'command': f'pytest {test_file}::{test_function} --{cloud}',
167+
'agents': {
168+
# Separate agent pool for each cloud.
169+
# Since they require different amount of resources and
170+
# concurrency control.
171+
'queue': queue
172+
},
173+
'if': f'build.env("{cloud}") == "1"'
174+
}
175+
steps.append(step)
176+
return {'steps': steps}
177+
178+
179+
def _dump_pipeline_to_file(yaml_file_path: str,
180+
pipelines: List[Dict[str, Any]],
181+
extra_env: Optional[Dict[str, str]] = None):
182+
default_env = {'LOG_TO_STDOUT': '1', 'PYTHONPATH': '${PYTHONPATH}:$(pwd)'}
183+
if extra_env:
184+
default_env.update(extra_env)
185+
with open(yaml_file_path, 'w', encoding='utf-8') as file:
186+
file.write(GENERATED_FILE_HEAD)
187+
all_steps = []
188+
for pipeline in pipelines:
189+
all_steps.extend(pipeline['steps'])
190+
# Shuffle the steps to avoid flakyness, consecutive runs of the same
191+
# kind of test may fail for requiring locks on the same resources.
192+
random.shuffle(all_steps)
193+
final_pipeline = {'steps': all_steps, 'env': default_env}
194+
yaml.dump(final_pipeline, file, default_flow_style=False)
195+
196+
197+
def _convert_release(test_files: List[str]):
198+
yaml_file_path = '.buildkite/pipeline_smoke_tests_release.yaml'
199+
output_file_pipelines = []
200+
for test_file in test_files:
201+
print(f'Converting {test_file} to {yaml_file_path}')
202+
pipeline = _generate_pipeline(test_file)
203+
output_file_pipelines.append(pipeline)
204+
print(f'Converted {test_file} to {yaml_file_path}\n\n')
205+
# Enable all clouds by default for release pipeline.
206+
_dump_pipeline_to_file(yaml_file_path,
207+
output_file_pipelines,
208+
extra_env={cloud: '1' for cloud in CLOUD_QUEUE_MAP})
209+
210+
211+
def _convert_quick_tests_core(test_files: List[str]):
212+
yaml_file_path = '.buildkite/pipeline_smoke_tests_quick_tests_core.yaml'
213+
output_file_pipelines = []
214+
for test_file in test_files:
215+
print(f'Converting {test_file} to {yaml_file_path}')
216+
# We want enable all clouds by default for each test function
217+
# for pre-merge. And let the author controls which clouds
218+
# to run by parameter.
219+
pipeline = _generate_pipeline(test_file)
220+
pipeline['steps'].append({
221+
'label': 'Backward compatibility test',
222+
'command': 'bash tests/backward_compatibility_tests.sh',
223+
'agents': {
224+
'queue': 'back_compat'
225+
}
226+
})
227+
output_file_pipelines.append(pipeline)
228+
print(f'Converted {test_file} to {yaml_file_path}\n\n')
229+
_dump_pipeline_to_file(yaml_file_path,
230+
output_file_pipelines,
231+
extra_env={'SKYPILOT_SUPPRESS_SENSITIVE_LOG': '1'})
232+
233+
234+
def main():
235+
test_files = os.listdir('tests/smoke_tests')
236+
release_files = []
237+
quick_tests_core_files = []
238+
for test_file in test_files:
239+
if not test_file.startswith('test_'):
240+
continue
241+
test_file_path = os.path.join('tests/smoke_tests', test_file)
242+
if "test_quick_tests_core" in test_file:
243+
quick_tests_core_files.append(test_file_path)
244+
else:
245+
release_files.append(test_file_path)
246+
247+
_convert_release(release_files)
248+
_convert_quick_tests_core(quick_tests_core_files)
249+
250+
251+
if __name__ == '__main__':
252+
main()

.pre-commit-config.yaml

+2-3
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@ repos:
2424
args:
2525
- "--sg=build/**" # Matches "${ISORT_YAPF_EXCLUDES[@]}"
2626
- "--sg=sky/skylet/providers/ibm/**"
27-
files: "^(sky|tests|examples|llm|docs)/.*" # Only match these directories
2827
# Second isort command
2928
- id: isort
3029
name: isort (IBM specific)
@@ -56,8 +55,8 @@ repos:
5655
hooks:
5756
- id: yapf
5857
name: yapf
59-
exclude: (build/.*|sky/skylet/providers/ibm/.*) # Matches exclusions from the script
60-
args: ['--recursive', '--parallel'] # Only necessary flags
58+
exclude: (sky/skylet/providers/ibm/.*) # Matches exclusions from the script
59+
args: ['--recursive', '--parallel', '--in-place'] # Only necessary flags
6160
additional_dependencies: [toml==0.10.2]
6261

6362
- repo: https://github.com/pylint-dev/pylint

README.md

+17-17
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
</p>
77

88
<p align="center">
9-
<a href="https://skypilot.readthedocs.io/en/latest/">
9+
<a href="https://docs.skypilot.co/">
1010
<img alt="Documentation" src="https://readthedocs.org/projects/skypilot/badge/?version=latest">
1111
</a>
1212

@@ -43,7 +43,7 @@
4343
<summary>Archived</summary>
4444

4545
- [Jul 2024] [**Finetune**](./llm/llama-3_1-finetuning/) and [**serve**](./llm/llama-3_1/) **Llama 3.1** on your infra
46-
- [Apr 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
46+
- [Apr 2024] Serve and finetune [**Llama 3**](https://docs.skypilot.co/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
4747
- [Mar 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/)
4848
- [Feb 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/)
4949
- [Dec 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/)
@@ -60,17 +60,17 @@
6060
SkyPilot is a framework for running AI and batch workloads on any infra, offering unified execution, high cost savings, and high GPU availability.
6161

6262
SkyPilot **abstracts away infra burdens**:
63-
- Launch [dev clusters](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html), [jobs](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html), and [serving](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) on any infra
63+
- Launch [dev clusters](https://docs.skypilot.co/en/latest/examples/interactive-development.html), [jobs](https://docs.skypilot.co/en/latest/examples/managed-jobs.html), and [serving](https://docs.skypilot.co/en/latest/serving/sky-serve.html) on any infra
6464
- Easy job management: queue, run, and auto-recover many jobs
6565

6666
SkyPilot **supports multiple clusters, clouds, and hardware** ([the Sky](https://arxiv.org/abs/2205.07147)):
6767
- Bring your reserved GPUs, Kubernetes clusters, or 12+ clouds
68-
- [Flexible provisioning](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html) of GPUs, TPUs, CPUs, with auto-retry
68+
- [Flexible provisioning](https://docs.skypilot.co/en/latest/examples/auto-failover.html) of GPUs, TPUs, CPUs, with auto-retry
6969

7070
SkyPilot **cuts your cloud costs & maximizes GPU availability**:
71-
* [Autostop](https://skypilot.readthedocs.io/en/latest/reference/auto-stop.html): automatic cleanup of idle resources
72-
* [Managed Spot](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html): 3-6x cost savings using spot instances, with preemption auto-recovery
73-
* [Optimizer](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html): 2x cost savings by auto-picking the cheapest & most available infra
71+
* [Autostop](https://docs.skypilot.co/en/latest/reference/auto-stop.html): automatic cleanup of idle resources
72+
* [Managed Spot](https://docs.skypilot.co/en/latest/examples/managed-jobs.html): 3-6x cost savings using spot instances, with preemption auto-recovery
73+
* [Optimizer](https://docs.skypilot.co/en/latest/examples/auto-failover.html): 2x cost savings by auto-picking the cheapest & most available infra
7474

7575
SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.
7676

@@ -79,13 +79,13 @@ Install with pip:
7979
# Choose your clouds:
8080
pip install -U "skypilot[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp]"
8181
```
82-
To get the latest features and fixes, use the nightly build or [install from source](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html):
82+
To get the latest features and fixes, use the nightly build or [install from source](https://docs.skypilot.co/en/latest/getting-started/installation.html):
8383
```bash
8484
# Choose your clouds:
8585
pip install "skypilot-nightly[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp]"
8686
```
8787

88-
[Current supported infra](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html) (Kubernetes; AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack, RunPod, Cudo, Paperspace, Cloudflare, Samsung, IBM, VMware vSphere):
88+
[Current supported infra](https://docs.skypilot.co/en/latest/getting-started/installation.html) (Kubernetes; AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack, RunPod, Cudo, Paperspace, Cloudflare, Samsung, IBM, VMware vSphere):
8989
<p align="center">
9090
<picture>
9191
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/cloud-logos-dark.png">
@@ -95,16 +95,16 @@ pip install "skypilot-nightly[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidst
9595

9696

9797
## Getting Started
98-
You can find our documentation [here](https://skypilot.readthedocs.io/en/latest/).
99-
- [Installation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)
100-
- [Quickstart](https://skypilot.readthedocs.io/en/latest/getting-started/quickstart.html)
101-
- [CLI reference](https://skypilot.readthedocs.io/en/latest/reference/cli.html)
98+
You can find our documentation [here](https://docs.skypilot.co/).
99+
- [Installation](https://docs.skypilot.co/en/latest/getting-started/installation.html)
100+
- [Quickstart](https://docs.skypilot.co/en/latest/getting-started/quickstart.html)
101+
- [CLI reference](https://docs.skypilot.co/en/latest/reference/cli.html)
102102

103103
## SkyPilot in 1 Minute
104104

105105
A SkyPilot task specifies: resource requirements, data to be synced, setup commands, and the task commands.
106106

107-
Once written in this [**unified interface**](https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html) (YAML or Python API), the task can be launched on any available cloud. This avoids vendor lock-in, and allows easily moving jobs to a different provider.
107+
Once written in this [**unified interface**](https://docs.skypilot.co/en/latest/reference/yaml-spec.html) (YAML or Python API), the task can be launched on any available cloud. This avoids vendor lock-in, and allows easily moving jobs to a different provider.
108108

109109
Paste the following into a file `my_task.yaml`:
110110

@@ -135,7 +135,7 @@ Prepare the workdir by cloning:
135135
git clone https://github.com/pytorch/examples.git ~/torch_examples
136136
```
137137

138-
Launch with `sky launch` (note: [access to GPU instances](https://skypilot.readthedocs.io/en/latest/cloud-setup/quota.html) is needed for this example):
138+
Launch with `sky launch` (note: [access to GPU instances](https://docs.skypilot.co/en/latest/cloud-setup/quota.html) is needed for this example):
139139
```bash
140140
sky launch my_task.yaml
141141
```
@@ -152,10 +152,10 @@ SkyPilot then performs the heavy-lifting for you, including:
152152
</p>
153153

154154

155-
Refer to [Quickstart](https://skypilot.readthedocs.io/en/latest/getting-started/quickstart.html) to get started with SkyPilot.
155+
Refer to [Quickstart](https://docs.skypilot.co/en/latest/getting-started/quickstart.html) to get started with SkyPilot.
156156

157157
## More Information
158-
To learn more, see [Concept: Sky Computing](https://docs.skypilot.co/en/latest/sky-computing.html), [SkyPilot docs](https://skypilot.readthedocs.io/en/latest/), and [SkyPilot blog](https://blog.skypilot.co/).
158+
To learn more, see [Concept: Sky Computing](https://docs.skypilot.co/en/latest/sky-computing.html), [SkyPilot docs](https://docs.skypilot.co/en/latest/), and [SkyPilot blog](https://blog.skypilot.co/).
159159

160160
<!-- Keep this section in sync with index.rst in SkyPilot Docs -->
161161
Runnable examples:

docs/requirements-docs.txt

+1
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ sphinx-autobuild==2021.3.14
1111
sphinx-autodoc-typehints==1.25.2
1212
sphinx-book-theme==1.1.0
1313
sphinx-togglebutton==0.3.2
14+
sphinx-notfound-page==1.0.4
1415
sphinxcontrib-applehelp==1.0.7
1516
sphinxcontrib-devhelp==1.0.5
1617
sphinxcontrib-googleanalytics==0.4

0 commit comments

Comments
 (0)