Skip to content

Releases: skypilot-org/skypilot

SkyPilot v0.9.1

24 Apr 18:04
1ffe585
Compare
Choose a tag to compare

SkyPilot v0.9.1: API Server Architecture, Web Dashboard, Faster Storage, Improved Configuration and more!

We're excited to announce the release of SkyPilot v0.9.1! This update brings major improvements to SkyPilot, making it faster, more powerful and flexible for production-ready deployment.

Highlights

Client-Server Architecture

Client-Server Architecture

The new client-server model transforms SkyPilot from a single-user system into a scalable, multi-user platform, making it easier for individuals and teams to run and manage their workloads.

  • Unified view and management: Get a single view of all running clusters and jobs across the organization and all infra you have.
  • Integrate with workflow orchestrators: SkyPilot state is centralized on the API server, does not need to be maintained in orchestrators like Airflow.
  • Multi-tenancy: Share clusters, jobs, and services securely among teammates.

More: Docs, Blog

Web dashboard

SkyPilot has a new dashboard! Easily view and manage your clusters, jobs and logs.

SkyPilot Web Dashboard

Access it with sky dashboard.

New configuration system

New configuration system

SkyPilot now supports specifying configuration at various levels: CLI, SkyPilot YAML, project-level config, client-level global config and server-side config.

You can now have a project configuration storing default values for all jobs in a project, a user configuration to apply globally to all projects and Task YAML overrides for specific jobs.

New mount_cached storage - 9.6x faster checkpointing

New storage mode mount_cached uses the local disk as a cache for cloud storage buckets. Boosts GPU utilizationby making cloud I/O asynchronous.

file_mounts:
  /checkpoints:
    source: gs://my-checkpoints-bucket
    mode: MOUNT_CACHED  # Will asynchronously upload all writes to the bucket

More: Docs, Blog

New cloud: Nebius

SkyPilot now supports Nebius cloud! Getting started is easy:

$ sky check nebius
$ sky launch --gpus H200:8 --cloud nebius

ARM instance support - run SkyPilot on GH200s, GB200s, and more!

New native images for ARM instances allows you to run SkyPilot on your GH200s, GB200s on Lambda cloud, GCP or your own Kubernetes clusters! (#4835)

What's new

CLI & Core interfaces

  • sky CLI now returns non-zero exit code on launch/exec/logs/jobs launch/jobs logs failures (#4846)
    • This improves scriptability with sky CLI in automated workflows.
  • sky check now separately checks storage and compute capabilities (#4996, #4977)
  • New --all option for sky jobs queue to show all jobs (#4923)
  • resources.gpus can now be used to alias resources.accelerators in the SkyPilot YAML (#5207)

Managed Jobs

  • Multiple users can now share the same jobs controller (#4733)
  • Autostop and autodown settings for the jobs controller can now be customized (#5182)
    # ~/.sky/config.yaml
    jobs:
    controller:
      # autostop: false  # to disable completely
      autostop:
        idle_minutes: 5
        down: true
    
  • See other users's jobs with sky jobs queue -u when using a shared controller (#4787)
  • Access to cloud object storage is no longer necessary for using file mounts or workdir in managed jobs. (#4708)
    • Running managed jobs on Kubernetes no longer requires cloud access.

Storage

  • New mode: mount_cached (#4369)
    • This mode is optimized for checkpointing large models
    • It asynchronously uploads the cached directory to the cloud storage bucket, increasing GPU utilization.
  • Fix issue with openrsync on Mac OS 15 causing uploads failures (#5196)
  • .gitignore handling is now more robust (#4988)
  • Fix exclusion for AWS bucket upload (#5128)

Kubernetes

  • Revamped /dev/fuse access mechanism on k8s (#5028)
    • We no longer need to request smarter-devices-fuse resource, making SkyPilot fuse mounting compatible on autoscaling clusters.
  • B200 GPUs are now supported on GKE (#5102)
  • Scale-to-zero autoscaling is now supported on GKE (#4935)
    • SkyPilot can now inspect the node pools available on scale-to-zero clusters before provisioning.
    • This allows SkyPilot to intelligently filter out clusters that cannot provision the requested GPU type.
  • sky check now detects and hints for unlabeled GPU nodes on GKE (#5065)
  • GPU names are now case-insensitive; numbers-only name formats are now supported (#4756, #4925)
  • Fixed fractional CPU support when using <1 CPU core (#4707)
  • Fix node filtering when provisioning multiple GPUs (#4930)
  • initContainers can now be overriden through pod_config (#5247)
  • Instructions on mounting NFS volumes (#4951)
  • GPU labelling script can now use custom context names (#5072)
  • Fixed a bug where clusters from stale contexts could not be cleaned up (#4980)

Backend

  • New Client-Server Architecture (#4660)
    • This allows SkyPilot to be deployed as a remote service shared by multiple users.
  • Fixed conda support when using python 3.12 (#4035)
  • sky exec now waits for the cluster to be started (#4867)
  • sky local up --ips now supports specifying sudo password (#5030)
  • Clouds with expired credentials are now automatically excluded from failover (#5015)

SkyServe

  • New Spot/On-demand Policy: dynamic_fallback (#4628)
    • New spot_placer field can be set to dynamic_fallback to let SkyPilot automatically switch from spot to on-demand instances if spot instances are not available.
    • More details in paper
  • Fixed: any_of field order issue causing version bump to not work (#4978)
  • Fixed: LiveError on controller (#4995)

Cloud Support

  • New cloud: Nebius (#4573, #4838)
  • GCP
    • TPU v6e is now supported on GKE clusters (#4986)
    • VPCs from different projects can be used (#5143)
    • Newer instance types (e.g., a3-highgpu-8g) can now be directly selected from the CLI with -t flag (#5120)
  • RunPod
    • Custom docker images with non-root user are now supported (#4683)
  • Lambda
    • New regions: us-east-3 and australia-east-1 (#4703, #4738)
    • Ports can now be opened on Lambda VMs (#5124)
  • Fluidstack: NVLINK GPUs are now supported (#3954)
  • IBM: new fetcher for IBM catalog (#5003)
  • Cloudflare R2: fixed upload issues when using new awscli versions (#5282)

New Examples and Tutorials

⚠️ Deprecations and removals

Removed

  • Env vars starting with SKY_ are no longer supported. Use SKYPILOT_ env vars instead.
  • Old services from 0.7.0 (before #4439) may require to be stopped and restarted.
  • kubernetes is no longer a valid region name. use the k8s context name to specify a kubernetes cluster if required.

Deprecated

  • experimental.config_overrides has been deprecated. Use the config field instead.

Migration guide

SkyPilot 0.9.1 introduces the asynchronous execution model, which may cause compatibility issues with user programs using SkyPilot SDKs <=0.8.1.

Refer to the migration guide to upgrade your code.

TL;DR: Wrap all SkyPilot SDK function calls (except tail_logs) with sky.stream_and_get() to make your program behave mostly the same as before:

# <= 0.8.1
job_id, handle = sky.launch(task)
# 0.9.1
job_id, handle = sky.stream_and_get(sky.launch(task))

Thanks to all contributors!

New contributors: @kyuds, @BorenTsai, @funkypenguin, @JiangJiaWei1103, @SalikovAlex, @flaviomartins, @ajay, @bradhilton, @SeungjinYang, @eltociear, @vvidovic, @KennBro, @DanielZhangQD

Many thanks to all contributors who contributed to this release!

Contributors: @aylei, @zpoint, @SeungjinYang, @cg505, @michaelvl...

Read more

SkyPilot v0.8.1

09 Apr 00:56
1ea3b4d
Compare
Choose a tag to compare

This patch release is a minor bump over v0.8.0 to get you the latest fixes as soon as possible.

  • Pin wheel<0.46.0 to mitigate build errors when launching clusters in environments with wheel>=0.46.0 (#5153)

Stay tuned for a major upgrade coming up in v0.9.0!

SkyPilot v0.8.0

12 Feb 23:13
c2c49a6
Compare
Choose a tag to compare

SkyPilot v0.8.0: Faster Managed Jobs, Faster Provisioning, Digital Ocean and Vast support, DeepSeek R1 recipes and more!

We’re thrilled to release SkyPilot v0.8.0! This update makes SkyPilot faster and more robust, with major improvements to Managed Jobs, Kubernetes support, and new cloud integrations.

Highlights

  • Faster Managed Jobs: 3x faster job submission, controller uses 37% less memory, and support for 2000+ concurrent jobs
  • Faster Provisioning: Kubernetes provisioning is 4x faster — provisioning a GPU cluster with 200 nodes takes under 90 seconds. sky launch on existing clusters is 5x faster when using --fast flag.
  • Intermediate buckets for managed jobs: bring your own buckets to be used as intermediate storage for managed jobs.
    # ~/.sky/config.yaml
    jobs:
      bucket: s3://my-bucket
    
  • Exciting new features in SkyServe:
    • SkyServe load balancer now supports TLS via HTTPS
    • New load_balancing_policy field to choose from multiple policies (round_robin, least_load)
    • Replica can now expose multiple ports
  • New clouds: Digital Ocean and Vast
  • New LLM Recipes: DeepSeek R1 and Janus, minGPT with Pytorch Distributed

Managed Jobs

  • Managed jobs scheduler has been reworked: 3x faster, uses 37% less memory and can support up to 2000 jobs running simultaneously (#4318, #4485, #4341)
  • Brand new look for the managed jobs dashboard, with new filters, log download, and failover history (#4253, #4644 ,#4638)
    Managed jobs dashboard
  • You can now bring your own bucket to act as the intermediate storage for managed jobs (#4257)
    • If no intermediate bucket is specified, we now create one bucket per job instead of one per file_mount/workdir.
  • sky jobs logs has a new flag --sync-down to download logs to local machine (#4527)
  • When fetching managed jobs logs, SkyPilot will autostart the jobs controller if it is not running (#4380)
  • Robustness of managed jobs is greatly improved (#4247, #4283, #4562, #4602, #4615)

Backend

  • sky launch on existing clusters is 5x faster when using --fast flag. We have reworked the provisioning logic to be more efficient when reusing clusters (#4328, #4289)
  • We now use uv under the hood for 3x faster setup phase (#4414)
  • Beefed up resource leak protection (#4443, #4267)
  • Skylet scheduler is 2x faster (#4264)
  • New remote_identity: NO_UPLOAD option to skip uploading credentials to the remote VM (#4307)
  • Other robustness improvements (#4227, #4290, #4310, #4390, #4488)

Kubernetes

  • Multi-node setup is now up to 4x faster: provisioning a GPU cluster with 200 nodes takes under 90 seconds (#4297, #4240, # 4393)
  • TPUs (Single-host) on GKE are now supported on fixed and autoscaling node pools (#3947)
  • sky check now shows enabled contexts (#4587)
    image
  • SkyPilot no longer has a dependency on lsof in k8s environments (#4304)
  • sky show-gpus --cloud kubernetes now handles limited permissions gracefully (#4208)
  • Both in-cluster (service account based) and kubeconfig auth are now supported concurrently (#4188)
  • Custom GPU resource names are supported with CUSTOM_GPU_RESOURCE_NAME environment variable (#4337)
  • Fixed a bug with SSH on IPv6 dual stack clusters (#4497)
  • Fixed a bug with L40 detection when using nvidia.com/product labels (#4511)
  • pod_config specified in config.yaml is now validated before launching clusters (#4466)
  • Other performance and robustness improvements (#4398, #4415, #4420, #4425, #4420, #4429, #4452, #4469, #4514, #4505, #4558, #4561, #4437)

CLI & Core interfaces

  • sky logs has a new --tail parameter to stream job logs (#4241)
  • sky.jobs.launch from the Python API now returns the job id (#4620)

SkyServe

  • SkyServe now supports choosing a load balancing policy to be used by the service (#4439)

    service:
      load_balancing_policy: round_robin  # round_robin, least_load
    
    Policy Description
    least_load (New default) Routes requests to replicas with the lowest current load, optimizing for latency and throughput
    round_robin Distributes requests evenly across all replicas in a circular order
  • Improved security with TLS support on the load balancer (#3380)

  • You can now expose multiple ports on replicas: useful for running monitoring, UI or other services on the replicas (#4356)

New LLM recipes

  • DeepSeek R1 (#4603) and DeepSeek Janus (#4611)
  • minGPT with Pytorch Distributed (#4464)

Cloud-specfic enhancements

  • AWS:
    • Disable additional auto update services for ubuntu image with cloud-init (#4252)
    • Adding aws assume role option, and env var detection (#4550)
    • Credentials are no longer uploaded when using service account auth (#4395)
    • Custom process based auth is now supported (#4547)
    • SkyPilot now only uses the specified VPC or the default VPC (No other VPCs are used unless specified) (#4546)
  • GCP: Fixed an issue where the service account was not activated for access google cloud storage on the controller, robustness improvements (#4529, #4593)
  • Azure: Support image ids tagged with latest and robustness improvements (#4581, #4411, #4457)
  • Fluidstack: H100 SXM5 support (#4359)
  • Lambda: Added support for GH200 and new regions (us-east-2, us-south-2, us-south-3) (#4291, #4377)
  • RunPod: support spot pods (#4447) and private container registries (#4287)
  • OCI: Faster and new provisioner, support for SkyServe, default image has been upgraded to 22.04 LTS (#4119, #4517)

Storage

  • OCI object storage is now supported (#4501)
  • Fixed a bug where object stores were not being mounted when only object stores were specified in file_mounts (#4317)

Docs

  • Docs have been revamped: brand new Overview page explaining core concepts (#4342), improved structuring (#4664), docs for multi-k8s (#4586), and more!

⚠️ Deprecation notice

  • LocalDockerBackend is deprecated. To run locally, use sky local up to setup a local k8s cluster.
  • sky spot CLI is now removed. Use sky jobs launch --use-spot to launch spot instances.

Thanks to all contributors!

New contributors: @weih1121, @clayrosenthal, @manbeardave, @bend, @nkwangleiGIT, @kristopolous, @sachiniyer, @KeplerC, @aylei, @Yisaer, @cbrownstein, @chesterli29, @sfrolich, @AlexCuadron

Many thanks to all contributors who contributed to this release!

Contributors: @romilbhardwaj, @cg505, @Michaelvll, @zpoint, @HysunHe, @cblmemo, @andylizf, @concretevitamin, @KeplerC, @yika-luo, @cbrownstein, @weih1121, @nkwangleiGIT, @aylei, @clayrosenthal, @sethkimmel3, @landscapepainter, @Conless, @sfrolich, @AlexCuadron, @shashank2000, @mjibril, @asaiacai, @chesterli29, @Yisaer, @sachiniyer, @manbeardave, @bend, @kristopolous

Full Changelog: v0.7.0...v0.8.0

SkyPilot v0.7.0

02 Nov 01:29
3f62588
Compare
Choose a tag to compare

SkyPilot v0.7.0: 3x faster, reservation support, observability, admin policies, new AI hardware, new UX, and more!

We are excited to announce the release of SkyPilot v0.7.0! This release brings significant performance improvements and many new features:

  • Upto 3x faster provisioning
  • Reservation support: AWS Capacity Reservations, AWS Capacity Blocks, GCP reservations, GCP Dynamic Workload Scheduler (DWS), and more
  • Observability features
  • Admin policy enforcement
  • Support for H100 Mega, TPU v6, TPU v5, gVNIC, azure blob storage, faster disks, and more
  • New UX for sky CLI

and many bug fixes and enhancements!

Release Highlights

Performance

We have made 2-3x performance improvements across cloud providers through optimizations in our provisioning stack and the images we use.

Cloud Provisioning Time Speedup
AWS 1 min 10s 3x
GCP 1 min 15s 3x
Azure 2 min 16s 2x
Kubernetes 52s 2.5x

Reservations

SkyPilot now supports short-term and long-term reservations across clouds:

  • AWS Capacity Reservations
  • AWS Capacity Blocks
  • GCP reservations
  • GCP Dynamic Workload Scheduler (DWS)
  • Bring your own VMs or Kubernetes clusters

SkyPilot's failover includes these reservations, so they can be combined with spot instances or any other resources/clouds to create a resilient and cost-effective infrastructure.

Observability on Kubernetes

SkyPilot now has two new observability features on Kubernetes:

  • sky status --kubernetes shows all SkyPilot resources on the cluster. (#4040, #4079)
    $ sky status --cloud kubernetes
    Kubernetes cluster state (context: mycluster)
    SkyPilot clusters
    USER     NAME                           LAUNCHED    RESOURCES                                  STATUS
    alice    infer-svc-1                    23 hrs ago  1x Kubernetes(cpus=1, mem=1, {'L4': 1})    UP
    alice    sky-jobs-controller-80b50983   2 days ago  1x Kubernetes(cpus=4, mem=4)               UP
    alice    sky-serve-controller-80b50983  23 hrs ago  1x Kubernetes(cpus=4, mem=4)               UP
    bob      dev                            1 day ago   1x Kubernetes(cpus=2, mem=8, {'H100': 1})  UP
    bob      multinode-dev                  1 day ago   2x Kubernetes(cpus=2, mem=2)               UP
    bob      sky-jobs-controller-2ea485ea   2 days ago  1x Kubernetes(cpus=4, mem=4)               UP
    
    Managed jobs
    In progress tasks: 1 STARTING
    USER     ID  TASK  NAME      RESOURCES   SUBMITTED   TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS
    alice    1   -     eval      1x[CPU:1+]  2 days ago  49s            8s            0            SUCCEEDED
    bob      4   -     pretrain  1x[H100:4]  1 day ago   1h 1m 11s      1h 14s        0            SUCCEEDED
    bob      3   -     bigjob    1x[CPU:16]  1 day ago   1d 21h 11m 4s  -             0            STARTING
    bob      2   -     failjob   1x[CPU:1+]  1 day ago   54s            9s            0            FAILED
    bob      1   -     shortjob  1x[CPU:1+]  2 days ago  1h 1m 19s      1h 16s        0            SUCCEEDED
    
  • sky show-gpus --cloud kubernetes shows detailed GPU availability information on the cluster. (#3816, #4085)
    $ sky show-gpus --cloud kubernetes
    Kubernetes GPUs
    GPU   REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
    L4    1, 2, 4                   8           8
    H100  1, 2, 4, 8                16          16
    
    Kubernetes per node GPU availability
    NODE_NAME                  GPU_NAME  TOTAL_GPUS  FREE_GPUS
    my-cluster-0               L4        4           4
    my-cluster-1               L4        4           4
    my-cluster-2               H100      8           8
    my-cluster-3               H100      8           8
    

Admin policy enforcement

SkyPilot has a new admin policy mechanism (#3966) that admins can use to enforce policies on users’ SkyPilot usage. These policies apply custom validation and mutation logic to a user’s tasks and SkyPilot config.

Example policies:

Azure Blob Storage support

In addition to S3, GCS and R2, you can now use Azure Blob Storage as a storage backend for storing and accessing data. (#3032)

New AI hardware support

  • New accelerators: TPU v6 (#4115), TPU v5 (#3814), H100 Mega (#4099),
  • Faster networking on GCP with gVNIC (#4095)
  • Faster disks: new disk tier ultra (#3860) for GCP and AWS.

UX revamp

SkyPilot CLI is cleaner, simpler and even easier to parse now (#4023)

New LLM Recipes

Deprecation Notice

  • All SKY_* environment variables are deprecated in favor of SKYPILOT_* variables.
    • All SKY_* variables will be removed in v0.9.0.
    • See docs for list of currently supported variables.

Backend

New Features

  • Managed jobs can now recover from job-level failures (e.g., GPU errors, non-zero exit codes, etc.) (#3919)
    • Set max_restarts_on_errors to specify the number of times SkyPilot should try to restart the job.
    resources:
      job_recovery:
          max_restarts_on_errors: 3  # Retry 3 times before marking the job as failed
    
  • Nvidia GPUs can now disable ECC (#3676)
  • New environment variable SKYPILOT_NUM_NODES to fetch the number of nodes in the cluster. (#3656)
  • SkyPilot config can now be overridden in the task definition with experimental.config_override (#3689)
    experimental:
      config_override:
        docker:
          run_options: ...
        kubernetes:
          pod_config: ...
          provision_timeout: ...
        gcp:
          managed_instance_group: ...
        nvidia_gpus:
          disable_ecc: ...
    

Enhancements

  • SSH keys AddKeysToAgent for ssh config file and ssh cmd #3985
  • SkyPilot runtime is now installed in a separate conda environment, reducing interference with user's environment. (#3639)
    • Similarly, the environment pre-configured in your docker image is no longer shadowed by SkyPilot's runtime environment (#3874, #3867)
  • docker.run_options now allows users to pass additional options when running docker containers. (#3682)

Fixes

  • Fix sky cancel not terminating all child processes (#3919)
  • Fix provisioning failures when multiple versions of SkyPilot are installed (#3866)
  • Shell autocomplete installation is now more robust (#3892, #3893)

Kubernetes

New Features

  • Observability improvements:
    • sky status --cloud kubernetes shows all SkyPilot resources on the Kubernetes cluster. (#4040, #4079)
    • sky show-gpus --cloud kubernetes shows detailed GPU availability information on the cluster. (#3816, #4085)
  • SkyPilot now helps you set up your clusters for running SkyPilot jobs.
  • SkyPilot job output is now piped to the container logs (#3758)
    • Use your existing logging tooling (kubectl logs, filebeat, etc.) to view SkyPilot job outputs.
  • Support for Nvidia GPU operator labels (nvidia.com/gpu.product) for detecting GPU types. (#3493)
    • You no longer need to label GPUs if you have the Nvidia GPU operator installed.
  • Spot instances are now supported on GKE clusters (#3675)
  • [Experimental] Multi-context support (#3913, #3968, #3897, #3772, #4013)

Performance improvements:

  • New command runner: 3x faster command submission for Kubernetes pods. (#3157)
  • sky local up for GPUs is now ~5x faster, provisioning in 2min 30s instead of 12min (#3664)
  • Our GPU images are now 3x smaller (1.5 GB), reducing the time to pull the image (#3665)
  • SSH jump pod is no longer required for port-forward mode (#3657)
  • SSH setup is now parallelized to speed up multi-node provisioning (#4158)

Enhancements and fixes

  • H100 Mega support on GKE (#3891, #3627)
  • Better handling for context names with special characters (#4147)
  • --k8s is now a valid alias for --cloud kubernetes (#4151)
  • Init containers are now supporte...
Read more

SkyPilot v0.6.1

27 Jul 01:09
bc30c0b
Compare
Choose a tag to compare

This patch release brings many improvements and fixes to SkyPilot, including major performance improvements for Kubernetes and Azure and new features for AWS and GCP.

Stay tuned for a detailed changelog coming up in v0.7.0!

SkyPilot v0.6.0

30 May 23:33
e37a39d
Compare
Choose a tag to compare

SkyPilot v0.6.0: Jobs API, SkyServe on Kubernetes, Spot + On-demand mixing, Paperspace support and more!

We are excited to release SkyPilot v0.6.0! This release includes a number of new features:

  • Managed Jobs for job execution and recovery
  • SkyServe and Jobs on Kubernetes
  • Mix on-demand and spot instances in SkyServe
  • New cloud: Paperspace

Release Highlights

Managed Jobs

  • The spot controller has been enhanced to support any job on on-demand or spot instances.
    • To use, run sky jobs launch instead of sky spot launch.
  • The new job controller can automatically recover jobs from any spot preemptions or hardware failures, and also execute pipelines of jobs.
  • The sky jobs API is identical to the sky spot API, but also supports on-demand instances.

SkyServe and Jobs on Kubernetes

  • SkyPilot can now run SkyServe and Managed Job controllers on Kubernetes
    • This means you can now run your SkyServe and Managed Jobs on your Kubernetes cluster!
  • Simply run sky jobs launch or sky serve up, and SkyPilot will automatically deploy the controller on your Kubernetes cluster if available and run jobs on the cheapest available location.

Mix on-demand and spot instances in SkyServe

  • SkyServe now supports a new intelligent policy for mixing spot and on-demand instances. Example.
    • Uses on-demand instances to ensure availability and spot instances to save costs.
  • Dynamically falls back to on-demand replicas when spot replicas are not available. Example.

Paperspace support

  • Newest cloud to join the Sky: Paperspace!
    • Paperspace offers the latest GPUs including H100 and A100-80GB for AI training and inference.
  • Simply add your Paperspace API key to ~/.paperspace/config.json and run sky check paperspace to get started.
  • Big thanks to @asaiacai for contributing Paperspace support!

More LLMs and Recipes

Deprecation Notes

The following features have been deprecated and will be removed in the next minor release:

  • sky spot CLI: use sky jobs CLI instead.
  • core.spot_xxx APIs: refactored to jobs.xxx.
  • qps_lower_threshold and auto_restart in service: use target_qps_per_replica instead.

Changelog

Managed Jobs

  • Changes make to local catalog at ~/.sky/catalog are now reflected on the controller (#3289)
  • The name of the spot job is now included in the SKYPILOT_TASK_ID environment variable (#3424)
  • Legacy spot job APIs have been refactored from core.spot_xxx to jobs.xxx (#3417)
  • Cloud for the controller is now chosen based on the resources of the replicas (#3363)
  • Bug fixes (#3302, #3397, #3459, #3468, #3480)

SkyServe

New Features

  • New intelligent policy for mixing spot and on-demand instances in SkyServe (#3194)
  • SkyServe now uses proxy instead of HTTP redirect responses for better performance (#3395)
  • Readiness probe now supports headers: this is useful for authentication or other headers required for readiness checks (#3552)

Enhancements

  • Optimizations - replicas are reused when only service section is changed (#3214)
  • Rolling updates are now the default behavior for SkyServe (#3249)
  • Controller cloud is now chosen from replica resources if it is not already up (#3231)
  • Bug fixes and API improvements (#3257, #3299, #3303, #3411, #3411, #3546)

Kubernetes

  • Kubernetes clusters can now run SkyServe and Managed Jobs (#3377, #3524, #3521)
  • sky show-gpus now shows realtime availability of GPUs in the cluster (#3499)
  • Autoscaling Kubernetes clusters are now supported: SkyPilot can now wait for GKE node pools, Karpenter and other autoscalers to provision nodes (#3513, #3415)
  • Use Kubernetes service accounts by specifying remote_identity in ~/.sky/config.yaml (#3377, #3527)
  • sky local up now also automatically installs the Nginx Ingress Controller (#3223)
  • Support for specifying custom pod configurations with pod_config (#3244)
    • Use this to modify the pod configuration for your environment, e.g., attaching volumes, specifying imagePullSecrets, increasing /dev/shm size limit, setting HTTP_PROXY and more! See example pod_config here.
  • Support for specifying custom metadata to all Kubernetes resources created by SkyPilot (#3333)
    • Useful for tracking resources created by SkyPilot in your Kubernetes cluster.
  • Support for PodIP mode for exposing ports (#3445)

Enhancements

  • GPU Isolation: SkyPilot no longer uses privileged containers and pods can no longer use GPUs not allocated to them (#3443)
  • Ingress creation requests are now batched to minimize nginx reloads and ingress paths are namespaced (#3263, #3373)
  • All SkyPilot pods are now labelled with skypilot-user to identify the owner of the pod (#3576)
  • Special characters in environment variables are now correctly parsed (#3322)
  • GPU labelling is now more robust (#3274)
  • Bug fixes and quality of life improvements (#3266, #3392, #3439, #3509, #3524, #3525, #3532, #3563, #3578, #3374)

CLI & Core interfaces

New Features

  • resources now supports labels field to set labels (instance tags on aws, labels on gcp and k8s) on cloud resources (#3464, #3505)
  • sky check now supports checking credentials for specific clouds, e.g. sky check aws gcp (#3229)
    • You can also restrict which clouds are checked by setting allowed_clouds in ~/.sky/config.yaml. (#3556)
  • any_of or ordered fields in resources can now have clouds that are not enabled (#3567)
  • A new environment variable SKYPILOT_CLUSTER_INFO, containing cluster name, cloud, region and zone is now available in all tasks (#3424)

Enhancements

  • Optimizer is up to 10x faster when multiple resources are specified (#3567)
  • Autostop timer is now reset at the start of a new sky launch to avoid unexpected autostops (#3205)
  • GCP GPUs now include DEVICE_MEM in sky show-gpus (#3375)
  • Better sorting for sky show-gpus (#3492)
  • Handling for usernames containing invalid characters (#3528)
  • Null environment variables now raise an error (#3557)

Runtime & Backend

Optimizations

  • Lazy imports for 2x faster import times (#3394, #3463)
  • Faster setup and job submission (#3523, #3484),

Cloud: GCP

  • H100 GPUs are now supported on GCP (#3279)
  • Support for fine-grained GCP IAM permissions (#3284)

Cloud: Azure

  • Custom images are now supported on Azure. Simply specify image_id in the resources field. (#3362)
  • 8x faster autostop for Azure (#3519)
  • Fix GPUs not being detected in Azure (#3313)
  • Provisioning fixes (#3483)

Cloud: AWS

  • Fine-grained IAM roles: you can now specify IAM roles on a per-resource basis (#3488, #3514)
  • SkyPilot can now be run in ECS containers by assuming container-role IAM roles (#3503)
  • SkyPilot will not delete user-specified security groups (#3402)

Cloud: Fluidstack

  • H100 and A100 Nvlink support for Fluidstack (#3467)
  • Opening ports is now supported for Fluidstack (#3294)
  • Bug fixes (#3254, #3265)

Other Clouds

  • Bug fixes for Lambda provisioning and termination (#3409, #3410)
  • Multi-gpu fixes for RunPod (#3291)
  • Cudo: handle missing project errors (#3438)

Thanks to all contributors!

New contributors: @MysteryManav, @JGSweets, @Harthgar, @mjkanji

Many thanks to all contributors who contributed to this release!

Contributors: @Michaelvll, @romilbhardwaj, @concretevitamin, @cblmemo, @MaoZiming, @shethhriday29, @asaiacai, @JGSweets, @mjkanji, @MysteryManav, @landscapepainter, @Harthgar, @mjibril, @dtran24, @fozziethebeat, @JungleCatSW

Full Changelog: v0.5.0...v0.6.0

SkyPilot v0.5.0

27 Feb 04:36
Compare
Choose a tag to compare

SkyPilot v0.5.0: SkyServe, New Provisioner, LLMs, Kubernetes, and More Clouds

We are excited to release SkyPilot v0.5.0, where we introduce a significant amount of new features and enhancements, including:

  • SkyPilot Serving
  • New provisioner
  • LLM recipes for the latest open models and engines
  • Kubernetes support improvement
  • 4 new clouds (contributed by the cloud providers!)

and more!

Release Highlights

New Features

  • Multiple candidate resources: SkyPilot now supports multiple candidate resources for a single task (using multiple accelerators, any_of or ordered in resources), allowing users to significantly enlarge the resource pool and get higher availability.
  • New Provisioner: Provisioner gets a new implementation, which is 2x faster and more reliable for supported clouds. Support launching clusters with more than 100 nodes. Dependency requirements for clouds are also significantly reduced.
  • Disk Tier: Introducing best disk tier for the best performance and cost, so you can choose the best disk for any cloud. (#2434)
  • Allow 2x spot jobs to be run concurrently
  • Mount storage back after cluster restart

SkyServe

SkyServe is a serving system on top of SkyPilot that deploys and scales any HTTP services across one or more regions or clouds, with autoscaling, load balancing, and more.

  • Introducing SkyServe: deploy and scale your AI models across multiple regions or clouds. (#2458)
  • Autoscaler: Request rate based autoscaling policy. (#2868, #2878)
  • Autoscaler: Support scaling to 0 when no requests (#2938)
  • Rolling update: Support rolling update for existing services (#2935, #3057)

Other Enhancements

New LLM Recipes

Kubernetes

Kubernetes support received a number of New Features and Enhancements.

  • Multi-node support for Kubernetes (#2609, #3019)
  • Open ports support for Kubernetes (#2588, #2713, #2997, #3200)
  • Support Coreweave label for GPUs in Kubernetes (Coreweave support under development) (#2650)
  • Starting a kubernetes GPU cluster locally with sky local up (#2890)
  • Custom Image Support for Kubernetes Instances (#2729, #3019, #3210)
  • New provisioner for kubernets for better performance and robustneess (#3019)
  • Supporting Kubernetes cluster launched with k3s and Rancher (#3148)

Other Enhancements

More Clouds

SkyPilot now supports 13 cloud providers, including 4 new provider-contributed clouds: VMWare vSphere, RunPod, Fluidstack and Cudo Compute.

Clouds

AWS

New Features

  • New provisioner for AWS: >2x faster for multi-node provisioning and more reliable for cluster launching. (#1702, #2719, #2792)
  • Support for AWS Trainium accelerator (#2690)
  • Support null for proxy command to filter regions (#2756)
  • Support CUDA 12.1 with default image updates (#2788)
  • Job scheduling on Inferentia and Trainium (#2969, #2798)
  • Allow specifying security_group (#3133)

Enhancements

  • Make public / private subnet selection robust (#2867)
  • Avoid hanging for restarting an instance in STOPPING state (#2998)
  • Remove sunset instance types (#2610)
  • Add docs for custom VPC support (#2776)

Fixes

  • Fix conda installation on AWS default image (#3206)
  • Robustify the custom image support (#3216)
  • Fix subnet selection for AWS and autodown for spot instances (#2921)
  • Fix minimal permission for AWS (#2978)
  • Improve opening ports for AWS (#2716)
  • Autstop with new provisioner (#2719)

GCP

New Features

  • Security: Custom VPC support for GCP. (#2764, #2772, #2854, #2944)
  • Security: Support private IP with proxy jump on GCP. (#2819)
  • New provisioner: Adopted new provisioner for GCP with >2x faster and more robust provisioning (#2681, #2719, #2943)
  • Automatically use reserved instances from multiple reserved pools (#2836, #2681)
  • Support L4 accelerator for GCP (#2724)
  • Allow stopping spot clusters on GCP (#2877)

Enhancements

  • Allow stopping VM with local SSD (#2587)
  • Update default runtime version for TPU node (#2601, #2602)
  • Handling transient error during launching GCP clusters (#2669)
  • Update GCSFuse version to 1.3.0 for GCS storage mount (#2887)
  • Set TPU VM the default option for TPU accelerators (#1758)
  • Ignore missing gcp credentials for latest gcloud and avoid duplicating credentials (#3028, #3172, #3234)

Fixes

  • Fix custom docker image support (#3218)
  • Fix minimal roles required for GCP (#2704)
  • Robustify the catalog fetching (#3141)
  • Fix ports on TPU VM and cluster launched before 0.4.0 (#2641)
  • Fix backward compatibility issue with GCP clusters (#2604)
  • Fix --disk-size for Custom Machine Images (#2718)
  • Update catalog fetcher with more options (#2562)
  • Assign GCP VMs with service account (#2972)
  • Fix machine image support (#3030, #3236)
  • Fix error handling for failed provisioning (#2852)
  • Leave out TPU v5 in catalog as it is not supported (#2656)
  • Fix GCP minimal permission (#2947, #2770, #2761)

Azure

Enhancements

  • Make ports openning more robust (#2649, #2891, #3084)
  • Additional arguments for Azure catalog fetcher and support H100 (#2561, #2844, #2847)
  • Support CUDA 12.1 with default image updates (#2468)
  • Support spot instances on Azure (#2871)

Fixes

  • Fix custom docker image support (#3218)
  • UX: Fix Azure disk tier explicitly shown in resources str (#3064)
  • Fix status query for Azure (#3015)

SCP

  • Fix SCP error raised in sky check (#3038)

CLI & Core interfaces

New Features

  • Multi-node jobs fail fast fast for single node failure (#3081)
  • Add configurations for not uploading credentials (#2904)
  • Adding sky status --endpoints CLI (#3199)
  • Support more characters in cluster name (#3130)
  • Show all regions and more accurate price in sky show-gpus (#2583, #2892, #2933, #2946, #3083, #3149, #3113)
  • Allow infering cloud from region or zone (#2632)
  • Add --commit and --version for sky CLI (#2720, #2731, #2733)

Enhancements

  • Robustify runtime initialization on remote cluster (#3132)
  • Better error message for YAML parsing (#3040)
  • Smarter GPU name completion (#3014)
  • Speed up retry until up by not doing exponential backoff (#2821)
  • Add schema validation for config (#2645)
  • Allow --disk-tier none override (#2906)
  • sky check improvement (#3174, #3212, #3160)
  • Better logging for CLIs (#2535, #2691, #2728, #3139, #3175)

Fixes

  • Fix permissi...
Read more

SkyPilot v0.4.1

29 Oct 21:05
Compare
Choose a tag to compare

This is a patch release to ship bug fixes faster to our users! This release includes many feature updates and bug fixes, including the new provisioner for AWS, fixing OOM and credential issues for long-running spot jobs, and some additional improvements.

Detailed changelog coming up in v0.5!

SkyPilot v0.4.0

19 Sep 16:57
Compare
Choose a tag to compare

SkyPilot v0.4.0: Kubernetes, native containers, ports and new clouds

We are excited to release SkyPilot v0.4.0, which brings a host of new features and improvements, including Kubernetes support, native container support, ability to open ports, and more.

Release Highlights

New Features

  • Kubernetes support: SkyPilot tasks and clusters can now run on Kubernetes clusters, including on-prem and cloud hosted deployments (GKE, EKS).
    • If you have a working kubeconfig, simply run sky check and sky launch --cloud kubernetes to run your task on Kubernetes.
    • If desired, tasks can also failover to the cloud when the Kubernetes cluster does not have enough resources. The same SkyPilot YAMLs and CLI works seamlessly across Kubernetes and clouds.
  • Opening ports on clusters: Open ports on your clusters with the ports field. These ports are publicly accessible and can be used for hosting LLM inference endpoints, Jupyter notebooks, web servers, Tensorboard, and other services.
  • Native container support: If your task uses docker containers, SkyPilot's setup and run commands can now directly be executed in that container. This allows you to wrap your environment in a container and run it on any cloud with SkyPilot.
  • Reservation support: This release adds support for GCP reservations. SkyPilot will now prioritize using your reservations on the cloud to save costs and get higher availability.
  • New Managed Spot Features

New LLM Recipes

More Clouds

SkyPilot now supports 8 clouds, including community contributed support for two new clouds:

SkyPilot now also supports IBM COS buckets (#1966).

Core and UX Improvements

  • Faster failover: 30x faster failover with our new quota optimization which checks if quotas are available before launching a cluster (Supported on GCP, AWS).
  • Easily get VM IPs: The new --ip flag for sky status returns the public IP address of the cluster (e.g., sky status --ip mycluster). Use this to access services such as LLM inference endpoints, jupyter notebooks and more.
  • Improved scriptability: SkyPilot YAMLs and CLI are more scriptable than ever - file_mounts can be dynamically defined with environment variables (docs, example), environment variables can be set through a dotenv file with the new --env-file flag (#2296).
  • Core optimizations: Multi-node clusters stop 4x faster (#2199), sky status updates for stopped clusters are 10x faster (#2288), and the job queue is more memory efficient (#1636).
  • Nightly releases: We now release nightly versions of SkyPilot. To get the cutting edge of SkyPilot without installing from source, run pip install skypilot-nightly (#1446)

Deprecation

  • SkyPilot On-prem is now deprecated and Kubernetes will be the recommended mode of running SkyPilot on on-prem clusters.

Below is a detailed list of changes.

Managed Spot

New Features

  • Spot pipeline support: automatically handles a pipeline of spot jobs. (#1982)
  • Spot dashboard is now available with sky spot dashboard: you can now see all your spot jobs in GUI (#2103, #2136)
  • Spot callback - users can now run custom code when spot job status changes (#2106, #2364)
  • Resource configuration of the spot controller can now be customized (docs, #2040)

Enhancements

  • SkyPilot now shows the spot job's resources and estimated cost before confirmation (#2524)
  • Switch to eager failover recovery policy for better spot lifetime (#2234)
  • Reduce the logging for launching spot controller (#2056)

Fixes

CLI & YAML interfaces

New Features

  • Users can now use environment variables to dynamically define file_mounts (docs, #2146)
  • sky status can now show the head IP of the cluster with -a or --ip flags (#2305, #2563)
  • sky down/stop/start defaults to a unique cluster if it exists and sky cancel without cluster cancels the latest task (#2325)

Enhancement

Fixes

  • Fixed the order of VMs in optimizer table when --cpus is provided (#2037)
  • Better handling when sky launch is interrupted (#2206, #2252)

Backend

New Features

  • Users can now open ports for their clusters with the ports field (docs, #2210, #2477)
  • Docker support in image_id - tasks can now be run inside docker containers (docs, #1910)
  • Users can now clone a cluster from an existing cluster's disk with the --clone-disk-from flag (#2098)
  • Users can now launch their own ray cluster on a SkyPilot cluster (#2020)

Enhancements

  • 30x faster failover for AWS and GCP when quotas are not available (#1953, #2187, #2313)
  • Faster sky launch by caching cluster IP address (#2400)
  • Job queue is now more resource efficient, with significant memory consumption reduction on remote cluster (#1636)
  • Cluster names no longer map directly to cloud cluster names. Instead, they are mapped to a unique cluster name on the cloud. This helps with isolation across users sharing cloud accounts. (#2403)
  • More efficient and robust stopping/termination for AWS (#2121)
  • sky status --refresh for STOPPED cluster is 10x faster (#2079)
  • Empty YAML fields are now allowed (#1890)

Fixes

Storage

New Features

  • IBM COS is now supported (#1966)
  • sky spot launch will now exclude files from .gitignore (#2018)

Enhancements

  • Deletion is now parallelized for faster deletion (#2058)
  • UX improvements for sky storage CLI (#2063, #2177)
  • GCS bucket mounting now uses gcsfuse v1.0.1 (#2470)

Fixes

Dependencies

  • Avoid buggy grpcio versions (#2055)
  • Pydantic is pinned to <2.0 (#2157)
  • PyYAML is pinned to >3.13, != 5.4.* to avoid issues with Cython 3 (#2256, #2514)
  • Ray <= 2.6.3 is supported on local machines (#2401)
  • pycryptodome, oauth2client are no longer required (#2515)

Clouds

AWS

  • H100 GPUs are now supported (#2323)
  • New docs for AWS cloud administrator about advanced login option (SSO and account switching) (#1888)
  • Insufficient permission is now handled gracefully (#2415, #2456)
  • Fixed a bug where existing AWS cluster would end up in INIT state after changing identity (#2442)
  • Fix fetching AZ when describe zones permission does not exist in all regions (#2463)

GCP

Read more

SkyPilot v0.3.3

17 Jul 23:38
f42c032
Compare
Choose a tag to compare

This patch release brings many bug fixes and features, including new mechanics for stop/down, callbacks for spot jobs and a critical dependency fix for PyYAML after the release of cython 3.

Detailed changelog coming up in v0.4!