SkyPilot v0.9.1: API Server Architecture, Web Dashboard, Faster Storage, Improved Configuration and more!
We're excited to announce the release of SkyPilot v0.9.1! This update brings major improvements to SkyPilot, making it faster, more powerful and flexible for production-ready deployment.
Highlights
Client-Server Architecture
The new client-server model transforms SkyPilot from a single-user system into a scalable, multi-user platform, making it easier for individuals and teams to run and manage their workloads.
- Unified view and management: Get a single view of all running clusters and jobs across the organization and all infra you have.
- Integrate with workflow orchestrators: SkyPilot state is centralized on the API server, does not need to be maintained in orchestrators like Airflow.
- Multi-tenancy: Share clusters, jobs, and services securely among teammates.
Web dashboard
SkyPilot has a new dashboard! Easily view and manage your clusters, jobs and logs.
Access it with sky dashboard
.
New configuration system
SkyPilot now supports specifying configuration at various levels: CLI, SkyPilot YAML, project-level config, client-level global config and server-side config.
You can now have a project configuration storing default values for all jobs in a project, a user configuration to apply globally to all projects and Task YAML overrides for specific jobs.
New mount_cached
storage - 9.6x faster checkpointing
New storage mode mount_cached
uses the local disk as a cache for cloud storage buckets. Boosts GPU utilizationby making cloud I/O asynchronous.
file_mounts:
/checkpoints:
source: gs://my-checkpoints-bucket
mode: MOUNT_CACHED # Will asynchronously upload all writes to the bucket
New cloud: Nebius
SkyPilot now supports Nebius cloud! Getting started is easy:
$ sky check nebius
$ sky launch --gpus H200:8 --cloud nebius
ARM instance support - run SkyPilot on GH200s, GB200s, and more!
New native images for ARM instances allows you to run SkyPilot on your GH200s, GB200s on Lambda cloud, GCP or your own Kubernetes clusters! (#4835)
What's new
CLI & Core interfaces
sky
CLI now returns non-zero exit code on launch/exec/logs/jobs launch/jobs logs failures (#4846)- This improves scriptability with
sky
CLI in automated workflows.
- This improves scriptability with
sky check
now separately checks storage and compute capabilities (#4996, #4977)- New
--all
option forsky jobs queue
to show all jobs (#4923) resources.gpus
can now be used to aliasresources.accelerators
in the SkyPilot YAML (#5207)
Managed Jobs
- Multiple users can now share the same jobs controller (#4733)
- Autostop and autodown settings for the jobs controller can now be customized (#5182)
# ~/.sky/config.yaml jobs: controller: # autostop: false # to disable completely autostop: idle_minutes: 5 down: true
- See other users's jobs with
sky jobs queue -u
when using a shared controller (#4787) - Access to cloud object storage is no longer necessary for using file mounts or workdir in managed jobs. (#4708)
- Running managed jobs on Kubernetes no longer requires cloud access.
Storage
- New mode:
mount_cached
(#4369)- This mode is optimized for checkpointing large models
- It asynchronously uploads the cached directory to the cloud storage bucket, increasing GPU utilization.
- Fix issue with openrsync on Mac OS 15 causing uploads failures (#5196)
.gitignore
handling is now more robust (#4988)- Fix exclusion for AWS bucket upload (#5128)
Kubernetes
- Revamped
/dev/fuse
access mechanism on k8s (#5028)- We no longer need to request
smarter-devices-fuse
resource, making SkyPilot fuse mounting compatible on autoscaling clusters.
- We no longer need to request
- B200 GPUs are now supported on GKE (#5102)
- Scale-to-zero autoscaling is now supported on GKE (#4935)
- SkyPilot can now inspect the node pools available on scale-to-zero clusters before provisioning.
- This allows SkyPilot to intelligently filter out clusters that cannot provision the requested GPU type.
sky check
now detects and hints for unlabeled GPU nodes on GKE (#5065)- GPU names are now case-insensitive; numbers-only name formats are now supported (#4756, #4925)
- Fixed fractional CPU support when using <1 CPU core (#4707)
- Fix node filtering when provisioning multiple GPUs (#4930)
initContainers
can now be overriden throughpod_config
(#5247)- Instructions on mounting NFS volumes (#4951)
- GPU labelling script can now use custom context names (#5072)
- Fixed a bug where clusters from stale contexts could not be cleaned up (#4980)
Backend
- New Client-Server Architecture (#4660)
- This allows SkyPilot to be deployed as a remote service shared by multiple users.
- Fixed conda support when using python 3.12 (#4035)
sky exec
now waits for the cluster to be started (#4867)sky local up --ips
now supports specifying sudo password (#5030)- Clouds with expired credentials are now automatically excluded from failover (#5015)
SkyServe
- New Spot/On-demand Policy:
dynamic_fallback
(#4628)- New
spot_placer
field can be set todynamic_fallback
to let SkyPilot automatically switch from spot to on-demand instances if spot instances are not available. - More details in paper
- New
- Fixed:
any_of
field order issue causing version bump to not work (#4978) - Fixed: LiveError on controller (#4995)
Cloud Support
- New cloud: Nebius (#4573, #4838)
- GCP
- RunPod
- Custom docker images with non-root user are now supported (#4683)
- Lambda
- Fluidstack: NVLINK GPUs are now supported (#3954)
- IBM: new fetcher for IBM catalog (#5003)
- Cloudflare R2: fixed upload issues when using new awscli versions (#5282)
New Examples and Tutorials
- Large-Scale batch inference
- Vector DB ingest and querying
- RAG (Retrieval Augmented Generation)
- Deepseek r1-671B with SGLang
- Gemma3 Example
- Llama 4 example
- Hyperpod+EKS example
- High-Performance Model Checkpointing with
mount_cached
⚠️ Deprecations and removals
Removed
- Env vars starting with
SKY_
are no longer supported. Use SKYPILOT_ env vars instead. - Old services from 0.7.0 (before #4439) may require to be stopped and restarted.
kubernetes
is no longer a valid region name. use the k8s context name to specify a kubernetes cluster if required.
Deprecated
experimental.config_overrides
has been deprecated. Use theconfig
field instead.
Migration guide
SkyPilot 0.9.1 introduces the asynchronous execution model, which may cause compatibility issues with user programs using SkyPilot SDKs <=0.8.1
.
Refer to the migration guide to upgrade your code.
TL;DR: Wrap all SkyPilot SDK function calls (except tail_logs
) with sky.stream_and_get()
to make your program behave mostly the same as before:
# <= 0.8.1
job_id, handle = sky.launch(task)
# 0.9.1
job_id, handle = sky.stream_and_get(sky.launch(task))
Thanks to all contributors!
New contributors: @kyuds, @BorenTsai, @funkypenguin, @JiangJiaWei1103, @SalikovAlex, @flaviomartins, @ajay, @bradhilton, @SeungjinYang, @eltociear, @vvidovic, @KennBro, @DanielZhangQD
Many thanks to all contributors who contributed to this release!
Contributors: @aylei, @zpoint, @SeungjinYang, @cg505, @Michaelvll, @romilbhardwaj, @KeplerC, @concretevitamin, @SalikovAlex, @DanielZhangQD, @kyuds, @cblmemo, @andylizf, @clayrosenthal, @JiangJiaWei1103, @bradhilton, @funkypenguin, @vvidovic, @cbrownstein, @flaviomartins, @KennBro, @mjibril, @kristopolous, @ajay, @landscapepainter, @eltociear, @BorenTsai