Skip to content

Support SSL Key Rotation in HTTP Server #13495

New issue

Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? No Sign in to your account

Merged
merged 3 commits into from
Feb 22, 2025
Merged

Conversation

youngkent
Copy link
Contributor

@youngkent youngkent commented Feb 18, 2025

Some production setup requires TLS key/cert rotation in HTTP server. This change use watchfiles to async'ly monitor the updates of ssl key, cert, and CA files, and update the SSLContext when changes are detected.

Test cmd
vllm serve /tmp/model -tp 1 --max_num_seqs 32 --ssl-keyfile ~/test_certs/server.key --ssl-certfile ~/test_certs/server.crt --ssl-ca-certs ~/test_certs/rootCA.crt --enable-ssl-refresh

touch ~/test_certs/server.key

Server output:
INFO 02-18 11:40:01 launcher.py:24] Watching files: ['/home/ktong/test_certs/server.key', '/home/ktong/test_certs/server.crt']
INFO 02-18 11:40:01 launcher.py:24] Watching files: ['/home/ktong/test_certs/rootCA.crt']
INFO: Application startup complete.
INFO 02-18 11:42:31 launcher.py:28] File change detected: modified - /home/ktong/test_certs/server.key
INFO 02-18 11:42:31 launcher.py:57] Reloading SSL certificate chain

Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@WoosukKwon
Copy link
Collaborator

@russellb @robertgshaw2-redhat could you please take a look? Unfortunately I have little background on this.

@@ -36,3 +36,4 @@ einops # Required for Qwen2-VL.
compressed-tensors == 0.9.1 # required for compressed-tensors
depyf==0.18.0 # required for profiling and debugging with compilation config
cloudpickle # allows pickling lambda functions in model_executor/models/registry.py
watchfiles # required for http server to monitor the updates of TLS files
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to pin a version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API being used is pretty standard, it should not be very sensitive to specific versions.
I would say the latest version is preferred here.

@russellb
Copy link
Member

@russellb @robertgshaw2-redhat could you please take a look? Unfortunately I have little background on this.

Sure.

What are some other examples of services that do dynamic reloading of configuration like this? I would normally expect a configuration rollout to include restart services. Running in an environment like Kubernetes makes this fairly straightforward.

I think it's going to be more complicated to account for all possible cases. What if the files are just deleted? Should the service keep running with what it had previously loaded? Should it exit with an error?

My preference would be to leave it as-is, and leave it to the administrator (or service automation) to decide when the service should restart with new configuration.

@youngkent
Copy link
Contributor Author

@russellb For example, a thrift server also supports SSL key/cert rotation: https://github.com/facebook/fbthrift/blob/a3b88c21b4bf382d506922c2d874b21a7c06b821/thrift/lib/cpp2/server/ThriftServer.cpp#L1876-L1881
Restarting a server should work, but it's intrusive. For infrastructure with decoupled key rotation and server rollout, supporting SSL key rotation without requiring restarting is actually needed.
For infrastructure that does not do key rotation, the logic here basically have no side effect. Or would you prefer adding a gating flag for the feature?

@russellb
Copy link
Member

russellb commented Feb 19, 2025

Thanks for the example. I've thought about this some more and I'm still not really comfortable with the feature.

Automatic reloading based on files changing strikes me as very surprising behavior. An alternative that some services use is allow you to send them a SIGHUP signal to reload their configuration. This would typically be hidden behind something like systemd, so to an administrator it's systemctl reload <service>.

Whether it was via SIGHUP or not, only dynamically reloading this but not other files seems like surprising behavior. For example, what should happen if a model is updated on disk?

Another reason I'm more on the side of keeping this simple is I don't expect using built-in SSL to be the production SSL endpoint in most cases. When running in Kubernetes, I'd expect a load balancer serving as ingress into the cluster would terminate SSL. In other words, I'd prefer to keep this simple and defer the more complex and dynamic configuration management to systems outside of vllm.

@russellb
Copy link
Member

I want to clarify one more thing. You should not interpret my comments as a rejection of the PR! I'm not a maintainer and don't have that authority. I'm just stating my gut reaction to the feature and certainly don't mind if the consensus after maintainer review goes toward accepting it!

@youngkent
Copy link
Contributor Author

@russellb thanks for reviewing and sharing your opinion. The file monitoring is limited to SSL key rotation at the moment. Generally, I feel people shouldn't mix up the expectation of model file loading with SSL key rotation. (Do we need to make it more clearly stated through documentation?)

About relying on systemctl reload <service>, I feel it might be better to keep vLLM service more self contained without relying assumptions like using systemctl to manage it. vLLM service are used in various infra setup, it would be nice to keep it flexible to support different types of infra setup.

cc: @WoosukKwon @simon-mo to hear about your suggestions.

Comment on lines 64 to 74
watch_ssl_cert_task = None
if config.ssl_keyfile and config.ssl_certfile:
watch_ssl_cert_task = loop.create_task(
watch_files([config.ssl_keyfile, config.ssl_certfile],
update_ssl_cert_chain))

watch_ssl_ca_task = None
if config.ssl_ca_certs:
watch_ssl_ca_task = loop.create_task(
watch_files([config.ssl_ca_certs], update_ssl_ca))

Copy link
Collaborator

@yeqcharlotte yeqcharlotte Feb 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also feel explicitly sending SIGHUP has clearer semantics if reverse proxy is not an option.

in addition, since ssl rotation is irrelevant to most users. i think we should isolate these changes to a dedicated ssl.py module instead of directly putting them in top serve_http entry point.

i also imagine more complexity may come to handle edge cases in production.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, SIGHUP isn't how TW does the key rotation, which we could consider in the long term.
Good point on feature isolation, I can gate the feature and decouple it into a dedicated file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is TW?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an infrastructure we are trying to integrate with, which we don't have a lot of control on how SSL files are delivered..

@russellb
Copy link
Member

About relying on systemctl reload , I feel it might be better to keep vLLM service more self contained without relying assumptions like using systemctl to manage it.

just to be clear, I don't suggest that systemd should be required for this. On the vLLM side, it's handling the SIGHUP signal. There's different ways to send the signal to trigger a reload and systemctl reload ... is just one example of where it can be wrapped up in a nice interface, consistent with triggering reloads across multiple system services.

Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is well scoped enough to handle a common deployment scenarios. I do agree SIGHUP might be a better option if designed from scratch but this PR offers isolated functionality for certain integrations.

@simon-mo simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 22, 2025
@simon-mo simon-mo merged commit 8db1b9d into vllm-project:main Feb 22, 2025
75 of 83 checks passed
@russellb
Copy link
Member

I think there's at least a small race condition here, where either the server cert OR the CA cert will be updated, but not both (but both need to be updated). One or more connections could get handled in between updating each file.

This is well scoped enough to handle a common deployment scenarios. I do agree SIGHUP might be a better option if designed from scratch but this PR offers isolated functionality for certain integrations.

It's a trivial change to use SIGHUP, FWIW.

@youngkent
Copy link
Contributor Author

@russellb Once we have a deployment environment that supports SIGHUP signaling when certs are updated, I think we can definitely extend the functionality here to support SIGHUP mode.

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
No Sign up for free to join this conversation on GitHub. Already have an account? No Sign in to comment
Labels
ci/build frontend ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants