Support SSL Key Rotation in HTTP Server #13495

youngkent · 2025-02-18T19:46:26Z

Some production setup requires TLS key/cert rotation in HTTP server. This change use watchfiles to async'ly monitor the updates of ssl key, cert, and CA files, and update the SSLContext when changes are detected.

Test cmd
vllm serve /tmp/model -tp 1 --max_num_seqs 32 --ssl-keyfile ~/test_certs/server.key --ssl-certfile ~/test_certs/server.crt --ssl-ca-certs ~/test_certs/rootCA.crt --enable-ssl-refresh

touch ~/test_certs/server.key

Server output:
INFO 02-18 11:40:01 launcher.py:24] Watching files: ['/home/ktong/test_certs/server.key', '/home/ktong/test_certs/server.crt']
INFO 02-18 11:40:01 launcher.py:24] Watching files: ['/home/ktong/test_certs/rootCA.crt']
INFO: Application startup complete.
INFO 02-18 11:42:31 launcher.py:28] File change detected: modified - /home/ktong/test_certs/server.key
INFO 02-18 11:42:31 launcher.py:57] Reloading SSL certificate chain

Signed-off-by: Keyun Tong <tongkeyun@gmail.com>

github-actions · 2025-02-18T19:46:38Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

WoosukKwon · 2025-02-19T02:38:23Z

@russellb @robertgshaw2-redhat could you please take a look? Unfortunately I have little background on this.

houseroad · 2025-02-19T08:28:52Z

requirements-common.txt

@@ -36,3 +36,4 @@ einops # Required for Qwen2-VL.
 compressed-tensors == 0.9.1 # required for compressed-tensors
 depyf==0.18.0 # required for profiling and debugging with compilation config
 cloudpickle # allows pickling lambda functions in model_executor/models/registry.py
+watchfiles # required for http server to monitor the updates of TLS files


do we want to pin a version?

The API being used is pretty standard, it should not be very sensitive to specific versions.
I would say the latest version is preferred here.

russellb · 2025-02-19T16:19:31Z

@russellb @robertgshaw2-redhat could you please take a look? Unfortunately I have little background on this.

Sure.

What are some other examples of services that do dynamic reloading of configuration like this? I would normally expect a configuration rollout to include restart services. Running in an environment like Kubernetes makes this fairly straightforward.

I think it's going to be more complicated to account for all possible cases. What if the files are just deleted? Should the service keep running with what it had previously loaded? Should it exit with an error?

My preference would be to leave it as-is, and leave it to the administrator (or service automation) to decide when the service should restart with new configuration.

youngkent · 2025-02-19T17:28:00Z

@russellb For example, a thrift server also supports SSL key/cert rotation: https://github.com/facebook/fbthrift/blob/a3b88c21b4bf382d506922c2d874b21a7c06b821/thrift/lib/cpp2/server/ThriftServer.cpp#L1876-L1881
Restarting a server should work, but it's intrusive. For infrastructure with decoupled key rotation and server rollout, supporting SSL key rotation without requiring restarting is actually needed.
For infrastructure that does not do key rotation, the logic here basically have no side effect. Or would you prefer adding a gating flag for the feature?

russellb · 2025-02-19T22:39:26Z

Thanks for the example. I've thought about this some more and I'm still not really comfortable with the feature.

Automatic reloading based on files changing strikes me as very surprising behavior. An alternative that some services use is allow you to send them a SIGHUP signal to reload their configuration. This would typically be hidden behind something like systemd, so to an administrator it's systemctl reload <service>.

Whether it was via SIGHUP or not, only dynamically reloading this but not other files seems like surprising behavior. For example, what should happen if a model is updated on disk?

Another reason I'm more on the side of keeping this simple is I don't expect using built-in SSL to be the production SSL endpoint in most cases. When running in Kubernetes, I'd expect a load balancer serving as ingress into the cluster would terminate SSL. In other words, I'd prefer to keep this simple and defer the more complex and dynamic configuration management to systems outside of vllm.

russellb · 2025-02-19T22:52:51Z

I want to clarify one more thing. You should not interpret my comments as a rejection of the PR! I'm not a maintainer and don't have that authority. I'm just stating my gut reaction to the feature and certainly don't mind if the consensus after maintainer review goes toward accepting it!

youngkent · 2025-02-20T00:30:40Z

@russellb thanks for reviewing and sharing your opinion. The file monitoring is limited to SSL key rotation at the moment. Generally, I feel people shouldn't mix up the expectation of model file loading with SSL key rotation. (Do we need to make it more clearly stated through documentation?)

About relying on systemctl reload <service>, I feel it might be better to keep vLLM service more self contained without relying assumptions like using systemctl to manage it. vLLM service are used in various infra setup, it would be nice to keep it flexible to support different types of infra setup.

cc: @WoosukKwon @simon-mo to hear about your suggestions.

yeqcharlotte · 2025-02-20T09:46:55Z

vllm/entrypoints/launcher.py

+    watch_ssl_cert_task = None
+    if config.ssl_keyfile and config.ssl_certfile:
+        watch_ssl_cert_task = loop.create_task(
+            watch_files([config.ssl_keyfile, config.ssl_certfile],
+                        update_ssl_cert_chain))
+
+    watch_ssl_ca_task = None
+    if config.ssl_ca_certs:
+        watch_ssl_ca_task = loop.create_task(
+            watch_files([config.ssl_ca_certs], update_ssl_ca))
+


i also feel explicitly sending SIGHUP has clearer semantics if reverse proxy is not an option.

in addition, since ssl rotation is irrelevant to most users. i think we should isolate these changes to a dedicated ssl.py module instead of directly putting them in top serve_http entry point.

i also imagine more complexity may come to handle edge cases in production.

IIUC, SIGHUP isn't how TW does the key rotation, which we could consider in the long term.
Good point on feature isolation, I can gate the feature and decouple it into a dedicated file.

What is TW?

It's an infrastructure we are trying to integrate with, which we don't have a lot of control on how SSL files are delivered..

russellb · 2025-02-20T14:18:38Z

About relying on systemctl reload , I feel it might be better to keep vLLM service more self contained without relying assumptions like using systemctl to manage it.

just to be clear, I don't suggest that systemd should be required for this. On the vLLM side, it's handling the SIGHUP signal. There's different ways to send the signal to trigger a reload and systemctl reload ... is just one example of where it can be wrapped up in a nice interface, consistent with triggering reloads across multiple system services.

Signed-off-by: Keyun Tong <tongkeyun@gmail.com>

simon-mo

This is well scoped enough to handle a common deployment scenarios. I do agree SIGHUP might be a better option if designed from scratch but this PR offers isolated functionality for certain integrations.

russellb · 2025-02-23T13:16:04Z

I think there's at least a small race condition here, where either the server cert OR the CA cert will be updated, but not both (but both need to be updated). One or more connections could get handled in between updating each file.

This is well scoped enough to handle a common deployment scenarios. I do agree SIGHUP might be a better option if designed from scratch but this PR offers isolated functionality for certain integrations.

It's a trivial change to use SIGHUP, FWIW.

youngkent · 2025-02-24T02:56:36Z

@russellb Once we have a deployment environment that supports SIGHUP signaling when certs are updated, I think we can definitely extend the functionality here to support SIGHUP mode.

Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

Support SSL key rotation

5bb7c3d

Signed-off-by: Keyun Tong <tongkeyun@gmail.com>

mergify bot added ci/build frontend labels Feb 18, 2025

houseroad reviewed Feb 19, 2025

View reviewed changes

yeqcharlotte reviewed Feb 20, 2025

View reviewed changes

youngkent added 2 commits February 20, 2025 22:25

Modularize SSLCertRefresher

68aa180

Signed-off-by: Keyun Tong <tongkeyun@gmail.com>

Add a unit test for SSLCertRefresher

75c70ff

Signed-off-by: Keyun Tong <tongkeyun@gmail.com>

youngkent requested review from DarkLight1337, robertgshaw2-redhat and simon-mo as code owners February 21, 2025 22:21

simon-mo approved these changes Feb 22, 2025

View reviewed changes

simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 22, 2025

simon-mo merged commit 8db1b9d into vllm-project:main Feb 22, 2025
75 of 83 checks passed

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

Support SSL Key Rotation in HTTP Server (vllm-project#13495)

f605734

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

Support SSL Key Rotation in HTTP Server (vllm-project#13495)

0e5772c

Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support SSL Key Rotation in HTTP Server #13495

Support SSL Key Rotation in HTTP Server #13495

youngkent commented Feb 18, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 18, 2025

WoosukKwon commented Feb 19, 2025

houseroad Feb 19, 2025

youngkent Feb 19, 2025

russellb commented Feb 19, 2025

youngkent commented Feb 19, 2025

russellb commented Feb 19, 2025 •

edited

Loading

russellb commented Feb 19, 2025

youngkent commented Feb 20, 2025

yeqcharlotte Feb 20, 2025 •

edited

Loading

youngkent Feb 20, 2025

russellb Feb 20, 2025

youngkent Feb 21, 2025

russellb commented Feb 20, 2025

simon-mo left a comment

russellb commented Feb 23, 2025

youngkent commented Feb 24, 2025

Support SSL Key Rotation in HTTP Server #13495

Support SSL Key Rotation in HTTP Server #13495

Conversation

youngkent commented Feb 18, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 18, 2025

WoosukKwon commented Feb 19, 2025

houseroad Feb 19, 2025

Choose a reason for hiding this comment

youngkent Feb 19, 2025

Choose a reason for hiding this comment

russellb commented Feb 19, 2025

youngkent commented Feb 19, 2025

russellb commented Feb 19, 2025 • edited Loading

russellb commented Feb 19, 2025

youngkent commented Feb 20, 2025

yeqcharlotte Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

youngkent Feb 20, 2025

Choose a reason for hiding this comment

russellb Feb 20, 2025

Choose a reason for hiding this comment

youngkent Feb 21, 2025

Choose a reason for hiding this comment

russellb commented Feb 20, 2025

simon-mo left a comment

Choose a reason for hiding this comment

russellb commented Feb 23, 2025

youngkent commented Feb 24, 2025

youngkent commented Feb 18, 2025 •

edited by github-actions bot

Loading

russellb commented Feb 19, 2025 •

edited

Loading

yeqcharlotte Feb 20, 2025 •

edited

Loading