-
-
Notifications
You must be signed in to change notification settings - Fork 701
fix: refill release concurrency token bucket queue when runs resume before checkpoints are created #1933
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
Conversation
…efore checkpoints are created
|
WalkthroughThis change introduces enhancements to the concurrency control mechanisms in the run engine. It adds new methods to the Changes
Sequence Diagram(s)sequenceDiagram
participant WaitpointSystem
participant ReleaseConcurrencySystem
participant ReleaseConcurrencyTokenBucketQueue
participant DequeueSystem
participant Worker
WaitpointSystem->>ReleaseConcurrencySystem: refillTokensForSnapshot(snapshot)
ReleaseConcurrencySystem->>ReleaseConcurrencyTokenBucketQueue: refillTokenIfNotInQueue(...)
ReleaseConcurrencyTokenBucketQueue-->>ReleaseConcurrencySystem: success/failure
DequeueSystem->>ReleaseConcurrencySystem: refillTokensForSnapshot(previousSnapshotId)
ReleaseConcurrencySystem->>ReleaseConcurrencyTokenBucketQueue: refillTokenIfNotInQueue(...)
ReleaseConcurrencyTokenBucketQueue-->>ReleaseConcurrencySystem: success/failure
DequeueSystem->>Worker: notify (after token refill)
Possibly related PRs
Suggested reviewers
Poem
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (2)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
internal-packages/run-engine/src/engine/tests/releaseConcurrencyTokenBucketQueue.test.ts (4)
716-741
: Consider more robust waiting to improve test reliability.
ThesetTimeout(100)
call may cause flakiness on slower environments. Using event-based determinism or longer timeouts might reduce intermittent failures.
745-770
: Revisit short timeouts to avoid flakiness.
Similar to the previous test, relying onsetTimeout(100)
could introduce sporadic test failures in resource-constrained CI environments.
773-805
: Validate queued logic without fixed delays if feasible.
Using a fixed delay to confirm queue processing can be brittle. An alternative approach is polling for state changes or mocking the queue’s executor.
807-824
: Clarify the return value when max tokens are reached.
Currently,refillTokenIfNotInQueue
returnstrue
even when tokens don’t actually increase because the max threshold was reached. Consider returningfalse
or a separate status to signal no net refill occurred.internal-packages/run-engine/src/engine/systems/releaseConcurrencySystem.ts (1)
72-116
: Refill tokens logic is well-organized.
The overload approach and snapshot checks look good. You might consider returning a status (e.g., success/failure) instead of silently returning when the snapshot is missing or in an invalid state, so callers can handle it intentionally.internal-packages/run-engine/src/engine/tests/releaseConcurrency.test.ts (1)
1091-1235
: Comprehensive waitpoint refill test.
This test thoroughly checks that the token bucket is refilled after waitpoints complete and the run resumes. Consider verifying concurrency changes using a more event-based or shorter polling approach to reduce reliance on a fixed 1s timeout.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
internal-packages/run-engine/package.json
(1 hunks)internal-packages/run-engine/src/engine/index.ts
(1 hunks)internal-packages/run-engine/src/engine/releaseConcurrencyTokenBucketQueue.ts
(3 hunks)internal-packages/run-engine/src/engine/systems/dequeueSystem.ts
(2 hunks)internal-packages/run-engine/src/engine/systems/releaseConcurrencySystem.ts
(2 hunks)internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
(2 hunks)internal-packages/run-engine/src/engine/tests/releaseConcurrency.test.ts
(2 hunks)internal-packages/run-engine/src/engine/tests/releaseConcurrencyTokenBucketQueue.test.ts
(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
internal-packages/run-engine/src/engine/tests/releaseConcurrencyTokenBucketQueue.test.ts (1)
internal-packages/testcontainers/src/index.ts (1)
redisTest
(128-128)
internal-packages/run-engine/src/engine/systems/dequeueSystem.ts (1)
internal-packages/run-engine/src/engine/systems/releaseConcurrencySystem.ts (1)
ReleaseConcurrencySystem
(26-280)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: typecheck / typecheck
🔇 Additional comments (15)
internal-packages/run-engine/src/engine/index.ts (1)
308-308
: Add releaseConcurrencySystem to DequeueSystemThis change correctly integrates the ReleaseConcurrencySystem with the DequeueSystem, enabling concurrency token refill logic when dequeuing runs. This is crucial for the fix as it allows the DequeueSystem to refill concurrency tokens when runs transition states.
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (2)
513-516
: Improved observability with debug loggingAdding this debug log statement enhances observability by tracking run state during the critical
continueRunIfUnblocked
operation, making it easier to diagnose concurrency-related issues.
551-551
: Ensure concurrency tokens are refilled when resuming runsThis is the key fix for the issue. When a run continues after being blocked by a waitpoint, this line ensures that concurrency tokens are properly refilled before sending the notification to the worker. This prevents concurrency token leakage when runs resume before checkpoints are created.
internal-packages/run-engine/package.json (1)
39-40
: Enhanced test scripts for improved development workflowThe updated test script now includes the
--run
flag for immediate execution, and the newtest:dev
script facilitates continuous testing during development. These changes support the testing of the concurrency token refill functionality introduced in this PR.internal-packages/run-engine/src/engine/systems/dequeueSystem.ts (3)
14-14
: Add import for ReleaseConcurrencySystemAdding this import enables the DequeueSystem to utilize the ReleaseConcurrencySystem for token refill operations.
21-21
: Add ReleaseConcurrencySystem dependency to DequeueSystemThese changes properly integrate the ReleaseConcurrencySystem as a dependency of the DequeueSystem by:
- Adding it to the DequeueSystemOptions interface
- Adding a private member to store the reference
- Assigning it in the constructor
This enables the DequeueSystem to interact with concurrency token management.
Also applies to: 28-28, 34-34
165-169
: Refill concurrency tokens during execution state transitionsThis is a critical part of the fix. When a run transitions from
QUEUED_EXECUTING
toEXECUTING
, this code ensures that concurrency tokens are refilled for the previous snapshot if it exists. This prevents concurrency token leakage when runs resume before checkpoints are created.The code correctly checks for the existence of
previousSnapshotId
before attempting to refill tokens, ensuring robustness.internal-packages/run-engine/src/engine/systems/releaseConcurrencySystem.ts (2)
135-138
: Helpful debug statement.
This new log clarifies when concurrency is skipped.
144-147
: Good logging for development environments.
Logging the snapshot ID offers valuable insight during debugging. No concerns here.internal-packages/run-engine/src/engine/tests/releaseConcurrency.test.ts (2)
6-6
: Import addition looks fine.
AddingEventBusEventArgs
helps type the notified event properly.
1237-1505
: Excellent multi-run concurrency scenario.
This adds meaningful coverage for partial concurrency reacquisition and queued re-execution. Implementation appears robust.internal-packages/run-engine/src/engine/releaseConcurrencyTokenBucketQueue.ts (4)
262-270
: Clean implementation of queue metrics retrieval.This method provides a useful way to retrieve the current token count and queue length for a release queue, which is valuable for monitoring and diagnostics. The implementation correctly handles potential null values from Redis.
272-313
: Well-designed conditional token refill mechanism.This method implements the core functionality needed to fix the issue described in the PR title. It properly checks if the releaser is not in the queue before refilling a token, handling edge cases like zero max tokens, and providing appropriate logging. The boolean return value makes it easy for callers to determine if the refill was successful.
840-883
: Solid implementation of the Lua script for conditional token refilling.The Lua script correctly:
- Checks if the releaser ID is in the queue
- Only refills a token if the ID is not found
- Enforces the maximum token limit
- Cleans up metadata
- Updates the master queue based on queue length
The implementation is thorough and handles all necessary edge cases.
941-950
: Appropriate extension of the Redis command interface.The interface extension correctly defines the parameters and return type for the new Redis command, following the same pattern as the other commands in the interface.
This PR fixes an issue where the release concurrency queue could get permanently stuck because runs that were in EXECUTING_WITH_WAITPOINT snapshot status would get resumed before the checkpoint was created. This would cause a token of the release concurrency bucket to be consumed (when the run was first blocked), but the token would never get returned, to the bucket, because the checkpoint would never get created. This now adds token returning when a run goes from EXECUTING_WITH_WAITPOINT -> EXECUTING, and EXECUTING_WITH_WAITPOINT -> QUEUED_EXECUTING -> EXECUTING, but only if the snapshot is not in the release concurrency queue.
Summary by CodeRabbit