Job Lifecycle
Every job in Ratchet follows a well-defined state machine from creation to terminal state. Understanding these states and transitions is essential for building reliable job workflows.
State Machine
┌──────────┐
┌─────│ PENDING │◄────────────────────┐
│ └────┬─────┘ │
│ │ │
pauseJob() │ Poller claims job scheduleRetry()
│ │ (SKIP LOCKED) (retries remain)
│ ▼ │
│ ┌──────────┐ │
│ │ RUNNING │─────────────────────┤
│ └──┬───┬───┘ │
│ │ │ │
│ success failure │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────┐ ┌────────┐ │
│ │SUCCEEDED│ │ FAILED │─────────────┘
│ └─────────┘ └────┬───┘
│ │
│ retryJob()
│ (manual reset)
│ │
▼ ▼
┌──────────┐ Back to PENDING
│ PAUSED │
└──────────┘
│
resumeJob()
│
▼
Original state
(PENDING or FAILED)
cancelJob() can reach CANCELED from PENDING or RUNNING:
┌──────────┐
│ CANCELED │ (terminal)
└──────────┘
States
PENDING
The job is queued and waiting for execution. A PENDING job becomes visible to the Poller when its scheduled_time <= now. Jobs start in this state when submitted.
- Visible to Poller: Yes, when scheduled time has passed
- Transitions to: RUNNING (claimed by worker), PAUSED (via
pauseJob()), CANCELED (viacancelJob())
RUNNING
A worker has claimed the job and is actively executing it. The picked_by field records which node owns the job, and optimistic locking (@Version) prevents duplicate execution.
- Visible to Poller: No
- Transitions to: SUCCEEDED (execution completes), FAILED (exception thrown or timeout), CANCELED (via
cancelJob()-- checked mid-execution) - Guard: Only one node can hold a RUNNING job at a time
SUCCEEDED
The job completed without throwing an exception. This is a terminal state. Succeeded jobs may trigger dependent workflow branches or chain steps.
- Terminal: Yes
- Eligible for archival: Yes, after retention period
FAILED
The job threw an exception during execution. A FAILED job may or may not have retries remaining:
- If retries remain: The engine schedules a retry (back to PENDING with a backoff delay). The job entity stays in FAILED only momentarily during the transition.
- If retries exhausted: The job is permanently FAILED and moved to the Dead Letter Queue. This is a terminal state.
- If
@DoNotRetry: Skips retries entirely, moves directly to DLQ.
Transitions:
- Back to PENDING: Automatic retry (retries remain) or manual
retryJob()call - To PAUSED: Via
pauseJob()(recordspausedFromStatus = FAILED)
PAUSED
The job is temporarily suspended and invisible to the Poller. The paused_from_status column records the state the job had before pausing, so it can be accurately restored.
- Visible to Poller: No
- Transitions to: Previous state via
resumeJob()-- restores PENDING or FAILED - Idempotent: Pausing an already-paused job returns
truewithout error
CANCELED
The job was explicitly canceled and will not execute. This is a terminal state. Canceling a RUNNING job sets the status; the executor checks status mid-flight and discards results.
- Terminal: Yes
- Cascading: Canceling a chain step cancels all downstream dependents
Transition Details
Submission to PENDING
When you call submit() on a builder, the engine:
- Analyzes the lambda to extract target class, method, and arguments
- Converts that metadata into a persisted job payload via the active
JobInvocationResolver - Checks the idempotency key for duplicates (globally unique, forever)
- Checks the business key for active conflicts (unique among PENDING/RUNNING jobs)
- Persists the
JobEntitywith status PENDING - For immediate or CRITICAL-priority jobs, publishes a wakeup notification via
ClusterCoordinator
JobHandle handle = scheduler.enqueue(() -> service.process(id))
.withIdempotencyKey(requestId) // prevents duplicate submission
.withBusinessKey("process-" + id) // prevents concurrent processing
.submit();
PENDING to RUNNING (Claim)
The Poller executes a query like:
SELECT * FROM scheduler_job
WHERE status = 'PENDING'
AND scheduled_time <= NOW()
ORDER BY (priority + age_boost) DESC, scheduled_time ASC
FOR UPDATE SKIP LOCKED
LIMIT :batchSize
age_boost is computed from the configured priority-boost interval, so old low-priority work can outrank newer high-priority work. SKIP LOCKED is critical -- it allows multiple nodes to poll concurrently without blocking each other. Each node claims a non-overlapping set of jobs. The claimed jobs are atomically updated:
status= RUNNINGpicked_by= node IDpicked_at= current timestamp
RUNNING to SUCCEEDED
When the job method returns normally:
- Execution timing is recorded (start, end, duration, queue wait)
- Return value is serialized to JSON (if non-void)
- Status atomically transitions RUNNING -> SUCCEEDED via
markJobSucceeded() JobCompletedEventis published- Post-execution handler triggers:
- For batch children: updates parent batch progress
- For chain steps: schedules next step
- For workflow branches: evaluates conditions and schedules matching branches
- Success callback (
onSuccess) is invoked if configured
RUNNING to FAILED (with Retry)
When the job throws an exception and retries remain:
- Attempt counter is atomically incremented
@DoNotRetrycheck on the exception classRetryPolicy.shouldRetry()is consulted- Backoff delay is calculated (see Retry Strategies)
- Job is rescheduled:
scheduled_time = now + backoff, status back to PENDING JobRetryingEventis published
RUNNING to FAILED (Terminal -- DLQ)
When retries are exhausted or @DoNotRetry applies:
- Status transitions RUNNING -> FAILED via compare-and-swap
- Error message is sanitized via
ErrorSanitizerSPI DeadLetterService.moveToDlq()records the alert with deduplicationJobDlqEventis published- For batch children: parent batch progress is updated (failure)
- For chain/workflow: downstream evaluation occurs (FAILURE branches may fire)
- Failure callback (
onFailure) is invoked if configured
Pause and Resume
Pausing suspends a job without losing its state:
scheduler.pauseJob(jobId); // PENDING or FAILED -> PAUSED
scheduler.resumeJob(jobId); // PAUSED -> original state
The paused_from_status field preserves context:
- A paused PENDING job resumes to PENDING (eligible for polling again)
- A paused FAILED job resumes to FAILED (can then be manually retried)
Only PENDING and FAILED jobs can be paused. RUNNING jobs cannot be paused -- cancel them instead.
Manual Retry
For jobs in the Dead Letter Queue, retryJob() provides manual recovery:
scheduler.retryJob(jobId);
This:
- Resets the attempt counter to 0
- Clears error information
- Sets
scheduled_timeto now - Transitions FAILED -> PENDING
Only FAILED jobs can be retried. The job becomes immediately eligible for polling.
Cancellation
scheduler.cancelJob(jobId);
Behavior depends on current state:
- PENDING: Immediately transitions to CANCELED
- RUNNING: Sets status to CANCELED. The executor periodically checks
wasJobCanceledDuringExecution()and discards results if true - Terminal states: Returns
false(cannot cancel completed jobs)
For chain steps, cancellation cascades to all downstream dependents using depth-first traversal.
Optimistic Locking
The JobEntity uses JPA @Version for optimistic locking. When two nodes attempt to modify the same job concurrently, one will get an OptimisticLockException. Combined with SKIP LOCKED during claiming, this ensures exactly-once execution semantics:
SKIP LOCKEDprevents two nodes from claiming the same job@Versionprevents stale updates if a race occurs during status transitions
Orphan Recovery
If a node crashes while executing a job, the job remains in RUNNING state with no node to complete it. The OrphanRecoveryTimer periodically scans for stale RUNNING jobs (based on picked_at timestamp) and resets them to PENDING for re-execution.
Archival
Completed jobs (SUCCEEDED and FAILED) are eligible for archival after a configurable retention period. The JobArchivingService moves old jobs from scheduler_job to scheduler_job_archive, keeping the active table lean for efficient polling.
Related
- Execution Model -- How the Poller and executor work together
- Error Handling -- Detailed retry and DLQ mechanics
- Retry Strategies -- Backoff policies and custom retry logic