Job Lifecycle

Every job in Ratchet follows a well-defined state machine from creation to terminal state. Understanding these states and transitions is essential for building reliable job workflows.

State Machine

                    ┌──────────┐
             ┌─────│ PENDING  │◄────────────────────┐
             │     └────┬─────┘                     │
             │          │                           │
        pauseJob()      │ Poller claims job    scheduleRetry()
             │          │ (SKIP LOCKED)        (retries remain)
             │          ▼                           │
             │     ┌──────────┐                     │
             │     │ RUNNING  │─────────────────────┤
             │     └──┬───┬───┘                     │
             │        │   │                         │
             │   success  failure                   │
             │        │   │                         │
             │        ▼   ▼                         │
             │  ┌─────────┐  ┌────────┐             │
             │  │SUCCEEDED│  │ FAILED │─────────────┘
             │  └─────────┘  └────┬───┘
             │                    │
             │               retryJob()
             │               (manual reset)
             │                    │
             ▼                    ▼
        ┌──────────┐        Back to PENDING
        │  PAUSED  │
        └──────────┘
             │
        resumeJob()
             │
             ▼
        Original state
        (PENDING or FAILED)

   cancelJob() can reach CANCELED from PENDING or RUNNING:

        ┌──────────┐
        │ CANCELED │  (terminal)
        └──────────┘

States

PENDING

The job is queued and waiting for execution. A PENDING job becomes visible to the Poller when its scheduled_time <= now. Jobs start in this state when submitted.

Visible to Poller: Yes, when scheduled time has passed
Transitions to: RUNNING (claimed by worker), PAUSED (via pauseJob()), CANCELED (via cancelJob())

RUNNING

A worker has claimed the job and is actively executing it. The picked_by field records which node owns the job, and optimistic locking (@Version) prevents duplicate execution.

Visible to Poller: No
Transitions to: SUCCEEDED (execution completes), FAILED (exception thrown or timeout), CANCELED (via cancelJob() -- checked mid-execution)
Guard: Only one node can hold a RUNNING job at a time

SUCCEEDED

The job completed without throwing an exception. This is a terminal state. Succeeded jobs may trigger dependent workflow branches or chain steps.

Terminal: Yes
Eligible for archival: Yes, after retention period

FAILED

The job threw an exception during execution. A FAILED job may or may not have retries remaining:

If retries remain: The engine schedules a retry (back to PENDING with a backoff delay). The job entity stays in FAILED only momentarily during the transition.
If retries exhausted: The job is permanently FAILED and moved to the Dead Letter Queue. This is a terminal state.
If @DoNotRetry: Skips retries entirely, moves directly to DLQ.

Transitions:

Back to PENDING: Automatic retry (retries remain) or manual retryJob() call
To PAUSED: Via pauseJob() (records pausedFromStatus = FAILED)

PAUSED

The job is temporarily suspended and invisible to the Poller. The paused_from_status column records the state the job had before pausing, so it can be accurately restored.

Visible to Poller: No
Transitions to: Previous state via resumeJob() -- restores PENDING or FAILED
Idempotent: Pausing an already-paused job returns true without error

CANCELED

The job was explicitly canceled and will not execute. This is a terminal state. Canceling a RUNNING job sets the status; the executor checks status mid-flight and discards results.

Terminal: Yes
Cascading: Canceling a chain step cancels all downstream dependents

Transition Details

Submission to PENDING

When you call submit() on a builder, the engine:

Analyzes the lambda to extract target class, method, and arguments
Converts that metadata into a persisted job payload via the active JobInvocationResolver
Checks the idempotency key for duplicates (globally unique, forever)
Checks the business key for active conflicts (unique among PENDING/RUNNING jobs)
Persists the JobEntity with status PENDING
For immediate or CRITICAL-priority jobs, publishes a wakeup notification via ClusterCoordinator

JobHandle handle = scheduler.enqueue(() -> service.process(id))
    .withIdempotencyKey(requestId)  // prevents duplicate submission
    .withBusinessKey("process-" + id)  // prevents concurrent processing
    .submit();

PENDING to RUNNING (Claim)

The Poller executes a query like:

SELECT * FROM scheduler_job
WHERE status = 'PENDING'
  AND scheduled_time <= NOW()
ORDER BY (priority + age_boost) DESC, scheduled_time ASC
FOR UPDATE SKIP LOCKED
LIMIT :batchSize

age_boost is computed from the configured priority-boost interval, so old low-priority work can outrank newer high-priority work. SKIP LOCKED is critical -- it allows multiple nodes to poll concurrently without blocking each other. Each node claims a non-overlapping set of jobs. The claimed jobs are atomically updated:

status = RUNNING
picked_by = node ID
picked_at = current timestamp

RUNNING to SUCCEEDED

When the job method returns normally:

Execution timing is recorded (start, end, duration, queue wait)
Return value is serialized to JSON (if non-void)
Status atomically transitions RUNNING -> SUCCEEDED via markJobSucceeded()
JobCompletedEvent is published
Post-execution handler triggers:
- For batch children: updates parent batch progress
- For chain steps: schedules next step
- For workflow branches: evaluates conditions and schedules matching branches
Success callback (onSuccess) is invoked if configured

RUNNING to FAILED (with Retry)

When the job throws an exception and retries remain:

Attempt counter is atomically incremented
@DoNotRetry check on the exception class
RetryPolicy.shouldRetry() is consulted
Backoff delay is calculated (see Retry Strategies)
Job is rescheduled: scheduled_time = now + backoff, status back to PENDING
JobRetryingEvent is published

RUNNING to FAILED (Terminal -- DLQ)

When retries are exhausted or @DoNotRetry applies:

Status transitions RUNNING -> FAILED via compare-and-swap
Error message is sanitized via ErrorSanitizer SPI
DeadLetterService.moveToDlq() records the alert with deduplication
JobDlqEvent is published
For batch children: parent batch progress is updated (failure)
For chain/workflow: downstream evaluation occurs (FAILURE branches may fire)
Failure callback (onFailure) is invoked if configured

Pause and Resume

Pausing suspends a job without losing its state:

scheduler.pauseJob(jobId);   // PENDING or FAILED -> PAUSED
scheduler.resumeJob(jobId);  // PAUSED -> original state

The paused_from_status field preserves context:

A paused PENDING job resumes to PENDING (eligible for polling again)
A paused FAILED job resumes to FAILED (can then be manually retried)

Only PENDING and FAILED jobs can be paused. RUNNING jobs cannot be paused -- cancel them instead.

Manual Retry

For jobs in the Dead Letter Queue, retryJob() provides manual recovery:

scheduler.retryJob(jobId);

This:

Resets the attempt counter to 0
Clears error information
Sets scheduled_time to now
Transitions FAILED -> PENDING

Only FAILED jobs can be retried. The job becomes immediately eligible for polling.

Cancellation

scheduler.cancelJob(jobId);

Behavior depends on current state:

PENDING: Immediately transitions to CANCELED
RUNNING: Sets status to CANCELED. The executor periodically checks wasJobCanceledDuringExecution() and discards results if true
Terminal states: Returns false (cannot cancel completed jobs)

For chain steps, cancellation cascades to all downstream dependents using depth-first traversal.

Optimistic Locking

The JobEntity uses JPA @Version for optimistic locking. When two nodes attempt to modify the same job concurrently, one will get an OptimisticLockException. Combined with SKIP LOCKED during claiming, this ensures exactly-once execution semantics:

SKIP LOCKED prevents two nodes from claiming the same job
@Version prevents stale updates if a race occurs during status transitions

Orphan Recovery

If a node crashes while executing a job, the job remains in RUNNING state with no node to complete it. The OrphanRecoveryTimer periodically scans for stale RUNNING jobs (based on picked_at timestamp) and resets them to PENDING for re-execution.

Archival

Completed jobs (SUCCEEDED and FAILED) are eligible for archival after a configurable retention period. The JobArchivingService moves old jobs from scheduler_job to scheduler_job_archive, keeping the active table lean for efficient polling.

Execution Model -- How the Poller and executor work together
Error Handling -- Detailed retry and DLQ mechanics
Retry Strategies -- Backoff policies and custom retry logic

State Machine​

States​

PENDING​

RUNNING​

SUCCEEDED​

FAILED​

PAUSED​

CANCELED​

Transition Details​

Submission to PENDING​

PENDING to RUNNING (Claim)​

RUNNING to SUCCEEDED​

RUNNING to FAILED (with Retry)​

RUNNING to FAILED (Terminal -- DLQ)​

Pause and Resume​

Manual Retry​

Cancellation​

Optimistic Locking​

Orphan Recovery​

Archival​

Related​