Skip to main content

Job Lifecycle

Every job in Ratchet follows a well-defined state machine from creation to terminal state. Understanding these states and transitions is essential for building reliable job workflows.

State Machine

                    ┌──────────┐
┌─────│ PENDING │◄────────────────────┐
│ └────┬─────┘ │
│ │ │
pauseJob() │ Poller claims job scheduleRetry()
│ │ (SKIP LOCKED) (retries remain)
│ ▼ │
│ ┌──────────┐ │
│ │ RUNNING │─────────────────────┤
│ └──┬───┬───┘ │
│ │ │ │
│ success failure │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────┐ ┌────────┐ │
│ │SUCCEEDED│ │ FAILED │─────────────┘
│ └─────────┘ └────┬───┘
│ │
│ retryJob()
│ (manual reset)
│ │
▼ ▼
┌──────────┐ Back to PENDING
│ PAUSED │
└──────────┘

resumeJob()


Original state
(PENDING or FAILED)

cancelJob() can reach CANCELED from PENDING or RUNNING:

┌──────────┐
│ CANCELED │ (terminal)
└──────────┘

States

PENDING

The job is queued and waiting for execution. A PENDING job becomes visible to the Poller when its scheduled_time <= now. Jobs start in this state when submitted.

  • Visible to Poller: Yes, when scheduled time has passed
  • Transitions to: RUNNING (claimed by worker), PAUSED (via pauseJob()), CANCELED (via cancelJob())

RUNNING

A worker has claimed the job and is actively executing it. The picked_by field records which node owns the job, and optimistic locking (@Version) prevents duplicate execution.

  • Visible to Poller: No
  • Transitions to: SUCCEEDED (execution completes), FAILED (exception thrown or timeout), CANCELED (via cancelJob() -- checked mid-execution)
  • Guard: Only one node can hold a RUNNING job at a time

SUCCEEDED

The job completed without throwing an exception. This is a terminal state. Succeeded jobs may trigger dependent workflow branches or chain steps.

  • Terminal: Yes
  • Eligible for archival: Yes, after retention period

FAILED

The job threw an exception during execution. A FAILED job may or may not have retries remaining:

  • If retries remain: The engine schedules a retry (back to PENDING with a backoff delay). The job entity stays in FAILED only momentarily during the transition.
  • If retries exhausted: The job is permanently FAILED and moved to the Dead Letter Queue. This is a terminal state.
  • If @DoNotRetry: Skips retries entirely, moves directly to DLQ.

Transitions:

  • Back to PENDING: Automatic retry (retries remain) or manual retryJob() call
  • To PAUSED: Via pauseJob() (records pausedFromStatus = FAILED)

PAUSED

The job is temporarily suspended and invisible to the Poller. The paused_from_status column records the state the job had before pausing, so it can be accurately restored.

  • Visible to Poller: No
  • Transitions to: Previous state via resumeJob() -- restores PENDING or FAILED
  • Idempotent: Pausing an already-paused job returns true without error

CANCELED

The job was explicitly canceled and will not execute. This is a terminal state. Canceling a RUNNING job sets the status; the executor checks status mid-flight and discards results.

  • Terminal: Yes
  • Cascading: Canceling a chain step cancels all downstream dependents

Transition Details

Submission to PENDING

When you call submit() on a builder, the engine:

  1. Analyzes the lambda to extract target class, method, and arguments
  2. Converts that metadata into a persisted job payload via the active JobInvocationResolver
  3. Checks the idempotency key for duplicates (globally unique, forever)
  4. Checks the business key for active conflicts (unique among PENDING/RUNNING jobs)
  5. Persists the JobEntity with status PENDING
  6. For immediate or CRITICAL-priority jobs, publishes a wakeup notification via ClusterCoordinator
JobHandle handle = scheduler.enqueue(() -> service.process(id))
.withIdempotencyKey(requestId) // prevents duplicate submission
.withBusinessKey("process-" + id) // prevents concurrent processing
.submit();

PENDING to RUNNING (Claim)

The Poller executes a query like:

SELECT * FROM scheduler_job
WHERE status = 'PENDING'
AND scheduled_time <= NOW()
ORDER BY (priority + age_boost) DESC, scheduled_time ASC
FOR UPDATE SKIP LOCKED
LIMIT :batchSize

age_boost is computed from the configured priority-boost interval, so old low-priority work can outrank newer high-priority work. SKIP LOCKED is critical -- it allows multiple nodes to poll concurrently without blocking each other. Each node claims a non-overlapping set of jobs. The claimed jobs are atomically updated:

  • status = RUNNING
  • picked_by = node ID
  • picked_at = current timestamp

RUNNING to SUCCEEDED

When the job method returns normally:

  1. Execution timing is recorded (start, end, duration, queue wait)
  2. Return value is serialized to JSON (if non-void)
  3. Status atomically transitions RUNNING -> SUCCEEDED via markJobSucceeded()
  4. JobCompletedEvent is published
  5. Post-execution handler triggers:
    • For batch children: updates parent batch progress
    • For chain steps: schedules next step
    • For workflow branches: evaluates conditions and schedules matching branches
  6. Success callback (onSuccess) is invoked if configured

RUNNING to FAILED (with Retry)

When the job throws an exception and retries remain:

  1. Attempt counter is atomically incremented
  2. @DoNotRetry check on the exception class
  3. RetryPolicy.shouldRetry() is consulted
  4. Backoff delay is calculated (see Retry Strategies)
  5. Job is rescheduled: scheduled_time = now + backoff, status back to PENDING
  6. JobRetryingEvent is published

RUNNING to FAILED (Terminal -- DLQ)

When retries are exhausted or @DoNotRetry applies:

  1. Status transitions RUNNING -> FAILED via compare-and-swap
  2. Error message is sanitized via ErrorSanitizer SPI
  3. DeadLetterService.moveToDlq() records the alert with deduplication
  4. JobDlqEvent is published
  5. For batch children: parent batch progress is updated (failure)
  6. For chain/workflow: downstream evaluation occurs (FAILURE branches may fire)
  7. Failure callback (onFailure) is invoked if configured

Pause and Resume

Pausing suspends a job without losing its state:

scheduler.pauseJob(jobId);   // PENDING or FAILED -> PAUSED
scheduler.resumeJob(jobId); // PAUSED -> original state

The paused_from_status field preserves context:

  • A paused PENDING job resumes to PENDING (eligible for polling again)
  • A paused FAILED job resumes to FAILED (can then be manually retried)

Only PENDING and FAILED jobs can be paused. RUNNING jobs cannot be paused -- cancel them instead.

Manual Retry

For jobs in the Dead Letter Queue, retryJob() provides manual recovery:

scheduler.retryJob(jobId);

This:

  1. Resets the attempt counter to 0
  2. Clears error information
  3. Sets scheduled_time to now
  4. Transitions FAILED -> PENDING

Only FAILED jobs can be retried. The job becomes immediately eligible for polling.

Cancellation

scheduler.cancelJob(jobId);

Behavior depends on current state:

  • PENDING: Immediately transitions to CANCELED
  • RUNNING: Sets status to CANCELED. The executor periodically checks wasJobCanceledDuringExecution() and discards results if true
  • Terminal states: Returns false (cannot cancel completed jobs)

For chain steps, cancellation cascades to all downstream dependents using depth-first traversal.

Optimistic Locking

The JobEntity uses JPA @Version for optimistic locking. When two nodes attempt to modify the same job concurrently, one will get an OptimisticLockException. Combined with SKIP LOCKED during claiming, this ensures exactly-once execution semantics:

  • SKIP LOCKED prevents two nodes from claiming the same job
  • @Version prevents stale updates if a race occurs during status transitions

Orphan Recovery

If a node crashes while executing a job, the job remains in RUNNING state with no node to complete it. The OrphanRecoveryTimer periodically scans for stale RUNNING jobs (based on picked_at timestamp) and resets them to PENDING for re-execution.

Archival

Completed jobs (SUCCEEDED and FAILED) are eligible for archival after a configurable retention period. The JobArchivingService moves old jobs from scheduler_job to scheduler_job_archive, keeping the active table lean for efficient polling.