Troubleshooting Overview

When something goes wrong with your Ratchet jobs, there are several layers of observability available to help you pinpoint the issue. This guide covers the diagnostic tools at your disposal and how to use them effectively.

Diagnostic Approach

Follow this general strategy when troubleshooting Ratchet issues:

Check the job status in the database -- most issues are visible in the scheduler_job table
Review the logs -- Ratchet uses java.util.logging (JUL) with detailed lifecycle messages
Listen to events -- the event system provides real-time visibility into job state transitions
Inspect execution history -- the scheduler_job_execution table records every attempt

Quick Health Check

Run this query to get a snapshot of your scheduler's current state:

SELECT status, COUNT(*) as count
FROM scheduler_job
GROUP BY status
ORDER BY count DESC;

A healthy system typically shows mostly SUCCEEDED jobs with a small number of PENDING and RUNNING jobs. Red flags include:

Many RUNNING jobs with old picked_at timestamps -- jobs may be stuck (orphaned)
Growing PENDING count -- the poller may not be running or the thread pool is exhausted
Many FAILED jobs -- check last_error for patterns

Event System for Debugging

Ratchet fires events at every major lifecycle transition. You can observe these through two mechanisms: programmatic listeners and CDI observers.

Programmatic Event Listeners

@Inject
JobSchedulerService scheduler;

public void enableDiagnostics() {
    scheduler.addEventListener(event -> {
        if (event instanceof JobFailedEvent failed) {
            log.error("Job {} failed: {}", failed.getJobId(), failed.getErrorMessage());
        } else if (event instanceof JobRetryingEvent retrying) {
            log.warn("Job {} retrying (attempt {}), next at: {}",
                retrying.getJobId(), retrying.getAttempt(), retrying.getNextScheduledTime());
        } else if (event instanceof JobDlqEvent dlq) {
            log.error("Job {} moved to DLQ after {} attempts: {}",
                dlq.getJobId(), dlq.getAttempt(), dlq.getErrorMessage());
        }
    });
}

CDI Event Observers

For type-safe event observation in a CDI environment, use @Observes:

@ApplicationScoped
public class JobDiagnosticObserver {

    private static final Logger log = Logger.getLogger(JobDiagnosticObserver.class.getName());

    public void onJobStarted(@Observes JobStartedEvent event) {
        log.info("Job " + event.getJobId() + " started on node " + event.getNodeId());
    }

    public void onJobCompleted(@Observes JobCompletedEvent event) {
        log.info("Job " + event.getJobId() + " completed in " + event.getExecutionTimeMs() + " ms");
    }

    public void onJobFailed(@Observes JobDlqEvent event) {
        log.severe("Job " + event.getJobId() + " sent to DLQ: " + event.getErrorMessage());
    }
}

Available Event Types

Event	When Fired
`JobStartedEvent`	Job begins execution on a worker thread
`JobCompletedEvent`	Job finishes successfully
`JobRetryingEvent`	Job failed but will be retried (includes next scheduled time)
`JobDlqEvent`	Job exhausted retries and moved to dead letter queue
`JobCancellingEvent`	Cancel request received for a job
`JobCancelledEvent`	Job successfully canceled
`JobPausedEvent`	Job paused via `pauseJob()`
`JobResumedEvent`	Job resumed via `resumeJob()`
`BatchCompletingEvent`	Last child of a batch completed
`BatchCompletedEvent`	Batch fully finalized
`ChainStartedEvent`	First step of a chain begins
`ChainCompletedEvent`	All chain steps completed successfully
`ChainFailedEvent`	A chain step failed permanently
`WorkflowBranchTriggeredEvent`	A conditional workflow branch was activated
`PerformanceMetricsEvent`	Periodic metrics snapshot

Logging Configuration

Ratchet uses java.util.logging (JUL) under the package run.ratchet. Most Jakarta EE runtimes bridge JUL to their logging subsystem.

WildFly / JBoss EAP

Add a logger category in your standalone.xml:

<subsystem xmlns="urn:jboss:domain:logging:8.0">
    <logger category="run.ratchet">
        <level name="DEBUG"/>
    </logger>
    <!-- For detailed poller and thread pool diagnostics -->
    <logger category="run.ratchet.ri.core.Poller">
        <level name="FINE"/>
    </logger>
    <logger category="run.ratchet.ri.core.ThreadPoolManager">
        <level name="FINE"/>
    </logger>
</subsystem>

Payara / GlassFish

Use the asadmin CLI:

asadmin set-log-levels run.ratchet=FINE

Open Liberty

Add to server.xml:

<logging traceSpecification="run.ratchet.*=fine"/>

Key Logger Categories

Logger	What It Logs
`run.ratchet.ri.core.JobTask`	Job execution lifecycle, payload resolution, retry decisions
`run.ratchet.ri.core.Poller`	Poll cycle results, claim counts, adaptive delay changes
`run.ratchet.ri.core.OrphanRecoveryTimer`	Orphan detection and recovery actions
`run.ratchet.ri.core.JobTimeoutHandler`	Soft and hard timeout warnings
`run.ratchet.ri.resilience.CircuitBreaker`	Circuit breaker state transitions
`run.ratchet.ri.security.JobSecurityValidator`	Security validation results and rejections
`run.ratchet.ri.security.PackagePrefixClassPolicy`	Class policy allow/deny decisions

MDC Context

Ratchet automatically sets MDC (Mapped Diagnostic Context) values during job execution:

jobId -- the job's unique identifier
nodeId -- the cluster node executing the job
jobCreator -- the user who created the job (if set)

These MDC values are available in your log format patterns for correlation:

# Example log4j2 pattern
%d{ISO8601} [%X{jobId}] [%X{nodeId}] %-5p %c - %m%n

Configuration Reference

Ratchet requires a CDI-produced RatchetOptions bean — deployment fails with UnsatisfiedResolutionException otherwise. Applications may write a programmatic producer or read env vars + MicroProfile Config inside a producer via RatchetOptionsFactory.fromEnvironment(). See Configuration.

Key diagnostic-related settings:

Option	Default	Purpose
`polling.minDelayMs(...)`	`2000`	Minimum time between poll cycles
`polling.maxDelayMs(...)`	`10000`	Maximum time between poll cycles (idle)
`polling.batchSize(...)`	`50`	Jobs claimed per poll cycle
`node.orphanGraceSeconds(...)`	`60`	Time before a stale node's jobs are recovered
`node.orphanScanIntervalMinutes(...)`	`5`	How often to scan for orphaned jobs
`timeout.softTimeoutPercent(...)`	`80`	Percentage of timeout at which warning fires
`timeout.defaultSlaSeconds(...)`	`1800`	Default job timeout in seconds (30 min)
`circuitBreaker.enabled(...)`	`true`	Enable/disable the built-in circuit breaker

Getting Help

If you cannot resolve an issue using these guides:

Search existing issues on the Ratchet GitHub repository
Open a new issue with:
- Ratchet version and Jakarta EE runtime (WildFly, Payara, GlassFish, etc.)
- Database vendor and version
- Relevant log output (with run.ratchet set to FINE)
- The SQL output of SELECT status, COUNT(*) FROM scheduler_job GROUP BY status
- Steps to reproduce the issue
Check the Common Issues page for known problems and their solutions

Diagnostic Approach​

Quick Health Check​

Event System for Debugging​

Programmatic Event Listeners​

CDI Event Observers​

Available Event Types​

Logging Configuration​

WildFly / JBoss EAP​

Payara / GlassFish​

Open Liberty​

Key Logger Categories​

MDC Context​

Configuration Reference​

Getting Help​