Skip to main content

Troubleshooting Overview

When something goes wrong with your Ratchet jobs, there are several layers of observability available to help you pinpoint the issue. This guide covers the diagnostic tools at your disposal and how to use them effectively.

Diagnostic Approach

Follow this general strategy when troubleshooting Ratchet issues:

  1. Check the job status in the database -- most issues are visible in the scheduler_job table
  2. Review the logs -- Ratchet uses java.util.logging (JUL) with detailed lifecycle messages
  3. Listen to events -- the event system provides real-time visibility into job state transitions
  4. Inspect execution history -- the scheduler_job_execution table records every attempt

Quick Health Check

Run this query to get a snapshot of your scheduler's current state:

SELECT status, COUNT(*) as count
FROM scheduler_job
GROUP BY status
ORDER BY count DESC;

A healthy system typically shows mostly SUCCEEDED jobs with a small number of PENDING and RUNNING jobs. Red flags include:

  • Many RUNNING jobs with old picked_at timestamps -- jobs may be stuck (orphaned)
  • Growing PENDING count -- the poller may not be running or the thread pool is exhausted
  • Many FAILED jobs -- check last_error for patterns

Event System for Debugging

Ratchet fires events at every major lifecycle transition. You can observe these through two mechanisms: programmatic listeners and CDI observers.

Programmatic Event Listeners

Register a listener via the JobSchedulerService API to receive all events in a single callback:

@Inject
JobSchedulerService scheduler;

public void enableDiagnostics() {
scheduler.addEventListener(event -> {
if (event instanceof JobFailedEvent failed) {
log.error("Job {} failed: {}", failed.getJobId(), failed.getErrorMessage());
} else if (event instanceof JobRetryingEvent retrying) {
log.warn("Job {} retrying (attempt {}), next at: {}",
retrying.getJobId(), retrying.getAttempt(), retrying.getNextScheduledTime());
} else if (event instanceof JobDlqEvent dlq) {
log.error("Job {} moved to DLQ after {} attempts: {}",
dlq.getJobId(), dlq.getAttempt(), dlq.getErrorMessage());
}
});
}

CDI Event Observers

For type-safe event observation in a CDI environment, use @Observes:

@ApplicationScoped
public class JobDiagnosticObserver {

private static final Logger log = Logger.getLogger(JobDiagnosticObserver.class.getName());

public void onJobStarted(@Observes JobStartedEvent event) {
log.info("Job " + event.getJobId() + " started on node " + event.getNodeId());
}

public void onJobCompleted(@Observes JobCompletedEvent event) {
log.info("Job " + event.getJobId() + " completed in " + event.getExecutionTimeMs() + " ms");
}

public void onJobFailed(@Observes JobDlqEvent event) {
log.severe("Job " + event.getJobId() + " sent to DLQ: " + event.getErrorMessage());
}
}

Available Event Types

EventWhen Fired
JobStartedEventJob begins execution on a worker thread
JobCompletedEventJob finishes successfully
JobRetryingEventJob failed but will be retried (includes next scheduled time)
JobDlqEventJob exhausted retries and moved to dead letter queue
JobCancellingEventCancel request received for a job
JobCancelledEventJob successfully canceled
JobPausedEventJob paused via pauseJob()
JobResumedEventJob resumed via resumeJob()
BatchCompletingEventLast child of a batch completed
BatchCompletedEventBatch fully finalized
ChainStartedEventFirst step of a chain begins
ChainCompletedEventAll chain steps completed successfully
ChainFailedEventA chain step failed permanently
WorkflowBranchTriggeredEventA conditional workflow branch was activated
PerformanceMetricsEventPeriodic metrics snapshot

Logging Configuration

Ratchet uses java.util.logging (JUL) under the package run.ratchet. Most Jakarta EE runtimes bridge JUL to their logging subsystem.

WildFly / JBoss EAP

Add a logger category in your standalone.xml:

<subsystem xmlns="urn:jboss:domain:logging:8.0">
<logger category="run.ratchet">
<level name="DEBUG"/>
</logger>
<!-- For detailed poller and thread pool diagnostics -->
<logger category="run.ratchet.ri.core.Poller">
<level name="FINE"/>
</logger>
<logger category="run.ratchet.ri.core.ThreadPoolManager">
<level name="FINE"/>
</logger>
</subsystem>

Payara / GlassFish

Use the asadmin CLI:

asadmin set-log-levels run.ratchet=FINE

Open Liberty

Add to server.xml:

<logging traceSpecification="run.ratchet.*=fine"/>

Key Logger Categories

LoggerWhat It Logs
run.ratchet.ri.core.JobTaskJob execution lifecycle, payload resolution, retry decisions
run.ratchet.ri.core.PollerPoll cycle results, claim counts, adaptive delay changes
run.ratchet.ri.core.OrphanRecoveryTimerOrphan detection and recovery actions
run.ratchet.ri.core.JobTimeoutHandlerSoft and hard timeout warnings
run.ratchet.ri.resilience.CircuitBreakerCircuit breaker state transitions
run.ratchet.ri.security.JobSecurityValidatorSecurity validation results and rejections
run.ratchet.ri.security.PackagePrefixClassPolicyClass policy allow/deny decisions

MDC Context

Ratchet automatically sets MDC (Mapped Diagnostic Context) values during job execution:

  • jobId -- the job's unique identifier
  • nodeId -- the cluster node executing the job
  • jobCreator -- the user who created the job (if set)

These MDC values are available in your log format patterns for correlation:

# Example log4j2 pattern
%d{ISO8601} [%X{jobId}] [%X{nodeId}] %-5p %c - %m%n

Configuration Reference

Ratchet requires a CDI-produced RatchetOptions bean — deployment fails with UnsatisfiedResolutionException otherwise. Applications may write a programmatic producer or read env vars + MicroProfile Config inside a producer via RatchetOptionsFactory.fromEnvironment(). See Configuration.

Key diagnostic-related settings:

OptionDefaultPurpose
polling.minDelayMs(...)2000Minimum time between poll cycles
polling.maxDelayMs(...)10000Maximum time between poll cycles (idle)
polling.batchSize(...)50Jobs claimed per poll cycle
node.orphanGraceSeconds(...)60Time before a stale node's jobs are recovered
node.orphanScanIntervalMinutes(...)5How often to scan for orphaned jobs
timeout.softTimeoutPercent(...)80Percentage of timeout at which warning fires
timeout.defaultSlaSeconds(...)1800Default job timeout in seconds (30 min)
circuitBreaker.enabled(...)trueEnable/disable the built-in circuit breaker

Getting Help

If you cannot resolve an issue using these guides:

  1. Search existing issues on the Ratchet GitHub repository
  2. Open a new issue with:
    • Ratchet version and Jakarta EE runtime (WildFly, Payara, GlassFish, etc.)
    • Database vendor and version
    • Relevant log output (with run.ratchet set to FINE)
    • The SQL output of SELECT status, COUNT(*) FROM scheduler_job GROUP BY status
    • Steps to reproduce the issue
  3. Check the Common Issues page for known problems and their solutions