Skip to main content

Debugging Guide

This guide covers practical techniques for understanding what your Ratchet jobs are doing at runtime, from event tracing to database-level inspection. JobContext.logger() exists on every job and is backed by JBoss Logging by default; database persistence of those log lines is optional.

Using JobContext.logger()

Every job has access to a JobContext that provides a job-scoped logger. The default logger writes to the runtime logging backend and publishes JobLogLine events. If your application observes and persists those events, the entries can be stored in scheduler_job_log, giving you a structured execution trace tied to each job ID.

@ApplicationScoped
public class DataImportService {

public void importData() {
JobContext ctx = JobContext.current();
JobLogger log = ctx.logger();

log.info("Starting import for job " + ctx.jobId());

String batchSize = ctx.param("batchSize", "100");
log.info("Using batch size: " + batchSize);

try {
int rows = processRecords(Integer.parseInt(batchSize));
log.info("Imported " + rows + " rows successfully");
} catch (Exception e) {
log.error("Import failed: " + e.getMessage());
throw e;
}
}
}

Querying Job Logs

If your application persists JobLogLine events, you can retrieve them after a job runs:

-- Get all log entries for a specific job, ordered chronologically
SELECT ts, level, message
FROM scheduler_job_log
WHERE job_id = '01902c4e-c4f3-7b8a-9d3e-fedcba987654'
ORDER BY ts ASC;
-- Find jobs that logged errors in the last hour
SELECT DISTINCT jl.job_id, j.status, j.last_error, jl.message
FROM scheduler_job_log jl
JOIN scheduler_job j ON j.job_id = jl.job_id
WHERE jl.level = 'ERROR'
AND jl.ts >= NOW() - INTERVAL '1 hour'
ORDER BY jl.ts DESC;

Using Job Parameters for Debug Context

Attach parameters at submission time to carry context through to execution:

scheduler.enqueue(importService::importData)
.withParam("batchSize", "500")
.withParam("source", "quarterly-report")
.withParam("requestedBy", currentUser.getName())
.submit();

These parameters are accessible via JobContext.current().param("key") and stored in the params column of scheduler_job:

-- Find jobs by parameter values
SELECT job_id, status, params::jsonb ->> 'source' as source
FROM scheduler_job
WHERE params::jsonb ->> 'source' = 'quarterly-report';

Event Listeners for Tracing Job Lifecycle

For real-time debugging, register an event listener that logs all lifecycle transitions:

@ApplicationScoped
public class LifecycleTracer {

private static final Logger log = Logger.getLogger(LifecycleTracer.class.getName());

@Inject
JobSchedulerService scheduler;

private Consumer<Object> listener;

public void startTracing() {
listener = event -> {
switch (event) {
case JobStartedEvent e ->
log.info("[TRACE] Job " + e.getJobId() + " STARTED on " + e.getNodeId()
+ " [type=" + e.getJobType() + ", priority=" + e.getPriority() + "]");

case JobCompletedEvent e ->
log.info("[TRACE] Job " + e.getJobId() + " COMPLETED in "
+ e.getExecutionTimeMs() + " ms");

case JobRetryingEvent e ->
log.warning("[TRACE] Job " + e.getJobId() + " RETRYING (attempt "
+ e.getAttempt() + "), next at " + e.getNextScheduledTime()
+ ", error: " + e.getErrorMessage());

case JobDlqEvent e ->
log.severe("[TRACE] Job " + e.getJobId() + " -> DLQ after "
+ e.getAttempt() + " attempts: " + e.getErrorMessage());

case JobCancelledEvent e ->
log.info("[TRACE] Job " + e.getJobId() + " CANCELLED from "
+ e.getPreviousStatus() + " after " + e.getExecutionTimeMs() + " ms");

case BatchCompletedEvent e ->
log.info("[TRACE] Batch " + e.getJobId() + " COMPLETED");

case ChainFailedEvent e ->
log.severe("[TRACE] Chain " + e.getJobId() + " FAILED");

default -> log.fine("[TRACE] " + event.getClass().getSimpleName()
+ " for job " + ((AbstractJobSchedulerEvent) event).getJobId());
}
};
scheduler.addEventListener(listener);
}

public void stopTracing() {
if (listener != null) {
scheduler.removeEventListener(listener);
}
}
}

CDI @Observes for Event Monitoring

CDI observers provide type-safe event handling. Use specific event types to narrow what you observe:

@ApplicationScoped
public class SchedulerMonitor {

private static final Logger log = Logger.getLogger(SchedulerMonitor.class.getName());

/**
* Track slow jobs. Fires after every successful completion.
*/
public void onJobCompleted(@Observes JobCompletedEvent event) {
if (event.getExecutionTimeMs() > 60_000) {
log.warning("Slow job detected: " + event.getJobId()
+ " took " + event.getExecutionTimeMs() + " ms"
+ " [type=" + event.getJobType() + "]");
}
}

/**
* Alert on DLQ entries. Fires when a job exhausts all retries.
*/
public void onDlq(@Observes JobDlqEvent event) {
log.severe("DLQ ALERT: Job " + event.getJobId()
+ " failed permanently after " + event.getRetryAttempt() + " attempts"
+ " - " + event.getErrorMessage());
// Send alert to monitoring system, Slack, etc.
}

/**
* Monitor retry patterns. Fires each time a job is rescheduled for retry.
*/
public void onRetry(@Observes JobRetryingEvent event) {
log.info("Retry: Job " + event.getJobId()
+ " attempt " + event.getRetryAttempt()
+ " scheduled for " + event.getScheduledTime());
}
}

Database Queries to Inspect Job State

Find a Specific Job

-- Full job details by ID
SELECT job_id, status, job_type, priority,
attempts, max_retries, backoff_policy,
scheduled_time, picked_by, picked_at,
execution_start_time, execution_end_time, execution_duration_ms,
last_error, business_key, resource_name,
payload::jsonb ->> 'target' as target_class,
payload::jsonb ->> 'method' as method_name
FROM scheduler_job
WHERE job_id = '01902c4e-c4f3-7b8a-9d3e-fedcba987654';

Find Failed Jobs by Error Pattern

-- Find jobs that failed with a specific exception type
SELECT job_id, status, attempts, max_retries, last_error,
payload::jsonb ->> 'target' as target_class,
payload::jsonb ->> 'method' as method_name,
created_at, updated_at
FROM scheduler_job
WHERE status = 'FAILED'
AND last_error LIKE '%ConnectionTimeout%'
ORDER BY updated_at DESC
LIMIT 20;

View Execution History for a Job

Each job execution attempt is recorded in scheduler_job_execution:

-- See all attempts for a job
SELECT id, attempt, node_id, started_at, ended_at, status,
duration_ms, error_message, error_class
FROM scheduler_job_execution
WHERE job_id = '01902c4e-c4f3-7b8a-9d3e-fedcba987654'
ORDER BY attempt ASC;

Monitor Job Throughput

-- Jobs completed per hour over the last 24 hours
SELECT date_trunc('hour', execution_end_time) as hour,
COUNT(*) as completed,
AVG(execution_duration_ms) as avg_ms,
MAX(execution_duration_ms) as max_ms
FROM scheduler_job
WHERE status = 'SUCCEEDED'
AND execution_end_time >= NOW() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour DESC;

Check Node Health

-- See which nodes are active and when they last checked in
SELECT node_id, heartbeat_ts, started_at,
EXTRACT(EPOCH FROM (NOW() - heartbeat_ts)) as seconds_since_heartbeat
FROM scheduler_node
ORDER BY heartbeat_ts DESC;

A seconds_since_heartbeat greater than the orphan grace period (default 60s) indicates a dead or unresponsive node. The OrphanRecoveryTimer will recover jobs from stale nodes automatically.

Inspect Batch Progress

-- Batch completion status
SELECT b.batch_id, j.status as parent_status, j.business_key,
b.total_items, b.completed_items, b.failed_items,
b.completion_processed,
ROUND(100.0 * (b.completed_items + b.failed_items) / GREATEST(b.total_items, 1), 1) as pct_done
FROM scheduler_batch b
JOIN scheduler_job j ON j.job_id = b.batch_id
WHERE j.status IN ('PENDING', 'RUNNING')
ORDER BY b.batch_id DESC;

Find Orphaned Jobs

-- Jobs stuck in RUNNING on nodes that have gone stale
SELECT j.job_id, j.picked_by, j.picked_at,
EXTRACT(EPOCH FROM (NOW() - j.picked_at)) / 60 as stuck_minutes,
n.heartbeat_ts
FROM scheduler_job j
LEFT JOIN scheduler_node n ON n.node_id = j.picked_by
WHERE j.status = 'RUNNING'
AND (n.node_id IS NULL OR n.heartbeat_ts < NOW() - INTERVAL '60 seconds')
ORDER BY j.picked_at ASC;

Enabling Debug Logging

By Component

Set fine-grained log levels to focus on the subsystem you are debugging:

What you are debuggingLogger to set to FINE/DEBUG
Job not being picked uprun.ratchet.ri.core.Poller
Job execution failuresrun.ratchet.ri.core.JobTask
Retry/backoff decisionsrun.ratchet.ri.core.JobTask
Security/ClassPolicy rejectionsrun.ratchet.ri.security
Circuit breaker behaviorrun.ratchet.ri.resilience
Thread pool saturationrun.ratchet.ri.core.ThreadPoolManager
Orphan recoveryrun.ratchet.ri.core.OrphanRecoveryTimer
Timeout enforcementrun.ratchet.ri.core.JobTimeoutHandler
CDI bean resolutionrun.ratchet.ri.cdi
Recurring job schedulingrun.ratchet.ri.core.RecurringScheduler
Batch orchestrationrun.ratchet.ri.core.BatchService

Full Debug Mode

To enable debug logging for all of Ratchet (generates significant output):

WildFly CLI:

/subsystem=logging/logger=run.ratchet:add(level=FINE)

Payara asadmin:

asadmin set-log-levels run.ratchet=FINE

Open Liberty server.xml:

<logging traceSpecification="run.ratchet.*=fine"/>

Reading the Log Output

Ratchet logs follow a consistent pattern. Here is what a healthy job execution looks like:

INFO  JobTask - Job 12345 starting execution [type=SINGLE, priority=NORMAL, attempt=1/4, payload=MyService.processData]
INFO JobTask - Job 12345 resolving target: com.myapp.MyService.processData (static=false)
INFO JobTask - Job 12345 succeeded in 234 ms
INFO JobTask - Job 12345 execution complete - cleaning up context

A failed job with retry:

SEVERE JobTask - Job 12345 failed with java.net.ConnectException: Connection refused
WARNING JobTask - Job 12345 retrying in 2000 ms (attempt 1/3) due to: java.net.ConnectException: Connection refused

A job that exhausted retries:

SEVERE JobTask - Job 12345 failed with java.net.ConnectException: Connection refused
SEVERE JobTask - Job 12345 moved to DLQ after 4 attempts

Debugging Checklist

When a job is not behaving as expected, work through this checklist:

  1. Check the job status: SELECT status, last_error FROM scheduler_job WHERE job_id = ?
  2. Check execution history: SELECT * FROM scheduler_job_execution WHERE job_id = ? ORDER BY attempt
  3. Check per-job logs (if enabled): SELECT * FROM scheduler_job_log WHERE job_id = ? ORDER BY ts
  4. Check if the node is alive: SELECT * FROM scheduler_node WHERE node_id = ?
  5. Check the payload: SELECT payload FROM scheduler_job WHERE job_id = ? -- verify the target class and method are correct
  6. Enable debug logging for the relevant subsystem (see table above)
  7. Register an event listener to watch lifecycle events in real time
  8. Check for resource contention: SELECT * FROM scheduler_resource_permit WHERE job_id = ?