Deployment Troubleshooting
This guide covers problems that surface during deployment and operations. For general debugging techniques, see Troubleshooting Overview.
Schema Issues
Tables/collections not found
Symptom: Table 'scheduler_job' doesn't exist or similar errors at startup.
Fix: For SQL stores, apply the DDL for your store module. DDL files are in the module's src/main/resources/ddl/ directory:
# PostgreSQL
psql -U ratchet -d myapp -f ratchet-store-postgresql/src/main/resources/ddl/postgresql-schema.sql
# MySQL
mysql -u ratchet -p myapp < ratchet-store-mysql/src/main/resources/ddl/mysql-schema.sql
# MongoDB — collections and indexes are initialized automatically at store startup
SQL stores ship plain SQL files, not migrations. Apply them however your project manages DDL (Flyway, Liquibase, manual scripts, etc.).
Missing indexes
Symptom: Slow job claiming, increasing poll cycle times, growing queue despite available workers.
Diagnosis:
-- PostgreSQL
EXPLAIN ANALYZE
SELECT * FROM scheduler_job
WHERE status = 'PENDING' AND scheduled_time <= NOW()
ORDER BY priority + FLOOR(GREATEST(0, EXTRACT(EPOCH FROM (statement_timestamp() - scheduled_time)) / 60) / 15) DESC,
scheduled_time ASC
FOR UPDATE SKIP LOCKED LIMIT 10;
If you see a sequential scan instead of an index scan, the composite index is missing.
Fix: Re-apply the DDL or create the index manually:
-- PostgreSQL
CREATE INDEX IF NOT EXISTS idx_job_claim_cover
ON scheduler_job (job_type, scheduled_time ASC, priority DESC, job_id ASC)
WHERE status = 'PENDING';
Schema version mismatch
Symptom: Column not found errors or constraint violations after upgrading Ratchet.
Fix: Compare your applied schema against the latest DDL in the new version. Ratchet does not auto-migrate — you must diff and apply changes manually or through your migration tool.
Connection Issues
DataSource not found
Symptom: JNDI lookup failed for 'java:comp/DefaultDataSource' or No qualifying bean of type 'DataSource'.
Fix: Ensure your application server has a DataSource configured and bound to the expected JNDI name. For WildFly:
<!-- standalone.xml -->
<datasource jndi-name="java:jboss/datasources/RatchetDS" pool-name="RatchetDS">
<connection-url>jdbc:postgresql://localhost:5432/myapp</connection-url>
<driver>postgresql</driver>
<security>
<user-name>ratchet</user-name>
<password>ratchet</password>
</security>
</datasource>
Connection pool exhaustion
Symptom: Unable to acquire JDBC Connection or timeouts during high load.
Cause: Ratchet's poller and workers each hold connections during claim and execution. If your pool is smaller than the number of concurrent workers, connections can be exhausted.
Fix: Set your connection pool size to at least workerThreads + pollerThreads + margin:
<datasource ...>
<pool>
<min-pool-size>5</min-pool-size>
<max-pool-size>30</max-pool-size>
</pool>
</datasource>
MongoDB transaction error
Symptom: Transaction numbers are only allowed on a replica set member or mongos at startup.
Fix: Ratchet's MongoDB store does not require multi-document transactions. This error usually means application code, a test fixture, or another library started a MongoDB transaction. Use a replica set for that workload, or remove the transaction wrapper around Ratchet store calls. For development:
mongod --replSet rs0
mongosh --eval 'rs.initiate()'
Clustering Problems
Duplicate job execution
Symptom: The same job runs on two nodes simultaneously.
Possible causes:
- Missing
SKIP LOCKEDsupport — You're on a database version that doesn't support it (MySQL < 8.0, PostgreSQL < 9.5) - Job timeout too short — The job takes longer than its timeout, so the poller reclaims it while it's still running on another node
- Clock skew — Nodes have significantly different system clocks
Fix:
- Upgrade to a supported database version
- Increase
withTimeout()on long-running jobs - Synchronize clocks with NTP. Check skew:
SELECT NOW()on the DB vsInstant.now()on each node
Node appears in scheduler_node but isn't running
Symptom: Stale entries in the node table for nodes that have been decommissioned.
Fix: Stale entries are harmless but confusing. Clean them up:
DELETE FROM scheduler_node
WHERE last_heartbeat < NOW() - INTERVAL '1 hour';
Or use the programmatic API:
nodeStore.deleteInactiveNodesSince(Instant.now().minus(Duration.ofHours(1)));
Jobs stuck in RUNNING
Symptom: Jobs have status = 'RUNNING' and picked_by set to a node that's no longer alive.
Cause: The node crashed mid-execution. The job's timeout hasn't expired yet, or there's no timeout set.
Fix:
- Prevent: Always set a reasonable
withTimeout()on jobs - Recover: After the timeout expires, the poller will automatically reclaim the job. To force recovery:
UPDATE scheduler_job
SET status = 'PENDING', picked_by = NULL, picked_at = NULL
WHERE status = 'RUNNING'
AND picked_at < NOW() - INTERVAL '30 minutes';
Only reset jobs if you're certain the claiming node is truly dead. Resetting a job that's still executing will cause duplicate execution.
Performance Issues
Poll cycle taking too long
Symptom: PerformanceMetricsEvent shows poll cycles > 1 second.
Diagnosis: Check if the polling query is using indexes:
EXPLAIN ANALYZE
SELECT * FROM scheduler_job
WHERE status = 'PENDING' AND scheduled_time <= NOW()
ORDER BY priority + FLOOR(GREATEST(0, EXTRACT(EPOCH FROM (statement_timestamp() - scheduled_time)) / 60) / 15) DESC,
scheduled_time ASC
FOR UPDATE SKIP LOCKED LIMIT 10;
Fixes:
- Ensure the composite index exists (see Missing indexes)
- Reduce batch size — claiming fewer jobs per cycle reduces lock contention
- Archive completed jobs to keep the table lean
Queue growing despite available capacity
Symptom: Pending job count increases while worker threads are idle.
Possible causes:
- Scheduling mismatch — Jobs have
scheduled_timein the future - Resource permits — Jobs require a resource permit that's exhausted
- Paused jobs — Jobs are in
PAUSEDstatus
Diagnosis:
SELECT status, COUNT(*) FROM scheduler_job
GROUP BY status ORDER BY status;
-- Check future-scheduled jobs
SELECT COUNT(*) FROM scheduler_job
WHERE status = 'PENDING' AND scheduled_time > NOW();
High lock contention on PostgreSQL
Symptom: deadlock detected errors or long wait times on scheduler_lock.
Fix: Advisory locks in Ratchet use short TTLs. If you see contention:
- Ensure
lock_timeoutis set reasonably:SET lock_timeout = '5s'; - Check for long-running transactions holding locks
- Increase the poll interval to reduce claim frequency
CDI Issues
Ratchet beans not discovered
Symptom: Unsatisfied dependency for type JobSchedulerService at deployment.
Fix: Ensure CDI bean discovery is enabled:
beans.xmlexists inMETA-INF/orWEB-INF/- Ratchet JARs are in the deployment (not just on the classpath as external modules)
- Bean discovery mode is
allorannotated(notnone)
Multiple store implementations on classpath
Symptom: Ambiguous dependency for type JobStore — CDI found two store beans.
Fix: Include only one store module in your deployment. If you need both (e.g., for migration), mark one as @Alternative and don't enable it.
See Also
- Troubleshooting Overview — General debugging techniques
- Common Issues — Frequently encountered problems
- Clustering — Multi-node architecture and failure modes
- Performance Tuning — Optimizing for throughput