INFRA-159: Confluence failed to start (CLUSTERSAFETY table)

Metadata

Source
INFRA-159
Type
Bug
Priority
Major
Status
Closed
Resolution
Fixed
Assignee
Giovanni Tirloni
Reporter
Giovanni Tirloni
Created
2018-04-10T10:17:29.582-0400
Updated
2018-04-10T10:20:47.308-0400
Versions
N/A
Fixed Versions
N/A
Component
N/A

Description

Fatal error in Confluence cluster: Database is being updated by an instance which is not part of the current cluster.

https://confluence.atlassian.com/confkb/confluence-will-not-start-due-to-fatal-error-in-confluence-cluster-179439771.html

Comments

  • Giovanni Tirloni commented 2018-04-10T10:20:41.073-0400

    Root cause:

    2018-04-09 20:37:56,707 ERROR [Caesium-1-3] [scheduler.caesium.impl.SchedulerQueueWorker] executeJob Unhandled exception thrown by job QueuedJob[jobId=ClusterSafetyJob,deadline=1523306258345]
    java.lang.OutOfMemoryError: Java heap space
    

    Symptom:

    2018-04-09 20:38:08,683 WARN [Caesium-1-3] [confluence.cluster.safety.DefaultClusterSafetyManager] onNumbersAreDifferent Detected different number in database [ -2137431111 ] and cache [ -1148876792 ]. Cache number last set by [ not clustered ]. Triggering panic on node [ not clustered ]
    2018-04-09 20:38:08,730 WARN [Caesium-1-3] [analytics.client.listener.ProductEventListener] processEventWithTiming Processing a critical event: com.atlassian.confluence.cluster.safety.ClusterPanicAnalyticsEvent@446d65da
    2018-04-09 20:38:08,731 ERROR [Caesium-1-3] [confluence.cluster.safety.ClusterPanicListener] onClusterPanicEvent Received a panic event, stopping processing on the node: Non Clustered Confluence: Database is being updated by another Confluence instance. Please see http://confluence.atlassian.com/x/mwiyCg for more details.
    2018-04-09 20:38:08,731 WARN [Caesium-1-3] [confluence.cluster.safety.ClusterPanicListener] onClusterPanicEvent Shutting down
    

    Confluence was running with a 1GB heap. Increased it to 4GB and restarted Confluence.

    Confluence was returning 200 OK while it was down, causing our monitoring to not detect it.