GPE crash on some DR nodes

Symptoms

GPE appears to be crashing after receiving Invalid delta attributes from Kafka. GPE reset brings it back up but replication is halted.

Error Messages

============ Call stacktrace. signal(Aborted) ============
 0# FailureSignalHandler at /home/graphsql/product/src/engine/utility/gutil/glogging.cpp:142
 1# 0x00007F4C22654630 in /home/tigergraph/tigergraph/app/3.7.0/.syspre/usr/lib_ld1/libpthread.so.0
 2# 0x00007F4C20CE8387 in /home/tigergraph/tigergraph/app/3.7.0/.syspre/usr/lib_ld1/libc.so.6
 3# 0x00007F4C20CE9A78 in /home/tigergraph/tigergraph/app/3.7.0/.syspre/usr/lib_ld1/libc.so.6
 4# 0x00007F4C255BEA33 in /home/tigergraph/tigergraph/app/3.7.0/bin/libtigergraph.so
 5# 0x00007F4C255B7EA2 in /home/tigergraph/tigergraph/app/3.7.0/bin/libtigergraph.so
 6# 0x00007F4C255B7E01 in /home/tigergraph/tigergraph/app/3.7.0/bin/libtigergraph.so
 7# 0x00007F4C255B77C9 in /home/tigergraph/tigergraph/app/3.7.0/bin/libtigergraph.so
 8# 0x00007F4C255BAA68 in /home/tigergraph/tigergraph/app/3.7.0/bin/libtigergraph.so
 9# topology4::Attribute::MoveToNextPosition_Ext(unsigned char*&, topology4::AttributeMeta const&) at /home/graphsql/product/src/engine/core/topology/topology4/attribute.cpp:1912
10# topology4::DeltaAttributeCombiner::AdvanceOneDeltaAttribute(unsigned char*&, topology4::AttributeMeta&) at /home/graphsql/product/src/engine/core/topology/topology4/deltaattribute.cpp:54
11# topology4::DeltaRecords::HandleEdgeDeltaAttribute(unsigned char*&, unsigned long&, unsigned char*, topology4::AttributesMeta&, std::vector<unsigned int, std::allocator<unsigned int> >*, gutil::GCharBuffer*) at /home/graphsql/product/src/engine/core/topology/topology4/deltarecords.cpp:1213
12# topology4::DeltaRecords::AddOneEdgeDelta_Internal(unsigned long, unsigned long, unsigned int, unsigned int, unsigned char*, unsigned long, unsigned long, unsigned long&, unsigned char*, topology4::AttributesMeta*, std::vector<unsigned int, std::allocator<unsigned int> >*, gutil::GCharBuffer*) at /home/graphsql/product/src/engine/core/topology/topology4/deltarecords.cpp:2328
13# topology4::DeltaRecords::AddEdgeDeltaThread(unsigned int) at /home/graphsql/product/src/engine/core/topology/topology4/deltarecords.cpp:2283
14# 0x00007F4C255A7565 in /home/tigergraph/tigergraph/app/3.7.0/bin/libtigergraph.so
15# 0x00007F4C2264CEA5 in /home/tigergraph/tigergraph/app/3.7.0/.syspre/usr/lib_ld1/libpthread.so.0
16# 0x00007F4C20DB0B0D in /home/tigergraph/tigergraph/app/3.7.0/.syspre/usr/lib_ld1/libc.so.6

Cause

The crash is caused by a race condition of delayed schema change and delta handling.

GPE crash on some DR nodes

Resolution

  1. Start GPE back up repeatedly (GPE may crash until all messages are processed, but they should all be processed correctly)

  2. Reset GPE (will bring GPE up and stop replication) 2-1) If deltas are rebuilt on the primary cluster, you should be able to restore backup and re-enable CRR 2-2) You may also upgrade at this point, then restore backup and CRR.

Workaround

  1. Perform a workaround to bring up GPE

  2. Upgrade TG