GPE down due to disk space issues

Symptoms

While running a batch loading job, GPE on m2 (cluster topology is 3x1) went down. After manually restarting that GPE down node they noticed that 2 out of 3 GPEs (on m2 and m3) were in WARMUP and stayed in warmup for at least 24hrs.

|   GPE_1#1    |     Online     |     Running     |
|   GPE_2#1    |     Warmup     |     Running     |
|   GPE_3#1    |     Warmup     |     Running     |

Error Messages

The GPE crash stack trace is:

E0522 23:59:02.547024 15964 glogging.cpp:132] ============ Crashed with stacktrace ============
  0# FailureSignalHandler at /home/graphsql/product/src/engine/utility/gutil/glogging.cpp:132
 1# 0x00007F68D7EC65F0 in /data/tigergraph/app/3.5.1/.syspre/usr/lib_ld1/libpthread.so.0
 2# topology4::SegmentVertexReader::MoveTo(unsigned long, topology4::QueryState*, topology4::SegmentVertexReaderResult&, bool) at /home/graphsql/product/src/engine/core/topology/topology4/segmentvertexattributereader.cpp:271
 3# topology4::DeltaRebuilder::WriteSegmentFiles(unsigned int, topology4::SegmentMeta&, topology4::QueryState*, gutil::GTimer&, topology4::SegmentCheckSum*, gutil::BitQueue*&) at /home/graphsql/product/src/engine/core/topology/topology4/deltarebuilder.cpp:1074
 4# topology4::DeltaRebuilder::RunOneSegment(topology4::QueryState*, unsigned long, bool, bool) at /home/graphsql/product/src/engine/core/topology/topology4/deltarebuilder.cpp:1327
 5# gutil::GThreadPool::pool_main(unsigned char) at /home/graphsql/product/src/engine/utility/gutil/gthreadpool.cpp:125
 6# 0x00007F68DB208299 in /data/tigergraph/app/3.5.1/bin/libtigergraph.so
 7# 0x00007F68D7EBEE65 in /data/tigergraph/app/3.5.1/.syspre/usr/lib_ld1/libpthread.so.0
 8# 0x00007F68D6D9F88D in /data/tigergraph/app/3.5.1/.syspre/usr/lib_ld1/libc.so.6
============ End of stacktrace ============

Diagnosis

There are two issues (known) in this case that are:

Disk space issue: This is the root cause for GPE going down, in fact while they were loading their batch data load GPE went down. Checking that GPE logs we saw Alert messages about disk space left 8% hence we checked the disk space usage and we saw that there was left only 8%
GPEs in WARMUP after restart: After restarting the down GPE, we noticed that there were 2 GPEs (out of three) in WARMUP state. Based on the troubleshooting we identified that there was a KAFKA-LOADER job running meanwhile GPE were in WARMUP. In fact GPE will consider itself being caught up when there’s no new delta records in 1 sec, which means if new delta records are coming all the time and the interval is less than 1 sec, GPE will never become online. Logs relevant for this part to understand that there is a loading job running preventing GPE to go online: From GPE logs:
```
#data getting pulled here while GPE is in warmup
I0523 16:52:21.118593 22502 post_listener.cpp:221] Request|gle,250598.KAFKA-LOADER_3_9.1652732703634.N,NAC,599,2,0,S|Post_listener|getDeltaMessage|25340085|25340084|852225
#GPE in warmup since it's suck in WaitForCatchUp
I0523 16:52:07.979143 31967 serviceapi.cpp:638] System_ServiceAPI|Wait for topology to catch up
```

Further information can be retrieved by running this gdb command

rm gdb.txt; gdb -p $(pgrep gpe) -batch -ex "set logging on" -ex "set pagination off" -ex "thread apply all bt"  -ex "quit"
and relevant part of the gdb output in this case is:
[source,bash]
1016 #3  gutil::GSparseArray<topology4::DeltaVertexAttributeRecord>::GetPointerForWrite (this=0x7fc1a26d00d0, id=id@entry=1, ret_lock=@0x7fa7305f6d00: 0x7fc1a2cfa600, create_if_not_exist=create_if_n     ot_exist@entry=true) at /home/graphsql/product/src/engine/utility/gutil/gsparsearray.hpp:278
1017 #4  0x00007fc229fa70c2 in topology4::DeltaRecords::ReadOneDelta (this=this@entry=0x7fc224ed9800, delta=0x7fa72eb02a5f "\377h\001\020", delta@entry=0x7fa72eb00307 "", postbinarysize=postbinarysi     ze@entry=1384138, tid=tid@entry=25340213, current_postqueue_pos=current_postqueue_pos@entry=25340213, num_inup_vertices=@0x7fa7305f6f90: 0, num_delete_vertices=@0x7fa7305f6fa0: 0, num_inup_edge     s=@0x7fa7305f6fb0: 0, num_delete_edges=@0x7fa7305f6fc0: 0, batchdelta=batchdelta@entry=false, attrbuffer=attrbuffer@entry=0x7fa7305f7250, requestid=...) at /home/graphsql/product/src/engine/cor     e/topology/topology4/deltarecords.cpp:1413
1018 #5  0x00007fc229fab5c4 in topology4::DeltaRecords::ReadDeltas (this=0x7fc224ed9800, deltadata=deltadata@entry=0x7fa72eb002c0 ":igle,250598.KAFKA-LOADER_3_9.1652732703634.N,NAC,599,2,0,S5\251\20     2\001", deltasize=deltasize@entry=1384209, current_postqueue_pos=current_postqueue_pos@entry=25340213, servicesummary=servicesummary@entry=0x7fc224e60200, attrbuffer=attrbuffer@entry=0x7fa7305f     7250, aborted_requests=...) at /home/graphsql/product/src/engine/core/topology/topology4/deltarecords.cpp:818
1019 #6  0x00007fc22a0da87c in gperun::EngineJobRunner::PullDeltaThread (this=0x7fc1f1bb0400) at /home/graphsql/product/src/engine/olgp/gpe/enginejobrunner.cpp:578

Where we can see that the jobid 250598.KAFKA-LOADER_3_9.1652732703634 is actively running and PullDataThread is running to consume new delta records

Workaround

For the 1st issue, the solution was to free up some disk space by deleting older log files and flushing TS3 database as well as the user had to allocate more disk space.

For the 2nd issue, the solution was to stop the KAFKA loading job, wait for GPE to go online and then restart the KAFKA loading job.