Map Reduce Debugging by taking JVM heap dump

How to debug a Map Reduce program which is being killed due to OOM/GC randomly

 

Taking Heap Dump manually:
 

To visually see the objects in JVM at this point in time 
jmap -histo:live  pid  (Histogram)

To take Heap Dump 
jmap -dump:live,format=b,file=file-name.bin  (dump jvm heap as a file on disk)

    Logonto the datanode where the map/reduce jvm is running ,  run ps -eaf | grep attempt_id  to get the pid .
    Use Sudo -u “appropriate user to get the heap dump by using jmap command”.
    Never use -f option . while taking the dump using jmap .

To analyse the dump , using jhat .
jhat  -port “protno”  heap_file_path .

What to look for in the Jhat analysis
    1. Object address having highest memory footprints
    2. objects having highest instance count .

Taking HeapDump on OutOfMemoryException using Jvm -XX options

set the following option in Job configuration .

set mapred.child.java.opts  ‘-Xmx512m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/@[email protected] ‘.

This option launches the map/reduce task jvm with the value specified thus giving us  handle to control various jvm memory related parameters.

Few things to note

  1. -Xmx512m                                                                                            heap memory in MB
  2. -XX:+HeapDumpOnOutOfMemoryError                                           dump heap on disk when jvm goes  out of memory
  3. -XX:HeapDumpPath=/tmp/@[email protected]                   @taskid@   is replaced by hadoop framework with original taskid which is unique .
One needs to log on to the data nodes and heap dump file would be present at  /tmp, 
file would be named as @[email protected]  ( @taskid@   is replaced by hadoop framework with the original taskid). 
Jhat can be used to analyze the dump .

[addToAppearHere]

Taking HeapDump on OutOfMemoryException And Collecting the  dump files across datanodes  in a  hdfs location for further analysis .

The above mentioned option required one to log on in the  data-node on which the map/reduce task has been spawned , and run jmap , jhat on those machines . A MR task which has 100 of Map/reduce tasks can make this process very difficult . This option mentioned below provides a mechanism to collect all heap dump in a specified hdfs location .

Make a shell script named dump.sh :

#!/bin/sh
text=`echo $PWD | sed ‘s=/=\\_=g’` (this helps in figuring out which heap dump belongs to which task)
hadoop fs -put heapdump.hprof    /user/dirname/hprof/$text.hprof
  •  Place the dump.sh  script in a hdfs location by using hadoop dfs -put dump.sh   “hdfs location (example /user/dirName/dump.sh) “
  •  Create a dir on hdfs where u want to gather all the heap dumps and give 777 permission to that dir . (example hadoop dfs -chmod -R 777 /user/dirName/hprof)
  • Set the following proprties in the MR job
  • set mapred.child.java.opts ‘-Xmx256m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./heapdump.hprof  -XX:OnOutOfMemoryError=./dump.sh’
  • set mapred.create.symlink ‘yes’
  • set mapred.cache.files ‘hdfs:///user/dir/dump.sh#dump.sh‘

 

Run the MR job , any OOME issue in any of the datanode will take a heap-dump and place the dump file into the specified hdfs location .One can verify sane execution of the script  in the stdoutLog .

on Stdlogout :

java.lang.OutOfMemoryError: Java heap space
Dumping heap to ./heapdump.hprof ...
Heap dump file created [12039655 bytes in 0.081 secs]
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="./dump.sh"
#   Executing /bin/sh -c "./dump.sh"...

Use Hadoop Default profiler for profiling and finding issues

 set mapred.task.profile '  true'; set mapred.task.profile.params  '-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s'   set mapred.task.profile.maps   '0-1'    set mapred.task.profile.reduces   '0-1' 

profiler will  provide the details  of the jvm  tasks in the specified range . Location of the dump will be  availabe at TaskLogs  under profile.out logs section .