2015年10月29日星期四

hive 1.2.1 mapjoin slow [launch local task to process map join cost long time]


问题:发现hive 1.2.1中的mapjoin在做localtask dump时非常慢;
 hadoop  jar  /data/home/hive-1.2.1/lib/hive-exec-1.2.1.jar org.apache.hadoop.hive.ql.exec.mr.ExecDriver -localtask -plan file:/data/home/eagooqi/plan.xml -jobconffile file:/data/home/eagooqi/jobconf.xml
后台进程cpu 100%,
查看后台
top -H -p 6730
得到cpu高的线程id,换算为十六进制。
25793 -> 64c1
jstack -l 6730
"main" prio=10 tid=0x0000000000618000 nid=0x64c1 runnable [0x00002b32aa88f000]
   java.lang.Thread.State: RUNNABLE
        at java.util.HashMap.getEntry(HashMap.java:426)
        at java.util.HashMap.getEntry(HashMap.java:418)
        at java.util.HashMap.get(HashMap.java:406)
        at org.apache.hadoop.hive.ql.exec.persistence.HashMapWrapper.get(HashMapWrapper.java:105)
        at org.apache.hadoop.hive.ql.exec.HashTableSinkOperator.process(HashTableSinkOperator.java:243)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
        at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
        at org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask.startForward(MapredLocalTask.java:416)
        at org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask.startForward(MapredLocalTask.java:378)
        at org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask.executeInProcess(MapredLocalTask.java:344)
        at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.main(ExecDriver.java:745)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
知道是因为HashMap的get put导致的慢,但为啥慢呢?
查一下社区,虽然有以下 
然并卵~
找到了问题所在:
看一下mapjoin的表结构,就明白,发生这个情况
我们的mapjoin表key是 id bigint,到hive这边关联条件是t1.resource_id1 = t2.id,t1.resource_id1是string,id是bigint,hive都转换为DoubleWritable
hive (default)> desc t_rd_appid_map;
OK
id                      bigint                  None                
pkgname                 string                  None                
updatetime              string                  None                
acid                    int                     None                
ds                      bigint                  None                
                 
# Partition Information          
# col_name              data_type               comment             
                 
ds                      bigint                  None          

需要以下条件同时成立才会发生;
a.mapjoin 的key 为DoubleWritable
b.如HIVE-12093中描述的不知从什么时候hive改为使用hadoop的org.apache.hadoop.io.DoubleWritable
Evidently, Hive has its own wrapper/subclass around Hadoop's DoubleWritable that overrode used to override hashCode() with a correct implementation, but for some reason they recently removed that code, so it now uses the incorrect hashCode() method inherited from Hadoop's DoubleWritable.
 但是hadoop的DoubleWritable有问题,导致hash int很多值相同,因为64位变成了32位,是直接裁剪掉了。造成hash遍历慢~,详解见参考文档1.
 @Override
  public int hashCode() {
    return (int)Double.doubleToLongBits(value);
  }
 而hive之前的DoubleWritable会取前后32位亦或。使得结果均匀分布。
 @Override
  public int hashCode() {
    long v = Double.doubleToLongBits(get());
    return (int) (v ^ (v >>> 32));
  }

最后通过修改hive的DoubleWritable类修正;
public class DoubleWritable extends org.apache.hadoop.io.DoubleWritable {

  public DoubleWritable() {
    super();
  }

  public DoubleWritable(double value) {
    super(value);
  }

  static { // register this comparator
    WritableComparator.define(DoubleWritable.class, new Comparator());
  }
  //Added by Eagooqi
  @Override
  public int hashCode() {
    long v = Double.doubleToLongBits(get());
    return (int) (v ^ (v >>> 32));
  }
}
 另外这个两个参数也可提升性能
HIVEHASHTABLETHRESHOLD("hive.hashtable.initialCapacity", 100000, "Initial capacity of " +
        "mapjoin hashtable if statistics are absent, or if hive.hashtable.stats.key.estimate.adjustment is set to 0"),
    HIVEHASHTABLELOADFACTOR("hive.hashtable.loadfactor", (float) 0.75, ""),

参考文档:
1.http://coding-geek.com/how-does-a-hashmap-work-in-java/
3.http://mail-archives.apache.org/mod_mbox/hive-dev/201503.mbox/%3CCAKDnX7kVjLofA+hHPvCVukNz=VQ6BsvSC5s+ZNuecgpmZH11VA@mail.gmail.com%3E

没有评论:

发表评论