HDFS中LightWeightGSet与HashMap结构解析

看见HashMap我就想起了一道HashMap的面试题，先来瞅下这个面试题：
HashMap和HashTable的区别？

HashMap是非线程安全的，HashTable是线程安全的。
HashMap的键值都可以为null，HashTable则不行。
HashMap是非线程安全的则效率会高于HashTable，这也是一般都用HashMap的原因。

深入的问题，Java中另一个线程安全的与HashMap类似的类是？
ConcurrentHashMap也是一个线程安全类，与HashTable不同的是提供的锁机制不一样(与SynchronizedMap也不一样)
HashTable中采用的锁机制是一次锁住整个hash表，从而同一时刻只能由一个线程对其进行操作；
ConcurrentHashMap是一次锁住一个桶，hash表默认是16个桶，诸如get、put、remove等操作只锁住当前需要的桶。

上面简单回忆了下HashMap，下面说下LightWeightGSet，LightWeightGSet是HDFS中的一个数据结构，类似HashMap，用来存储block和副本所在dn的映射，是Namenode中存储全部数据块信息类BlocksMap的一个重要的数据结构。

LightWeightGSet使用数组来存储元素，使用链表来解决冲突，非线程安全。这点与HashMap类似，但是LightWeightGSet中数组的大小不会发生变化，而HashMap会根据其内部空间利用率对数组的大小进行再分配，进行重新哈希分区。再者就是LightWeightGSet不能存null元素。

下面从源码角度对比一下LightWeightGSet和HashMap的区别。

LightWeightGSet和HashMap的初始化

接口实现

LightWeightGSet实现了GSet接口，HashMap实现了Map接口。其中GSet和Map接口分别定义了键映射到值的规则。

初始化

LightWeightGSet只有一个构造函数，初始化时传入数组的大小，在BlocksMap中初始化代码如下：
blocks = new LightWeightGSet<Block, BlockInfo>(capacity)

HashMap提供了三个构造函数：
HashMap()：构造一个具有默认初始容量(16)和默认loadFactor(0.75)的空HashMap。
HashMap(int initialCapacity)：构造一个带指定初始容量和默认loadFactor(0.75)的空HashMap。
HashMap(int initialCapacity, float loadFactor)：构造一个带指定初始容量和loadFactor的空HashMap。

在这里提到了两个参数：初始容量，loadFactor。这两个参数是影响HashMap性能的重要参数，其中容量表示哈希表中桶的数量，初始容量是创建哈希表时的容量，loadFactor是哈希表在其容量自动增加之前可以达到多满的一种尺度，它衡量的是一个散列表的空间的使用程度，loadFactor越大表示散列表的装填程度越高，反之愈小。对于使用链表法的散列表来说，查找一个元素的平均时间是O(1+a)，因此如果负载因子越大，对空间的利用更充分，然而后果是查找效率的降低；如果负载因子太小，那么散列表的数据将过于稀疏，对空间造成严重浪费。系统默认负载因子为0.75，一般情况下我们是无需修改的。

数据结构

我们知道在Java中最常用的两种结构是数组和模拟指针(引用)，几乎所有的数据结构都可以利用这两种来组合实现，HashMap和LightWeightGSet就是如此。

数组的特点：连续空间，寻址迅速，但是在删除或者添加元素的时候需要有较大幅度移动，所以查询速度快，增删较慢。
链表正好相反，由于空间不连续，寻址困难，增删元素只需修改指针，所以查询慢，增删快。

HashMap数据结构

实际上HashMap和LightWeightGSet都是一个“链表散列”，如下是HashMap数据结构：

从上图我们可以看出HashMap底层实现还是数组，只是数组的每一项都是一条链。其中参数initialCapacity就代表了该数组的长度。下面为HashMap构造函数的源码：

public HashMap(int initialCapacity, float loadFactor) {
    //初始容量不能<0
    if (initialCapacity < 0)
        throw new IllegalArgumentException("Illegal initial capacity: "
                + initialCapacity);
    //初始容量不能 > 最大容量值，HashMap的最大容量值为2^30
    if (initialCapacity > MAXIMUM_CAPACITY)
        initialCapacity = MAXIMUM_CAPACITY;
    //负载因子不能 < 0
    if (loadFactor <= 0 || Float.isNaN(loadFactor))
        throw new IllegalArgumentException("Illegal load factor: "
                + loadFactor);
    // 计算出大于 initialCapacity 的最小的 2 的 n 次方值。
    int capacity = 1;
    // 这里保证了table数组的长度是2的n次方，这样会出现传入的initialCapacity比实际
    while (capacity < initialCapacity)     
        // 分配的capacity小。这样的好处是在利用hash寻址时方便，快速，且key分布均匀
        capacity <<= 1;                    
    this.loadFactor = loadFactor;
    //设置HashMap的容量极限，当HashMap的容量达到该极限时就会进行扩容操作
    threshold = (int) (capacity * loadFactor);
    //初始化table数组
    table = new Entry[capacity];
    init();
}

从源码中可以看出，每次新建一个HashMap时，都会初始化一个table数组。table数组的元素为Entry节点。

static class Entry<K,V> implements Map.Entry<K,V> {
    final K key;
    V value;
    Entry<K,V> next;
    final int hash;
    /**
     * Creates new entry.
     */
    Entry(int h, K k, V v, Entry<K,V> n) {
        value = v;
        next = n;
        key = k;
        hash = h;
    }
    .......
}

其中Entry为HashMap的内部类，它包含了键key、值value、下一个节点next，以及hash值（这个hash值是key的hash赋值的），这是非常重要的，正是由于Entry才构成了table数组的项为链表。

LightWeightGSet数据结构

先来看下LightWeightGSet的构造函数：

public LightWeightGSet(final int recommended_length) {
  final int actual = actualArrayLength(recommended_length);
  ...
  entries = new LinkedElement[actual];
  hash_mask = entries.length - 1;
}

由代码可以看出LightWeightGSet的数据结构和HashMap类似，LightWeightGSet初始化时会创建一个LinkedElement类型的数组，LinkedElement是LightWeightGSet的内部接口，由此解决了元素的冲突问题，使冲突的元素形成一个链表。BlocksMap的结构中通过BlockInfo实现LinkedElement接口，通过nextLinkedElement连接下一个BlockInfo。

public static interface LinkedElement {
  /** Set the next element. */
  public void setNext(LinkedElement next);
  /** Get the next element. */
  public LinkedElement getNext();
}

LightWeightGSet与HashMap类似，都是创建一个数组，LightWeightGSet的数组中存储LinkedElement类型的元素，HashMap的数组中存储Entry类型的元素，这些元素都会将key冲突的元素通过引用连成一个链表。

上面简单分析了LightWeightGSet和HashMap的数据结构，下面将探讨他们是如何实现快速存取的。

存储实现put

分别看下LightWeightGSet和HashMap的put方法
LightWeightGSet.put

// LightWeightGSet
public E put(final E element) {
  //validate element
  if (element == null) {
    throw new NullPointerException("Null element is not supported.");
  }
  if (!(element instanceof LinkedElement)) {
    throw new HadoopIllegalArgumentException(
        "!(element instanceof LinkedElement), element.getClass()="
        + element.getClass());
  }
  final LinkedElement e = (LinkedElement)element;

  //find index
  final int index = getIndex(element);

  //remove if it already exists
  final E existing = remove(index, element);

  //insert the element to the head of the linked list
  modification++;
  size++;
  e.setNext(entries[index]);
  entries[index] = e;

  return existing;
}

HashMap.put

public V put(K key, V value) {
        //当key为null，调用putForNullKey方法，保存null与table第一个位置中，这是HashMap允许为null的原因
        if (key == null)
            return putForNullKey(value);
        //计算key的hash值
        int hash = hash(key.hashCode());           
        //计算key hash 值在 table 数组中的位置
        int i = indexFor(hash, table.length); 
        //从i出开始迭代 e,找到 key 保存的位置
        for (Entry<K, V> e = table[i]; e != null; e = e.next) {
            Object k;
            //判断该条链上是否有hash值相同的(key相同)
            //若存在相同，则直接覆盖value，返回旧value
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;    //旧值 = 新值
                e.value = value;
                e.recordAccess(this);
                return oldValue;     //返回旧值            
            }
        }
        //修改次数增加1
        modCount++;
        //将key、value添加至i位置处       
        addEntry(hash, key, value, i);
        return null;
}

void addEntry(int hash, K key, V value, int bucketIndex) {
    // 保存“bucketIndex”位置的值到“e”中
    Entry<K,V> e = table[bucketIndex];
    // 设置“bucketIndex”位置的元素为“新Entry”，
    // 设置“e”为“新Entry的下一个节点”
    table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
    // 若HashMap的实际大小 不小于 “阈值”，则调整HashMap的大小
    if (size++ >= threshold)
        resize(2 * table.length);
}

两个put都是先对element进行校验，然后拿到key对应数组的index，其index都是对key取hashcode值与运算因子进行&操作得出的。两部分代码如下：
LightWeightGSet

private int getIndex(final K key) {
  return key.hashCode() & hash_mask;
}
// BlockInfo.class
public int hashCode() {
  // Super implementation is sufficient
  return super.hashCode();
}
// Block.class
public int hashCode() {
  //GenerationStamp is IRRELEVANT and should not be used here
  return (int)(blockId^(blockId>>>32));
}

HashMap

static int hash(int h) { 
	h ^= (h >>> 20) ^ (h >>> 12); 
	return h ^ (h >>> 7) ^ (h >>> 4); 
}

static int indexFor(int h, int length) {
        return h & (length-1);
    }

这里重点说下运算因子，LightWeightGSet的运算因子是hash_mask，在构造函数中赋值，hash_mask = entries.length - 1，是entries数组长度减一。HashMap的运算因子也是entry数组的长度减一。之所以这样是为了让数据在LightWeightGSet和HashMap中的分布尽量均匀(最好每项都只有一个元素，这样就可以直接找到)，不能太紧也不能太松，太紧会导致查询速度慢，太松则浪费空间。

由于LightWeightGSet和HashMap底层数组的长度都是2的n次方，LightWeightGSet中数组的长度计算如下：

private static int actualArrayLength(int recommended) {
  if (recommended > MAX_ARRAY_LENGTH) {
    return MAX_ARRAY_LENGTH;
  } else if (recommended < MIN_ARRAY_LENGTH) {
    return MIN_ARRAY_LENGTH;
  } else {
    final int a = Integer.highestOneBit(recommended);
    return a == recommended? a: a << 1;
  }
}

HashMap中数组的长度计算为capacity <<= 1。则使用h&(length - 1)对length取模。同时运算因子length-1还有一个非常重要的责任：均匀分布table数据和充分利用空间。

当length为2的n次方时，h&(length - 1)就相当于对length取模(只有具备length为2的n次方的条件时，才成立)，也就是h%length，而且速度比直接取模(%)快得多，这是HashMap在速度上的一个优化。

得到key对应的index之后，进行查重put，LightWeightGSet通过remove方法进行查重，如果链表中的第一个元素不等于此值，则遍历该链表，查看是否有相同值，有则删除。如果等于链表中的第一个值则直接删除。LightWeightGSet的插入链表操作比较简单，直接调用setNext方法指向下一个元素，并将此值放入index处就可以了(这里实际存储的是BlockInfo对象，BlockInfo实现了LightWeightGSet.LinkedElement接口，并且声明了一个nextLinkedElement属性，该属性是下一个元素的引用)。remove代码如下：

private E remove(final int index, final K key) {
  if (entries[index] == null) {
    return null;
  } else if (entries[index].equals(key)) {
  	// 等于链表中的第一个元素
    //remove the head of the linked list
    modification++;
    size--;
    // 得到index处的值
    final LinkedElement e = entries[index];
    // 将index指向下一个元素
    entries[index] = e.getNext();
    // 将元素e从链表中移除
    e.setNext(null);
    return convert(e);
  } else {
    //head != null and key is not equal to head
    //search the element
    // 如果不等于第一个元素则遍历该链表查找是否有相同值
    LinkedElement prev = entries[index];
    for(LinkedElement curr = prev.getNext(); curr != null; ) {
      if (curr.equals(key)) {
        //found the element, remove it
        modification++;
        size--;
        prev.setNext(curr.getNext());
        curr.setNext(null);
        return convert(curr);
      } else {
        prev = curr;
        curr = curr.getNext();
      }
    }
    //element not found
    return null;
  }
}

HashMap是在put中实现，若该位置没有元素，则直接插入。否则迭代该处元素链表并依此比较其key的hash值。如果两个hash值相等且key值相等(e.hash == hash && ((k = e.key) == key || key.equals(k))),则用新的Entry的value覆盖原来节点的value。如果两个hash值相等但key值不等，则将该节点插入该链表的链头。具体的实现过程见addEntry方法，如下：

void addEntry(int hash, K key, V value, int bucketIndex) {
    //获取bucketIndex处的Entry
    Entry<K, V> e = table[bucketIndex];
    //将新创建的Entry放入bucketIndex 索引处，并让新的Entry指向原来的Entry 
    table[bucketIndex] = new Entry<K, V>(hash, key, value, e);
    //若HashMap中元素的个数超过极限了，则容量扩大两倍
    if (size++ >= threshold)
        resize(2 * table.length);
}

这里需要注意一点，LightWeightGSet不会根据数据的多少而进行二次扩容，而HashMap会有个临界点来触发扩容。该临界点在当HashMap中元素的数量等于table数组长度*加载因子。但是扩容是一个非常耗时的过程，因为它需要重新计算这些数据在新table数组中的位置并进行复制处理。所以如果我们已经预知HashMap中元素的个数，那么预设元素的个数能够有效的提高HashMap的性能。

读取实现get

get就比较简单，都是通过key的hash值找到在底层数组的index，然后遍历该链表找到key对应的value。
LightWeightGSet.get

public E get(final K key) {
  //validate key
  if (key == null) {
    throw new NullPointerException("key == null");
  }

  //find element
  final int index = getIndex(key);
  for(LinkedElement e = entries[index]; e != null; e = e.getNext()) {
    if (e.equals(key)) {
      return convert(e);
    }
  }
  //element not found
  return null;
}

HashMap.get

public V get(Object key) {
    // 若为null，调用getForNullKey方法返回相对应的value
    if (key == null)
        return getForNullKey();
    // 根据该 key 的 hashCode 值计算它的 hash 码  
    int hash = hash(key.hashCode());
    // 取出 table 数组中指定索引处的值
    for (Entry<K, V> e = table[indexFor(hash, table.length)]; e != null; e = e.next) {
        Object k;
        //若搜索的key与查找的key相同，则返回相对应的value
        if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
            return e.value;
    }
    return null;
}

在HashMap中能够根据key快速的取到value除了和HashMap的数据结构密不可分外，还和Entry有莫大的关系，在前面就提到过，HashMap在存储过程中并没有将key，value分开来存储，而是当做一个整体key-value来处理的，这个整体就是Entry对象。同时value也只相当于key的附属而已。在存储的过程中，系统根据key的hashcode来决定Entry在table数组中的存储位置，在取的过程中同样根据key的hashcode取出相对应的Entry对象。

在LightWeightGSet中，key和value都是BlockInfo对象。

附加

LightWeightGSet中底层数组的大小是在构造函数中固定的，并且其数组的大小不会扩容，则其数组的初始化就尤为重要，下面看下BlocksMap设置的LightWeightGSet数组的大小。代码如下：

/**
 * Let t = percentage of max memory.
 * Let e = round(log_2 t).
 * Then, we choose capacity = 2^e/(size of reference),
 * unless it is outside the close interval [1, 2^30].
 */
public static int computeCapacity(double percentage, String mapName) {
  return computeCapacity(Runtime.getRuntime().maxMemory(), percentage,
      mapName);
}

static int computeCapacity(long maxMemory, double percentage,
    String mapName) {
  ... // 参数校验
  //VM detection
  //See http://java.sun.com/docs/hotspot/HotSpotFAQ.html#64bit_detection
  final String vmBit = System.getProperty("sun.arch.data.model");

  //Percentage of max memory
  final double percentDivisor = 100.0/percentage;
  // 运行时的最大内存Runtime.getRuntime().maxMemory()
  final double percentMemory = maxMemory/percentDivisor;
  
  //compute capacity
  final int e1 = (int)(Math.log(percentMemory)/Math.log(2.0) + 0.5);
  final int e2 = e1 - ("32".equals(vmBit)? 2: 3);
  final int exponent = e2 < 0? 0: e2 > 30? 30: e2;
  final int c = 1 << exponent;
  ...
  return c;
}