昨天Hadoop的get命令突然无法使用,返回NullPointerException异常,无法从hdfs pull数据,其它命令正常,并且最近也无任务修改配置的操作。
这下捉急了,捉急也没用,还是滚回去看日志吧,在日志中也没发现什么具体的报错信息,只发现NN的状态发生了变化,变成了standby。
但按照以往的经验NN切换并不会导致Hadoop相关命令返回空指针异常,难道是当初配置有什么问题?
先把NN切回来吧,先保证线上任务正常运行吧。切回回来之后一切正常,剩下一脸懵逼的我。。。。

先贴下异常现象:

1
2
3
4
5
$ hadoop dfs -get /test/test_1527672887521.sh .
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

get: java.lang.NullPointerException

问题恢复了,那就是找到问题的原因,而彻底解决到问题。那就开始搞吧。

  1. 首先在测试环境测试下NN切换,会不会导致get命令返回NullPointerException,此问题没有在测试环境复现。
  2. 把NN1上的Hadoop包打包重新搭建了一套Hadoop环境,进行测试,问题依然没有复现
  3. 把线上NN重启之后,再次手动触发切换,使用get命令依然返回NullPointerException
  4. 查看下NN2的log(终于想起NN2的log了),定位到发生故障的时间点,搜下NullPointerException,发现报错信息,如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
18/07/31 11:53:32 WARN net.ScriptBasedMapping: Exception running ${HADOOP_HOME}/etc/hadoop/rack_awareness.py 127.0.0.1
java.io.IOException: Cannot run program "${HADOOP_HOME}/etc/hadoop/rack_awareness.py" (in directory "${HADOOP_HOME}"): error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:526)
at org.apache.hadoop.util.Shell.run(Shell.java:482)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.resolveNetworkLocation(DatanodeManager.java:751)
at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.resolveNetworkLocation(DatanodeManager.java:731)
at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.resolveNetworkLocationWithFallBackToDefaultLocation(DatanodeManager.java:705)
at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.registerDatanode(DatanodeManager.java:958)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:4481)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.registerDatanode(NameNodeRpcServer.java:1286)
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.registerDatanode(DatanodeProtocolServerSideTranslatorPB.java:96)
at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:28752)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2213)
Caused by: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 22 more
18/07/31 11:53:32 ERROR blockmanagement.DatanodeManager: The resolve call returned null!
18/07/31 11:53:32 ERROR blockmanagement.DatanodeManager: Unresolved topology mapping. Using /default-rack for host 127.0.0.1

是不是很清晰,有没有

原来是NN2上的机器感知脚本没有执行权限,加上可执行权限之后,NN切换之后一切命令正常。