• Ubuntu 20.04.05 LTS

  • Hadoop 2.10.2

安装Ubuntu

下载镜像文件,可在一些镜像站下载,比如USTC开源软件镜像站

我使用VirtualBox安装虚拟机,新建虚拟机时勾选Skip Unattended Installation,建议关闭网络并选择最小安装。安装系统时设置用户名为hadoop,名称节点计算机名为master,数据节点计算机名为slave0slave1……这里部署3台机器的集群。

安装语言简体中文,用户的文件夹名是中文,在命令行下很麻烦。可以在设置中将语言改为英文,注销后再登录,系统弹出提示将用户文件夹改为英文;再将语言改为中文,注销后再登录,此时选择保留原有用户文件夹名称。

安装完毕后将软件源更改为国内源,并打开网络,更新系统和安装必要的软件。

1
2
3
4
sudo apt update
sudo apt upgrade
sudo apt -y install vim openssh-server net-tools
sudo snap install code

集群通讯配置

网络配置

在虚拟机设置中将网络设置为“桥接网卡”。

ifconfig可查看本机IP地址,比如192.168.2.254

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
hadoop@master:~$ ifconfig
enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.2.254 netmask 255.255.255.0 broadcast 192.168.2.255
inet6 fe80::4bb5:bdec:2d63:fece prefixlen 64 scopeid 0x20<link>
ether 08:00:27:2b:ed:a7 txqueuelen 1000 (以太网)
RX packets 9885 bytes 11971313 (11.9 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1294 bytes 98350 (98.3 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (本地环回)
RX packets 149 bytes 11998 (11.9 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 149 bytes 11998 (11.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

在各个机器的/etc/hosts文件中映射所有机器的IP。

1
2
3
4
127.0.0.1       localhost
192.168.2.254 master
192.168.2.10 slave0
192.168.2.2 slave1

关闭所有系统的防火墙。

1
2
sudo systemctl stop ufw
sudo systemctl disable ufw

SSH免密码登录

首先在所有机器上新建文件夹。

1
2
cd ~
mkdir .ssh

在master节点上生成Key,输入命令后一直回车。

1
ssh-keygen -t rsa

接着让master节点能无密码登录本机。

1
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

将公钥上传到其它所有节点,类似如下命令。

1
scp ~/.ssh/id_rsa.pub hadoop@slave1:/home/hadoop/

在slave机器上执行命令。

1
2
cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
rm ~/id_rsa.pub

可以在master结点上尝试ssh slave0ssh slave1看看是否能免密码登录。

环境配置

Java安装

在所有机器上安装Java。

1
sudo apt install -y openjdk-8-jdk

Hadoop安装配置

只需在master节点上安装配置Hadoop,接着把Hadoop文件目录复制到其余节点上。

1
2
3
4
5
6
cd ~/Downloads
wget https://mirrors.nju.edu.cn/apache/hadoop/common/stable2/hadoop-2.10.2.tar.gz
sudo tar -zxvf hadoop-2.10.2.tar.gz -C /usr/local
cd /usr/local/
sudo mv hadoop-2.10.2 hadoop
sudo chown -R hadoop ./hadoop

所有配置文件都位于/usr/local/hadoop/etc/hadoop/

修改slaves,填入数据节点。

1
2
slave0
slave1

修改core-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
</configuration>

修改hdfs-site.xml。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>master:50090</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/tmp/dfs/data</value>
</property>
</configuration>

新建mapred-site.xml。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>
</configuration>

修改yarn-site.xml。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

修改hadoop-env.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64" # ${JAVA_HOME}

# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol. Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"}

# Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
if [ "$HADOOP_CLASSPATH" ]; then
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
else
export HADOOP_CLASSPATH=$f
fi
done

# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""

# Enable extra debugging of Hadoop's JAAS binding, used to set up
# Kerberos security.
# export HADOOP_JAAS_DEBUG=true

# Extra Java runtime options. Empty by default.
# For Kerberos debugging, an extended option set logs more invormation
# export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true -Dsun.security.krb5.debug=true -Dsun.security.spnego.debug"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"

export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"

# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS"
# set heap args when HADOOP_HEAPSIZE is empty
if [ "$HADOOP_HEAPSIZE" = "" ]; then
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
fi
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"

# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol. This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}

# Where log files are stored. $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER

# Where log files are stored in the secure data environment.
#export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}

###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS=""

###
# Router-based HDFS Federation specific parameters
# Specify the JVM options to be used when starting the RBF Routers.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_DFSROUTER_OPTS=""
###

###
# Advanced Users Only!
###

# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by
# the user that will run the hadoop daemons. Otherwise there is the
# potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}

# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER

配置好后,将master节点上的Hadoop目录复制到各个节点上,类似如下命令。

1
2
scp -r ./hadoop/ slave1:/home/hadoop/Downloads/
sudo mv hadoop/ /usr/local/

环境变量配置

所有机器上都要配置环境变量,在~/.bashrc的末尾添加如下内容。

1
2
3
4
5
6
7
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:$PATH
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export JAVA_LIBRARY_PATH=${HADOOP_HOME}/lib/native

接着刷新下配置文件。

1
source ~/.bashrc

运行Hadoop集群

启动集群

首次启动需要在master节点执行格式化。

1
hdfs namenode -format

接着可以启动Hadoop了,启动需要在master节点上进行。

1
2
3
start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver

通过命令jps可以查看各个节点启动的进程。在master节点可以看到Jps、ResourceManager、JobHistoryServer、NameNode、SecondaryNameNode,在slave节点可看到Jps、NodeManager、DataNode。

另外还需要查看slave节点是否启动正常,这里用了3台机器,两台作为数据节点,因此输出”Live datanodes (2)“。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Configured Capacity: 25140936704 (23.41 GB)
Present Capacity: 4816609280 (4.49 GB)
DFS Remaining: 4816560128 (4.49 GB)
DFS Used: 49152 (48 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (2):

Name: 192.168.2.10:50010 (slave0)
Hostname: slave0
Decommission Status : Normal
Configured Capacity: 12570468352 (11.71 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 9061605376 (8.44 GB)
DFS Remaining: 2847973376 (2.65 GB)
DFS Used%: 0.00%
DFS Remaining%: 22.66%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Feb 10 22:11:05 CST 2023
Last Block Report: Fri Feb 10 22:09:42 CST 2023


Name: 192.168.2.2:50010 (slave1)
Hostname: slave1
Decommission Status : Normal
Configured Capacity: 12570468352 (11.71 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 9940992000 (9.26 GB)
DFS Remaining: 1968586752 (1.83 GB)
DFS Used%: 0.00%
DFS Remaining%: 15.66%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Feb 10 22:11:05 CST 2023
Last Block Report: Fri Feb 10 22:09:42 CST 2023

也可以做在浏览器看到集群状态:http://master:50070/。

执行分布式实例

使用 HDFS,首先需要在 HDFS 中创建用户目录。

1
hdfs dfs -mkdir -p /user/hadoop

接着将 ${HADOOP_HOME}/etc/hadoop 中的配置文件文件作为输入文件复制到分布式文件系统中。我们使用的是hadoop用户,并且已创建相应的用户目录/user/hadoop,因此在命令中就可以使用相对路径如input,其对应的绝对路径就是/user/hadoop/input

1
2
hdfs dfs -mkdir input
hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml input

复制完成后,可通过如下命令查看文件列表。

1
hdfs dfs -ls input

运行Hadoop作业。

1
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output "dfs[a-z.]+"

可在http://master:8088看到任务进度。

任务完成后记得关闭集群。

1
2
3
mr-jobhistory-daemon.sh stop historyserver
stop-yarn.sh
stop-dfs.sh

参考资料