HDFS Cluster 구성

Halim Kim·2021년 5월 13일

HDFS Nutch cluster hadoop

Hadoop

목록 보기

2/2

Apache Nutch라는 OpenSource 분산 크롤러를 사용하기 위해서는 Hadoop Cluster가 필요하다. 그래서 Linux(ubuntu) 서버 2대를 활용해 HDFS 구성 테스트를 진행해보았다.

Hadoop Install

wget을 이용하여 Hadoop 3.2.2 version을 설치한다.

$ sudo wget -P /usr/local/ https://downloads.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz

Hadoop을 설치한 Directory로 이동하여 압축을 풀어준다.

$ sudo tar xzf hadoop-3.2.2.tar.gz

환경변수 설정

Ubuntu에서 Test를 진행했기 때문에 ~/.bashrc에 환경변수를 추가해주었다.

$ echo 'export HADOOP_HOME=/usr/local/hadoop-3.2.2' >> ~/.bashrc
$ echo 'export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin' >> ~/.bashrc

source 명령어로 환경변수를 바로 적용해줄 수 있다.

$ source ~/.bashrc

SSH 설치

Hadoop Node는 SSH를 이용하여 서로 통신을 하기 때문에 SSH Server를 설치해주어야 한다.

$ sudo apt install openssh-server

그리고 PDSH 기본통신을 RSH에서 SSH로 바꾸어주어야 한다. (이 부분은 아직 어떤 의미가 있는 것인지 정확히 파악이 되지 않았다.

echo 'export PDSH_RCMD_TYPE=ssh' >> ~/.bashrc

PDSH 참고 : https://linux.die.net/man/1/pdsh

위의 명령어를 실행한 뒤에 변경사항을 적용하고 싶으면 hadoop 환경변수를 등록 후 source 명령어를 이용했던 것처럼 똑같이 해주면 된다.

SSH를 설치했으면 Hadoop이 암호화된 통신을 하기 위해 RSA Key를 만들고 분배해주어야 한다.

# 마스터노드에서 실행할 것
$ ssh-keygen -t rsa
$ cp id_rsa.pub authorized_keys

# 모든 네임노드에서 실행할 것
# {namenode00} - 모든 네임노드에 적용해야함을 의미함
$ scp authorized_keys {namenode00}:~/.ssh/authorized_keys

# 모든 데이터노드에서 실행할 것
# {datenode00} - 모든 데이터노드에 적용해야함을 의미함

ssh {datanode00}
$ mkdir -p /home/hadoop/.ssh
$ scp authorized_keys {datanode00}:~/.ssh/authorized_keys

Hadoop 환경 설정

Hadoop 환경 설정을 해주기 위해서는 4개의 xml 파일이 필요하다. 아래에 직접 테스트할 때 이용했던 xml 파일 내용을 첨부하였다.

HOST IP Alias

Node의 Host IP 주소가 들어가야 하는 부분들은 직접 IP 주소를 입력하기보다는 /etc/hosts에서 IP에 alias를 만들어주어서 그 alias를 입력하였다.

예를 들면 xxx.xxx.xxx.xxx라는 IP 주소를 datanode01이라고 alias를 만들어주어 사용한 것을 확인할 수 있을 것이다.

HDFS용 스토리지

그리고 HDFS은 운영체제가 설치되어 있는 SSD가 아닌 HDD를 이용할 것이기 때문에 HDD를 리눅스 파일시스템으로 포맷을 하고 Mount를 해주었다.

# 스토리지 리스트 확인
sudo fdisk -l

# linux file system으로 포맷
sudo mkfs.ext4 /dev/sdb

# mount 할 디렉토리 생성
sudo mkdir /data

# mount
sudo mount /dev/sdb /data

# mount 된 폴더에 권한 부여
sudo chown -R hadoop:hadoop /data

아래의 xml 파일들을 잘 보면 mounting된 /data directory를 사용하는 것을 확인할 수 있을 것이다.

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://namenode01:9000</value>
	</property>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/home/data/hadoop/tmp</value>
	</property>
</configuration>

hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>dfs.replication</name>
		<value>1</value>
	</property>
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>/home/data/hadoop/dfs/name</value>
	</property>
	<property>
		<name>dfs.namenode.edits.dir</name>
		<value>/home/data/hadoop/dfs/edit</value>
	</property>	
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>/home/data/hadoop/dfs/data</value>
	</property>
	<property>
		<name>dfs.namenode.secondary.http-address</name>
		<value>namenode02:9868</value>
	</property>
	<property>
		<name>dfs.namenode.checkpoint.dir</name>
		<value>/home/data/hadoop/dfs/checkpoint</value>
	</property>

	
</configuration>

mapred-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!--

 Licensed under the Apache License, Version 2.0 (the "License");

 you may not use this file except in compliance with the License.

 You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software

 distributed under the License is distributed on an "AS IS" BASIS,

 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

 See the License for the specific language governing permissions and

 limitations under the License. See accompanying LICENSE file.

-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
        <property>
                <name>yarn.app.mapreduce.am.resource.mb</name>
                <value>1228</value>
        </property>
        <property>
                <name>yarn.app.mapreduce.am.env</name>
                <value>HADOOP_MAPRED_HOME=/usr/local/hadoop-3.2.2</value>
        </property>
        <property>
                <name>mapreduce.map.env</name>
                <value>HADOOP_MAPRED_HOME=/usr/local/hadoop-3.2.2</value>
        </property>
        <property>
                <name>mapreduce.reduce.env</name>
                <value>HADOOP_MAPRED_HOME=/usr/local/hadoop-3.2.2</value>
        </property>
        <property>
                <name>mapreduce.application.classpath</name>
                <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/common/*,$HADOOP_MAPRED_HOME/share/hadoop/common/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*</value>
        </property>
</configuration>

yarn-site.xml

yarn은 Hadoop Cluster의 Resource를 관리해주는 역할을 한다. 크게 ResourceManager와 NodeManager로 구성이 되어 있으며 Yarn에 대해서는 추후에 추가로 글을 더 작성해볼 예정이다.
일단 yarn-site.xml 파일에서 해주어야 할 것은 아래의 resourcemanager.hostname property에 namenode의 Host 주소와 nodemanager.hostname property에 datanode의 Host 주소를 입력하면 된다.(Datanode가 여러개라면 각 Datanode의 yarn-site.xml 파일의 nodemanager.hostname은 각자의 Host 주소를 입력하면 된다. 예를 들면 datanode01에는 datanode01의 host 주소를, datanode02에는 datanode02의 host 주소를 입력한다.)

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>namenode01</value>
        </property>
        <property>
                <name>yarn.nodemanager.hostname</name>
                <value>datanode01</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
                <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
                <value>1</value>
        </property>
        <property>
                <name>yarn.nodemanager.resource.memory-mb</name>
                <value>4096</value>
        </property>
        <property>
                  <name>yarn.nodemanager.env-whitelist</name>
                  <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
        </property>
</configuration>

마지막으로 master와 workers file에 각각에 맞는 Node의 Host 주소를 입력해주면 된다.
(namenode로 설정한 host의 IP는 master 파일에, datanode로 설정한 host의 IP는 workers 파일에 넣어주면 된다.)

Hadoop 실행

hadoop이 위치한 Directory로 이동한다. (직접 해본 Test에서는 /usr/local/hadoop-3.2.2에 위치)

그 뒤에 sbin directory에 있는 start-all.sh 스크립트를 실행하면 된다. sbin에 있는 shell script 파일을 확인해보면 stop-all.sh도 존재하는 것을 알 수 있을텐데 당연히 hadoop을 중지하는 script이다.

$ sbin/start-all.sh

Halim Kim

나는 하림

이전 포스트