Hadoop + Hbase + Phoenix + Spark 环境配置安装

一、概述

1.版本

2.节点规划

ZKNameNodeDataNodeJournalNodeResourceManagerNodeManagerHMasterHRegionServerQueryServer
hadoop01YYYYNYYYY
hadoop02YYYYNYNYY
hadoop03YNYYNYNYN
spark01NNYNYYNNN
spark02NNYNYYNNN
spark03NNYNNYNNN

3.环境准备

  • 安装必要工具

    yum -y install wget
    yum -y install vim
    yum -y install rsync
    
  • 6台服务器新建hadoop用户(略过)

  • 修改6台服务器hosts (修改为对应的内网地址)

  • 配置6台服务器的互相免密登录(6台切换到hadoop用户执行)

    [hadoop@hadoop01 bin]$ cd /home/hadoop/.ssh/
    #三下回车
    [hadoop@hadoop02 .ssh]$ ssh-keygen -t rsa
    [hadoop@hadoop02 .ssh]$ ssh-copy-id hadoop01
    [hadoop@hadoop02 .ssh]$ ssh-copy-id hadoop02
    [hadoop@hadoop02 .ssh]$ ssh-copy-id hadoop03
    [hadoop@hadoop02 .ssh]$ ssh-copy-id spark01
    [hadoop@hadoop02 .ssh]$ ssh-copy-id spark02
    [hadoop@hadoop02 .ssh]$ ssh-copy-id spark03
    
  • 脚本准备(非必要)

    同步脚本

    [hadoop@hadoop01 bin]$ mkdir /home/hadoop/bin

    [hadoop@hadoop01 bin]$ cd /home/hadoop/bin

    [hadoop@hadoop01 bin]$ vim xsync

    写入以下内容(注意hostname,hadoop的和spark的同步分开)

    #!/bin/bash
    
    #1. 判断参数个数
    if [ $# -lt 1 ]
    then
        echo Not Enough Arguement!
        exit;
    fi
    
    #2. 遍历集群所有机器
    # for host in spark01 spark02 spark03
    for host in hadoop01 hadoop02 hadoop03
    do
        echo ====================  $host  ====================
        #3. 遍历所有目录,挨个发送
    
        for file in $@
        do
            #4. 判断文件是否存在
            if [ -e $file ]
                then
                    #5. 获取父目录
                    pdir=$(cd -P $(dirname $file); pwd)
    
                    #6. 获取当前文件的名称
                    fname=$(basename $file)
                    ssh $host "mkdir -p $pdir"
                    rsync -av $pdir/$fname $host:$pdir
                else
                    echo $file does not exists!
            fi
        done
    done
    

    [hadoop@hadoop01 bin]$ chmod +x xsync

    所有节点同步脚本

    [hadoop@hadoop01 bin]$ vim xsyncall

    写入以下内容

    #!/bin/bash
    
    #1. 判断参数个数
    if [ $# -lt 1 ]
    then
        echo Not Enough Arguement!
        exit;
    fi
    
    #2. 遍历集群所有机器
    for host in hadoop01 hadoop02 hadoop03 spark01 spark02 spark03
    do
        echo ====================  $host  ====================
        #3. 遍历所有目录,挨个发送
    
        for file in $@
        do
            #4. 判断文件是否存在
            if [ -e $file ]
                then
                    #5. 获取父目录
                    pdir=$(cd -P $(dirname $file); pwd)
    
                    #6. 获取当前文件的名称
                    fname=$(basename $file)
                    ssh $host "mkdir -p $pdir"
                    rsync -av $pdir/$fname $host:$pdir
                else
                    echo $file does not exists!
            fi
        done
    done
    

    jps进程脚本

    [hadoop@hadoop02 bin]$ vim jpsall

    写入以下内容

    #!/bin/bash
    
    for host in hadoop01 hadoop02 hadoop03 spark01 spark02 spark03
    do
            echo =============== $host ===============
            ssh $host $JAVA_HOME/bin/jps
    done
    

    [hadoop@hadoop01 bin]$ chmod +x jpsall


二、安装

1.hadoop安装

1.1 下载、环境变量配置

  • 下载解压

    [hadoop@hadoop01 Jack]$ cd /home/hadoop
    [hadoop@hadoop01 ~]$ wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
    [hadoop@hadoop01 ~]$ tar -zxvf hadoop-3.3.0.tar.gz
    
  • 添加Hadoop二进制文件到PATH,编辑vim /home/hadoop/.profile,

    [hadoop@hadoop01 ~]$ vim /home/hadoop/.profile
    

    添加以下内容:

    PATH=/home/hadoop/hadoop-3.3.0/bin:/home/hadoop/hadoop-3.3.0/sbin:$PATH
    
  • 添加Hadoop到PATH给shell

    [hadoop@hadoop01 ~]$ vim /home/hadoop/.bashrc
    

    添加以下内容:

    export HADOOP_HOME=/home/hadoop/hadoop-3.3.0
    export PATH=${PATH}:$PATH:{PATH}:
    PATH:{HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
    
  • source,使之生效

    [hadoop@hadoop01 ~]$ source /home/hadoop/.profile
    

1.2 Hadoop配置

  1. 配置环境变量 编辑${HADOOP_HOME}/etc/hadoop/hadoop-env.sh

    [hadoop@hadoop01 ~]$ vim /home/hadoop/hadoop-3.3.0/etc/hadoop/hadoop-env.sh
    

    在末尾添加以下内容

    export JAVA_HOME=/usr/lib/jvm/jdk8
    export HADOOP_SSH_OPTS="-o BatchMode=yes -o StrictHostKeyChecking=no -o ConnectTimeout=10s -p 3844" # 仅当SSH端口不是默认的22端口才需要配置
    export HADOOP_PID_DIR=/home/hadoop/tmp
    export HDFS_NAMENODE_OPTS="-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=1026" # 开启NameNode的jmx监控
    export HDFS_DATANODE_OPTS="-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=1027"# 开启DataNode的jmx监控
    
  2. 创建必要的路径

    [hadoop@hadoop01 ~]$ mkdir -p /home/hadoop/data/tmp
    [hadoop@hadoop01 ~]$ mkdir -p /home/hadoop/tmp
    
  3. 编辑${HADOOP_HOME}/etc/hadoop/core-site.xml

    [hadoop@hadoop01 ~]$ vim /home/hadoop/hadoop-3.3.0/etc/hadoop/core-site.xml
    

    写入以下内容,fs.defaultFS中hdfs后为集群名称

    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://macccluster</value>
            <description>
                The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
            </description>
        </property>
        <property>
            <name>io.file.buffer.size</name>
            <value>131072</value>
            <description>参数:文件的缓冲区大小
    描述:用于顺序文件的缓冲区大小。这个缓冲区的大小应该是硬件页面大小的倍数(在Intel x86上是4096),它决定了在读写操作中缓冲了多少数据。SequenceFiles 读取和写入操作的缓存区大小,还有map的输出都用到了这个缓冲区容量, 可减少 I/O 次数。建议设定为 64KB 到 128KB
            </description>
        </property>
        <property>
            <name>hadoop.tmp.dir</name>
            <value>/home/hadoop/data/tmp</value>
            <description>如果服务器是多磁盘的,每个磁盘都设置一个临时文件目录,这样便于mapreduce或者hdfs等使用的时候提高磁盘IO效率
            </description>        
        </property>
    </configuration>
    
  4. 配置hdfs目录和副本数${HADOOP_HOME}/etc/hadoop/hdfs-site.xml

    [hadoop@hadoop01 ~]$ vim /home/hadoop/hadoop-3.3.0/etc/hadoop/hdfs-site.xml[hadoop@hadoop01 ~]$ vim /home/hadoop/hadoop-3.3.0/etc/hadoop/hdfs-site.xml
    

    写入以下内容

    注意zookeeper地址,以实际为准(ha.zookeeper.quorum)

    实际中,磁盘传输速率为200MB/s时,一般设定block大小为256MB;磁盘传输速率为400MB/s时,一般设定block大小为512MB.(dfs.blocksize)

    副本个数以实际需求为准,默认为3(dfs.replication)

    配置中的macccluster与core-site.xml中的fs.defaultFS参数对应

    <configuration>
    <property>
                <!--Comma-separated list of the directory to store the name table. The table is replicated across the list for redundancy management -->
                <name>dfs.namenode.name.dir</name>
                <value>/home/hadoop/data/namenode</value>
                <description>Comma-separated list of the directory to store the name table. The table is replicated across the list for redundancy management
            </description>
        </property>
        <property>
            <name>dfs.blocksize</name>
            <value>268435456</value>
            <description>
                The default block size for new files, in bytes. You can use the following suffix (case insensitive): k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.), Or provide complete size in bytes (such as 134217728 for 128 MB).
            </description>
        </property>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>/home/hadoop/data/datanode</value>
                    <description>
            </description>
        </property>
    
        <property>
                <name>dfs.replication</name>
                <value>2</value>
                <description>Default replication factor for each file block. default:3
            </description>
        </property>
        <property>
            <name>dfs.namenode.handler.count</name>
            <value>40</value>
            <description>
                        namenode的服务器线程数:用来处理客户端的远程过程调用及集群守护进程的调用。设置该值的一般原则是将其设置为集群大小的自然对数乘以20,即20log2(N),N为集群大小
            </description>
        </property>
        <property>
            <name>dfs.datanode.handler.count</name>
            <value>30</value>
            <description>服务线程个数,根据CPU核数和实际测试决定,一般是比核数多几个
            </description>
        </property>
        <property>
            <name>dfs.blockreport.incremental.intervalMsec</name>
            <value>300</value>
            <description>
                If set to a positive integer, the value in ms to wait between sending incremental block reports from the Datanode to the Namenode. 通过延迟快汇报配置可以减少 datanode 写完块后的块汇报次数,提高 namenode 处理 rpc 的响应时间和处理速度
            </description>
        </property>
        <property>
            <name>dfs.nameservices</name>
            <value>macccluster</value>
            <description>
                the logical name for this new nameservice
                Choose a logical name for this nameservice, for example “macccluster”, and use this logical name for the value of this config option. The name you choose is arbitrary. It will be used both for configuration and as the authority component of absolute HDFS paths in the cluster.
            </description>
        </property>
        <property>
            <name>dfs.ha.namenodes.macccluster</name>
            <value>hadoop01,hadoop02</value>
            <description>
                格式:dfs.ha.namenodes.[nameservice ID]
                unique identifiers for each NameNode in the nameservice.
                Configure with a list of comma-separated NameNode IDs. This will be used by DataNodes to determine all the NameNodes in the cluster. For example, if you used “macccluster” as the nameservice ID previously, and you wanted to use “nn1”, “nn2” and “nn3” as the individual IDs of the NameNodes
        </description>
        </property>
        <property>
            <name>dfs.namenode.rpc-address.macccluster.hadoop01</name>
            <value>hadoop01:8020</value>
        </property>
        <property>
            <name>dfs.namenode.rpc-address.macccluster.hadoop02</name>
            <value>hadoop02:8020</value>
        </property>
        <property>
            <name>dfs.namenode.http-address.macccluster.hadoop01</name>
            <value>hadoop01:9870</value>
        </property>
        <property>
            <name>dfs.namenode.http-address.macccluster.hadoop02</name>
            <value>hadoop02:9870</value>
        </property>
        <property>
            <name>dfs.namenode.shared.edits.dir</name>
            <value>qjournal://hadoop01:8485;hadoop02:8485;hadoop03:8485/macccluster</value>
        </property>
        <property>
            <name>dfs.client.failover.proxy.provider.macccluster</name>
            <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
            <description>
                Configure the name of the Java class which will be used by the DFS Client to determine which NameNode is the current Active, and therefore which NameNode is currently serving client requests. The two implementations which currently ship with Hadoop are the ConfiguredFailoverProxyProvider and the RequestHedgingProxyProvider (which, for the first call, concurrently invokes all namenodes to determine the active one, and on subsequent requests, invokes the active namenode until a fail-over happens), so use one of these unless you are using a custom proxy provider
            </description>
        </property>
        <property>
            <name>dfs.journalnode.edits.dir</name>
            <value>/home/hadoop/data/journalnode</value>
            <description>
                This is the absolute path on the JournalNode machines where the edits and other local state used by the JNs will be stored. You may only use a single path for this configuration. Redundancy for this data is provided by running multiple separate JournalNodes, or by configuring this directory on a locally-attached RAID array.
            </description>
        </property>
        <property>
            <name>dfs.ha.automatic-failover.enabled</name>
            <value>true</value>
        </property>
         <property>
            <name>ha.zookeeper.quorum</name>
            <value>hadoop01:2181,hadoop02:2181,hadoop03:2181</value>
        </property>
        <property>      
            <name>dfs.ha.fencing.methods</name>
            <value>shell(/bin/true)</value>
            <description>
            </description>
        </property>
    </configuration>
    
  5. 配置YARN 作为Job Scheduler,编辑${HADOOP_HOME}/etc/hadoop/mapred-site.xml

    [hadoop@hadoop01 ~]$ vim /home/hadoop/hadoop-3.3.0/etc/hadoop/mapred-site.xml
    

    写入以下内容

    <configuration>
    <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
        <property>
                <name>yarn.app.mapreduce.am.env</name>
                <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
        </property>
        <property>
                <name>mapreduce.map.env</name>
                <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
        </property>
        <property>
                <name>mapreduce.reduce.env</name>
                <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
        </property>
    </configuration>
    
  6. 配置YARN,编辑${HADOOP_HOME}/etc/hadoop/yarn-site.xml

    [hadoop@hadoop01 ~]$ vim /home/hadoop/hadoop-3.3.0/etc/hadoop/yarn-site.xml
    

    写入以下内容

    <configuration>
        <property>
                <name>yarn.acl.enable</name>
                <value>0</value>
        </property>
    
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>node-master</value>
        </property>
        <!-- tell the node manager tha a MapReduce container will have to shuffle the map tasks to the reduce tasks -->
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
    </configuration>
    
  7. 配置DataNode列表,编辑${HADOOP_HOME}/etc/hadoop/workers

    hadoop01
    hadoop02
    hadoop03
    spark01
    spark02
    spark03
    
  8. 同步hadoop到其他节点

    #前提,xsyncall中同步的节点是集群的所有节点,按照实际为准
    [hadoop@hadoop01 ~]$ ~/bin/xsyncall /home/hadoop/hadoop-3.3.0/
    [hadoop@hadoop01 ~]$ ~/bin/xsyncall /home/hadoop/.profile
    [hadoop@hadoop01 ~]$ ~/bin/xsyncall /home/hadoop/.bashrc
    #在其他节点生效环境变量
    [hadoop@hadoop02 ~]$ source /home/hadoop/.profile
    [hadoop@hadoop03 ~]$ source /home/hadoop/.profile
    [hadoop@spark01 ~]$ source /home/hadoop/.profile
    [hadoop@spark02 ~]$ source /home/hadoop/.profile
    [hadoop@spark03 ~]$ source /home/hadoop/.profile
    
  9. 启动集群

    此处先不启动yarn,后续整合spark后启动

    #在hadoop01,hadoop02,hadoop03启动JournalNode
    [hadoop@hadoop01 ~]$ /home/hadoop/hadoop-3.3.0/bin/hdfs --daemon start journalnode
    [hadoop@hadoop02 ~]$ /home/hadoop/hadoop-3.3.0/bin/hdfs --daemon start journalnode
    [hadoop@hadoop03 ~]$ /home/hadoop/hadoop-3.3.0/bin/hdfs --daemon start journalnode
    #初始化HA状态到Zookeeper(只在一个节点执行)
    [hadoop@hadoop01 ~]$ /home/hadoop/hadoop-3.3.0/bin/hdfs zkfc -formatZK
    #格式化HDFS文件系统
    [hadoop@hadoop01 ~]$ /home/hadoop/hadoop-3.3.0/bin/hdfs namenode -format
    #在Active NameNode上启动Name Node,hadoop01
    [hadoop@hadoop01 ~]$ /home/hadoop/hadoop-3.3.0/bin/hdfs --daemon start namenode
    #拷贝metadata到所有的Stand-by NameNode,在所有Stand-by NameNode执行,按照前面规划,只要hadoop02执行即可:
    [hadoop@hadoop01 ~]$ /home/hadoop/hadoop-3.3.0/bin/hdfs namenode -bootstrapStandby
    #关闭Active NameNode
    [hadoop@hadoop01 ~]$ /home/hadoop/hadoop-3.3.0/bin/hdfs --daemon stop namenode
    #在hadoop01,hadoop02,hadoop03服务器关闭JournalNode
    [hadoop@hadoop01 ~]$ /home/hadoop/hadoop-3.3.0/bin/hdfs --daemon stop journalnode
    #启动hdfs
    [hadoop@hadoop01 ~]$ /home/hadoop/hadoop-3.3.0/sbin/start-dfs.sh
    #验证集群状态(使用~/bin/jpsall 或者每个节点jps)
    [hadoop@hadoop01 ~]$ ~/bin/jpsall
    =============== hadoop01 ===============
    1393 JournalNode
    1139 DataNode
    1004 NameNode
    15373 Jps
    1646 DFSZKFailoverController
    =============== hadoop02 ===============
    10304 Jps
    6178 NameNode
    6568 DFSZKFailoverController
    6393 JournalNode
    6267 DataNode
    =============== hadoop03 ===============
    32439 Jps
    21818 DataNode
    21947 JournalNode
    =============== spark01 ===============
    25798 DataNode
    1659 Jps
    =============== spark02 ===============
    18348 DataNode
    25711 Jps
    =============== spark03 ===============
    13122 DataNode
    7449 Worker
    1582 Jps
    #启动yarn
    [hadoop@hadoop01 ~]$ /home/hadoop/hadoop-3.3.0/sbin/start-yarn.sh
    
    

1.3 常用命令与常见错误

  • 常用命令
    start-all.sh  # 开启线程(hdfs、yarn、JournalNode)
    hdfs dfs -mkdir /test  # 创建目录
    hdfs dfs -ls /        # 查看目录
    
  • 常见错误
    1. HA is not enabled for this namenode

      2020-08-13 14:11:38,789 ERROR tools.DFSZKFailoverController: DFSZKFailOverController exiting due to earlier exception org.apache.hadoop.HadoopIllegalArgumentException: HA is not enabled for this namenode.
      

      可能原因:dfs.ha.namenodes.macccluster的value要与dfs.namenode.rpc-address.macccluster.node1中的node1一致

    2. Does not contain a valid host:port authority

       2020-08-13 14:33:54,129 ERROR tools.DFSZKFailoverController: DFSZKFailOverController exiting due to earlier exception java.lang.IllegalArgumentException: Does not contain a valid host:port authority: MACC_Local_HBASE_A:8020
      

      可能原因:/etc/hosts中的host不能包含字符./_等

    3. 启动报权限问题,如

Starting namenodes on [node1 node2]
Starting datanodes
node3: ERROR: Cannot set priority of datanode process 5660
Starting journal nodes [node2 node3 node1]
node3: ERROR: Cannot set priority of journalnode process 5745
解决方法:master,slave都需要修改的hadoop安装目录的sbin目录下start-dfs.sh,stop-dfs.sh,start-yarn.sh,stop-yarn.sh四个文件

xxxx-dfs.sh在开头增加如下内容:

HDFS_DATANODE_USER=hadoop
HDFS_DATANODE_SECURE_USER=hadoop
HDFS_NAMENODE_USER=hadoop
HDFS_SECONDARYNAMENODE_USER=hadoop
HDFS_JOURNALNODE_USER=hadoop
HDFS_ZKFC_USER=hadoop

xxxx-yarn.sh在开头增加如下内容:

YARN_RESOURCEMANAGER_USER=hadoop
HDFS_DATANODE_SECURE_USER=hadoop
HADOOP_SECURE_DN_USER=hadoop
YARN_NODEMANAGER_USER=hadoop
HDFS_JOURNALNODE_USER=hadoop

2. 整合Hbase

2.1 下载与和环境变量配置

  1. 下载解压
    [hadoop@hadoop01 ~]$ cd /home/hadoop
    [hadoop@hadoop01 ~]$ wget https://archive.apache.org/dist/hbase/2.2.5/hbase-2.2.5-bin.tar.gz
    [hadoop@hadoop01 ~]$ tar -zxvf hbase-2.2.5-bin.tar.gz 
    
  2. 环境变量的配置
    1. 添加Hadoop二进制文件到PATH

      [hadoop@hadoop01 ~]$ vim /home/hadoop/.profile
      

      添加以下内容:

      export HBASE_HOME=/home/hadoop/hbase-2.2.5 # 以实际路径为准
      export PATH=${HBASE_HOME}/bin:${PATH} # HBASE_HOME放最前面
      
    2. 添加Hadoop到PATH给shell

      [hadoop@hadoop01 ~]$ vim /home/hadoop/.bashrc
      

      添加 以下内容

      export HBASE_HOME=/home/hadoop/hbase-2.2.5
      export PATH=${PATH}:${HBASE_HOME}/bin
      
    3. source,使之生效

      #同步到hadoop01,hadoop02,hadoop03
      [hadoop@hadoop01 ~]$ ~/bin/xsync /home/hadoop/.bashrc /home/hadoop/.profile
      [hadoop@hadoop01 ~]$ source /home/hadoop/.profile
      [hadoop@hadoop02 ~]$ source /home/hadoop/.profile
      [hadoop@hadoop03 ~]$ source /home/hadoop/.profile
      

2.2 Hbase配置

  1. 配置JAVA环境变量 编辑${HBASE_HOME}/conf/hbase-env.sh(以下JVM的配置需要根据实际服务器调整)

    [hadoop@hadoop01 ~]$ vim /home/hadoop/hbase-2.2.5/conf/hbase-env.sh
    [hadoop@hadoop01 ~]$ mkdir -p /home/hadoop/data/hbase/tmp
    
    export JAVA_HOME=/usr/lib/jvm/jdk8 # 请配置成实际jdk的位置
    export HBASE_JMX_BASE="-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false" # JMX免密认证
    export HBASE_MASTER_OPTS="-Xmx8g -Xms8g -XX:+PrintGCDetails -XX:+PrintGCDateStamps" # MASTER的JVM配置
    export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101" # MASTER开启JMX端口
    export HBASE_REGIONSERVER_OPTS="-Xmx30g -Xms30g -XX:MaxDirectMemorySize=8g -XX:+UseG1GC -XX:MaxGCPauseMillis=90 -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=5 -XX:InitiatingHeapOccupancyPercent=50 -XX:+ParallelRefProcEnabled -XX:ConcGCThreads=4 -XX:ParallelGCThreads=16 -XX:G1HeapRegionSize=32m -XX:+PrintGCDetails -XX:+PrintGCDateStamps" # REGIONSERVER JVM配置
    export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102"# REGIONSERVER开启JMX端口
    export HBASE_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HBASE_CONF_DIR -p 2222" # SSH端口不是默认的22时才需要设置
    export HBASE_MANAGES_ZK=false # 使用独立配置的Zookeeper
    # The directory where pid files are stored. /tmp by default.
    export HBASE_PID_DIR=/home/hadoop/data/hbase/tmp
    
  2. 配置RegionServer列表

    [hadoop@hadoop01 ~]$ vim /home/hadoop/hbase-2.2.5/conf/regionservers
    

    写入以下内容

    hadoop01
    hadoop02
    hadoop03
    
  3. 配置backup-masters

    [hadoop@hadoop01 ~]$ vim /home/hadoop/hbase-2.2.5/conf/backup-masters
    

    写入以下内容

    hadoop02
    
  4. 软连接hadoop配置

    [hadoop@hadoop01 ~] ln -s /home/hadoop/hadoop-3.3.0/etc/hadoop/hdfs-site.xml /home/hadoop/hbase-2.2.5/conf/hdfs-site.xml
    [hadoop@hadoop01 ~] ln -s /home/hadoop/hadoop-3.3.0/etc/hadoop/core-site.xml /home/hadoop/hbase-2.2.5/conf/core-site.xml
    
  5. 编辑hbase-site.xml

    [hadoop@hadoop01 ~]$ vim /home/hadoop/hbase-2.2.5/conf/hbase-site.xml
    

    配置如下

    <configuration>
      <!--
               The following properties are set for running HBase as a single process on a
        developer workstation. With this configuration, HBase is running in
        "stand-alone" mode and without a distributed file system. In this mode, and
        without further configuration, HBase and ZooKeeper data are stored on the
        local filesystem, in a path under the value configured for `hbase.tmp.dir`.
        This value is overridden from its default value of `/tmp` because many
        systems clean `/tmp` on a regular basis. Instead, it points to a path within
        this HBase installation directory.
    
        Running against the `LocalFileSystem`, as opposed to a distributed
        filesystem, runs the risk of data integrity issues and data loss. Normally
        HBase will refuse to run in such an environment. Setting
        `hbase.unsafe.stream.capability.enforce` to `false` overrides this behavior,
        permitting operation. This configuration is for the developer workstation
        only and __should not be used in production!__
    
        See also https://hbase.apache.org/book.html#standalone_dist
      -->
    <!-- 设置HRegionServers共享的HDFS目录,必须设为在hdfs-site中dfs.nameservices的值:hdapp,而且不能有端口号,该属性会让hmaster在hdfs集群上建一个/hbase的目录 -->
           <property>
               <name>hbase.rootdir</name>
               <value>hdfs://macccluster/hbase</value>
           </property>
           <!-- 启用分布式模式 -->
           <property>
               <name>hbase.cluster.distributed</name>
               <value>true</value>
           </property>
           <!-- 启用分布式模式时,以下的流能力加强需设为false -->
           <property>
               <name>hbase.unsafe.stream.capability.enforce</name>
               <value>false</value>
           </property>
           <!-- 指定Zookeeper集群位置,值可以是hostname或者hostname:port -->
           <property>
               <name>hbase.zookeeper.quorum</name>
               <value>hadoop01,hadoop02,hadoop03</value>
           </property>
           <!-- 指定ZooKeeper集群端口 -->
           <property>
               <name>hbase.zookeeper.property.clientPort</name>
               <value>2181</value>
           </property>
           <property>
               <name>dfs.client.failover.proxy.provider.hdfscluster</name>
               <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
           </property>
    
           <!-- 配置短路径读 --> 
           <property>
               <name>dfs.client.read.shortcircuit</name>
               <value>true</value>
           </property>
               <property>
               <name>dfs.domain.socket.path</name>
               <value>/home/hadoop/dn_socket</value>
           </property>
               <property>
               <name>dfs.client.read.shortcircuit.buffer.size</name>
               <value>131072</value>
           </property>
           <!-- WAL 写入SSD 根据实际情况调整-->
           <property>
               <name>hbase.wal.storage.policy</name>
               <value>ONE_SSD</value>
           </property>
           
           <!-- 默认128M(134217728),memstore大于该阈值就会触发flush。如果当前系统flush比较频繁,并且内存资源比较充足,可以适当将该值调整为256M。-->
           <property>
               <name>hbase.hregion.memstore.flush.size</name>
               <value>268435456</value>
           </property>
           <!-- 默认4.表示一旦某region中所有写入memstore的数据大小总和达到或超过阈值hbase.hregion.memstore.block.multiplier * hbase.hregion.memstore.flush.size,就会执行flush操作,并抛出RegionTooBusyException异常. 如果日志中出现类似”Above memstore limit, regionName = ***, server=***,memstoreSizse=***,blockingMemstoreSize=***”,就需要考虑修改该参数了。 -->
           <property>
               <name>hbase.hregion.memstore.block.multiplier</name>
               <value>4</value>
           </property>
           <!-- 表示整个RegionServer上所有写入memstore的数据大小总和不能超过该阈值,否则会阻塞所有写入请求并强制执行flush操作,直至总memstore数据大小降到hbase.regionserver.global.memstore.lowerLimit以下 该值在offheap模式下需要配置为0.6~0.65。一旦写入出现阻塞,立马查看日志定位“Blocking update on ***: the global memstore size *** is >= than blocking *** size”。一般情况下不会出现这类异常,如果出现需要明确是不是region数目太多、单表列族设计太多。-->
           <property>
               <name>hbase.regionserver.global.memstore.size</name>
               <value>0.4</value>
           </property>
           <!-- 默认1h(3600000),hbase会起一个线程定期flush所有memstore,时间间隔就是该值配置。
                   解读:生产线上该值建议设大,比如6h。因为很多场景下1小时flush一次会导致产生很多小文件,一方面导致flush比较频繁,一方面导致小文件很多,影响随机读性能。因此建议设置较大值-->
           <property>
               <name>hbase.regionserver.optionalcacheflushinterval</name>
               <value>21600000</value>
           </property>
           <!-- compaction模块主要用来合并小文件,删除过期数据、deleted数据等。涉及参数较多,对于系统读写性能影响也很重要。-->
           <!-- 默认为3,compaction的触发条件之一,当store中文件数超过该阈值就会触发compaction。通常建议生产线上写入qps较高的系统调高该值,比如5~10之间-->
           <property>
               <name>hbase.hstore.compactionThreshold</name>
               <value>5</value>
           </property>    
           <!-- 默认为10,最多可以参与minor compaction的文件数。该值通常设置为hbase.hstore.compactionThreshold略的2~3倍-->
           <property>
               <name>hbase.hstore.compaction.max</name>
               <value>15</value>
           </property>    
           <!-- 默认为16,表示一旦某个store中文件数大于该阈值,就会导致所有更新阻塞。生产线上建议设置该值为100,避免出现阻塞更新,一旦发现日志中打印’*** too many store files***’,就要查看该值是否设置正确-->
           <property>
               <name>hbase.hstore.blockingStoreFiles</name>
               <value>64</value>
           </property>    
           <!-- 默认为1周(1000*60*60*24*7),表示major compaction的触发周期。生产线上建议大表major compaction手动执行,需要将此参数设置为0,即关闭自动触发机制-->
           <property>
               <name>hbase.hregion.majorcompaction</name>
               <value>0</value>
           </property>
           <!-- hbase.regionserver.maxlogs: region flush的触发条件之一,wal日志文件总数超过该阈值就会强制执行flush操作。该默认值对于很多集群来说太小, 目前版本采用动态值,请参考:https://issues.apache.org/jira/browse/HBASE-14951-->
    
           <!-- 默认为false,表示是否开启quota功能,quota功能主要用来限制用户/表的QPS,起到限流作用。生产线上建议设置为true-->
           <property>
               <name>hbase.quota.enabled</name>
               <value>true</value>
           </property>
           <property>
                 <name>hbase.wal.provider</name>
                 <value>filesystem</value>
           </property>
    </configuration>
    
  6. 同步Hbase到hadoop02,hadoop03节点

    [hadoop@hadoop01 ~]$ ~/bin/xsync /home/hadoop/hbase-2.2.5
    
  7. 启动Hbase

    [hadoop@hadoop01 ~]$ start-hbase.sh
    #检查是否启动成功
    [hadoop@hadoop01 ~]$ ~/bin/jpsall 
    =============== hadoop01 ===============
    31313 HMaster
    19963 Jps
    31486 HRegionServer
    =============== hadoop02 ===============
    6178 HMaster
    31790 HRegionServer
    13598 Jps
    =============== hadoop03 ===============
    2076 HRegionServer
    764 Jps
    

2.3 常见问题和注意点

  1. 注意点
    hbase.rootdir配置的集群名要和hadoop对应上
    hbase.wal.provider要指定写入模式,要不然连接不上RegionServer
    关闭Hbase自带的zookeeper,不关闭,部署的ZK的id会被HBase自带的ZK覆盖
    zookeeper无需指定端口号,会默认去找2181,如果ZK端口不是2181会报错
    
  2. 常见问题
    1. stop-hbase.sh命令一直处于等待状态

      先输入hbase-daemon.sh stop master命令

      再输入stop-hbase.sh命令。

    2. HBase使用Shell命令报错 Server is not running yet

      #检查是否Hdfs处于安全模式,是的话执行
      hdfs dfsadmin -safemode leave
      #再重新启动HBase就可以了
      

3.整合Phoenix

3.1 下载和配置

  1. 下载、解压

    [hadoop@hadoop01 ~]$ wget http://archive.apache.org/dist/phoenix/phoenix-5.1.2/phoenix-hbase-2.2-5.1.2-bin.tar.gz
    [hadoop@hadoop01 ~]$ wget http://archive.apache.org/dist/phoenix/phoenix-queryserver-6.0.0/phoenix-queryserver-6.0.0-bin.tar.gz
    [hadoop@hadoop01 ~]$ tar -zxvf phoenix-hbase-2.2-5.1.2-bin.tar.gz
    [hadoop@hadoop01 ~]$ tar -zxvf phoenix-queryserver-6.0.0-bin.tar.gz
    
  2. 拷贝phoenix下jar包至HBase

    [hadoop@hadoop01 ~]$ cp /home/hadoop/phoenix-hbase-2.2-5.1.2-bin/phoenix-pherf-5.1.2.jar /home/hadoop/hbase-2.2.5/lib/
    [hadoop@hadoop01 ~]$ cp /home/hadoop/phoenix-hbase-2.2-5.1.2-bin/phoenix-server-hbase-2.2-5.1.2.jar /home/hadoop/hbase-2.2.5/lib/
    
  3. 修改hbase-site.xml文件配置

    [hadoop@hadoop01 ~]$ vim /home/hadoop/hbase-2.2.5/conf/hbase-site.xml 
    

    加入以下配置

    <!-- Phoenix 支持HBase 命名空间映射 -->
      <property>
        <name>phoenix.schema.isNamespaceMappingEnabled</name>
        <value>true</value>
      </property>
    
      <property>
        <name>phoenix.schema.mapSystemTablesToNamespace</name>
        <value>true</value>
      </property>
    
  4. 软连接hbase-site.xml配置

    ln -sf /home/hadoop/hbase-2.2.5/conf/hbase-site.xml /home/hadoop/phoenix-hbase-2.2-5.1.2-bin/bin/hbase-site.xml
    
  5. 分发到hadoop02,hadoop03

    [hadoop@hadoop01 ~]$ ~/bin/xsync /home/hadoop/hbase-2.2.5
    [hadoop@hadoop01 ~]$ ~/bin/xsync /home/hadoop/phoenix-hbase-2.2-5.1.2-bin
    [hadoop@hadoop01 ~]$ ~/bin/xsync /home/hadoop/phoenix-queryserver-6.0.0
    
  6. 重启Hbase

    [hadoop@hadoop01 ~]$ stop-hbase.sh
    [hadoop@hadoop01 ~]$ start-hbase.sh
    
  7. 验证phoenix(首次启动需要初始化表,较慢是正常情况)

    [hadoop@hadoop02 ~]$ /home/hadoop/phoenix-hbase-2.2-5.1.2-bin/bin/sqlline.py 
    Setting property: [incremental, false]
    Setting property: [isolation, TRANSACTION_READ_COMMITTED]
    issuing: !connect -p driver org.apache.phoenix.jdbc.PhoenixDriver -p user "none" -p password "none" "jdbc:phoenix:"
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/home/hadoop/phoenix-hbase-2.2-5.1.2-bin/phoenix-client-hbase-2.2-5.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/macc/install/hadoop-3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    Connecting to jdbc:phoenix:
    22/07/04 07:48:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Connected to: Phoenix (version 5.1)
    Driver: PhoenixEmbeddedDriver (version 5.1)
    Autocommit status: true
    Transaction isolation: TRANSACTION_READ_COMMITTED
    sqlline version 1.9.0
    0: jdbc:phoenix:> !tables 
    
    
  8. 启动QueryServer服务

    [hadoop@hadoop01 ~]$ /home/hadoop/phoenix-queryserver-6.0.0/bin/queryserver.py
    

4.Spark On Yarn

4.1 Scala安装

  1. 下载和配置
    [hadoop@spark01 ~]$ wget https://downloads.lightbend.com/scala/2.12.12/scala-2.12.12.tgz
    [hadoop@spark01 ~]$ tar -zxvf scala-2.12.12.tgz -C /home/hadoop/
    [hadoop@spark01 ~]$ mv /home/hadoop/scala-2.12.12 /home/hadoop/scala
    [hadoop@spark01 ~]$ sudo -i
    [root@spark01 ~]$ vim /etc/profile
    #添加环境变量
    #SCALA
    export SCALA_HOME=/home/hadoop/scala
    export PATH=$PATH:$SCALA_HOME/bin
    [root@spark01 ~]$ source /etc/profile
    
  2. 检查是否正常
    [root@macc-hk-spark-1 ~]# scala
    Welcome to Scala 2.12.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_192).
    Type in expressions for evaluation. Or try :help.
    
    scala> 
    

4.2 Spark下载和配置

  1. 下载和环境变量配置

    [root@spark01 ~]$ su hadoop
    [hadoop@spark01 ~]$ wget 
    [hadoop@spark01 ~]$ tar -zxvf spark-3.1.2-bin-hadoop3.2.tgz -C /home/hadoop/
    [hadoop@spark01 ~]$ cd /home/hadoop
    [hadoop@spark01 ~]$ mv /home/hadoop/spark-3.1.2-bin-hadoop3.2 /home/hadoop/spark
    
    [root@spark01 ~]$ vim /etc/profile
    #添加环境变量
    #Spark
    export SPARK_HOME=/home/hadoop/spark
    export PATH=$PATH:$SPARK_HOME/bin
    
    [root@spark01 ~]$ source /etc/profile
    
  2. 检查是否安装成功

    [hadoop@spark01 ~]$ cd /home/hadoop/spark/bin/
    [hadoop@spark01 bin]$ run-example SparkPi 10
    看到运行结果中出现
    Pi is roughly 3.1406511406511406
    表示运行成功
    

4.3 Spark On Yarn

  1. 修改spark-env.sh

    [hadoop@spark01 conf]$ cd /home/hadoop/spark/conf/
    [hadoop@spark01 conf]$ cp spark-env.sh.template spark-env.sh
    [hadoop@spark01 conf]$ vim spark-env.sh
    

    修改以下配置

    # 主节点机器名称
    export SPARK_MASTER_HOST=spark01
    # 默认端口号为7077
    export SPARK_MASTER_PORT=7077
    #spark Yarn模式
    export YARN_CONF_DIR=/home/hadoop/hadoop-3.3.0/etc/hadoop
    
  2. 修改workes

    [hadoop@spark01 conf]$ cd /home/hadoop/spark/conf/
    [hadoop@spark01 conf]$ cp workers.template workers
    [hadoop@spark01 conf]$ vim workers
    

    修改为如下

    #localhost
    spark01
    spark02
    spark03
    
  3. 同步到spark02,spark03节点

    #同步之后配置spark02,spark03环境变量
    [hadoop@spak01 ~]$ ~/bin/xsync spark
    [hadoop@spak01 ~]$ ~/bin/xsync scala
    
  4. 集群测试

    #启动集群
    [hadoop@spark01 ~]$ /home/hadoop/spark/sbin/start-all.sh
    #测试集群
    [hadoop@spark01 ~]$ spark-shell --master yarn
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 3.1.2
          /_/
             
    Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_192)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala>
    

4.4 Yarn基于标签的调度、YarnHA

  1. 拷贝Spark的Shuffle相关jar包到yarn目录

    [hadoop@spark01 ~]$ cp /home/hadoop/spark/yarn/spark-3.1.2-yarn-shuffle.jar /home/hadoop/hadoop-3.3.0/share/hadoop/yarn/lib/
    #同步到所有节点
    [hadoop@spark01 ~]$ ~/bin/xsyncall /home/hadoop/hadoop-3.3.0/share/hadoop/yarn/lib/
    
  2. 修改所有结点的yarn-site.xml、capacity-scheduler.xml文件

    [hadoop@hadoop01 ~]$ vim /home/hadoop/hadoop-3.3.0/etc/hadoop/yarn-site.xml
    

    配置修改为以下(注意zk地址)、资源配置注释下的配置按照实际情况配置:

    <configuration>
        <property>
            <name>yarn.acl.enable</name>
            <value>0</value>
        </property>
    
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle,spark_shuffle</value>
        </property>
        <!-- tell the node manager tha a MapReduce container will have to shuffle the map tasks to the reduce tasks 
            使用外部shuffle服务,减轻Spark excutor的工作量-->
        <property>
            <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
            <value>org.apache.spark.network.yarn.YarnShuffleService</value>
        </property>
        <!-- YARN Node Labels Config -->
        <property>
            <name>yarn.node-labels.fs-store.root-dir</name>
            <value>file:///home/hadoop/tmp/node-labels</value>
        </property>
        <property>
            <name>yarn.node-labels.enabled</name>
            <value>true</value>
        </property>
        <property>
            <name>yarn.node-labels.manager-class</name>
            <value>org.apache.hadoop.yarn.server.resourcemanager.nodelabels.RMNodeLabelsManager</value>
        </property>
    
    
        <!-- ================资源配置====================-->
        <!-- 选择调度器,默认容量 -->
        <property>
            <description>The class to use as the resource scheduler.</description>
            <name>yarn.resourcemanager.scheduler.class</name>
            <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
        </property>
        <!-- 是否让yarn自动检测硬件进行配置,默认是false;
    	如果该节点有很多其他应用程序,建议手动配置。如果该节点没有其他应用程序,可以采用自动 -->
        <property>
            <description>Enable auto-detection of node capabilities such as
                memory and CPU.
            </description>
            <name>yarn.nodemanager.resource.detect-hardware-capabilities</name>
            <value>false</value>
        </property>
        <!-- ResourceManager处理调度器请求的线程数量,默认50;如果提交的任务数大于50,可以增加该值,
        但是不能超过实际服务器数量的线程数量减去服务器数量
        例如:3台 * 4线程 = 12线程(去除其他应用程序实际不能超过8) -->
        <property>
            <description>Number of threads to handle scheduler interface.</description>
            <name>yarn.resourcemanager.scheduler.client.thread-count</name>
            <value>6</value>
        </property>
        <!-- 是否将虚拟核数当作CPU核数,默认是false,采用物理CPU核数 -->
        <property>
            <description>Flag to determine if logical processors(such as
                hyperthreads) should be counted as cores. Only applicable on Linux
                when yarn.nodemanager.resource.cpu-vcores is set to -1 and
                yarn.nodemanager.resource.detect-hardware-capabilities is true.
            </description>
            <name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
            <value>false</value>
        </property>
        <!-- 虚拟核数和物理核数乘数,默认是1.0 -->
        <property>
            <description>Multiplier to determine how to convert phyiscal cores to
                vcores. This value is used if yarn.nodemanager.resource.cpu-vcores
                is set to -1(which implies auto-calculate vcores) and
                yarn.nodemanager.resource.detect-hardware-capabilities is set to true. The number of vcores will be
                calculated as number of CPUs * multiplier.
            </description>
            <name>yarn.nodemanager.resource.pcores-vcores-multiplier</name>
            <value>1.0</value>
        </property>
        <!-- NodeManager使用内存数,默认8G,
            不要超过服务器的总内存,建议小于服务器内存,为其他应用预留内存-->
        <property>
            <description>Amount of physical memory, in MB, that can be allocated
                for containers. If set to -1 and
                yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
                automatically calculated(in case of Windows and Linux).
                In other cases, the default is 8192MB.
            </description>
            <name>yarn.nodemanager.resource.memory-mb</name>
            <value>4096</value>
        </property>
        <!-- nodemanager的CPU核数,不按照硬件环境自动设定时默认是8个,根据实际情况设定 -->
        <property>
            <description>Number of vcores that can be allocated
                for containers. This is used by the RM scheduler when allocating
                resources for containers. This is not used to limit the number of
                CPUs used by YARN containers. If it is set to -1 and
                yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
                automatically determined from the hardware in case of Windows and Linux.
                In other cases, number of vcores is 8 by default.
            </description>
            <name>yarn.nodemanager.resource.cpu-vcores</name>
            <value>1</value>
        </property>
        <!-- 容器最小内存,默认1G -->
        <property>
            <description>The minimum allocation for every container request at the RM in MBs. Memory requests lower than
                this will be set to the value of this property. Additionally, a node manager that is configured to have less
                memory than this value will be shut down by the resource manager.
            </description>
            <name>yarn.scheduler.minimum-allocation-mb</name>
            <value>1024</value>
        </property>
    
        <!-- 容器最大内存,默认8G,按照实际修改 -->
        <property>
            <description>The maximum allocation for every container request at the RM in MBs. Memory requests higher than
                this will throw an InvalidResourceRequestException.
            </description>
            <name>yarn.scheduler.maximum-allocation-mb</name>
            <value>2048</value>
        </property>
    
        <!-- 容器最小CPU核数,默认1个 -->
        <property>
            <description>The minimum allocation for every container request at the RM in terms of virtual CPU cores.
                Requests lower than this will be set to the value of this property. Additionally, a node manager that is
                configured to have fewer virtual cores than this value will be shut down by the resource manager.
            </description>
            <name>yarn.scheduler.minimum-allocation-vcores</name>
            <value>1</value>
        </property>
    
        <!-- 容器最大CPU核数,默认4个,按照实际修改 -->
        <property>
            <description>The maximum allocation for every container request at the RM in terms of virtual CPU cores.
                Requests higher than this will throw an
                InvalidResourceRequestException.
            </description>
            <name>yarn.scheduler.maximum-allocation-vcores</name>
            <value>1</value>
        </property>
    
        <!-- 虚拟内存检查,默认打开,修改为关闭 -->
        <property>
            <description>Whether virtual memory limits will be enforced for
                containers.
            </description>
            <name>yarn.nodemanager.vmem-check-enabled</name>
            <value>false</value>
        </property>
    
        <!-- 虚拟内存和物理内存设置比例,默认2.1 -->
        <property>
            <description>Ratio between virtual memory to physical memory when setting memory limits for containers.
                Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to
                exceed this allocation by this ratio.
            </description>
            <name>yarn.nodemanager.vmem-pmem-ratio</name>
            <value>2.1</value>
        </property>
    
    
        <!-- =============Yarn高可用配置================== -->
        <!-- 启用 resourcemanager ha -->
        <property>
            <name>yarn.resourcemanager.ha.enabled</name>
            <value>true</value>
        </property>
        <!-- 声明两台 resourcemanager 的地址 -->
        <property>
            <name>yarn.resourcemanager.cluster-id</name>
            <value>maccyarn</value>
        </property>
        <!--指定 resourcemanager 的逻辑列表-->
        <property>
            <name>yarn.resourcemanager.ha.rm-ids</name>
            <value>rm1,rm2</value>
        </property>
        <!-- ========== rm1 的配置 ========== -->
        <!-- 指定 rm1 的主机名 -->
        <property>
            <name>yarn.resourcemanager.hostname.rm1</name>
            <value>spark01</value>
        </property>
        <!-- 指定 rm1 的 web 端地址 -->
        <property>
            <name>yarn.resourcemanager.webapp.address.rm1</name>
            <value>spark01:8088</value>
        </property>
        <!-- 指定 rm1 的内部通信地址 -->
        <property>
            <name>yarn.resourcemanager.address.rm1</name>
            <value>spark01:8032</value>
        </property>
        <!-- 指定 AM 向 rm1 申请资源的地址 -->
        <property>
            <name>yarn.resourcemanager.scheduler.address.rm1</name>
            <value>spark01:8030</value>
        </property>
        <!-- 指定供 NM 连接的地址 -->
        <property>
            <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
            <value>spark01:8031</value>
        </property>
        <!-- ========== rm2 的配置 ========== -->
        <!-- 指定 rm2 的主机名 -->
        <property>
            <name>yarn.resourcemanager.hostname.rm2</name>
            <value>spark02</value>
        </property>
        <property>
            <name>yarn.resourcemanager.webapp.address.rm2</name>
            <value>spark02:8088</value>
        </property>
        <property>
            <name>yarn.resourcemanager.address.rm2</name>
            <value>spark02:8032</value>
        </property>
        <property>
            <name>yarn.resourcemanager.scheduler.address.rm2</name>
            <value>spark02:8030</value>
        </property>
        <property>
            <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
            <value>spark02:8031</value>
        </property>
        <!-- 指定 zookeeper 集群的地址 -->
        <property>
            <name>yarn.resourcemanager.zk-address</name>
            <value>hadoop01:2181,hadoop02:2181,hadoop03:2181</value>
        </property>
        <!-- 启用自动恢复 -->
        <property>
            <name>yarn.resourcemanager.recovery.enabled</name>
            <value>false</value>
        </property>
        <!-- 指定 resourcemanager 的状态信息存储在 zookeeper 集群 -->
        <property>
            <name>yarn.resourcemanager.store.class</name>
            <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
        </property>
        <!-- 环境变量的继承 -->
        <property>
            <name>yarn.nodemanager.env-whitelist</name>
            <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLAS
                SPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME
            </value>
        </property>
    </configuration>
    
    

    修改

    [hadoop@hadoop01 ~]$ vim /home/hadoop/hadoop-3.3.0/etc/hadoop/capacity-scheduler.xml
    

    写入以下配置

    <configuration>
    
      <property>
        <name>yarn.scheduler.capacity.maximum-applications</name>
        <value>10000</value>
        <description>
          Maximum number of applications that can be pending and running.
        </description>
      </property>
    
      <property>
        <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
        <value>0.5</value>
        <description>
          这里控制并发数:AM申请的资源占总资源的百分比,并发数=AM的最大资源数/每个AM的资源数
        </description>
      </property>
    
      <property>
        <name>yarn.scheduler.capacity.resource-calculator</name>
        <value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
        <description>
          The ResourceCalculator implementation to be used to compare
          Resources in the scheduler.
          The default i.e. DefaultResourceCalculator only uses Memory while
          DominantResourceCalculator uses dominant-resource to compare
          multi-dimensional resources such as Memory, CPU etc.
        </description>
      </property>
    
      <property>
        <name>yarn.scheduler.capacity.root.queues</name>
        <value>hadoop,spark</value>
      </property>
    
      <property>
        <name>yarn.scheduler.capacity.root.hadoop.capacity</name>
        <value>0</value>
      </property>
      <property>
        <name>yarn.scheduler.capacity.root.spark.capacity</name>
        <value>100</value>
      </property>
        <property>
        <name>yarn.scheduler.capacity.root.hadoop.maximum-capacity</name>
        <value>0</value>
      </property>
      
      
      <property>
        <name>yarn.scheduler.capacity.root.spark.maximum-capacity</name>
        <value>100</value>
      </property>
      
      
      <property>
        <name>yarn.scheduler.capacity.root.accessible-node-labels</name>
        <value>*</value>
      </property>
      <property>
        <name>yarn.scheduler.capacity.root.hadoop.accessible-node-labels</name>
        <value>lb_hdfs</value>
      </property>
      <property>
        <name>yarn.scheduler.capacity.root.spark.accessible-node-labels</name>
        <value>lb_spark</value>
      </property>
      <property>
        <name>yarn.scheduler.capacity.root.accessible-node-labels.lb_hdfs.capacity</name>
        <value>0</value>
      </property>
      <property>
        <name>yarn.scheduler.capacity.root.accessible-node-labels.lb_spark.capacity</name>
        <value>100</value>
      </property>
      <property>
        <name>yarn.scheduler.capacity.root.hadoop.accessible-node-labels.lb_hdfs.capacity</name>
        <value>0</value>
      </property>
      <property>
        <name>yarn.scheduler.capacity.root.spark.accessible-node-labels.lb_spark.capacity</name>
        <value>100</value>
      </property>
     <property>
        <name>yarn.scheduler.capacity.root.hadoop.default-node-label-expression</name>
        <value>lb_hdfs</value>
      </property>
      <property>
        <name>yarn.scheduler.capacity.root.spark.default-node-label-expression</name>
        <value>lb_spark</value>
      </property>
        <property>
        <name>yarn.scheduler.capacity.root.hadoop.state</name>
        <value>RUNNING</value>
      </property>
      <property>
        <name>yarn.scheduler.capacity.root.spark.state</name>
        <value>RUNNING</value>
      </property>
    
      <property>
        <name>yarn.scheduler.capacity.root.hadoop.acl_submit_applications</name>
        <value>*</value>
      </property>
      <property>
        <name>yarn.scheduler.capacity.root.spark.acl_submit_applications</name>
        <value>*</value>
      </property>
    
      <property>
        <name>yarn.scheduler.capacity.root.hadoop.acl_administer_queue</name>
        <value>*</value>
      </property>
      <property>
        <name>yarn.scheduler.capacity.root.spark.acl_administer_queue</name>
        <value>*</value>
      </property>
        <property>
        <name>yarn.scheduler.capacity.node-locality-delay</name>
        <value>2</value>
        <description>
          Number of missed scheduling opportunities after which the CapacityScheduler
          attempts to schedule rack-local containers.
          Typically this should be set to number of nodes in the cluster, By default is setting
          approximately number of nodes in one rack which is 40.
        </description>
      </property>
    
      <property>
        <name>yarn.scheduler.capacity.queue-mappings</name>
        <value></value>
        <description>
          A list of mappings that will be used to assign jobs to queues
          The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]*
          Typically this list will be used to map users to queues,
          for example, u:%user:%user maps all users to queues with the same name
          as the user.
        </description>
      </property>
    
      <property>
        <name>yarn.scheduler.capacity.queue-mappings-override.enable</name>
        <value>false</value>
        <description>
          If a queue mapping is present, will it override the value specified
          by the user? This can be used by administrators to place jobs in queues
          that are different than the one specified by the user.
          The default is false.
        </description>
      </property>
    
    </configuration>
    

    同步到所有节点

    [hadoop@hadoop01 ~]$ ~/bin/xsyncall /home/hadoop/hadoop-3.3.0/etc/hadoop/yarn-site.xml
    [hadoop@hadoop01 ~]$ ~/bin/xsyncall /home/hadoop/hadoop-3.3.0/etc/hadoop/capacity-scheduler.xml
    
  3. 重启yarn

    [hadoop@hadoop01 ~]$ stop-yarn.sh
    [hadoop@hadoop01 ~]$ start-yarn.sh
    
  4. 为服务器设置标签

    Hadoop节点和Spark节点共享一个Yarn集群,通过Yarn集群的基于标签的调度,把Spark程序的运行限制在Spark节点上,配置步骤如下:

    spark01/spark02/spark03:运行Spark集群和Yarn的RM/NM

    hadoop01/hadoop02/hadoop03:运行hadoop集群,和Yarn的NM

    #设置全局标签
    [hadoop@hadoop01 ~]$ yarn rmadmin -addToClusterNodeLabels "lb_spark(exclusive=true),lb_hdfs(exclusive=true)"
    # 查看设置结果
    [hadoop@hadoop01 ~]$ yarn cluster --list-node-labels
    # 为服务器打标签 如果hostsname有差异可以使用 yarn node -list -all查询所有节点的nodeAddress
    [hadoop@hadoop01 ~]$ yarn rmadmin -replaceLabelsOnNode "spark01=lb_spark spark02=lb_spark spark03=lb_spark hadoop01=lb_hdfs hadoop02=lb_hdfs hadoop03=lb_hdfs"
    #配置后无需重启yarn,可使用以下命令刷新
    yarn rmadmin -refreshQueues
    
  5. 验证

    [hadoop@spark01 ~]$ cd /home/hadoop/spark/bin/
    [hadoop@macc-hk-spark-1 bin]$ ./spark-submit --class org.apache.spark.examples.SparkPi --master yarn --queue spark --deploy-mode client ../examples/jars/spark-examples_2.12-3.1.2.jar 10
    
       进入http://sprak01:8088/cluster 查看Applications执行正常即可