- Print
- PDF
Connecting Apache Flume
- Print
- PDF
Available in VPC
Apache Flume is a service that efficiently collects a large amount of log data in distributed environments and transfers it to the data storage.
For more information, see the official Apache Flume website.
Characteristics
- Distributed: while it may differ depending on how the topology is configured, you can create pipes with multiple Flume Agents. Usually, the Sink and Source (next hop) in Avro type are used for connecting between Agents.
- Reliable: in the Flume agent, events move by following the components called Source, Channel, and Sink. Until an event is forwarded to Sink, it does not disappear from Channel. In order to guarantee this, Flume has implemented Transactional approach.
- Available: when a Disk-backed system is used for the Channel, the data forwarded from Source to Channel can be recovered, even when an error occurs in the Agent.
Purpose of use
- rom many ~ to a centralized: Flume Agent can collect logs from multiple nodes, and eventually save them to the centralized storage.
- collecting, aggregating, moving: these can collect and combine logs. During this process, you can utilize the Selector and Interceptor to change the form of events.
The collected events can be forwarded to the next Agent or saved to the final Sink.
Component
- Event: basic unit of the data transferred by the Flume Agent. Optionally, you can give header values to events. Usually, headers are used to check and change the content of events.
- Flume Agent: JVM process that hosts the Source, Channel, and Sink components. Events can flow through Agents from the external Source to the destination of next-hop.
- Source: consumes the event forwarded from the client. When the Source receives an event, it passes it to 1 or more channels.
- Channel: temporary storage of events. It connects Source and Sink, and plays an important role in guaranteeing the durability of the event flow.
- Sink: removes events from channels and forwards them to next-hop of the flow.
This guide explains how to configure a Flume topology to store the server's logs in Cloud Hadoop HDFS.
Use Flume Agent
You can try configuring a Flume topology that uses Flume Agent to collect vmstat results from each server and store them in Cloud Hadoop HDFS as follows:
Install Flume
The following describes how to install Flume Agents.
Create 3 servers to collect the logs from. (See Create Server guide)
- Each server has to be created on the ACG that includes Cloud Hadoop.
- log-gen-001 / centos-7.8–64 / 2vCPU, 8G Mem
- log-gen-002 / centos-7.8–64 / 2vCPU, 8G Mem
- log-gen-003 / centos-7.8–64 / 2vCPU, 8G Mem
- Each server has to be created on the ACG that includes Cloud Hadoop.
Create a directory for path
~/downloads ~/apps
. Download the Flume package to the path, unzip it, and complete the installation.mkdir ~/downloads ~/apps cd ~/downloads wget https://archive.apache.org/dist/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz tar -xvf apache-flume-1.9.0-bin.tar.gz mv apache-flume-1.9.0-bin ~/apps/ cd ~/apps ln -s apache-flume-1.9.0-bin flume
Preparations
Using Cloud Hadoop as HDFS Sink requires the following preparations:
1. Preparations for the communication between HDFS and the web server
In order for each log server to communicate with the log servers and HDFS name node hosts, the Private IP and name of each host needs to be registered to /etc/hosts
.
You can check this information at /etc/hosts
of the Cloud Hadoop's edge node (e.g., e-001-xxx).
2. Preparations for using HDFS Sink
This topology uses HDFS Sink to use HDFS Sink. It requires the Hadoop common jar library for the node that the Flume Agent is executed on. Configuration files like hdfs-site.xml and core-site.xml are also necessary to use a name service for NameNode HA.
- Download Hadoop binary
Download Hadoop binary and .jar library that you need under/home
using the following command:
Use the following codes for Cloud Hadoop 1.3 version:
# Download Hadoop binary wget https://archive.apache.org/dist/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz -P ~/apps tar xfz hadoop-2.6.5.tar.gz # Download HDFS Jar wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-auth/2.6.5/hadoop-auth-2.6.5.jar -P ~/apps/hadoop-2.6.5/share/hadoop/common wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs/2.6.5/hadoop-hdfs-2.6.5.jar -P ~/apps/hadoop-2.6.5/share/hadoop/common wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs-client/2.6.5/hadoop-hdfs-client-2.6.5.jar -P ~/apps/hadoop-2.6.5/share/hadoop/common
Use the following codes for Cloud Hadoop 1.4 version or later:
# Download Hadoop binary wget https://archive.apache.org/dist/hadoop/common/hadoop-3.1.4/hadoop-3.1.4.tar.gz -P ~/apps tar xfz hadoop-3.1.4.tar.gz # Download HDFS Jar wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-auth/3.1.4/hadoop-auth-3.1.4.jar -P ~/apps/hadoop-3.1.4/share/hadoop/common wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs/3.1.4/hadoop-hdfs-3.1.4.jar -P ~/apps/hadoop-3.1.4/share/hadoop/common wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs-client/3.1.4/hadoop-hdfs-client-3.1.4.jar -P ~/apps/hadoop-3.1.4/share/hadoop/common
Set Hadoop config
Download the Hadoop config file under$FLUME_CLASS_PATH/conf
.$ cd ~/apps/flume/conf $ curl -u $AMBARI_ID:$AMBARI_PASS -G 'http://$AMBARI_URI:8080/api/v1/clusters/$CLUSTER_NAME/components?format=client_config_tar' -o client_config.tgz $ tar xfz client_config.tgz $ rm -f client_config.tgz
Set Hadoop environment variables
Run the following commands to set Hadoop environment variables.export HADOOP_HOME=~/apps/hadoop-3.1.4 export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=~/apps/flume/conf/HDFS_CLIENT/ export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${PATH}
Change Flume configurations
The following describes how to change the Flume configuration.
Run the following commands in the Flume Agent to create configuration values.
cd ~/apps/flume/conf cp flume-conf.properties.template flume.conf cp flume-env.sh.template flume-env.sh cp ~/apps/hadoop-3.1.4/share/hadoop/common/*.jar ~/apps/flume/lib/ cp ~/apps/hadoop-3.1.4/share/hadoop/common/lib/woodstox-core-5.0.3.jar ~/apps/flume/lib/ mv ~/apps/flume/lib/guava-11.0.2.jar ~/apps/flume/lib/guava-11.0.2.jar.bak
Edit the
hadoop-env.sh
'sJAVA_HOME
andHADOOP_HOME
option values from each Flume Agent as shown below.- Java settings can have different option values depending on the installation methods. The following shows the path set after installing the Java package with yum.
# Java installation
yum install -y java-1.8.0-openjdk
# Edit hadoop-evn.sh
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/jre-openjdk
# Hadoop home directory
export HADOOP_HOME=/root/apps/hadoop-3.1.4
flume.conf
- Define the name of Agent and each component. (Agent name:
fooAgent
) - You can use the path that includes the name service for the Sink path of HDFS. In Cloud Hadoop, the cluster name becomes the name service.
- Since the node information is included in
hdfs-site.xml
, you don't have to specify which name node is in the active status.
fooAgent.sources = Exec fooAgent.channels = MemChannel fooAgent.sinks = HDFS fooAgent.sources.Exec.type = exec fooAgent.sources.Exec.command = /usr/bin/vmstat 1 fooAgent.sources.Exec.channels = MemChannel fooAgent.channels.MemChannel.type = memory fooAgent.channels.MemChannel.capacity = 10000 fooAgent.channels.MemChannel.transactionCapacity = 1000 fooAgent.sinks.HDFS.channel = MemChannel fooAgent.sinks.HDFS.type = hdfs fooAgent.sinks.HDFS.hdfs.path = hdfs://$CLUSTER_NAME/user/hduser/flume/events/ fooAgent.sinks.HDFS.hdfs.fileType = DataStream fooAgent.sinks.HDFS.hdfs.writeFormat = Text fooAgent.sinks.HDFS.hdfs.batchSize = 1000 fooAgent.sinks.HDFS.hdfs.rollSize = 0 fooAgent.sinks.HDFS.hdfs.rollCount = 10000
- Define the name of Agent and each component. (Agent name:
Flume-env.sh
Add the path ofhadoop client
installed in the preparations toFLUME_CLASSPATH
.export JAVA_HOME="/usr/lib/jvm/jre-openjdk" export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote" export HADOOP_CONF_DIR="/root/apps/flume/conf/HDFS_CLIENT" FLUME_CLASSPATH="/root/apps/flume/lib"
Start process
Create directory and set owner authority
$ sudo su - hdfs $ hdfs dfs -mkdir /user/hduser/flume/events/ $ hdfs dfs -chown -R sshuser: /user/hduser/flume/events/ $ exit
Use the following command to start each Flume Agent.
cd ~/apps/flume/ ./bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n fooAgent .... 20/09/09 12:32:22 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false 20/09/09 12:32:22 INFO hdfs.BucketWriter: Creating hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622342911.tmp 20/09/09 12:32:52 INFO hdfs.HDFSEventSink: Writer callback called. 20/09/09 12:32:52 INFO hdfs.BucketWriter: Closing hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622342911.tmp 20/09/09 12:32:52 INFO hdfs.BucketWriter: Renaming hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622342911.tmp to hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622342911 20/09/09 12:32:55 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false 20/09/09 12:32:55 INFO hdfs.BucketWriter: Creating hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622375913.tmp 20/09/09 12:33:25 INFO hdfs.HDFSEventSink: Writer callback called. 20/09/09 12:33:25 INFO hdfs.BucketWriter: Closing hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622375913.tmp 20/09/09 12:33:25 INFO hdfs.BucketWriter: Renaming hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622375913.tmp to hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622375913 20/09/09 12:33:28 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false 20/09/09 12:33:28 INFO hdfs.BucketWriter: Creating hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622408915.tmp 20/09/09 12:33:58 INFO hdfs.HDFSEventSink: Writer callback called. 20/09/09 12:33:58 INFO hdfs.BucketWriter: Closing hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622408915.tmp 20/09/09 12:33:58 INFO hdfs.BucketWriter: Renaming hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622408915.tmp to hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622408915 20/09/09 12:34:01 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false 20/09/09 12:34:01 INFO hdfs.BucketWriter: Creating
You can use the following commands to check them in HDFS.
$ hadoop fs -ls /user/hduser/flume/events/ Found 17 items -rw-r--r-- 2 root hdfs 3089 2020-09-09 12:25 /user/hduser/flume/events/FlumeData.1599621914876 -rw-r--r-- 2 root hdfs 3093 2020-09-09 12:26 /user/hduser/flume/events/FlumeData.1599621946882 -rw-r--r-- 2 root hdfs 2931 2020-09-09 12:26 /user/hduser/flume/events/FlumeData.1599621979885 -rw-r--r-- 2 root hdfs 3091 2020-09-09 12:27 /user/hduser/flume/events/FlumeData.1599622012888 -rw-r--r-- 2 root hdfs 2931 2020-09-09 12:27 /user/hduser/flume/events/FlumeData.1599622045890 -rw-r--r-- 2 root hdfs 3091 2020-09-09 12:28 /user/hduser/flume/events/FlumeData.1599622078893 -rw-r--r-- 2 root hdfs 2930 2020-09-09 12:29 /user/hduser/flume/events/FlumeData.1599622111895 -rw-r--r-- 2 root hdfs 3093 2020-09-09 12:29 /user/hduser/flume/events/FlumeData.1599622144897 -rw-r--r-- 2 root hdfs 3092 2020-09-09 12:30 /user/hduser/flume/events/FlumeData.1599622177899 -rw-r--r-- 2 root hdfs 2931 2020-09-09 12:30 /user/hduser/flume/events/FlumeData.1599622210902 -rw-r--r-- 2 root hdfs 3093 2020-09-09 12:31 /user/hduser/flume/events/FlumeData.1599622243904 -rw-r--r-- 2 root hdfs 2932 2020-09-09 12:31 /user/hduser/flume/events/FlumeData.1599622276906
In the actual production environment, 2 or more Flume Agents are piped together and the interceptor is used to transform the event.
There are various types of Source, Channel, and Sink. Kafka is often used as Channel and Sink.