Connecting Apache Flume

Available in VPC

Apache Flume is a service that efficiently collects a large amount of log data in distributed environments and transfers it to the data storage.
For more information, see the official Apache Flume website.

hadoop-chadoop-use-ex8_0-1

Characteristics
- Distributed: while it may differ depending on how the topology is configured, you can create pipes with multiple Flume Agents. Usually, the Sink and Source (next hop) in Avro type are used for connecting between Agents.
- Reliable: in the Flume agent, events move by following the components called Source, Channel, and Sink. Until an event is forwarded to Sink, it does not disappear from Channel. In order to guarantee this, Flume has implemented Transactional approach.
- Available: when a Disk-backed system is used for the Channel, the data forwarded from Source to Channel can be recovered, even when an error occurs in the Agent.
Purpose of use
- rom many ~ to a centralized: Flume Agent can collect logs from multiple nodes, and eventually save them to the centralized storage.
- collecting, aggregating, moving: these can collect and combine logs. During this process, you can utilize the Selector and Interceptor to change the form of events.
  The collected events can be forwarded to the next Agent or saved to the final Sink.
Component
- Event: basic unit of the data transferred by the Flume Agent. Optionally, you can give header values to events. Usually, headers are used to check and change the content of events.
- Flume Agent: JVM process that hosts the Source, Channel, and Sink components. Events can flow through Agents from the external Source to the destination of next-hop.
  - Source: consumes the event forwarded from the client. When the Source receives an event, it passes it to 1 or more channels.
  - Channel: temporary storage of events. It connects Source and Sink, and plays an important role in guaranteeing the durability of the event flow.
  - Sink: removes events from channels and forwards them to next-hop of the flow.

This guide explains how to configure a Flume topology to store the server's logs in Cloud Hadoop HDFS.

Use Flume Agent

You can try configuring a Flume topology that uses Flume Agent to collect vmstat results from each server and store them in Cloud Hadoop HDFS as follows:

hadoop-chadoop-use-ex8_1-1

Install Flume

The following describes how to install Flume Agents.

Create 3 servers to collect the logs from. (See Create Server guide)
- Each server has to be created on the ACG that includes Cloud Hadoop.
  - log-gen-001 / centos-7.8–64 / 2vCPU, 8G Mem
  - log-gen-002 / centos-7.8–64 / 2vCPU, 8G Mem
  - log-gen-003 / centos-7.8–64 / 2vCPU, 8G Mem

Create a directory for path ~/downloads ~/apps. Download the Flume package to the path, unzip it, and complete the installation.

mkdir ~/downloads ~/apps

cd ~/downloads
wget https://archive.apache.org/dist/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz
tar -xvf apache-flume-1.9.0-bin.tar.gz

mv apache-flume-1.9.0-bin ~/apps/
cd ~/apps
ln -s apache-flume-1.9.0-bin flume

Preparations

Using Cloud Hadoop as HDFS Sink requires the following preparations:

1. Preparations for the communication between HDFS and the web server

In order for each log server to communicate with the log servers and HDFS name node hosts, the Private IP and name of each host needs to be registered to /etc/hosts.

You can check this information at /etc/hosts of the Cloud Hadoop's edge node (e.g., e-001-xxx).

2. Preparations for using HDFS Sink

This topology uses HDFS Sink to use HDFS Sink. It requires the Hadoop common jar library for the node that the Flume Agent is executed on. Configuration files like hdfs-site.xml and core-site.xml are also necessary to use a name service for NameNode HA.

Download Hadoop binary
Download Hadoop binary and .jar library that you need under /home using the following command:

Use the following codes for Cloud Hadoop 1.3 version:

# Download Hadoop binary
wget https://archive.apache.org/dist/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz -P ~/apps
tar xfz hadoop-2.6.5.tar.gz  

# Download HDFS Jar
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-auth/2.6.5/hadoop-auth-2.6.5.jar -P ~/apps/hadoop-2.6.5/share/hadoop/common
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs/2.6.5/hadoop-hdfs-2.6.5.jar -P ~/apps/hadoop-2.6.5/share/hadoop/common
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs-client/2.6.5/hadoop-hdfs-client-2.6.5.jar -P ~/apps/hadoop-2.6.5/share/hadoop/common

Use the following codes for Cloud Hadoop 1.4 version or later:

# Download Hadoop binary
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.1.4/hadoop-3.1.4.tar.gz -P ~/apps
tar xfz hadoop-3.1.4.tar.gz  

# Download HDFS Jar
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-auth/3.1.4/hadoop-auth-3.1.4.jar -P ~/apps/hadoop-3.1.4/share/hadoop/common
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs/3.1.4/hadoop-hdfs-3.1.4.jar -P ~/apps/hadoop-3.1.4/share/hadoop/common
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs-client/3.1.4/hadoop-hdfs-client-3.1.4.jar -P ~/apps/hadoop-3.1.4/share/hadoop/common

Set Hadoop config
Download the Hadoop config file under $FLUME_CLASS_PATH/conf.

$ cd ~/apps/flume/conf
$ curl -u $AMBARI_ID:$AMBARI_PASS -G 'http://$AMBARI_URI:8080/api/v1/clusters/$CLUSTER_NAME/components?format=client_config_tar' -o client_config.tgz 
$ tar xfz client_config.tgz
$ rm -f client_config.tgz

Set Hadoop environment variables
Run the following commands to set Hadoop environment variables.

export HADOOP_HOME=~/apps/hadoop-3.1.4
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=~/apps/flume/conf/HDFS_CLIENT/
export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${PATH}

Change Flume configurations

The following describes how to change the Flume configuration.

Run the following commands in the Flume Agent to create configuration values.

cd ~/apps/flume/conf
cp flume-conf.properties.template flume.conf
cp flume-env.sh.template flume-env.sh
cp ~/apps/hadoop-3.1.4/share/hadoop/common/*.jar ~/apps/flume/lib/ 
cp ~/apps/hadoop-3.1.4/share/hadoop/common/lib/woodstox-core-5.0.3.jar ~/apps/flume/lib/
mv ~/apps/flume/lib/guava-11.0.2.jar ~/apps/flume/lib/guava-11.0.2.jar.bak

Edit the hadoop-env.sh's JAVA_HOME and HADOOP_HOME option values from each Flume Agent as shown below.
- Java settings can have different option values depending on the installation methods. The following shows the path set after installing the Java package with yum.

# Java installation
yum install -y java-1.8.0-openjdk

# Edit hadoop-evn.sh
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/jre-openjdk
# Hadoop home directory
export HADOOP_HOME=/root/apps/hadoop-3.1.4

flume.conf

Define the name of Agent and each component. (Agent name: fooAgent)
You can use the path that includes the name service for the Sink path of HDFS. In Cloud Hadoop, the cluster name becomes the name service.
Since the node information is included in hdfs-site.xml, you don't have to specify which name node is in the active status.

fooAgent.sources = Exec
fooAgent.channels = MemChannel
fooAgent.sinks = HDFS

fooAgent.sources.Exec.type = exec
fooAgent.sources.Exec.command = /usr/bin/vmstat 1
fooAgent.sources.Exec.channels = MemChannel 

fooAgent.channels.MemChannel.type = memory
fooAgent.channels.MemChannel.capacity = 10000
fooAgent.channels.MemChannel.transactionCapacity = 1000

fooAgent.sinks.HDFS.channel = MemChannel
fooAgent.sinks.HDFS.type = hdfs
fooAgent.sinks.HDFS.hdfs.path = hdfs://$CLUSTER_NAME/user/hduser/flume/events/
fooAgent.sinks.HDFS.hdfs.fileType = DataStream
fooAgent.sinks.HDFS.hdfs.writeFormat = Text
fooAgent.sinks.HDFS.hdfs.batchSize = 1000
fooAgent.sinks.HDFS.hdfs.rollSize = 0
fooAgent.sinks.HDFS.hdfs.rollCount = 10000

Flume-env.sh
Add the path of hadoop client installed in the preparations to FLUME_CLASSPATH.

export JAVA_HOME="/usr/lib/jvm/jre-openjdk"
export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"
export HADOOP_CONF_DIR="/root/apps/flume/conf/HDFS_CLIENT"
FLUME_CLASSPATH="/root/apps/flume/lib"

Start process

Create directory and set owner authority

$ sudo su - hdfs
$ hdfs dfs -mkdir /user/hduser/flume/events/
$ hdfs dfs -chown -R sshuser: /user/hduser/flume/events/
$ exit

Use the following command to start each Flume Agent.

cd ~/apps/flume/

./bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n  fooAgent

....

20/09/09 12:32:22 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
20/09/09 12:32:22 INFO hdfs.BucketWriter: Creating hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622342911.tmp
20/09/09 12:32:52 INFO hdfs.HDFSEventSink: Writer callback called.
20/09/09 12:32:52 INFO hdfs.BucketWriter: Closing hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622342911.tmp
20/09/09 12:32:52 INFO hdfs.BucketWriter: Renaming hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622342911.tmp to hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622342911
20/09/09 12:32:55 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
20/09/09 12:32:55 INFO hdfs.BucketWriter: Creating hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622375913.tmp
20/09/09 12:33:25 INFO hdfs.HDFSEventSink: Writer callback called.
20/09/09 12:33:25 INFO hdfs.BucketWriter: Closing hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622375913.tmp
20/09/09 12:33:25 INFO hdfs.BucketWriter: Renaming hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622375913.tmp to hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622375913
20/09/09 12:33:28 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
20/09/09 12:33:28 INFO hdfs.BucketWriter: Creating hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622408915.tmp
20/09/09 12:33:58 INFO hdfs.HDFSEventSink: Writer callback called.
20/09/09 12:33:58 INFO hdfs.BucketWriter: Closing hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622408915.tmp
20/09/09 12:33:58 INFO hdfs.BucketWriter: Renaming hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622408915.tmp to hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622408915
20/09/09 12:34:01 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
20/09/09 12:34:01 INFO hdfs.BucketWriter: Creating

You can use the following commands to check them in HDFS.

$ hadoop fs -ls /user/hduser/flume/events/

Found 17 items

-rw-r--r--   2 root hdfs       3089 2020-09-09 12:25 /user/hduser/flume/events/FlumeData.1599621914876
-rw-r--r--   2 root hdfs       3093 2020-09-09 12:26 /user/hduser/flume/events/FlumeData.1599621946882
-rw-r--r--   2 root hdfs       2931 2020-09-09 12:26 /user/hduser/flume/events/FlumeData.1599621979885
-rw-r--r--   2 root hdfs       3091 2020-09-09 12:27 /user/hduser/flume/events/FlumeData.1599622012888
-rw-r--r--   2 root hdfs       2931 2020-09-09 12:27 /user/hduser/flume/events/FlumeData.1599622045890
-rw-r--r--   2 root hdfs       3091 2020-09-09 12:28 /user/hduser/flume/events/FlumeData.1599622078893
-rw-r--r--   2 root hdfs       2930 2020-09-09 12:29 /user/hduser/flume/events/FlumeData.1599622111895
-rw-r--r--   2 root hdfs       3093 2020-09-09 12:29 /user/hduser/flume/events/FlumeData.1599622144897
-rw-r--r--   2 root hdfs       3092 2020-09-09 12:30 /user/hduser/flume/events/FlumeData.1599622177899
-rw-r--r--   2 root hdfs       2931 2020-09-09 12:30 /user/hduser/flume/events/FlumeData.1599622210902
-rw-r--r--   2 root hdfs       3093 2020-09-09 12:31 /user/hduser/flume/events/FlumeData.1599622243904
-rw-r--r--   2 root hdfs       2932 2020-09-09 12:31 /user/hduser/flume/events/FlumeData.1599622276906

Note

In the actual production environment, 2 or more Flume Agents are piped together and the interceptor is used to transform the event.
There are various types of Source, Channel, and Sink. Kafka is often used as Channel and Sink.