Connecting Apache Flume
    • PDF

    Connecting Apache Flume

    • PDF

    Article Summary

    Available in VPC

    Apache Flume is a service that efficiently collects a large amount of log data in distributed environments and transfers it to the data storage.
    For more information, see the official Apache Flume website.

    hadoop-chadoop-use-ex8_0-1

    • Characteristics

      • Distributed: while it may differ depending on how the topology is configured, you can create pipes with multiple Flume Agents. Usually, the Sink and Source (next hop) in Avro type are used for connecting between Agents.
      • Reliable: in the Flume agent, events move by following the components called Source, Channel, and Sink. Until an event is forwarded to Sink, it does not disappear from Channel. In order to guarantee this, Flume has implemented Transactional approach.
      • Available: when a Disk-backed system is used for the Channel, the data forwarded from Source to Channel can be recovered, even when an error occurs in the Agent.
    • Purpose of use

      • rom many ~ to a centralized: Flume Agent can collect logs from multiple nodes, and eventually save them to the centralized storage.
      • collecting, aggregating, moving: these can collect and combine logs. During this process, you can utilize the Selector and Interceptor to change the form of events.
        The collected events can be forwarded to the next Agent or saved to the final Sink.
    • Component

      • Event: basic unit of the data transferred by the Flume Agent. Optionally, you can give header values to events. Usually, headers are used to check and change the content of events.
      • Flume Agent: JVM process that hosts the Source, Channel, and Sink components. Events can flow through Agents from the external Source to the destination of next-hop.
        • Source: consumes the event forwarded from the client. When the Source receives an event, it passes it to 1 or more channels.
        • Channel: temporary storage of events. It connects Source and Sink, and plays an important role in guaranteeing the durability of the event flow.
        • Sink: removes events from channels and forwards them to next-hop of the flow.

    This guide explains how to configure a Flume topology to store the server's logs in Cloud Hadoop HDFS.

    Use Flume Agent

    You can try configuring a Flume topology that uses Flume Agent to collect vmstat results from each server and store them in Cloud Hadoop HDFS as follows:

    hadoop-chadoop-use-ex8_1-1

    Install Flume

    The following describes how to install Flume Agents.

    1. Create 3 servers to collect the logs from. (See Create Server guide)

      • Each server has to be created on the ACG that includes Cloud Hadoop.
        • log-gen-001 / centos-7.8–64 / 2vCPU, 8G Mem
        • log-gen-002 / centos-7.8–64 / 2vCPU, 8G Mem
        • log-gen-003 / centos-7.8–64 / 2vCPU, 8G Mem
    2. Create a directory for path ~/downloads ~/apps. Download the Flume package to the path, unzip it, and complete the installation.

      mkdir ~/downloads ~/apps
      
      cd ~/downloads
      wget https://archive.apache.org/dist/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz
      tar -xvf apache-flume-1.9.0-bin.tar.gz
      
      mv apache-flume-1.9.0-bin ~/apps/
      cd ~/apps
      ln -s apache-flume-1.9.0-bin flume
      

    Preparations

    Using Cloud Hadoop as HDFS Sink requires the following preparations:

    1. Preparations for the communication between HDFS and the web server

    In order for each log server to communicate with the log servers and HDFS name node hosts, the Private IP and name of each host needs to be registered to /etc/hosts.

    You can check this information at /etc/hosts of the Cloud Hadoop's edge node (e.g., e-001-xxx).

    2. Preparations for using HDFS Sink

    This topology uses HDFS Sink to use HDFS Sink. It requires the Hadoop common jar library for the node that the Flume Agent is executed on. Configuration files like hdfs-site.xml and core-site.xml are also necessary to use a name service for NameNode HA.

    • Download Hadoop binary
      Download Hadoop binary and .jar library that you need under /home using the following command:
    1. Use the following codes for Cloud Hadoop 1.3 version:

      # Download Hadoop binary
      wget https://archive.apache.org/dist/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz -P ~/apps
      tar xfz hadoop-2.6.5.tar.gz  
      
      # Download HDFS Jar
      wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-auth/2.6.5/hadoop-auth-2.6.5.jar -P ~/apps/hadoop-2.6.5/share/hadoop/common
      wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs/2.6.5/hadoop-hdfs-2.6.5.jar -P ~/apps/hadoop-2.6.5/share/hadoop/common
      wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs-client/2.6.5/hadoop-hdfs-client-2.6.5.jar -P ~/apps/hadoop-2.6.5/share/hadoop/common
      
    2. Use the following codes for Cloud Hadoop 1.4 version or later:

      # Download Hadoop binary
      wget https://archive.apache.org/dist/hadoop/common/hadoop-3.1.4/hadoop-3.1.4.tar.gz -P ~/apps
      tar xfz hadoop-3.1.4.tar.gz  
      
      # Download HDFS Jar
      wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-auth/3.1.4/hadoop-auth-3.1.4.jar -P ~/apps/hadoop-3.1.4/share/hadoop/common
      wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs/3.1.4/hadoop-hdfs-3.1.4.jar -P ~/apps/hadoop-3.1.4/share/hadoop/common
      wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-hdfs-client/3.1.4/hadoop-hdfs-client-3.1.4.jar -P ~/apps/hadoop-3.1.4/share/hadoop/common
      
    • Set Hadoop config
      Download the Hadoop config file under $FLUME_CLASS_PATH/conf.

      $ cd ~/apps/flume/conf
      $ curl -u $AMBARI_ID:$AMBARI_PASS -G 'http://$AMBARI_URI:8080/api/v1/clusters/$CLUSTER_NAME/components?format=client_config_tar' -o client_config.tgz 
      $ tar xfz client_config.tgz
      $ rm -f client_config.tgz
      
    • Set Hadoop environment variables
      Run the following commands to set Hadoop environment variables.

      export HADOOP_HOME=~/apps/hadoop-3.1.4
      export HADOOP_HDFS_HOME=$HADOOP_HOME
      export HADOOP_CONF_DIR=~/apps/flume/conf/HDFS_CLIENT/
      export PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${PATH}
      

    Change Flume configurations

    The following describes how to change the Flume configuration.

    1. Run the following commands in the Flume Agent to create configuration values.

      cd ~/apps/flume/conf
      cp flume-conf.properties.template flume.conf
      cp flume-env.sh.template flume-env.sh
      cp ~/apps/hadoop-3.1.4/share/hadoop/common/*.jar ~/apps/flume/lib/ 
      cp ~/apps/hadoop-3.1.4/share/hadoop/common/lib/woodstox-core-5.0.3.jar ~/apps/flume/lib/
      mv ~/apps/flume/lib/guava-11.0.2.jar ~/apps/flume/lib/guava-11.0.2.jar.bak 
      
    2. Edit the hadoop-env.sh's JAVA_HOME and HADOOP_HOME option values from each Flume Agent as shown below.

      • Java settings can have different option values depending on the installation methods. The following shows the path set after installing the Java package with yum.
    # Java installation
    yum install -y java-1.8.0-openjdk
    
    # Edit hadoop-evn.sh
    vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh 
    
    # The java implementation to use.  Required.
    export JAVA_HOME=/usr/lib/jvm/jre-openjdk
    # Hadoop home directory
    export HADOOP_HOME=/root/apps/hadoop-3.1.4
    
    • flume.conf

      • Define the name of Agent and each component. (Agent name: fooAgent)
      • You can use the path that includes the name service for the Sink path of HDFS. In Cloud Hadoop, the cluster name becomes the name service.
      • Since the node information is included in hdfs-site.xml, you don't have to specify which name node is in the active status.
      fooAgent.sources = Exec
      fooAgent.channels = MemChannel
      fooAgent.sinks = HDFS
      
      fooAgent.sources.Exec.type = exec
      fooAgent.sources.Exec.command = /usr/bin/vmstat 1
      fooAgent.sources.Exec.channels = MemChannel 
      
      fooAgent.channels.MemChannel.type = memory
      fooAgent.channels.MemChannel.capacity = 10000
      fooAgent.channels.MemChannel.transactionCapacity = 1000
      
      fooAgent.sinks.HDFS.channel = MemChannel
      fooAgent.sinks.HDFS.type = hdfs
      fooAgent.sinks.HDFS.hdfs.path = hdfs://$CLUSTER_NAME/user/hduser/flume/events/
      fooAgent.sinks.HDFS.hdfs.fileType = DataStream
      fooAgent.sinks.HDFS.hdfs.writeFormat = Text
      fooAgent.sinks.HDFS.hdfs.batchSize = 1000
      fooAgent.sinks.HDFS.hdfs.rollSize = 0
      fooAgent.sinks.HDFS.hdfs.rollCount = 10000
      
    • Flume-env.sh
      Add the path of hadoop client installed in the preparations to FLUME_CLASSPATH.

      export JAVA_HOME="/usr/lib/jvm/jre-openjdk"
      export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"
      export HADOOP_CONF_DIR="/root/apps/flume/conf/HDFS_CLIENT"
      FLUME_CLASSPATH="/root/apps/flume/lib"
      

    Start process

    1. Create directory and set owner authority

      $ sudo su - hdfs
      $ hdfs dfs -mkdir /user/hduser/flume/events/
      $ hdfs dfs -chown -R sshuser: /user/hduser/flume/events/
      $ exit
      
    2. Use the following command to start each Flume Agent.

      cd ~/apps/flume/
      
      ./bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n  fooAgent
      
      ....
      
      20/09/09 12:32:22 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
      20/09/09 12:32:22 INFO hdfs.BucketWriter: Creating hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622342911.tmp
      20/09/09 12:32:52 INFO hdfs.HDFSEventSink: Writer callback called.
      20/09/09 12:32:52 INFO hdfs.BucketWriter: Closing hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622342911.tmp
      20/09/09 12:32:52 INFO hdfs.BucketWriter: Renaming hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622342911.tmp to hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622342911
      20/09/09 12:32:55 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
      20/09/09 12:32:55 INFO hdfs.BucketWriter: Creating hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622375913.tmp
      20/09/09 12:33:25 INFO hdfs.HDFSEventSink: Writer callback called.
      20/09/09 12:33:25 INFO hdfs.BucketWriter: Closing hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622375913.tmp
      20/09/09 12:33:25 INFO hdfs.BucketWriter: Renaming hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622375913.tmp to hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622375913
      20/09/09 12:33:28 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
      20/09/09 12:33:28 INFO hdfs.BucketWriter: Creating hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622408915.tmp
      20/09/09 12:33:58 INFO hdfs.HDFSEventSink: Writer callback called.
      20/09/09 12:33:58 INFO hdfs.BucketWriter: Closing hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622408915.tmp
      20/09/09 12:33:58 INFO hdfs.BucketWriter: Renaming hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622408915.tmp to hdfs://xxxxx/user/hduser/flume/events/FlumeData.1599622408915
      20/09/09 12:34:01 INFO hdfs.HDFSDataStream: Serializer = TEXT, UseRawLocalFileSystem = false
      20/09/09 12:34:01 INFO hdfs.BucketWriter: Creating
      
    3. You can use the following commands to check them in HDFS.

      $ hadoop fs -ls /user/hduser/flume/events/
      
      Found 17 items
      
      -rw-r--r--   2 root hdfs       3089 2020-09-09 12:25 /user/hduser/flume/events/FlumeData.1599621914876
      -rw-r--r--   2 root hdfs       3093 2020-09-09 12:26 /user/hduser/flume/events/FlumeData.1599621946882
      -rw-r--r--   2 root hdfs       2931 2020-09-09 12:26 /user/hduser/flume/events/FlumeData.1599621979885
      -rw-r--r--   2 root hdfs       3091 2020-09-09 12:27 /user/hduser/flume/events/FlumeData.1599622012888
      -rw-r--r--   2 root hdfs       2931 2020-09-09 12:27 /user/hduser/flume/events/FlumeData.1599622045890
      -rw-r--r--   2 root hdfs       3091 2020-09-09 12:28 /user/hduser/flume/events/FlumeData.1599622078893
      -rw-r--r--   2 root hdfs       2930 2020-09-09 12:29 /user/hduser/flume/events/FlumeData.1599622111895
      -rw-r--r--   2 root hdfs       3093 2020-09-09 12:29 /user/hduser/flume/events/FlumeData.1599622144897
      -rw-r--r--   2 root hdfs       3092 2020-09-09 12:30 /user/hduser/flume/events/FlumeData.1599622177899
      -rw-r--r--   2 root hdfs       2931 2020-09-09 12:30 /user/hduser/flume/events/FlumeData.1599622210902
      -rw-r--r--   2 root hdfs       3093 2020-09-09 12:31 /user/hduser/flume/events/FlumeData.1599622243904
      -rw-r--r--   2 root hdfs       2932 2020-09-09 12:31 /user/hduser/flume/events/FlumeData.1599622276906
      
    Note

    In the actual production environment, 2 or more Flume Agents are piped together and the interceptor is used to transform the event.
    There are various types of Source, Channel, and Sink. Kafka is often used as Channel and Sink.


    Was this article helpful?

    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.