Wednesday, February 19, 2014

Install Apache Spark and Fast Log Analytics


Spark is cool and fast tool, the processing layer over the top Hadoop HDFS.
I try this after installation and it really fast by advanced caching and scheduling jobs in distributed computation system.
In  this example, I try a counting the number of IP "121.242.255.20" in nguyentantrieu.info access log / month(12 MB)

Testing Apache Spark 
Installing Apache Spark only involves some simple steps including the  
  1. Install Java
  2. Install Hadoop
  3. Install Scala
  4. Install Spark
Install Java on Ubuntu
Java can be installed as shown on this howto:
  • sudo add-apt-repository ppa:webupd8team/java 
  • sudo apt-get update
  • sudo apt-get install oracle-java7-installer
After installation, you can test if it works by typing java -version at the command prompt. This should give you the java version.

Install Hadoop on Ubuntu
  • Hadoop can simply be installed by downloading a .deb file:
  • go to http://www.apache.org/dyn/closer.cgi/hadoop/common/ and choose a mirror
  • choose a Hadoop version you prefer (i have chosen hadoop-1.2.1)
  • download your .deb file (i have chosen hadoop_1.2.1-1_x86_64.deb)
When opening it, the Ubuntu Software Center opens to install it.
After installation, you can see if it works by typing hadoop at the command prompt. It should give you some information about using the hadoop command.

After installing Hadoop, lookup /etc/hadoop/hadoop-env.sh and change the line:
 export JAVA_HOME=/usr/lib/jvm/java-6-sun
into
 #export JAVA_HOME=/usr/lib/jvm/java-6-sun

Install Scala on Ubuntu
Follow the steps as presented on this page: 
Download Scala from http://scala-lang.org/ and save it somewhere you can find it (e.g. ~/)
at the command prompt, type:
  • cd /usr/share
  • sudo tar -zxf <location and name of the tgz file> (e.g. sudo tar -zxf ~/scala-2.10.3.tgz)
  • link (ln -s) the executables to the /usr/bin location, e.g.:
  • sudo ln -s /usr/share/scala-2.10.3/bin/scala /usr/bin/scala
  • sudo ln -s /usr/share/scala-2.10.3/bin/scalac /usr/bin/scalac
  • sudo ln -s /usr/share/scala-2.10.3/bin/fsc /usr/bin/fsc
Installing Spark on Ubuntu
  • Getting Spark up and running is easy as described on http://spark.incubator.apache.org/docs/latest/: 
  • Goto http://spark.incubator.apache.org/downloads.html and download Spark.
  • Unpack it at a preferred location
  • Go to your Spark home directory in a terminal and type: sbt/sbt assembly
  • You can start spark by executing ./spark_shell in the spark home 
Note: log4j is still configured in such a way that all logging messages occur in your main window. You can redirect the messages to a standard log file by creating a file log/log4j.properties with the following content:
log4j.rootLogger = DEBUG, A1
log4j.appender.A1=org.apache.log4j.RollingFileAppender
log4j.appender.A1.File=SparkLog.log
log4j.appender.A1.MaxFileSize = 100KB
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Ignore messages below warning level from Jetty, because it's a bit verbose
log4j.logger.org.eclipse.jetty=WARN

Now you are ready to make some Sparks