Big Data Vietnam: February 2014

Wednesday, February 26, 2014

Emergency Notification System,  applied  Reactive Lambda Architecture

applied from Reactive Lambda Architecture

http://nguyentantrieu.info/blog/the-slide-for-techcamp-vn-2014

Monday, February 24, 2014

Responsive Big Data ?

Personalized homepage

Responsive Web Design

Wednesday, February 19, 2014

Install Apache Spark and Fast Log Analytics

Spark is cool and fast tool, the processing layer over the top Hadoop HDFS.

I try this after installation and it really fast by advanced caching and scheduling jobs in distributed computation system.

In this example, I try a counting the number of IP "121.242.255.20" in nguyentantrieu.info access log / month(12 MB)

Testing Apache Spark

Installing Apache Spark only involves some simple steps including the

Install Java
Install Hadoop
Install Scala
Install Spark

Install Java on Ubuntu

Java can be installed as shown on this howto:

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer

After installation, you can test if it works by typing java -version at the command prompt. This should give you the java version.

Install Hadoop on Ubuntu

or manually install at http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster

Hadoop can simply be installed by downloading a .deb file:
go to http://www.apache.org/dyn/closer.cgi/hadoop/common/ and choose a mirror
choose a Hadoop version you prefer (i have chosen hadoop-1.2.1)
download your .deb file (i have chosen hadoop_1.2.1-1_x86_64.deb)

When opening it, the Ubuntu Software Center opens to install it.

After installation, you can see if it works by typing hadoop at the command prompt. It should give you some information about using the hadoop command.

After installing Hadoop, lookup /etc/hadoop/hadoop-env.sh and change the line:

export JAVA_HOME=/usr/lib/jvm/java-6-sun

into

#export JAVA_HOME=/usr/lib/jvm/java-6-sun

Install Scala on Ubuntu

Follow the steps as presented on this page:

Download Scala from http://scala-lang.org/ and save it somewhere you can find it (e.g. ~/)

at the command prompt, type:

cd /usr/share
sudo tar -zxf <location and name of the tgz file> (e.g. sudo tar -zxf ~/scala-2.10.3.tgz)
link (ln -s) the executables to the /usr/bin location, e.g.:
sudo ln -s /usr/share/scala-2.10.3/bin/scala /usr/bin/scala
sudo ln -s /usr/share/scala-2.10.3/bin/scalac /usr/bin/scalac
sudo ln -s /usr/share/scala-2.10.3/bin/fsc /usr/bin/fsc

Installing Spark on Ubuntu

Getting Spark up and running is easy as described on http://spark.incubator.apache.org/docs/latest/:
Goto http://spark.incubator.apache.org/downloads.html and download Spark.
Unpack it at a preferred location
Go to your Spark home directory in a terminal and type: sbt/sbt assembly
You can start spark by executing ./spark_shell in the spark home

Note: log4j is still configured in such a way that all logging messages occur in your main window. You can redirect the messages to a standard log file by creating a file log/log4j.properties with the following content:

log4j.rootLogger = DEBUG, A1

log4j.appender.A1=org.apache.log4j.RollingFileAppender

log4j.appender.A1.File=SparkLog.log

log4j.appender.A1.MaxFileSize = 100KB

log4j.appender.A1.layout=org.apache.log4j.PatternLayout

log4j.appender.A1.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Ignore messages below warning level from Jetty, because it's a bit verbose

log4j.logger.org.eclipse.jetty=WARN

Now you are ready to make some Sparks

Tuesday, February 18, 2014

new tools for reactive => creative big data

My ideas from some Open Source Projects, that make the World of big data more real-time, more cool, accessible and usable.
Just "connecting the dots" and create a new tool

Name:

Reactive Functor Framework / Platform (aka: Rfx Framework)

Goals (RFVA):

Reactive to data with logic rules , fuzzy-logic rules (RxSQL) and stream algorithms
Full Stack (backend+frontend) real-time big data framework

Backend: Data Crawler + Kafka + Netty + Redis + Hadoop Tools
Frontend: Groovy + AngularJS + D3 + Bootstrap

Visualize data with Accessibility and Usability (D3.js)
Agent-based Processing (Akka): social network simulation problems

Targeting to:

Big Data Developers
Data Scientists
Data Analysts
All users, who want to play with stream data science in social network

Problems & Domains:

Social Media Research (Facebook Graph + News) for data-driven marketing
Humanity issues (data science)
Social Science (Classical Statistics with stream data from news, social data)
Real-time Data-Driven Business
Time series data visualization

Wednesday, February 12, 2014

Reactive Real-time Big Data System at techcamp.vn

what I will cover at http://techcamp.vn

vote for me my topic at http://techcamp.vn/voting/topic/view/id/21/title/Reactive+Real-time+Big+Data+with+Open+Source+Lambda+Architecture

The mission "Help developers and data scientists take the opportunity to build a reputation as the creator of new information experiences"

The story of “Reactive Real-time Big Data System”

Concepts, Implementations & Practices

The story of “Reactive Real-time Big Data System”

Concepts, Implementations & Practices

This story is about:

Reactive
Real-time
Big Data
System
Our small worlds

Breaking down stories into sub-stories:

Short history, how big data is born
Problems, what’s the issues ?
Demands: what do we need ?
Dreams: what do we dream about ?
Supply: Solutions (Frameworks, Patterns, Platforms, Best Practices)
Realtime Data-Driven Business

Real story (a non-fiction story about Data-Driven Marketing):

The small world of Flappy Bird with Active Functor Framework

subscribe * from Article where title contains ['Flappy Bird'] and facebook's stats (like + share + comment) > 1000
subscribe * from my Facebook’s Feeds where my facebook’s friends shared and domain contains [‘diadiemanuong.com’] or title contains [‘caffee’]
subscribe * from Article where i could like and category in [‘big data’, ‘computer’,’mobile’]
subscribe * from my Facebook’s Feeds where my facebook’s friends shared and title contains ‘Fast and furious’
subscribe, visualize places from my Facebook’s Feeds where i and my wife took photos

What’s next ?

Thursday, February 6, 2014

Harnessing the Power of Big Data for Media (Journalism/Content)

https://reutersinstitute.politics.ox.ac.uk/about/news/item/article/big-data-for-media-conference-2014.html

Mainstream media and the distribution of news
https://reutersinstitute.politics.ox.ac.uk/sites/default/files/Mainstream%20media%20and%20the%20distribution%20of%20news%20in%20the%20age%20of%20social%20discovery.pdf

Discovering Company Revenue Relations from News: A Network Approach

http://home.business.utah.edu/actgp/Papers/Discovering%20Company%20Revenue%20Relations%20from%20News.pdf

Real Time News Analysis for Improved Social Relationship Discovery

http://www.dtic.mil/dtic/tr/fulltext/u2/a481369.pdf

Semi-supervised Statistical Inference for Business Entities Extraction and Business Relations Discovery

http://research.microsoft.com/en-us/um/beijing/events/eos2011/11.pdf

my improvements for Apache Kafka

Apache Kafka https://kafka.apache.org/ , a open source project from LinkedIn
My forked version https://bitbucket.org/trieunt/kafka/overview
Original paper: http://sites.computer.org/debull/A12june/pipeline.pdf

What I can do:

upgrade & compile OK with latest version Scala 2.10.3
notify when Kafka's producer writer is finished, means it should be faster
real-time indexing & search Kafka message & offset using Apache Lucene http://lucene.apache.org/core/
real-time monitoring & alert using Redis http://redis.io/

here the first try after a hacked day:

Reactive Real-time Big Data with Open Source Lambda Architecture Stack

5 Why

1) Why "Reactive" ? http://www.reactivemanifesto.org/

react to events
react to load
react to failure
react to users

2) Why "Real-time"? http://en.wikipedia.org/wiki/Real-time
3) Why "Big Data" ? http://www.bigdata-startups.com/best-practices/
4) Why "Open Source" ?

Security, Quality, Customizability, Freedom, Flexibility, Interoperability, Auditability, Support Options, Cost, Try Before You Buy

http://www.pcworld.com/article/209891/10_reasons_open_source_is_good_for_business.html
http://www.redhat.com/about/whoisredhat/opensource.html

5) Why Lambda Architecture ?
http://www.slideshare.net/tantrieuf31/lambda-architecture-for-real-time-big-data

The list of open source framework/tools I have tried:
● Netty (http://netty.io/) a framework using reactive programming pattern for scaling HTTP system easier, by JBoss http://www.jboss.org
https://blog.twitter.com/2013/netty-4-at-twitter-reduced-gc-overhead

● Apache Kafka (http://kafka.apache.org/) a publish-subscribe messaging rethought as a distributed commit log, open sourced by Linkedin.
http://www.slideshare.net/amywtang/building-a-realtime-data-pipeline-apache-kafka-at-linked-in

● Storm (http://storm-project.net/) the framework for distributed realtime computation system, by Twitter
http://www.quora.com/Apache-Storm/What-are-some-of-the-use-cases-of-Apache-Storm

● Akka http://akka.io/ (Actor Model), a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM.
More use cases at http://doc.akka.io/docs/akka/2.2.3/intro/use-cases.html

● Redis (http://redis.io/) a advanced key-value in-memory NoSQL database, all fast statistical computations in here.
http://openmymind.net/redis.pdf
http://www.manning.com/carlson/

● OrientDB, an Open Source NoSQL DBMS with the features of both Document and Graph DBMSs for KPI Report Data Management http://pettergraff.blogspot.it/2014/01/getting-started-with-orientdb.html

● Groovy http://groovy.codehaus.org/ and Grails http://grails.org/ for scripting layer on JVM, ad-hoc query on Redis, and the front-end

● Hadoop ecosystem http://hadoop.apache.org/ : HDFS, Hive, HBase for batch processing

● RxJava https://github.com/Netflix/RxJava a library for composing asynchronous and event-based programs
https://www.coursera.org/course/reactive

● Hystrix https://github.com/Netflix/Hystrix : for Latency and Fault Tolerance for Distributed Systems
http://techblog.netflix.com/2012/11/hystrix.html

● NVD3 Reusable D3 Chart http://nvd3.org http://d3js.org/
http://techslides.com/over-1000-d3-js-examples-and-demos/
https://github.com/anvaka/VivaGraphJS

Pages