Estimate the unique words from data stream URL http://en.wikipedia.org/wiki/List_of_United_States_counties_and_county_equivalents |
http://redis.io/commands#hyperloglog
Open Source Stream Library of AddThis
https://github.com/addthis/stream-lib
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
Original Paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
Mining Data Stream
Slide: http://www.stanford.edu/class/cs246/slides/16-streams.pdf
Applicable Problems:
- Estimate the unique elements in continuous data stream
- Estimation for Big Data
- finding an ever growing number of applications in networking and traffic monitoring, such as the detection of worm propagation, of network attacks (e.g., by Denial of Service), and of link-based spam on the web
- an important indication for detecting attacks and monitoring traffic, as it records the number of distinct active flows
- http://antirez.com/news/75
- http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
- http://research.google.com/pubs/pub40671.html (HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm)
- How can sliding HyperLogLog and EWMA detect port scan attacks in IP traffic? http://jis.eurasipjournals.com/content/2014/1/5