Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Download Full Version of the eBook "Advanced Analytics with Spark: Patterns for Learning from Data at Scale"

Advanced Analytics with Spark: Patterns for Learning from Data at Scale by Sandy Ryza

Download - Advanced Analytics with Spark: Patterns for Learning from Data at Scale by Sandy Ryza - PDF 

I don’t like to think I have many regrets, but it’s hard to believe anything good came out of a particular lazy moment in 2011 when I was looking into how to best distribute tough discrete optimization problems over clusters of computers. My advisor explained this newfangled Spark thing he had heard of, and I basically wrote off the concept as too good to be true and promptly got back to writing my undergrad thesis in MapReduce. Since then, Spark and I have both matured a bit, but one of us has seen a meteoric rise that’s nearly impossible to avoid making “ignite” puns about. Cut to two years later, and it has become crystal clear that Spark is something worth paying attention to.


Spark’s long lineage of predecessors, running from MPI to MapReduce, makes it possible to write programs that take advantage of massive resources while abstracting away the nitty-gritty details of distributed systems. As much as data processing needs have motivated the development of these frameworks, in a way the field of big data has become so related to these frameworks that its scope is defined by what these frameworks can handle. Spark’s promise is to take this a little further—to make writing distributed programs feel like writing regular programs.


Spark will be great at giving ETL pipelines huge boosts in performance and easing some of the pain that feeds the MapReduce programmer’s daily chant of despair (“why? whyyyyy?”) to the Hadoop gods. But the exciting thing for me about it has always been what it opens up for complex analytics. With a paradigm that supports iterative algorithms and interactive exploration, Spark is finally an open source framework that allows a data scientist to be productive with large data sets.


I think the best way to teach data science is by example. To that end, my colleagues and I have put together a book of applications, trying to touch on the interactions between the most common algorithms, data sets, and design patterns in large-scale analytics. This book isn’t meant to be read cover to cover. Page to a chapter that looks like something you’re trying to accomplish, or that simply ignites your interest.


98
Views
0
Likes

Licenses:

  • CC BY-NC-SA 3.0 PH
  • The author's reference is not required

Share on networks

eBooks Details:

Comments (0) Add

Кликните на изображение чтобы обновить код, если он неразборчив
No comments yet. Your comment will be the first!