Machine Learning with PySpark. 2nd Edition
Download Full Version of the eBook "Machine Learning with PySpark. 2nd Edition"
Download - Machine Learning with PySpark: With Natural Language Processing and Recommender Systems. 2nd Edition by Pramod Singh - PDF, ePUB
I am going to be very honest with you. When I signed the contract to write this second edition, I thought it would be a bit easier to write, but I couldn't have been more wrong about this assumption. It has taken me quite a significant amount of time to complete the chapters. What I have come to realize is that it's never easy to break down a thought process and put it on paper in the most convincing manner. There are so many retrials in that process, but what helped was the foundation block or the blueprint that was already established in the first edition of this book. The main challenge was to figure out how I could make this book more relevant and useful for the readers. I mean there are literally thousands of books on this subject already that this might just end up as another book on the shelf.
To find the answer, I spent a lot of time thinking and going through the messages that I received from so many people who read the first edition of the book. After a while a few patterns started to emerge. The first realization was that data continues to get generated at a much faster pace. The basic premise of the first edition was that a data scientist needs to get familiar with at least one big data framework in order to handle the scalable ML engagement. It would require them to gradually move away from libraries like sklearn that have certain limitations in terms of handling large datasets. That is still highly relevant today as businesses want to leverage as much data as possible to build powerful and significant insights. Hence, people would be excited to learn new things about the Spark framework.
Most of the books that have been published on this subject were either too detailed or lacked a high-level overview. Readers would start really easy but, after a couple of chapters, would start to feel overwhelmed as the content became too technical. As a result, readers would give up without getting enough out of the book. That’s why I wanted to write this book that demonstrates the different ways of using Machine Learning without getting too deep, yet capturing the complete methodology to build a ML model from scratch.
Another issue that I wanted to address in this edition is the development environment. It was evident many people struggled with setting up the right environment in their local machines to install Spark properly and could see a lot of issues. Hence, I wrote this edition using Databricks as the core development platform, which is easy to access, and one doesn't have to worry about setting up anything on the local system. The best thing about using Databricks is that it provides a platform to code in multiple languages such as Python, R, and Scala. The other extension to this edition is that the codebase demonstrates end-to-end development of ML models including automating the intermediate steps using Spark pipelines. The libraries that have been used are from the latest Spark version.
This book is divided into three different sections. The first section covers the process to access Databricks and alternate ways to use Spark. It goes into architecture details of the Spark framework, along with an introduction to Machine Learning. The second section focuses on different machine learning algorithm details and executing end-to-end pipelines for different use cases in PySpark. The algorithms are explained in simple terms for anyone to read and understand the details. The datasets that are being used in the book are relatively smaller on scale, but the overall process and steps remain the same on big data as well. The third and final section showcases how to build a distributed recommender system and Natural Language Processing in PySpark. The bonus part covers creating and visualizing sequence embeddings in PySpark. This book might also be relevant for data analysts and data engineers as it covers steps of big data processing using PySpark. The readers who want to make a transition to the data science and Machine Learning field would find this book easier to start with and can gradually take up more complicated stuff later. The case studies and examples given in the book make it really easy to follow along and understand the fundamental concepts. Moreover, there are limited books available on PySpark, and this book would certainly add value towards upskilling of the readers. The strength of this book lies in explaining the Machine Learning algorithms in most simplistic ways and taking a practical approach toward building and training them using PySpark.
I have put my entire experience and learnings into this book and feel it is precisely relevant to what readers are seeking either to upskill or to solve ML problems. I hope you have some useful takeaways from this book.
- CC BY-NC-SA 3.0 PH
- The author's reference is not required