Machine learning with Apache Spark

Installing Apache Spark on Ubuntu

  1. Download from the project page.
  2. Unzip into a directory. For example, I put all my custom stuff in `~/bin`.
  3. Add the bin directory of the exploded tar-file to your PATH variable.
  4. Running `which spark-shell` should point to the correct path. Restart your shell if it doesn’t.

Books

  • Fast Data Processing with Spark
  • Scala for Machine Learning
  • Learning Spark

Logistic Regression on Apache Spark

Logistic Regression, at its simplest form, is a classification tool. You train the regression model to divide the training data into seperate buckets, the easiest and most common one being 2 buckets. These can be Yes/No, A/B, 0/1, Buy/Sell, Buy/Do_not_buy, etc. Once the model has finished training, you feed it new data and it will use its training to identify which bucket this set of data goes into.

Apache Spark’s MLlib is a machine-learning library that implements a few machine-learning algorithms, one of which is the logistic regression. I’ve recently discovered Kaggle, and I’m going to use their titanic training data on the MLlib logit trainer to figure out if a particular passenger of the titanic would be expected to live or die, based on data about them such as age, sex, which class of ticket they had.

The code for this, commented, is available on GitHub.

Things learned

  • Scala

    This is the first thing I’ve ever done with Scala. Apache Spark is implemented in Scala, so it made sense to learn Scala for this. Its a fairly simple language to get into. I would like to read more Scala code, so I can learn the conventions used in that language community. I like that its got a lot less boilerplate than Java, but I did miss Clojure when writing this Spark script. Scala’s use of OOP with functional idioms makes it feel a lot like Ruby, at least at this stage of my experience with it.

  • Apache Spark

    I’m very interested in learning more of this program. Ever since using Ruby, I’ve grown to appreciate using existing libraries and leveraging the efforts of the community instead of rolling everything on my own, as I did with Common Lisp. I like that I can trust the Apache foundation to maintain and extend the Spark software. I like that it abstracts away the parallelization of working through the data. I like that it has built-in facilities to scale horizontally over a cluster. I like that it supports Apache Mesos out of the box. I like that I don’t have to write the code to do all that myself.

Next steps

Do more Spark. I’m going to keep running through the various competitions on Kaggle and switch up the different ML systems I use. I feel its going to help me get a lot better with Scala, learn more about the Spark way of doing things, and allow me to learn more about the different ML techniques out there that are popular enough to have gotten into the Spark project (And that lend themselves to processing huge amounts of data).

Leave a Reply