MOOC On Apache Spark
Written by Alex Denham   
Thursday, 28 May 2015

If you are want to apply data science techniques using parallel programming, in Apache Spark, you'll be interested in an edX course starting Monday June 1st that prepares you for the Spark Certified Developer exam.


CS 100.1x Introduction to Big Data with Apache Spark is a 5-week course at Intermediate level under the auspices of UC BerkeleyX, Berkeley's online course outfit, and sponsored by Databricks, a company founded by the creators of Apache Spark.

It will be taught by Anthony D Joseph who is both Professor in Electrical Engineering and Computer Science and Technical Adviser at Databricks.

With a required effort of 5-7 hours per week (around 30 hours in total) students will learn:

  • Learn how to use Apache Spark to perform data analysis

  • How to use parallel programming to explore data sets

  • Apply Log Mining, Textual Entity Recognition and Collaborative Filtering to real world data questions

  • Prepare for the Spark Certified Developer exam

The Spark Certified developer exam is offered by Databricks in conjunction with O'Reilly at a cost of $300. It can be taken in person during sessions at Strata events or online from you computer.

This certification enables you to:


  • Demonstrate industry recognized validation for your expertise.
  • Meet global standards required to ensure compatibility between Spark applications and distributions.
  • Stay up to date with the latest advances and training in Spark.
  • Become an integral part of the growing Spark developer community.

Of course you don't have to take this certification and can use this MOOC, simply to extend your knowledge of data science. It is part of a two-module Big Data XSeries with the other module being CS 190.1x: Scalable Machine Learning which starts on June 29.



According to its rubric:

This course will attempt to articulate the expected output of Data Scientists and then teach students how to use PySpark (part of Apache Spark) to deliver against these expectations. The course assignments include Log Mining, Textual Entity Recognition, Collaborative Filtering exercises that teach students how to manipulate data sets using parallel processing with PySpark.

Because all exercises will use PySpark (part of Apache Spark) you either need expereience with Python or to take a free online Python mini-course supplied by UC Berkeley.







TIOBE Index June Highlights

The June 2024 TIOBE Index is out and its headline comes as a bit of a shock: C++ surpasses C for the first time in history. Lower down the ranks both Go and Rust have achieved their highest positions  [ ... ]

Microsoft Reveals Preview C#13 Features

Microsoft has announced details of what will be included in C# 13. The news was announced at Microsoft's recent Build conference. The new version will have enhanced parameters, extension types, and se [ ... ]

More News


C book



or email your comment to:


Last Updated ( Wednesday, 17 August 2016 )