Small Data or Big Data, Your Choice
Written by Sue Gee   
Friday, 11 August 2023

In this digital age we are collecting data at an unprecedented rate and the career opportunities in this field continue to expand. Nanodegrees and Courses from Udacity's School of Data Science start again on August 16th so time take another look at what's on offer. A new short course on Small Data caught my attention, raising questions about how this differs from big data.

Udacity recently launched a set of Short Online Courses designed to teach practical, job-ready, skills in less than a month. We have already explored three of them related to Artificial Intelligence. This time the topic is data and again we have a line up of three.udacityLogoNew

Disclosure: When you make a purchase having followed a link to from this article, we may earn an affiliate commission. 

There was a time when we'd put air quotes around "big data" because the concept seemed new. Now we take massive data sets for granted. They are what provide fodder for machine learning and in turn fuel developments in Artificial Intelligence.

So what is Small Data and why has Udacity developed a new course about it? What will you learn by following the course and what pre-requisites do you need?

According to Udacity's co-founder, Sebastian Thrun:

Small data is a really interesting topic because we now these universally trained broad AI systems that don't quite know what your task is and you might not have a billion data points on your task. It might just have a small data set, yet. 

He goes on to say that it is possible to take the same large models used with massive datasets and tailor them effectively towards a specific task for which you only have a small amount, but still enough, data.

small data

Small Data is a 1-month course that is at intermediate level and requires learners to have intermediate Python and Machine Learning knowledge  Graduates of the course will be able to:

  • Apply transfer learning techniques in machine learning problems to small datasets
  • Generate synthetic tabular data using variational autoencoders to train machine learning models
  • Identify appropriate machine learning techniques to use with small datasets

The course based around a project with the title Transfer Learning and Data Generation Solutions. In this project, the learner will start with 2 small datasets and apply the appropriate technique to solve specific problems. One small dataset will require the learner to utilize transfer learning to categorize data from a relatively small dataset correctly. The other dataset will require the learner augment the small dataset with synthetically generated data suitable for developing a robust machine learning model.

The first module of supporting lesson content introduces the concept of Small Data. The next is on Machine Learning Techniques for Small Data and covers how to classify and describe small data, comparing and contrasting it with big data, relate the concept of small data to real-world situations and outline the various solutions available. After this comes a module on Transfer Learning which culminates in creating a tranfer learning solution. The final module on Synthetic Data covers differentiating between synthetic data and fake data, evaluating whether synthetic data is appropriate for a scenario, contrasting synthetic image data and synthetic tabular data before creating a synthetic data solution.

The other two 1-month short courses are related to Big Data. Both of them are also part of the Data Engineer Nanodegree, which we have previously outlined in detail and which is currently billed as one of the most popular programs on Udacity.

CloudUdacity

The newly available Cloud Data Warehouses is at intermediate level requiring experience in relational database design, SQL, basic dimensional modeling,  Amazon Web Services basics, and Python. Estimated to require a month, learners will build warehousing skills, gain an understanding of data infrastructure, and build on the cloud using AWS. In the course project, learners act as a data engineer for a streaming music service to build an ELT pipeline that extracts data from S3, stages it in Redshift, and transforms it into a set of dimensional tables for an analytics team to find insights into what songs their users are listening to.

Those who complete it will be ready to apply the following skills:

  • Data Engineering: Data Extraction, ETL
  • Data Architecture: OLAP Cubes, Data Warehouse Architecture
  • Databases: Database Fundamentals
  • Amazon Web Services: Redshift, Amazon S3
  • Cloud Strategy and Governance: Cloud Computing Fluency

Spark & Data Lakes is also at intermediate level, requiring intermediate knowledge of Python and SQL and AWS basics. In the course of a month, learners will build a data lake on AWS and a data catalog following the principles of data lakehouse architecture. They will learn about the big data ecosystem and the power of Apache Spark for data wrangling and transformation. They’ll work with AWS data tools and services to extract, load, process, query, and transform semi-structured data in data lakes. In the course project learners will act as a data engineer for a team building a data lakehouse solution for sensor data that trains a machine learning model. They will build an ELT (Extract, Load, Transform) pipeline for lakehouse architecture, load data from an AWS S3 data lake, process the data into analytics tables using Spark and AWS Glue, and load them back into lakehouse architecture.

Those who complete it will be ready to apply the following skills:

  • Amazon Web Services: AWS Glue, AWS Data Lakes, Amazon S3, Amazon Athena
  • Big Data Tools: Apache Spark
  • Data Engineering: Data Transformation, Data Wrangling, ELT
  • Data Architecture: Data Lakes, Data Lakehouse Architecture
  • Data Formats: Data Format Fundamentals
  • Big Data: Big Data Fluency

These are sought-after skills and Udacity helps learner to succeed with services customized for their needs along the learning journey including timely personalized feedback and on-demand help.   

More Information

Small Data

Cloud Data Warehouses

Spark & Data Lakes

Data Engineer Nanodegree

Related Articles

AI Short Course New From Udacity

Data Scientist or Data Engineer? Choose Your Path On Udacity

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Master Large Language Model Ops
20/03/2024

New technology brings with it more career opportunities. You may never have imagined becoming an LLMOps consultant,  but there's now a Coursera Specialization which provides preparation for this  [ ... ]



Apache Updates Geronimo Arthur
28/03/2024

Apache Geronimo Arthur has been updated with support for Common-compress, XBean, and ensures the default options are compatible with last GraalVM release.


More News

raspberry pi books

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Friday, 11 August 2023 )