New Public Datasets Added To AWS
Written by Kay Ewbank   
Wednesday, 06 February 2019

Amazon has announced nine new AWS public datasets for researchers and developers interested in machine learning, environmental science, geospatial, astronomy, cybersecurity, and housing.

The AWS Public Dataset Program covers the cost of storage for publicly available high-value cloud-optimized datasets. The datasets within it can be used for analysis on AWS, and the aim is also to develop new cloud-native techniques, formats, and tools that lower the cost of working with data.



The machine learning dataset is a massively multilingual image dataset from the University of Pennsylvania. The dataset contains images paired with the words they represent in 100 languages, and the dataset is doubly parallel: for each language, words are stored parallel to images that represent the word, as well as parallel to the word's translation into English. The image below shows five images for the Indonesian word "kucing", a word with high predicted concreteness, along with its top 4 ranked translations using CNN features:



There are three environmental datasets. The first is a set of atmospheric deterministic and probabilistic forecasts from the UK Meteorological Office. This is actually an update to previously available data, but is now updated daily.

The second environmental dataset is a collection of scientific information for land owners from the Queensland Government. The database is made up of Australian climate data from 1889 to the present.

The third collection of environmental data is air quality and radiation data from Safecast. Safecast was started after the Fukushima Daiichi Nuclear Power Plant meltdown, when volunteers began monitoring radiation levels. Air quality measurements were added later, and the project has spread around the world.

There are two new Geospatial datasets; the USGS 3D elevation data, which contains elevation data in the form of light detection and ranging (LiDAR) data over the United States, Hawaii, and the U.S. territories, with data acquired over an 8-year period; and a set of images collected by the China-Brazil Earth Resources Satellite from AMS Kepler.
In the astronomy sector, there's data from the Transiting Exoplanet Survey Satellite (TESS), a two-year survey looking for exoplanets in orbit around bright stars.
The Open City Model data has also been made available. This is an initiative to provide cityGML data for all the buildings in the United States. By using other open datasets in conjunction with the researchers' own code and algorithms, the intention is to provide 3D geometries for every US building.

The final addition is a collection of datasets from QIIME 2. The Microbiome research user tutorial datasets contains the user documents and datasets for QIIME 2. QIIME is an extensible and decentralized microbiome analysis package with a focus on data and analysis transparency. It enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results. 



More Information

Massively Multilingual Image Dataset

Learning Translations via Images with a Massively Multilingual Image Dataset

Atmospheric Deterministic and Probabilistic Forecasts

Scientific Information for Land Owners

Safecast Air Quality and Radiation data

USGS 3DEP LiDAR Point Clouds 

China-Brazil Earth Resources Satellite

Transiting Exoplanet Survey Satellite

Open City Model

Microbiome Research User Tutorial Datasets

Related Articles

Amazon Releases Managed Message Broker Service for ActiveMQ

AWS Lambda for the Impatient Part 1

AWS Lambda for the Impatient Part 2

AWS Lambda for the Impatient Part 3

Amazon Adds Game Dev Options To AWS

Amazon Strengthens Data Offerings

New Amazon Elasticsearch Service
Amazon Introduces Quicksight - Cloud BI

New AWS Managed Services


To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on, Twitter, Facebook or Linkedin.


The True State of Java and its Ecosystem

JRebel has released its 2020 Java Technology Report. Combining its findings with those of  two other recent reports on Java, from Baeldung and Snyk, allows us to reveal the latest state of affair [ ... ]

Lottery Ticket Hypothesis - Who Needs Backprop Just Prune

New research suggests that a random neural network may have the same power as a fully trained network and uncovering this is just a matter of pruning the connections. Is this profound? Is this obvious [ ... ]

More News





or email your comment to:

Last Updated ( Wednesday, 06 February 2019 )