Awesome Public Data Sets
This is a link to another list of public data sets maintained by other people. It looks really promising.
Live Streaming Temporal Data
-
http://data.numenta.org (lots of different data, curated by the NuPIC community)canx for lack of interest - https://data.sparkfun.com/streams/ (IoT data, live & free)
- http://www.acleddata.com/ (Armed Conflict Location and Event Data Project)
NLP
Biometrics
- http://www.physionet.org/cgi-bin/atm/ATM (annotated, digitized physiologic signals and time series)
General Data Sets
- http://www.cs.cmu.edu/~ark/
- https://cloud.google.com/bigquery/public-data/
- http://webscope.sandbox.yahoo.com/catalog.php
- http://www.quandl.com
- http://archive.ics.uci.edu/ml/
- http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1
- http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/DataSets
- http://www.data.gov/
- http://www.data.gc.ca
- http://www.factual.com/
- http://www.google.com/publicdata/directory
- http://www.guardian.co.uk/news/datablog
- http://www.kaggle.com/
- http://www.nationalarchives.gov.uk/records/catalogues-and-online-records.htm
- https://datamarket.com/data/list/?q=provider:tsdl (Time series datasets)
- http://ckan.net/
- http://www.delicious.com/pskomoroch/redistributable+dataset
- http://www.reddit.com/r/datasets/
- http://www.reddit.com/r/opendata
- http://www.trustlet.org/wiki/Repositories_of_datasets
- http://www.diggingintodata.org/Repositories/tabid/167/Default.aspx
- http://oad.simmons.edu/oadwiki/Data_repositories
- http://data-ac-uk.ecs.soton.ac.uk/
Topic Specific Collections
- https://www.datarepository.movebank.org/ (migratory animal paths over time)
- http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml (NYC Taxi and Limo trip records)
- http://crawdad.org/ (Wireless data and traces)
- http://data.cityofchicago.org/ (Chicago government data)
- http://data.govloop.com/ (National government data)
- http://data.gov.uk/data (UK government data)
- http://data.medicare.gov/
- http://data.seattle.gov/
- http://data.sunlightlabs.com/ (Whitehouse and Congressional data)
- http://gettingpastgo.socrata.com/ (Education and educational accountability)
- http://snap.stanford.edu/data/index.html (Social Network Datasets)
- http://timetric.com/public-data/ (Economic datasets)
- http://www.bls.gov/ (Labor statistics)
- http://www.dartmouthatlas.org/tools/downloads.aspx (Health care data)
- http://www.datakc.org/ (King County, Washington data)
- http://research.stlouisfed.org/fred2/ (Economic data)
- http://www.who.int/research/en/ (World Health Data)
- http://www.datasf.org/ (San Francisco Data)
- http://www.nyc.gov/html/datamine/html/home/home.shtml (New York City Data)
- http://data.vancouver.ca (Vancouver Data)
- http://stats.oecd.org/index.aspx (Organization for Economic Cooperation and Development)
- http://data.un.org/Explorer.aspx (UN Data)
- http://www.ngdc.noaa.gov/ngdc.html (Geophysical Data)
- http://data.worldbank.org (World Bank)
- https://wist.echo.nasa.gov/~wist/api/imswelcome/ (Earth Science Data)
- http://www.stats.com/ (Sports Data)
- http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1 (The TIMIT dataset as mentioned by Andrew Ng in Unsupervised Feature Learning and Deep Learning. Note: available for purchase only; no free download. )
- http://www.ncbi.nlm.nih.gov/genome/ (Genomic Data)
- http://noticeboard.americascup.com/race-data/data-api-intro/ (America’s Cup)
- http://www.frixo.com/ (traffic data)
Specific Datasets
- http://en.wikipedia.org/wiki/Wikipedia:Database_download
- http://factfinder.census.gov/servlet/DatasetMainPageServlet (The US Census)
- http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 (Full Google N-Gram dataset for English $150)
- Viewer for above: http://ngrams.googlelabs.com/
- http://www2.jpl.nasa.gov/srtm (Elevation dataset)
- http://build.kiva.org/ (Stats from Kiva.org - Available through API)
- http://www.imdb.com/interfaces
- http://db.humanconnectome.org/ (Neuroimaging)
Sports
- http://www.retrosheet.org/game.htm (Baseball)
- http://www.basketball-reference.com/ (Basketball)
Health Datasets
Traffic
- Colorado: http://www.cotrip.org/xmlFeed.htm
- NYC: http://www.nyc.gov/html/dot/html/about/datafeeds.shtml#realtime
Data Stream Sources
- Pachube - http://www.pachube.com/ - Store, share & discover realtime sensor, energy and environment data from objects, devices & buildings around the world.
- Gnip - http://www.gnip.com/ - Social media API
- Factual - http://www.factual.com/ - Factual has constantly evolving data on thousands of topics
Community of Data Nerds
-
https://groups.google.com/forum/#!forum/get-theinfo
HN - ‘[T]he best way to find data sets. They are a bunch of data hoarders who can help you’ -
http://opendata.stackexchange.com/ - StackExchange site to find data sets
Dataset Articles / Presentation
How info chimp uses Hadoop and the cloud
Amazon Machine Image for data processing (could be very useful)
-
http://ebiquity.umbc.edu/blogger/2009/02/14/infochimps-amazon-machine-image-for-data-analysis-and-viz/
Kate’s Data Presentation.ppt