Datasets for machine learning, data mining, app mashups and research
As a market researcher, app producer and software entrepreneur, I use a lot of different data sets for either research or tell stories. Here are a few great repositories I use regularly:
- USA tapestry data 2015 is great for cross referencing where people live and the type of psychographic lifestyle they exhibit.
- USA Demographic data 2015 is the foundation of marketing. Age, sex and location and ethnicity.
- General Social Survey from the National Opinion Research Center offers the most often used survey data on happiness in the U.S. Since 1972.
- Gallop poll and Gallop Analytics provide a variety of data sets focusing on:
- economic confidence
- entrepreneurial energy
- confidence in leadership
- confidence in military and police
- food access
- freedom of media
- life evaluations
- Yahoo Query Language - YQL isn't so much an API as it is a tool for collecting and processing data from different web sources. This tool is surprisingly powerful and I highly suggest developers look at it at least once.
- Programmable Web offers +15k API's.
- Mashape offers a really clear way to access and sell data as a service.
- Google API's are great for mashups. Specifically Google Map's Geocode API, Reverse Geocode API, Google Calendar for event-type apps and Google Apps Script for data mining and collection.
Data sources for civic engagement
- City data has collected and analyzed data from numerous sources to create as complete and interesting profiles of all U.S. cities as we could.
- Envirofacts Envirofacts provides a single point of access to U.S. EPA environmental data contained in U.S. EPA databases. Interested parties from State and local governments, EPA or other Federal agencies, or individuals can search for information about environmental activities that may affect air, water, and land anywhere in the United States. Envirofacts makes it easy to find information using an address, ZIP Code, city, county, water body, or other geographic designation. Envirofacts make it easy to find information from all sources or within specific environmental subject areas, such as Waste, Water, Toxics, Air, Radiation, and Land. Experienced users can use more sophisticated capabilities such as maps or customized reporting.
- U.S. Census The American Community Survey 5 Year Data covers a broad range of topics about social, economic, demographic, and housing characteristics of the U.S. population.
- FourSquare The foursquare API gives you access to all of the data used by the foursquare mobile applications, and, in some cases, even more.
- Yelp The Yelp v2.0 API enables access to more relevant search results that more closely match the results on Yelp.
- Find up to the 40 best results for a geographically-oriented search
- Sort results by the best match for the query, highest ratings, or distance
- Limit results to those businesses offering a Yelp Deal, displaying information about the deal such as title, savings and purchase URL
- Identify and display whether a business has been claimed on Yelp.com
- Dark sky API lets you query for short-term precipitation forecast data at geographical points inside the United States.
- Weather API alerts, almanac, astronomy, conditions, currenthurricane, forecast, forecast10day, geolookup, history, hourly, hourly10day, planner, rawtide, satellite,tide,webcams,yesterday
Here's my curated list of Map API's which also include geo-coding and GIS.
- Million Song Dataset on EC2
- Million Song Dataset from Labrosa.
- Check out my other post on Music Industry API's
- UC Irvine Machine Learning Repository
- Machine learning data set repository
- LIBSVM offers different regression, binary, and multilabel classification datasets stored in the LIBSVM format.
I found this post on Stack exchange and it's very good.