It can be fun to sift through dozens of data sets to find the one that's perfect, but it can also be frustrating to download and import multiple csv files, to realize that the data is ultimately not interesting. Fortunately, there are online datasets that keep only the right datasets.
In this article, we will go through several types of Data Science projects: Data Visualization, Data Cleaning and Machine Learning.
We will therefore identify the right places to find datasets suitable for your projects. Whether you want to fine-tune your portfolio by showing that you're familiar with Data Visualization, or want to practice your skills to test Machine Learning algorithms, you're in the right place to find the perfect dataset.
Datasets for Data Visualization Projects
A typical data visualization project could be: "I want to create graphs showing how income varies across different European countries or the evolution of unemployment in a country of choice".
Let's look right now that they are the points characterizing a good dataset for data visualization:
This one should not be messy because you do not want to spend a lot of time cleaning up the data.
It should be nuanced and interesting enough to make diagrams.
Ideally, each column should be well explained to make the visualization accurate.
This one should not contain too many rows or columns to keep a certain ease to work with.
To find good datasets for Data Visualization projects, we will privilege information sites that publish their data publicly. They usually clean the data for you and already have graphics that you can replicate or improve.
FiveThirtyEight is an incredibly popular interactive and sports site created by Nate Silver (author of the book ‘The Signal and The Noise’ which I recommend). They write interesting articles always focused on the study of data.
FiveThirtyEight provides Github with the datasets used in its articles.
- Here are some examples:
- Airline Safety – .contains information about the accidents of each airline
- US Weather History – historical weather data for the United States.
- Study Drugs – data on Adderall drug users in the United States.
2. BuzzFeed
BuzzFeed started as a supplier of poor quality articles but has since evolved and is now writing some successful surveys.
BuzzFeed also makes the datasets used in its articles available on Github.
Here are some examples:
Federal Surveillance Planes - contains data on aircraft used for national surveillance.
Zika Virus - data on the geography of the Zika virus outbreak.
Firearm background checks - data on background checks of people who are trying to buy firearms.
3. Socrata OpenData
Socrata OpenData is a portal that contains cleaned datasets that can be browsed or downloaded. Much of this data comes from US government sources (and many are out of date).
You can explore and download data from OpenData without registering. You can also use visualization and exploration tools to explore the data directly from the browser.
Here are some examples:
White House staff salaries - salary data for each member of the White House in 2010.
Radiation Analysis - data on radioactive dairy products in some states of the United States.
Workplace fatalities by US state - the number of deaths in the workplace in the United States.
Datasets for data processing projects
Sometimes you may need to work with large volumes of data. The end result is not as important as the process of reading and analyzing data. You can use tools such as Spark or Hadoop to distribute the processing on multiple nodes.
Keep in mind when looking for a good dataset for data processing:
The cleaner the data, the better for you - cleaning up a large amount of data can take a lot of time.
The dataset should be interesting.
There should be an interesting question that can be answered with this data.
A good place to find large sets of public data are cloud hosting providers like Amazon and Google. They are encouraged to host the datasets because they require you to analyze them using their infrastructure (and pay for them).



Aucun commentaire:
Enregistrer un commentaire