R
has a great many datasets built into its core installation under the package datasets
. In addition, both caret
and its graphical dependancy ggplot2
come with several datasets built in.
If you see a dataset that interests you, make sure the relevant package is loaded (if necessary). Then, by running the command data(<dataset>)
you get your dataset of choice pre-loaded into your instance. Technically the dataset doesn't fully arrive until you try to do something with it, but since you'll want to inspect it almost immediately:
data(iris)
str(iris)
This sequence will load the iris
dataset and then tell you about its structure. You could also run ?iris
to get the documentation on the data set, which might bring more clarity to the columns.
Unfortunately several of these data sets are quite small. Two that might be of interest:
airquality
- Ozone readings in parts per billion based on 5 predictors.iris
- A classification of iris species on 4 predictors.
Much more thorough descriptions at the caret
page. Each dataset is relatively flush with predictors and observations.
A relatively approachable dataset might be cars
.
Official list (look under the Data heading a little over half-way down)
No descriptions here (though again, you can use ?<dataset>
to get the documentation).
Of particular interest here is the diamonds
dataset. 50,000 observations, 9 predictors for price, endless possibilities.