-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
abf8f9d
commit 145236c
Showing
17 changed files
with
213 additions
and
66 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
77 changes: 60 additions & 17 deletions
77
.../Data-Science/correlations/langs/en-us/pages/notes-DUPLICATE-scatter-plots.adoc
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,27 +1,70 @@ | ||
= Correlations in Scatter Plots in a Nutshell | ||
= Scatter Plots in a Nutshell | ||
|
||
*Scatter Plots* can be used to show a relationship between two quantitative columns. Each row in the dataset is represented by a point, with one column providing the x-value and the other providing the y-value. | ||
++++ | ||
<style> | ||
body.LessonNotes li { | ||
margin-bottom: 1px; | ||
} | ||
</style> | ||
++++ | ||
|
||
The resulting “point cloud” makes it possible to look for a relationship between those two columns. | ||
== Relationships Between two Quantitative Columns | ||
|
||
- If the points in a scatter plot appear to follow a straight line, it suggests that a linear relationship exists between those two columns. A number called a *correlation* can be used to summarize this relationship. | ||
@vspace{1ex} | ||
|
||
- @math{r} is the name of the *correlation statistic*. The @math{r}-value will always fall between −1 and +1. The sign tells us whether the correlation is positive or negative. Distance from 0 tells us the strength of the correlation. | ||
** −1 is the strongest possible negative correlation. | ||
** +1 is the strongest possible positive correlation. | ||
** 0 means no correlation. | ||
** ±0.65 or ±0.70 or more is typically considered a "strong correlation". | ||
** ±0.35 and ±0.65 is typically considered “moderately correlated”. | ||
** Anything less than about ±0.25 or ±0.35 may be considered weak. | ||
** *However, these cutoffs are not an exact science!* In some contexts an @math{r}-value of ±0.50 might be considered impressively strong! | ||
Scatter plots can be used to look for relationships between columns. Each row in the dataset is represented by a point, with one column providing the x-value (@vocab{explanatory variable}) and the other providing the y-value (@vocab{response variable}). The resulting “point cloud” makes it possible to look for a relationship between those two columns. | ||
|
||
@vspace{1ex} | ||
|
||
- The correlation is *positive* if the point cloud slopes up as it goes farther to the right. This means larger y-values tend to go with larger x-values. The correlation is *negative* if the point cloud slopes down as it goes farther to the right. | ||
- _Form_ | ||
|
||
- It is a *strong* correlation if the points are tightly clustered around a line. In this case, knowing the x-value gives us a pretty good idea of the y-value. It is a *weak* correlation if the points are loosely scattered and the y-value doesn't depend much on the x-value. | ||
* If the points in a scatter plot appear to follow a straight line, it suggests that a @vocab{linear relationship} exists between those two columns. | ||
* Relationships may take other forms (u-shaped for example). If they aren't linear, it won't make sense to look for a correlation. | ||
* Sometimes there will be no relationship at all between two variables. | ||
|
||
- Points that do not fit the trend line in a scatter plot are called *unusual observations*. | ||
@vspace{1ex} | ||
|
||
- We graphically summarize this relationship by drawing a straight line through the data cloud, so that the vertical distance between the line and all the points taken together is as small as possible. This line is called the *line of best fit* and allows us to predict y-values based on x-values. | ||
== Line of Best Fit | ||
|
||
- [.underline]#*Correlation is not causation!*# Correlation only suggests that two column variables are related, but does not tell us if one causes the other. For example, hot days are correlated with people running their air conditioners, but air conditioners do not cause hot days! | ||
@vspace{1ex} | ||
|
||
@vocab{Linear Relationships} can be graphically summarized by drawing a straight line through the data cloud, so that the vertical distance between the line and all the points taken together is as small as possible. This allows us to predict y-values (the @vocab{response variable}) based on x-values (the @vocab{explanatory variable}). Points that do not fit the trend line in a scatter plot are called *unusual observations*. | ||
|
||
@vspace{1ex} | ||
|
||
- _Direction_ | ||
|
||
* The correlation is *positive* if the point cloud slopes up as it goes farther to the right. This means larger y-values tend to go with larger x-values. | ||
* The correlation is *negative* if the point cloud slopes down as it goes farther to the right. | ||
|
||
- _Strength_ | ||
|
||
* It is a *strong* correlation if the points are tightly clustered around a line. In this case, knowing the x-value gives us a pretty good idea of the y-value. | ||
* It is a *weak* correlation if the points are loosely scattered and the y-value doesn't depend much on the x-value. | ||
|
||
*Linear Regression* is a way of computing the *line of best fit*. (Want details? It minimizes the _sum of the squares_ of the vertical distances from the points to the line. There's a reason we use computers to do this!) | ||
|
||
@vspace{1ex} | ||
|
||
== Summarizing Correlations with @math{r}-values | ||
|
||
@vspace{1ex} | ||
|
||
The @vocab{correlation} between two quantitative columns can be summarized in a single number, the @math{r}-value. | ||
|
||
- The sign tells us whether the correlation is positive or negative. | ||
- Distance from 0 tells us the strength of the correlation. | ||
- Here is how we might interperet some specific r-values: | ||
* −1 is the strongest possible negative correlation. | ||
* +1 is the strongest possible positive correlation. | ||
* 0 means no correlation. | ||
* ±0.65 or ±0.70 or more is typically considered a "strong correlation". | ||
* ±0.35 to ±0.65 is typically considered “moderately correlated”. | ||
* Anything less than about ±0.25 or ±0.35 may be considered weak. | ||
|
||
_Note: These cutoffs are not an exact science!_ In some contexts an @math{r}-value of ±0.50 might be considered impressively strong! And sample size matters! We'd be more convinced of a positive relationship in general between cat age and time to adoption if a correlation of +0.57 were based on 50 cats instead of 5. | ||
|
||
@vspace{1ex} | ||
|
||
|
||
[.underline]#*Correlation is not causation!*# Correlation only suggests that two variables are related. It does not tell us if one causes the other. For example, hot days are correlated with people running their air conditioners, but air conditioners do not cause hot days! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
69 changes: 62 additions & 7 deletions
69
...-Science/linear-regression/langs/en-us/pages/notes-computing-relationships.adoc
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,70 @@ | ||
= Linear Regression in a Nutshell | ||
= Scatter Plots in a Nutshell | ||
|
||
++++ | ||
<style> | ||
body.LessonNotes li { | ||
margin-bottom: 1px; | ||
} | ||
</style> | ||
++++ | ||
|
||
== Relationships Between two Quantitative Columns | ||
|
||
@vspace{1ex} | ||
|
||
Scatter plots can be used to look for relationships between columns. Each row in the dataset is represented by a point, with one column providing the x-value (@vocab{explanatory variable}) and the other providing the y-value (@vocab{response variable}). The resulting “point cloud” makes it possible to look for a relationship between those two columns. | ||
|
||
@vspace{1ex} | ||
|
||
- _Form_ | ||
|
||
* If the points in a scatter plot appear to follow a straight line, it suggests that a @vocab{linear relationship} exists between those two columns. | ||
* Relationships may take other forms (u-shaped for example). If they aren't linear, it won't make sense to look for a correlation. | ||
* Sometimes there will be no relationship at all between two variables. | ||
|
||
@vspace{1ex} | ||
|
||
== Line of Best Fit | ||
|
||
@vspace{1ex} | ||
|
||
@vocab{Linear Relationships} can be graphically summarized by drawing a straight line through the data cloud, so that the vertical distance between the line and all the points taken together is as small as possible. This allows us to predict y-values (the @vocab{response variable}) based on x-values (the @vocab{explanatory variable}). Points that do not fit the trend line in a scatter plot are called *unusual observations*. | ||
|
||
@vspace{1ex} | ||
|
||
* *We compute linear relationships to predict the future!* Well...sort of. Given a dataset, like ages of animals v. how long before they're adopted, we try to compute the relationship between `age` and `weeks` so that we can _predict_ how long a new animal might stay, based on their age. | ||
- _Direction_ | ||
|
||
* The correlation is *positive* if the point cloud slopes up as it goes farther to the right. This means larger y-values tend to go with larger x-values. | ||
* The correlation is *negative* if the point cloud slopes down as it goes farther to the right. | ||
|
||
- _Strength_ | ||
|
||
* When we compute linear relationships, we're talking about *straight-line patterns* that appear on a scatter plot. | ||
* It is a *strong* correlation if the points are tightly clustered around a line. In this case, knowing the x-value gives us a pretty good idea of the y-value. | ||
* It is a *weak* correlation if the points are loosely scattered and the y-value doesn't depend much on the x-value. | ||
|
||
*Linear Regression* is a way of computing the *line of best fit*. (Want details? It minimizes the _sum of the squares_ of the vertical distances from the points to the line. There's a reason we use computers to do this!) | ||
|
||
@vspace{1ex} | ||
|
||
* A scatter plot has an x-axis and a y-axis. When looking for relationships, the y-axis is called the @vocab{response variable}, and the x-axis is called the @vocab{explanatory variable}. In our example, we are trying to figure out how much of the `weeks` variable is _explained by_ the `age` variable. | ||
== Summarizing Correlations with @math{r}-values | ||
|
||
* *Linear Regression* is a way of computing the *line of best fit*, which tries to draw a line as close as possible to all the points. (Want details? It minimizes the _sum of the squares_ of the vertical distances from the points to the line. There's a reason we use computers to do this!) | ||
@vspace{1ex} | ||
|
||
The @vocab{correlation} between two quantitative columns can be summarized in a single number, the @math{r}-value. | ||
|
||
- The sign tells us whether the correlation is positive or negative. | ||
- Distance from 0 tells us the strength of the correlation. | ||
- Here is how we might interperet some specific r-values: | ||
* −1 is the strongest possible negative correlation. | ||
* +1 is the strongest possible positive correlation. | ||
* 0 means no correlation. | ||
* ±0.65 or ±0.70 or more is typically considered a "strong correlation". | ||
* ±0.35 to ±0.65 is typically considered “moderately correlated”. | ||
* Anything less than about ±0.25 or ±0.35 may be considered weak. | ||
|
||
_Note: These cutoffs are not an exact science!_ In some contexts an @math{r}-value of ±0.50 might be considered impressively strong! And sample size matters! We'd be more convinced of a positive relationship in general between cat age and time to adoption if a correlation of +0.57 were based on 50 cats instead of 5. | ||
|
||
@vspace{1ex} | ||
|
||
* *Slope* is how much we predict the @vocab{response variable} will increase or decrease for each unit that the @vocab{explanatory variable} increases. In our example, a slope of 0.5 would mean "we predict that each additional year of age means an extra half-week in the shelter". (What would a slope of 3 mean?) | ||
|
||
* *Sample size matters!* The number of data values is also relevant. We'd be more convinced of a positive relationship in general between cat age and time to adoption if a correlation of +0.57 were based on 50 cats instead of 5. | ||
[.underline]#*Correlation is not causation!*# Correlation only suggests that two variables are related. It does not tell us if one causes the other. For example, hot days are correlated with people running their air conditioners, but air conditioners do not cause hot days! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.