Skip to content

Commit

Permalink
more workbook updates (see #2395)
Browse files Browse the repository at this point in the history
  • Loading branch information
flannery-denny committed Feb 14, 2025
1 parent abf8f9d commit 145236c
Show file tree
Hide file tree
Showing 17 changed files with 213 additions and 66 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ body.LessonNotes li {

@vspace{1ex}

*You've learned that functions are _machines that consume and produce data_.* +
*You've learned that functions are _machines that consume and produce values_.* +

@vspace{1ex}

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,28 @@ Then compute:

Make box plots for each family's age distribution on the number lines below. _Hint: Plot the 5-Number Summaries, draw a box around the IQR (from Q1 to Q3), let the median split the box into 2 parts, and add whiskers from the box to the minimum and maximum values._

@n Ledet: @ifnotsoln{@image{../images/blank-0to80-num-line-numbered.png}} @ifsoln{@image{../images/ledet.png}}
@vspace{1ex}

@teacher{The student version of this page will have two numberlines with Ledet above Watson.}

@ifnotsoln{

@n Ledet: @image{../images/blank-0to80-num-line-numbered.png}

@n Watson: @image{../images/blank-0to80-num-line-numbered.png}

}

@ifsoln{
[cols="1a,1a", options="header", stripes="none"]
|===
| Ledet
| Watson

@n Watson: @ifnotsoln{@image{../images/blank-0to80-num-line-numbered.png}} @ifsoln{@image{../images/watson.png}}
|@image{../images/ledet-pyret.png}
|@image{../images/watson-pyret.png}
|===
}

== Compare and Contrast

Expand Down
Original file line number Diff line number Diff line change
@@ -1,27 +1,70 @@
= Correlations in Scatter Plots in a Nutshell
= Scatter Plots in a Nutshell

*Scatter Plots* can be used to show a relationship between two quantitative columns. Each row in the dataset is represented by a point, with one column providing the x-value and the other providing the y-value.
++++
<style>
body.LessonNotes li {
margin-bottom: 1px;
}
</style>
++++

The resulting “point cloud” makes it possible to look for a relationship between those two columns.
== Relationships Between two Quantitative Columns

- If the points in a scatter plot appear to follow a straight line, it suggests that a linear relationship exists between those two columns. A number called a *correlation* can be used to summarize this relationship.
@vspace{1ex}

- @math{r} is the name of the *correlation statistic*. The @math{r}-value will always fall between −1 and +1. The sign tells us whether the correlation is positive or negative. Distance from 0 tells us the strength of the correlation.
** −1 is the strongest possible negative correlation.
** +1 is the strongest possible positive correlation.
** 0 means no correlation.
** ±0.65 or ±0.70 or more is typically considered a "strong correlation".
** ±0.35 and ±0.65 is typically considered “moderately correlated”.
** Anything less than about ±0.25 or ±0.35 may be considered weak.
** *However, these cutoffs are not an exact science!* In some contexts an @math{r}-value of ±0.50 might be considered impressively strong!
Scatter plots can be used to look for relationships between columns. Each row in the dataset is represented by a point, with one column providing the x-value (@vocab{explanatory variable}) and the other providing the y-value (@vocab{response variable}). The resulting “point cloud” makes it possible to look for a relationship between those two columns.

@vspace{1ex}

- The correlation is *positive* if the point cloud slopes up as it goes farther to the right. This means larger y-values tend to go with larger x-values. The correlation is *negative* if the point cloud slopes down as it goes farther to the right.
- _Form_

- It is a *strong* correlation if the points are tightly clustered around a line. In this case, knowing the x-value gives us a pretty good idea of the y-value. It is a *weak* correlation if the points are loosely scattered and the y-value doesn't depend much on the x-value.
* If the points in a scatter plot appear to follow a straight line, it suggests that a @vocab{linear relationship} exists between those two columns.
* Relationships may take other forms (u-shaped for example). If they aren't linear, it won't make sense to look for a correlation.
* Sometimes there will be no relationship at all between two variables.

- Points that do not fit the trend line in a scatter plot are called *unusual observations*.
@vspace{1ex}

- We graphically summarize this relationship by drawing a straight line through the data cloud, so that the vertical distance between the line and all the points taken together is as small as possible. This line is called the *line of best fit* and allows us to predict y-values based on x-values.
== Line of Best Fit

- [.underline]#*Correlation is not causation!*# Correlation only suggests that two column variables are related, but does not tell us if one causes the other. For example, hot days are correlated with people running their air conditioners, but air conditioners do not cause hot days!
@vspace{1ex}

@vocab{Linear Relationships} can be graphically summarized by drawing a straight line through the data cloud, so that the vertical distance between the line and all the points taken together is as small as possible. This allows us to predict y-values (the @vocab{response variable}) based on x-values (the @vocab{explanatory variable}). Points that do not fit the trend line in a scatter plot are called *unusual observations*.

@vspace{1ex}

- _Direction_

* The correlation is *positive* if the point cloud slopes up as it goes farther to the right. This means larger y-values tend to go with larger x-values.
* The correlation is *negative* if the point cloud slopes down as it goes farther to the right.

- _Strength_

* It is a *strong* correlation if the points are tightly clustered around a line. In this case, knowing the x-value gives us a pretty good idea of the y-value.
* It is a *weak* correlation if the points are loosely scattered and the y-value doesn't depend much on the x-value.

*Linear Regression* is a way of computing the *line of best fit*. (Want details? It minimizes the _sum of the squares_ of the vertical distances from the points to the line. There's a reason we use computers to do this!)

@vspace{1ex}

== Summarizing Correlations with @math{r}-values

@vspace{1ex}

The @vocab{correlation} between two quantitative columns can be summarized in a single number, the @math{r}-value.

- The sign tells us whether the correlation is positive or negative.
- Distance from 0 tells us the strength of the correlation.
- Here is how we might interperet some specific r-values:
* −1 is the strongest possible negative correlation.
* +1 is the strongest possible positive correlation.
* 0 means no correlation.
* ±0.65 or ±0.70 or more is typically considered a "strong correlation".
* ±0.35 to ±0.65 is typically considered “moderately correlated”.
* Anything less than about ±0.25 or ±0.35 may be considered weak.

_Note: These cutoffs are not an exact science!_ In some contexts an @math{r}-value of ±0.50 might be considered impressively strong! And sample size matters! We'd be more convinced of a positive relationship in general between cat age and time to adoption if a correlation of +0.57 were based on 50 cats instead of 5.

@vspace{1ex}


[.underline]#*Correlation is not causation!*# Correlation only suggests that two variables are related. It does not tell us if one causes the other. For example, hot days are correlated with people running their air conditioners, but air conditioners do not cause hot days!
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ The code below draws a solid, green triangle, using 5x the age of `old-row` as t
@n An animal is "young" if it is less than 1 year old. Is `female-row` young? +
@fitb{}{@ifsoln{`female-row["age"] < 1`}}

@n Using `hermaphrodite-row`, draw a solid, blue circle where the radius is the `age` of the row. +
@fitb{}{@ifsoln{`circle(hermaphrodite-row["age"], "solid", "blue")`}}
@n Using `male-row`, draw a solid, blue circle where the radius is the `age` of the row. +
@fitb{}{@ifsoln{`circle(male-row["age"], "solid", "blue")`}}

@n If every week is 7 days, how many _days_ did it take for our `rabbit-row` to be adopted? +
@fitb{}{@ifsoln{`rabbit-row["weeks"] * 7`}}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ The numbers tell the computer which Row we want from the Table. _Note: Rows are

For example:
```
row-n(animals-table, 0) # the first row
row-n(animals-table, 2) # the third row
row-n(animals-table, 0) # the first row (Sasha)
row-n(animals-table, 2) # the third row (Mittens)
```

When we define these rows, it's most useful to name them based on their _properties_ (rather than their identifiers, e.g. snuffles):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,11 @@ Try some other bin sizes (be sure to experiment with bigger and smaller bins!)
@n What shape emerges? @fitb{}{@ifsoln{right skew}}
@n What bin size gives you the best picture of the distribution? (Note: _ideally your histogram should have between 5 and 10 bars_) @fitb{}{@ifsoln{5}}
@n What bin size gives you a picture of the distribution with between 5 and 10 bins. @fitb{}{@ifsoln{5}}
@vspace{1ex}
@indented{@teacher{Be Prepared - Due to a bug in Google Charts, if students use a bin size of 5, 6, or 7, the histogram they get back will use a bin size of 5.}}
@n Are there any outliers? If so, are they high or low? @fitb{}{@ifsoln{high}}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,8 @@ For each of the scatter plots below, draw a *predictor line* that seems like the
!===
! *Direction*: ! @ifsoln-choice{Positive} ! Negative ! None
! *Strength*: ! @ifsoln-choice{Stronger} ! Weaker !
! estimated *r*: 3+! -0.9 @hspace{3em} -0.5 @hspace{3em} 0 @hspace{3em} 0.5 @hspace{3em} @ifsoln{*0.9*} @ifnotsoln{0.9}
4+! I would guess that *r* is closest to...
4+! @center{-1 @hspace{3em} -0.5 @hspace{3em} 0 @hspace{3em} 0.5 @hspace{3em} @ifsoln{*1*} @ifnotsoln{1}}
!===

| *B*
Expand All @@ -39,7 +40,8 @@ For each of the scatter plots below, draw a *predictor line* that seems like the
!===
! *Direction*: ! Positive ! @ifsoln-choice{Negative} ! None
! *Strength*: ! @ifsoln-choice{Stronger} ! Weaker !
! estimated *r*: 3+! -0.9 @hspace{3em} @ifnotsoln{-0.5} @ifsoln{*-0.5*} @hspace{3em} 0 @hspace{3em} 0.5 @hspace{3em} 0.9
4+! I would guess that *r* is closest to...
4+! @center{ -1 @hspace{3em} @ifnotsoln{-0.5} @ifsoln{*-0.5*} @hspace{3em} 0 @hspace{3em} 0.5 @hspace{3em} 1}
!===

| *C*
Expand All @@ -50,7 +52,8 @@ For each of the scatter plots below, draw a *predictor line* that seems like the
!===
! *Direction*: ! Positive ! @ifsoln-choice{Negative} ! None
! *Strength*: ! @ifsoln-choice{Stronger} ! Weaker !
! estimated *r*: 3+! @ifnotsoln{-0.9} @ifsoln{*-0.9*} @hspace{3em} -0.5 @hspace{3em} 0 @hspace{3em} 0.5 @hspace{3em} 0.9
4+! I would guess that *r* is closest to...
4+! @center{ @ifnotsoln{-1} @ifsoln{*-1*} @hspace{3em} -0.5 @hspace{3em} 0 @hspace{3em} 0.5 @hspace{3em} 1 }
!===

| *D*
Expand All @@ -61,7 +64,8 @@ For each of the scatter plots below, draw a *predictor line* that seems like the
!===
! *Direction*: ! @ifsoln-choice{Positive} ! Negative ! None
! *Strength*: ! @ifsoln-choice{Stronger} ! Weaker !
! estimated *r*: 3+! -0.9 @hspace{3em} @ifnotsoln{-0.5} @ifsoln{*0.5*} @hspace{3em} 0 @hspace{3em} 0.5 @hspace{3em} 0.9
4+! I would guess that *r* is closest to...
4+! @center{ -1 @hspace{3em} @ifnotsoln{-0.5} @ifsoln{*0.5*} @hspace{3em} 0 @hspace{3em} 0.5 @hspace{3em} 1}
!===

|===
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
[.linkInstructions]
You should already have @ifproglang{pyret}{plotted @show{(code '(lr-plot animals-table "name" "age" "weeks"))}}@ifproglang{codap}{created a Least Squares line with `Weeks` on the x-axis and `Age` on the y-axis} in the @starter-file{animals}.

@n What is the predictor function? @math{y =} @fitb{10em}{@ifsoln{0.78924}} @math{x +} @fitb{10em}{@ifsoln{2.30936}}
@n What is the predictor function? @math{y =} @fitb{10em}{@ifsoln{0.78924}} @math{x +} @fitb{10em}{@ifsoln{2.30936}} @hspace{10em} @math{r=}@fitb{}{}

@n What is the slope? @fitb{}{@ifsoln{0.78924}}

Expand Down Expand Up @@ -59,7 +59,7 @@ lr-plot(filter(animals-table, is-cat), "name", "age", "weeks")

@vspace{1ex}

@n What is the predictor function? @math{y =} @fitb{5em}{@ifsoln{0.23161}} @math{x +} @fitb{5em}{@ifsoln{2.48598}}
@n What is the predictor function? @math{y =} @fitb{10em}{@ifsoln{0.23161}} @math{x +} @fitb{10em}{@ifsoln{2.48598}} @hspace{10em} @math{r=}@fitb{}{}

@n What is the slope? @fitb{}{@ifsoln{0.23161}}

Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,70 @@
= Linear Regression in a Nutshell
= Scatter Plots in a Nutshell

++++
<style>
body.LessonNotes li {
margin-bottom: 1px;
}
</style>
++++

== Relationships Between two Quantitative Columns

@vspace{1ex}

Scatter plots can be used to look for relationships between columns. Each row in the dataset is represented by a point, with one column providing the x-value (@vocab{explanatory variable}) and the other providing the y-value (@vocab{response variable}). The resulting “point cloud” makes it possible to look for a relationship between those two columns.

@vspace{1ex}

- _Form_

* If the points in a scatter plot appear to follow a straight line, it suggests that a @vocab{linear relationship} exists between those two columns.
* Relationships may take other forms (u-shaped for example). If they aren't linear, it won't make sense to look for a correlation.
* Sometimes there will be no relationship at all between two variables.

@vspace{1ex}

== Line of Best Fit

@vspace{1ex}

@vocab{Linear Relationships} can be graphically summarized by drawing a straight line through the data cloud, so that the vertical distance between the line and all the points taken together is as small as possible. This allows us to predict y-values (the @vocab{response variable}) based on x-values (the @vocab{explanatory variable}). Points that do not fit the trend line in a scatter plot are called *unusual observations*.

@vspace{1ex}

* *We compute linear relationships to predict the future!* Well...sort of. Given a dataset, like ages of animals v. how long before they're adopted, we try to compute the relationship between `age` and `weeks` so that we can _predict_ how long a new animal might stay, based on their age.
- _Direction_

* The correlation is *positive* if the point cloud slopes up as it goes farther to the right. This means larger y-values tend to go with larger x-values.
* The correlation is *negative* if the point cloud slopes down as it goes farther to the right.

- _Strength_

* When we compute linear relationships, we're talking about *straight-line patterns* that appear on a scatter plot.
* It is a *strong* correlation if the points are tightly clustered around a line. In this case, knowing the x-value gives us a pretty good idea of the y-value.
* It is a *weak* correlation if the points are loosely scattered and the y-value doesn't depend much on the x-value.

*Linear Regression* is a way of computing the *line of best fit*. (Want details? It minimizes the _sum of the squares_ of the vertical distances from the points to the line. There's a reason we use computers to do this!)

@vspace{1ex}

* A scatter plot has an x-axis and a y-axis. When looking for relationships, the y-axis is called the @vocab{response variable}, and the x-axis is called the @vocab{explanatory variable}. In our example, we are trying to figure out how much of the `weeks` variable is _explained by_ the `age` variable.
== Summarizing Correlations with @math{r}-values

* *Linear Regression* is a way of computing the *line of best fit*, which tries to draw a line as close as possible to all the points. (Want details? It minimizes the _sum of the squares_ of the vertical distances from the points to the line. There's a reason we use computers to do this!)
@vspace{1ex}

The @vocab{correlation} between two quantitative columns can be summarized in a single number, the @math{r}-value.

- The sign tells us whether the correlation is positive or negative.
- Distance from 0 tells us the strength of the correlation.
- Here is how we might interperet some specific r-values:
* −1 is the strongest possible negative correlation.
* +1 is the strongest possible positive correlation.
* 0 means no correlation.
* ±0.65 or ±0.70 or more is typically considered a "strong correlation".
* ±0.35 to ±0.65 is typically considered “moderately correlated”.
* Anything less than about ±0.25 or ±0.35 may be considered weak.

_Note: These cutoffs are not an exact science!_ In some contexts an @math{r}-value of ±0.50 might be considered impressively strong! And sample size matters! We'd be more convinced of a positive relationship in general between cat age and time to adoption if a correlation of +0.57 were based on 50 cats instead of 5.

@vspace{1ex}

* *Slope* is how much we predict the @vocab{response variable} will increase or decrease for each unit that the @vocab{explanatory variable} increases. In our example, a slope of 0.5 would mean "we predict that each additional year of age means an extra half-week in the shelter". (What would a slope of 3 mean?)

* *Sample size matters!* The number of data values is also relevant. We'd be more convinced of a positive relationship in general between cat age and time to adoption if a correlation of +0.57 were based on 50 cats instead of 5.
[.underline]#*Correlation is not causation!*# Correlation only suggests that two variables are related. It does not tell us if one causes the other. For example, hot days are correlated with people running their air conditioners, but air conditioners do not cause hot days!
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
@vspace{1ex}

* *Mean* is the average of all the numbers in a dataset .
* *Median* is a value that is smaller than half the dataset, and larger than the other half of a dataset . In an ordered list the median will either be the middle number or the average of the two middle numbers.
* *Median*: Half of the dataset will always be greater than or equal to the median. Half of the dataset will always be less than or equal to the median. In an ordered list, the median will either be the middle number or the average of the two middle numbers.
* *Mode(s)* of a dataset is the value (or values) occurring most often. When all of the values occur equally often, a dataset has no mode.

@vspace{1ex}
Expand Down
Loading

0 comments on commit 145236c

Please sign in to comment.