\(\large{Z = \frac{\hat{\pi} - \pi_0}{\sqrt{\frac{\pi_0(1-\pi_0)}{n}}} \sim N(0,1)}\),
\(\large{(1-\alpha)\% \space CI = \hat{\pi} \pm z_{\alpha / 2}\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}}\)
\(\large{Z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}} \sim N(0,1)}\),
\(\large{(1-\alpha)\% \space CI = \bar{x} \pm z_{\alpha / 2}{\frac{\sigma}{\sqrt{n}}}}\)
where \(\mu\), and \(\sigma\) are known population parameters and the distribution of \(\bar{X}\) is normal.
Note: The underlying distribution of X does not affect the distribution of \(\bar{X}\)
\(\large{T = \frac{\bar{x} - \mu}{s/\sqrt{n}} \sim t_{n-1}}\),
\(\large{(1-\alpha)\% \space CI = \bar{x} \pm t_{\alpha/2, \space n-1}{\frac{s}{\sqrt{n}}}}\)
where \(\mu\) is a known parameter and s is the sample standard deviation of observed \(x_i's\).
Note: As n get larger, the t-distribution will be equal to the normal distribution. If n is small, then the underlying distribution of \(X\) must be normal. Also, when using the table, use \(t_{\alpha}\)
Suppose the incidence rate of myocardial infarction per year was 0.005 among males age 45-54 in 1970. For 1 year starting in 1980, 5000 males age 45-54 were followed, and 15 new myocardial infarction cases were observed.
From the central limit theorem, we know that the sample proportion approximately follows a normal distribution (if the sample size is reasonably large), so we can perform a z-test on this data.
Conduct a hypothesis test to determine if true myocardial infarction rate changed from 1970 to 1980.
\(H_0: \pi_0 = 0.005\)
\(\pi_0 = 0.005\)
\(\hat{\pi} = \frac{15}{5000} =
0.003\)
\(n = 5000\)
\(SE = \sqrt{\frac{\pi_0(1-\pi_0)}{n}}\)
\(SE = \sqrt{\frac{0.005(1-0.005)}{5000}}\)
\(SE = 0.000997\)
Remember that to compute a test statistic we use:
\(z = \frac{\hat{\pi}-\pi_0}{SE}\)
\(z = \frac{0.003-0.005}{0.000997}\)
\(z = -2.01\)
Find 2-tailed probability by looking up this z-score on the z-table:
\(p = 2(0.022) = 0.044\)
We can use the ‘pnorm’ function to calculate this p-value in R.
2*pnorm(2.01,mean=0,sd=1,lower.tail=FALSE)
## [1] 0.04443119
# OR
2*(1-pnorm(2.01,mean=0,sd=1))
## [1] 0.04443119
We can compare this to what we would get doing the exact test using binom.test().
binom.test(15, 5000, p = 0.005)
##
## Exact binomial test
##
## data: 15 and 5000
## number of successes = 15, number of trials = 5000, p-value = 0.04422
## alternative hypothesis: true probability of success is not equal to 0.005
## 95 percent confidence interval:
## 0.001680019 0.004943224
## sample estimates:
## probability of success
## 0.003
From the p-value that’s given using the exact binomial test (p = 0.0442), we are able to see that normal approximation p-value is virtually identical (p = 0.0444).
Why do you think this is, especially when p is so close to 0?
The CLT approach works reasonably well when n is fairly large and p is not close to 0 or 1. However, in this instance, despite p being close to 0, we are able to approximate with great accuracy due to the extremely large sample size (n).
Based on this data, there is significant evidence to suggest that the true myocardial infarction rate of males age 45-54 decreased from 1970 to 1980 (p = 0.044).
Now we want to create a 95% confidence interval for \(\pi\). Interpret the interval.
Note that although \(\hat{\pi}\) is very close to 0, since \(n\) is very large, the approximation will still be precise.
Remember that now standard error is based on \(\hat{\pi}\) and becomes:
\(SE = \sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}\)
\(SE = \sqrt{\frac{0.003(1-0.003)}{5000}}\)
\(SE = 0.000773\)
We will have to find \(z_{\alpha/2}\) using the z-table. What is our \(\alpha\) for a 95% confidence interval?
\(z_{\alpha/2} = z_{0.025} = -1.96\) (from table)
Remember that the equation for the confidence interval is:
\(\hat{\pi} \pm
z_{\alpha/2}*SE\)
\(0.003 \pm 1.96*0.000773\)
95% CI: (0.0015, 0.0045)
pi_hat <- 15/5000
n <- 5000
SE <- sqrt((pi_hat * (1 - pi_hat))/n)
pi_hat + qnorm(c(.025,.975)) * SE
## [1] 0.001484097 0.004515903
We can say with 95% confidence that this interval contains the true myocardial infarction rate in males 45-54 in 1980.
Interpretation Note: Remember that when we say “95% confidence” about an interval, this does NOT mean that there is a 95% probability of the true parameter being in the interval. It means that if we were to repeat this experiment a bunch of times, 95% of the intevals constructed in this manner would contain the true parameter. It’s a bit of a touchy subject, so overall just be careful to not say “probability” when you’re interpreting confidence intervals.
The distribution of weights for the population of males in the United States is approximately normal. We believe the mean \(\mu\) = 172.2 lbs. We conduct an experiment with a sample size of 50, and we find our sample mean to be 180 lbs and the sample standard deviation to be 30 lbs. Conduct a hypothesis test to determine if the true mean is 172.2 lbs based on our data. How would you interpret the result?
\(H_0: \mu = 172.2\)
\(\mu = 172.2\)
\(\hat{\mu} = 180\)
\(s = 30\)
\(n = 50\)
\(df = n-1 = 49\)
To compute a test statistic we use:
\(t = \frac{\hat{\mu}-\mu}{s/\sqrt{n}}\)
\(t = \frac{180-172.2}{30/\sqrt{50}}\)
\(t = 1.84\)
Find 2-tailed probability using this test statistic and Student’s t-table:
\((0.05 < p < 0.1)\)
(Note that we don’t need to multiply this by 2 since Patrick’s t-table already accounts for both tails)
We can use the ‘pt’ function to calculate this p-value in R.
2*pt(1.84, df=49,lower.tail=FALSE)
## [1] 0.07182936
Notice that this p-value fits with what we were able to calculate by hand.
There is not significant evidence to suggest that the true mean weight of males in the United States is different from 172.2 lbs, based on this data (0.05 < p < 0.1).
Now we want to create a 95% confidence interval for \(\mu\). Interpret the interval.
Recall the formula to create a confidence interval is:
\(\hat{\mu} \pm t_{\alpha/2}*SE\)
We can then find \(t_{\alpha/2}\), plug in our given values, and calculate the interval.
\(t_{\alpha/2} = 2.01\) (from table)
\(180 \pm 2.01*\frac{30}{\sqrt{50}}\)
(171.4, 188.5)
mu <- 172.2
mu_hat <- 180
s <- 30
n <- 50
mu_hat + qt(c(.025,.975), n-1)*s/sqrt(n)
## [1] 171.4741 188.5259
# which is the same as
180 + qt(c(.025,.975), 49)*30/sqrt(50)
## [1] 171.4741 188.5259
We can say with 95% confidence that this interval contains the true mean weight of males in the US.
Suppose that the average IQ is 95. Using the lead-IQ dataset as a sample, perform a test to see if the children have an average IQ. Also, create a 95% confidence interval for the mean IQ based on this data. (https://iowabiostat.github.io/data-sets/lead-iq/lead-iq.html)
\(H_0: \mu = 95\)
leadIQ <- read.delim('https://raw.githubusercontent.com/IowaBiostat/data-sets/main/lead-iq/lead-iq.txt')
mu <- 95
mu.hat <- mean(leadIQ$IQ)
s <- sd(leadIQ$IQ)
n <- length(leadIQ$IQ)
df <- n-1
t <- (mu.hat-mu)/(s/sqrt(n))
2*pt(t,df)
## [1] 0.002981458
This gives you a p-value of 0.00298 which means that there is very significant evidence to suggest that the true IQ of children in this dataset is not 100.
mu.hat+qt(c(.025,.975),n-1)*s/sqrt(n)
## [1] 88.52022 93.64107
This gives a confidence interval of 88.52 to 93.64. We could also use the ‘t.test’ function (as shown below) for this dataset, and it would provide us with both the p-value and the 95% confidence interval. This function works similarly to the ‘binom.test’ function.
t.test(IQ ~ 1, leadIQ, mu=95)
##
## One Sample t-test
##
## data: IQ
## t = -3.03, df = 123, p-value = 0.002981
## alternative hypothesis: true mean is not equal to 95
## 95 percent confidence interval:
## 88.52022 93.64107
## sample estimates:
## mean of x
## 91.08065
Suppose that the current commonly used screening test for breast cancer has a sensitivity of 68%. A new screening test was used to test 200 breast cancer patients, in which 147 patients tested positive.
p <- 147/200
p + c(-1, 1) * qnorm(0.975) * sqrt((p*(1-p))/200)
## [1] 0.6738355 0.7961645
p_hat <- 147/200
p0 <- 0.68
z <- (p_hat - p0) / (sqrt((p0*(1-p0))/200))
2*pnorm(z, lower.tail = F)
## [1] 0.09542845
binom.test(x = 147, n = 200, p = 0.68)
##
## Exact binomial test
##
## data: 147 and 200
## number of successes = 147, number of trials = 200, p-value = 0.1111
## alternative hypothesis: true probability of success is not equal to 0.68
## 95 percent confidence interval:
## 0.6681299 0.7947609
## sample estimates:
## probability of success
## 0.735
A patient recently diagnosed with Alzheimer’s disease takes a cognitive abilities test. The mean of this test is \(\mu = 52\) and the variance was \(\sigma^2 = 25\). Assume the cognitive abilities test scores are normally distributed. Find the answers to the following questions with the Z distribution table, your calculators, or in R. Remember the Z table gives you the left-tailed probability.
pnorm(47, 52, 5, lower.tail = FALSE) - pnorm(56, 52, 5, lower.tail = FALSE)
## [1] 0.6294893
pnorm(60, 52, 5/sqrt(9), lower.tail = F)
## [1] 7.933282e-07
pnorm(43, 52, 5)
## [1] 0.03593032
qnorm(.271, 52, 5, lower.tail = FALSE)
## [1] 55.04896
pbinom(1, 25, 0.036, lower.tail = FALSE)
## [1] 0.2267949
Wilson’s orchard’s pumpkins’ weights are known to follow a normal distribution with population mean \(\mu = 18 lbs.\) and variance \(\sigma^2 = 16 lbs\). Each year Wilson’s orchard randomly selects 4 pumpkins and measures the mean weight of the pumpkins.
\(\bar{X} \sim N(\mu, \sigma^2/n) \sim N(18, 4)\)
pnorm(16, 18, sd = sqrt(4))
## [1] 0.1586553
pnorm(21, 18, sqrt(4), lower.tail = FALSE)
## [1] 0.0668072
(p <- pnorm(14, 18, 2, lower.tail = FALSE) - pnorm(20, 18, 2, lower.tail = FALSE))
## [1] 0.8185946
pbinom(1, 5, p, lower.tail = FALSE)
## [1] 0.9953711