Objectives

Note: A hat on a Greek letter indicates an estimator, so for example, when you see \(\hat{\mu}\), this is the same thing as \(\bar{x}\).

Normal Distribution and T Distribution Key Formulas:

    • \(\large{Z = \frac{\hat{\pi} - \pi_0}{\sqrt{\frac{\pi_0(1-\pi_0)}{n}}} \sim N(0,1)}\),

    • \(\large{(1-\alpha)\% \space CI = \hat{\pi} \pm z_{\alpha / 2}\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}}\)

    where \(\pi_0\) is the proportion under the null hypothesis, \(\hat{\pi}\) is the sample proportion, \(n\) is the sample size. This approximation is precise if \(n\) is large and \(\pi_0\) is not close to 0 or 1. (Note that in the case of the CI, \(\hat{\pi}\) can’t be close to 0 or 1.)


    • \(\large{Z = \frac{x - \mu}{\sigma} \sim N(0,1)}\),
    where \(\mu\), and \(\sigma\) are known population parameters and the distribution of X is normal. (Note that this does not have a confidence interval. We would just find the middle 95% of the data directly.)


    • \(\large{Z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}} \sim N(0,1)}\),

    • \(\large{(1-\alpha)\% \space CI = \bar{x} \pm z_{\alpha / 2}{\frac{\sigma}{\sqrt{n}}}}\)

    where \(\mu\), and \(\sigma\) are known population parameters and the distribution of \(\bar{X}\) is normal.

    Note: The underlying distribution of X does not affect the distribution of \(\bar{X}\)


    • \(\large{T = \frac{\bar{x} - \mu}{s/\sqrt{n}} \sim t_{n-1}}\),

    • \(\large{(1-\alpha)\% \space CI = \bar{x} \pm t_{\alpha/2, \space n-1}{\frac{s}{\sqrt{n}}}}\)

    where \(\mu\) is a known parameter and s is the sample standard deviation of observed \(x_i's\).

    Note: As n get larger, the t-distribution will be equal to the normal distribution. If n is small, then the underlying distribution of \(X\) must be normal. Also, when using the table, use \(t_{\alpha}\)


Example 1: Z-test for categorical data

Suppose the incidence rate of myocardial infarction per year was 0.005 among males age 45-54 in 1970. For 1 year starting in 1980, 5000 males age 45-54 were followed, and 15 new myocardial infarction cases were observed.

From the central limit theorem, we know that the sample proportion approximately follows a normal distribution (if the sample size is reasonably large), so we can perform a z-test on this data.

Conduct a hypothesis test to determine if true myocardial infarction rate changed from 1970 to 1980.

How would you interpret the result?

Creating a confidence interval (z)

Now we want to create a 95% confidence interval for \(\pi\). Interpret the interval.

Note that it although \(\hat{\pi}\) is very close to 0, since \(n\) is very large, the approximation will still be precise.


Example 2: T-test Continuous

The distribution of weights for the population of males in the United States is approximately normal. We believe the mean \(\mu\) = 172.2. We conduct an experiment with a sample size of 50, and we find our sample mean to be 180 and the sample standard deviation to be 30. Conduct a hypothesis test to determine if the true mean is 172.2 based on our data. How would you interpret the result?


Practice Problem 1:

Suppose that the current commonly used screening test for breast cancer has a sensitivity of 68%. A new screening test was used to test 200 breast cancer patients, in which 147 patients tested positive.

  1. Create a 95% confidence interval for the sensitivity of the test.

  2. Perform a hypothesis test to determine if there is a significant difference in the sensitivity of the old and new test.

  3. Using R, calculate the actual confidence interval and conduct a hypothesis test using the binomial distribution.

Practice Problem 2:

A patient recently diagnosed with Alzheimer’s disease takes a cognitive abilities test. The mean of this test is \(\mu = 52\) and the variance was \(\sigma^2 = 25\). Assume the cognitive abilities test scores are normally distributed. Find the answers to the following questions with the Z distribution table, your calculators, or in R. Remember the Z table gives you the left-tailed probability.

  1. What percent of individuals scored between a 47 and a 56?

  2. Suppose we have a sample of 9 individuals. Calculate the probability that the sample mean test score is greater than 60.

  3. Patients can be considered for an alternative treatment if they score below a 43 on this test. What percent of patients can be considered for this treatment?

  4. Find the test score where 27.1% of patients lie above.

  5. What is the probability that at least 2 patients of 25 sampled Alzheimer’s patients will be considered for the alternative treatment?


Practice Problem 3:

Wilson’s orchard’s pumpkins’ weights are known to follow a normal distribution with population mean \(\mu = 18 lbs.\) and variance \(\sigma^2 = 16 lbs\). Each year Wilson’s orchard randomly selects 4 pumpkins and measures the mean weight of the pumpkins.

  1. What distribution do the sample means follow?

  2. Using this distribution, calculate the probability that this year’s sample mean weight is less than 16 lbs.

  3. What is the probability that this year’s sample mean weight is greater than 21 lbs?

  4. What is the probability that at least 2 of the next 5 years’ sample means are between 14 and 20 lbs?


Practice Problem 4:

Suppose that the average IQ is 95. Using the lead IQ dataset as a sample, perform a test to see if the children have an average IQ. Also, create a 95% confidence interval for the mean IQ based on this data. (https://s3.amazonaws.com/pbreheny-data-sets/lead-iq.txt)