Objectives

Familiarize ourselves with the normal distribution
Expand on the normal distribution into the Central Limit Theorem (CLT)

The Normal Distribution Recap

As previously mentioned in class, a lot of data in nature typically follows the normal distribution curve. The normal distribution is used for continuous data, ie, values can take any number on the real number line, unlike discrete data that takes on count values.

The normal distribution is characterized by its mean and standard deviation. Therefore there are many different normal curves, each of which have a different mean and standard deviation, but all take on a bell shape. One of these normal curves is the standard normal curve with mean of 0, standard deviation of 1, whose lower-tail probabilities are given to you in the table shown in class. This is the normal curve we will be working with the most.

Since we know a lot about the standard normal distribution (ie, we have the table), we can translate any normally distributed data into a standard normal to make the appropriate calculations using the following equation:

\(z = \frac{x - \overline{x}}{SD}\)

Normally Distributed Data

We will be using the lipids.txt dataset from the course website. Here we are going to look at various lipid levels of the 3026 adults in the study.

As reference, a reading of Triglycerides (TRG) is a measurement of total cholesterol and includes the concentration of LDL and HDL in the blood. A reading of low-density lipoprotein (LDL) is a more specific measurement of total cholesterol and is typically referred to as “bad” cholesterol.

nhanes <- read.delim("http://myweb.uiowa.edu/pbreheny/data/lipids.txt")

Let’s begin by looking at a histogram of LDL measurements.

hist(nhanes$LDL)

Although LDL values are slightly right-skewed, let’s assume that it is close enough to normal to answer some health questions using the normal distribution. (The blue line is superimposed onto the histogram to show its shape, but you don’t need to worry about coding it yourselves)

First, to use the normal distribution we must find the mean and standard deviation of the data

xbar <- mean(nhanes$LDL)
std.dev <- sd(nhanes$LDL)

Suppose we are interested in comparing the NHANEs LDL data to the following guidelines:

LDL cholesterol levels should be less than 100 mg/dL.
Levels of 100 to 129 mg/dL are acceptable for people with no health issues but may be of more concern for those with heart disease or heart disease risk factors.
A reading of 130 to 159 mg/dL is borderline high and 160 to 189 mg/dL is high.
A reading of 190 mg/dL or higher is considered very high.

What is the probability of observing an LDL measurement that is 160 or greater(LDL readings that are considered high and very high)?

We can answer this using the ‘pnorm’ function. As the default, the pnorm function will give us the probability of the lower tail. We must change the direction of the tail if we want to find a probability that is greater than a number of interest.

# Using the pnorm function directly
pnorm(160, mean = xbar, sd = std.dev, lower.tail = FALSE)

## [1] 0.06872828

# Or simply find the z-score first and plug it into the pnorm function
z <- (160 - xbar) / std.dev
pnorm(z, lower.tail = FALSE) # Note that we did not insert a mean/sd argument. R will default these arguments to the standard normal curve, mean  = 0, sd = 1

## [1] 0.06872828

Note: The normal distribution is used for continuous data, unlike the binomial distribution that is used for discrete data. For discrete random variables, you must not include the number of interest in the ‘pbinom’ function when finding the complement. This is unlike the ‘pnorm’ function where you should include the number of interest even when finding the complement.

Essentially, this is because the distribution is continuous. If you were to find the probability of having a LDL 160 or greater using: ‘1-pnorm(159, mean=xbar, sd=std.dev)’ you would see that the probability is different from the one above and is incorrect because you are including values such as 159.9, 159.8, etc. Therefore include 160 as shown below and you will get the same answer as above:

Note that by the nature of continuous distributions, there is no weight to a single point on the curve. We can only find the weight of an interval, ie, \(P(x = 160) = 0\).

1 - pnorm(160, mean = xbar, sd = std.dev, lower.tail = TRUE)

## [1] 0.06872828

How would we interpret our findings?

6.87% of the data falls above the value 160 mg/dL.

Now, let’s compare this to the probability of 160 or greater using the actual data

Do you expect the probability of observing a LDL 160 or greater using the actual data to be exactly the same, less than, or greater than the probability found using the normal distribution? Why?

We would expect it to be slightly higher than what we calculated using the pnorm function. This is because if you recall at the beginning of the lab, we assumed the data was normal when in reality, it is very slightly skewed right, hence, there is a little bit more data on the upper end than what is described by the normal curve.

sum(nhanes$LDL >= 160) / length(nhanes$LDL)

## [1] 0.07865169

What is the probability of observing a LDL that would be classified as borderline high (160-189)?

We can do this in 3 ways:

# Finding the area of the lower tails: P(X < 189) - P(X < 160)
pnorm(189, mean = xbar, sd=std.dev, lower.tail = TRUE) - 
pnorm(160, mean = xbar, sd=std.dev, lower.tail = TRUE)

## [1] 0.05785488

#OR

# Finding the area of the upper tails: P(X > 160) - P(X > 189)
pnorm(160, mean = xbar, sd=std.dev, lower.tail = FALSE) -
pnorm(189, mean = xbar, sd=std.dev, lower.tail = FALSE)

## [1] 0.05785488

#OR

# Finding the area outside the desired interval and taking the complement: 1 - (P(X < 160) + P(X > 189))
1 - (pnorm(160, mean = xbar, sd=std.dev, lower.tail = TRUE) +
pnorm(189, mean = xbar, sd=std.dev, lower.tail = FALSE))

## [1] 0.05785488

As you can see you can find the answer many ways as long as you understand the direction in which you are finding the probability.

Recall that the qnorm function is the inverse of the pbinom function, so you can feed it an upper tail or lower tail probability, and it will give you the corresponding value on the x-axis. SUppose we have an LDL measurement that exceeds 6.872828% of all other data. What would the corresponding LDL measurement be?

qnorm(p = 0.06872828, mean = xbar, sd=std.dev, lower.tail = F)

## [1] 160

Central Limit Theorem

Recall from class that the central limit theorem states that as you increase your sample size, the distribution of the sample means will approach normal with \(mean = \mu\) and \(SD = \frac{\sigma}{\sqrt{n}}\), regardless of what the underlying distribution looks like.

We are going to simulate the central limit theorem using the TRG readings of the nhanes dataset. Let us assume that this data represents the population. First, let’s look at the distribution.

hist(nhanes$TRG)
abline(v=mean(nhanes$TRG), lty=1, lwd = 2, col="blue")
abline(v=median(nhanes$TRG), lty=2, lwd = 2, col="red")
legend(x = "topright",
       legend = c("Mean","Median"),
       lty = c(1, 2),
       col = c("blue", "red"))

From looking at the histogram we can see that the data is skewed right and does not follow a normal distribution very closely.

Now let’s say that we are going to randomly select from the population to conduct a study. Suppose go out and conduct 3 separate studies. The first time we are able to recruit 10 people, the second time we are able to recruit 30 people, and the third time we are able to recruit 300 people.

Distribution of Means

R lets you randomly select samples from a dataset using the sample() function. We will perform your own studies using the sample function to draw random TRG measurements for 10, 30 and 300 subjects. we save these draws and find the mean of your studies as follows:

(Note that the sample function pulls samples randomly, so if you run this function yourself, everyone will get a different result.)

sample10 <- sample(nhanes$TRG, 10)
sample30 <- sample(nhanes$TRG, 30)
sample300 <- sample(nhanes$TRG, 300)

mean(sample10)

## [1] 144

mean(sample30)

## [1] 127.6333

mean(sample300)

## [1] 111.9

For now, lets focus on the sample with 10 observations. Run the sample function again and finds its mean. Note that you will get a different mean, since it is a different sample of size 15. You can keep doing this over and over, and get a whole dataset of means, which will take on a certain distribution. We have done so for you below:

Note that the histogram looks a bit like the distribution of the data, but now it is more bell-shaped, despite it being a bit skewed.

The same process can be performed if we increase the sample size. The corresponding histograms for sample sizes of 30 and 300 are shown below.

What do you observe about the distribution of the means compared to the population distribution?

As we increased the sample size, the distribution became more and more bell-shaped.

The following table gives us the mean and standard deviation for the three sampling distributions as well as the original population.

##                             mean        sd
## sample of ten           117.0097 21.455609
## sample of thirty        116.9510 12.286945
## sample of three-hundred 116.9331  3.702799
## population              116.9451 67.943216

Standard Error

Recall in class that the standard error is the standard deviation of the population divided by the square root of the total number of observations in a study (\(SE = \frac{\sigma}{\sqrt{n}}\)). In this case since we know the population, we can find the population standard deviation and therefore the standard error for each of the experiments.

Find the standard error for each study with 10, 30, and 300 subjects and show that it equals the standard deviations for the respective distribution.

Practice Problems

Find the probability that a randomly selected \(\bf{LDL}\) measurement has a value above 123.

mean_LDL <- mean(nhanes$LDL)
sd_LDL <- sd(nhanes$LDL)

pnorm(q = 123, mean = mean_LDL, sd = sd_LDL, lower.tail = F)

## [1] 0.3254162

Find the probability that a randomly selected \(\bf{LDL}\) measurement has a value between 118 and 126.

mean_LDL <- mean(nhanes$LDL)
sd_LDL <- sd(nhanes$LDL)

pnorm(q = 126, mean = mean_LDL, sd = sd_LDL, lower.tail = T) - pnorm(q = 118, mean = mean_LDL, sd = sd_LDL, lower.tail = T)

## [1] 0.0812601

Find the probability that a sample of 50 \(\bf{LDL}\) measurements will have a mean greater than 123.

mean_LDL <- mean(nhanes$LDL)
se_LDL <- sd(nhanes$LDL) / sqrt(50)

pnorm(q = 123, mean = mean_LDL, sd = se_LDL, lower.tail = F)

## [1] 0.0006861629

Find the probability that a sample of 50 \(\bf{LDL}\) measurements will have a mean between 118 and 126.

mean_LDL <- mean(nhanes$LDL)
se_LDL <- sd(nhanes$LDL) / sqrt(50)

pnorm(q = 126, mean = mean_LDL, sd = se_LDL, lower.tail = T) - pnorm(q = 118, mean = mean_LDL, sd = se_LDL, lower.tail = T)

## [1] 0.0133539

What two value values contain the middle 95% of sample means of LDL measurements of sample size 50?

mean_LDL <- mean(nhanes$LDL)
se_LDL <- sd(nhanes$LDL) / sqrt(50)

qnorm(0.025, mean = mean_LDL, sd = se_LDL)

## [1] 96.85348

qnorm(0.975, mean = mean_LDL, sd = se_LDL)

## [1] 116.7149

Lab 8

March 8-9, 2022