9 Confidence Intervals
In this practical, you will learn how to derive confidence intervals for a particular problem using Monte Carlo simulations.
In a random experiment, a sample of data is collected from which we can estimate a population parameter of interest. This estimate can either be a point estimate or an interval estimate - a range of values.
A confidence interval is an interval estimate which has an associated confidence level. The confidence level tells us the probability that the procedure that is used to construct the confidence interval will result in the interval containing the true population parameter. It is not the probability that the population parameter lies in the range.
This is a very counter-intuitive concept which we shall now illustrate in this exercise.
9.1 Setup
First, create a new R
script within Rstudio then we will start with some code preamble. We will use the package ggplot2
for plotting.
library(ggplot2)
Use install.packages("ggplot2")
in the console window if the package is not installed on your system.
Lets define the width of an interval, we will set this to 1 initially but we will change this later on:
= 1 # width of confidence interval interval_width
9.2 Simulating data
We are now going to generate some simulated data for our experiment. We will create 30 samples from a Normal distribution with mean 2.5 and variance 1. These are true values of the population parameters. In a real experiment, we would not know these values but using simulated data, we obviously control these.
Let define these first:
# number of data points to generate
<- 30
n # population mean
<- 2.5
mu # population standard deviation (square root of population variance)
<- 1.0 sigma
Now, we generate some normally distributed data using the R
function rnorm
,
# generate n values from the Normal distribution N(mu, sigma)
<- rnorm(n, mean = mu, sd = sigma) x
We now have 30 samples from a Normal distribution with population mean 2.5 and variance 1.
9.3 Constructing the confidence interval
We are going to pretend that we do not know the population mean value (2.5) used to generate this dataset and try to provide an interval estimate for it from the simulated sample data.
Remember, from lectures, that the sample mean \(\bar{x}\) is a natural point estimate for the population mean \(\mu\).
= mean(x) # compute sample mean x_bar
so a suitable interval might be centred on the sample mean and extend out,
<- c(x_bar - interval_width / 2, x_bar + interval_width / 2) interval
Let’s look at this interval:
print(interval)
## [1] 1.891126 2.891126
Q: Does the confidence interval contain the true parameter?
9.4 Experiment
The previous experiment only examined one simulated dataset so we cannot fully understand the probabilistic interpretation of the confidence interval just yet. At the moment, the interval you have calculated will either contain the population mean or not.
In order to understand the probabilistic interpretation, we will need to generate many data sets, construct confidence intervals as we have for each and then see across all generated data sets, how often those intervals cover the true population mean.
For a Monte Carlo simulation, we will need many repeats of the simulation. Lets define the number of repeats to be used:
<- 1000 # number of Monte Carlo simulation runs nreps
We will use 1000 simulations initially to make the code quick to run but you may want to make this higher later on for greater accuracy.
Now, let us define a series of interval widths to simultaneously test,
# define a series of interval widths
<- seq(0.1, 1.0, 0.1)
interval_width # store the number of interval widths generated
<- length(interval_width) n_interval_widths
This creates a sequence of values from 0.1 to 1.0 in steps of 0.1 in the vector interval_width
:
print(interval_width)
## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Now, we will create a vector of zeros of the same length. We will use this to store the number of times that a confidence interval of those specific widths contain the true population mean
# create a vector to store the number of times the population mean is contained
<- rep(0, n_interval_widths) mu_contained
The hard work now begins. We use a for
loop to repeat the simulation nreps
times. Within each loop, we will simulate a new data set, compute a sample mean and then check if the confidence interval contains the true population mean. Since we are using more than one confidence width, we use a second for
loop to cycle through the different widths.
for (replicate in 1:nreps) {
<- sigma * rnorm(n) + mu # simulate a data set
x
<- mean(x) # compute the sample mean
xbar
# for each interval width that we are testing ...
for (j in 1:n_interval_widths) {
# check if the interval contains the true mean
if ((mu > xbar - 0.5 * interval_width[j]) &
< xbar + 0.5 * interval_width[j])) {
(mu # if it is, we increment the count by one for this width
<- mu_contained[j] + 1
mu_contained[j]
}
}
}
We can now calculate, for each width, an estimate of the probability that a confidence interval of that width will contain the population mean.
<- mu_contained / nreps probability_mean_contained
Let’s use ggplot2
to plot this relationship,
# create a data frame containing the variables we wish to plot
<- data.frame(interval_width = interval_width,
df probability_mean_contained = probability_mean_contained)
# initialise the ggplot
<- ggplot(df, aes(x = interval_width, y = probability_mean_contained))
plt # create a line plot
<- plt + geom_line()
plt # add a horizontal axis label
<- plt + xlab("Interval Width")
plt # create a vertical axis label
<- plt + ylab("Probability that mu is contained")
plt
print(plt) # plot to screen
Can you see that an interval width of \(0.6\) \((\bar{x} \pm 0.3)\) gives a confidence interval close to 90% probability of containing the population mean?
Remember from the lectures that we saw that the theory says \(\bar{x} \pm 1.65\,\frac{\sigma}{\sqrt{n}}\) gives a 90% confidence interval?
So, if we compute \(2 \times 1.65\, \frac{\sigma}{\sqrt{n}}\), what do we get?
print(2 * 1.65 * sigma / sqrt(n))
## [1] 0.6024948
The Monte Carlo estimate matches up with the theory!
9.5 Problem: Confidence Interval
Can you devise a way to compute a confidence interval for the population standard deviation?
You can make use of the following as a point estimate of the sample variance:
\[ s^2 = \frac{1}{n - 1}\sum_{i = 1}^n (x - \bar{x})^2 \]
which can be calculated using the sd
function in R
, remember the relationship between the standard deviation and variance.