STAT 231 - Statistics

STAT 231 - Statistics
Instructor: Surya Banerjee
Location: DC 1350
Time: Mondays and Wednesdays 2:30am - 3:50pm
Tutorials: DC 1351 Mondays 5:30pm - 6:20pm
Term: Spring 2016

May 2, 2016 - Lecture 1

Email: or

Course Marking Scheme: 3 Tutorial Quizzes (TQ) and 2 Midterms

STAT 231 is the reverse of STAT 230. E.g. We toss a coin 100 times, we get 60 heads. In STAT 230, we calculate the probability of that happening (binomial distribution, 100 and 0.5, P(X=60)) with all the parameters given (P = 0.5 to get a head for a fair coin). In STAT 231, we try to find the parameters by experiments, and try to infer what we can say about the “fairness” of the coin.

MLB: Between 1901-1941, batters with BA .400+: 11. 1941 onwards: 0. Why? Use statistics to find out.

Kansas Weathermen: They are correct about whether it will rain the next day about 85% of the time. Good record, right? However, it only rains about 10% of the time in Kansas. So if I go to the weather broadcasting station and say “no rain tomorrow” and then go home, I will be statistically more accuarate that these meteorologists.

Baseball Pundits: Pundits are right about the outcome of baseball games about 48% of the time. Literally worse than a coin.

In STAT 231, we learn how to test claims and check for correlation and causation. How can we make the leap from correlation to causation? (Recommended reading: Freakanomics)

Introduction to Data Analysis

Data can be classified into two categories: numerical and categorical (non-numerical).

Numerical Data is either discrete or continuous. Discrete data (e.g. number of accidents) takes integer values, while continuous data (e.g. height, weight) measure non-integer values.

Categorical Data is either ordinal (if there’s an underlying order, e.g. on a scale from 1 to 10, 1 being least satisfied and 10 being most satisfied) or non-ordinal (no order, e.g. nationality, favorite color).

Transformation {x1, x2,…, xn}
yi = f(xi) is a transformation.

An affine transformation is a linear transformation: yi = axi + b

Coding is the conversion of categorical data into numerical data.

Data Summaries

We want to extract important information about data.

  1. Numerical: come up with numbers that represent different properties of the data set
  2. Graphical: graph that tells us the shape of the data set

There are some properties of interest:

  • Centre of the data set -> Central Tendency

  • How volatile is the data set -> Dispersion

  • If the data set is symmetric -> Measures of Symmetry

  • How “fat” the tails of the data set are (i.e. the size of extrema) -> Kurtosis

Measures of Central Tendency

Data set: {y1, y2,…, yn}

Sample Mean:

Property: The sum of the deviations of the observations from the sample mean is 0.

Geometric Mean

Harmonic Mean

May 4, 2016 - Lecture 2

Lecture notes will be posted every weekend.
Practice questions (with solutions) will be posted this week. First tutuorial quiz on May 17 (T), 2016 (11:25am).


  • measures of central tendency: medias, quartiles, mode
  • measures of variability: range, IQR, variance

Sample -> Predictions about the population

Data Summaries

  • Variability: DISPERSION
  • Tails are similar: SYMMETRY
  • Frequency of Extreme Obersevations: KURTOSIS

Numerical and Graphical Summaries

{y1, y2, … , yn}

Sample mean: 1/n sigma yi
Properties: sigma (yi - mean) = 0

Also, affine transformation preserves linear combinations (i.e. the arithmetic mean changes accordingly by the affine transformation)

Median: the middle most observations
arrange the data set in ascending order and pick the middle one. e.g. 1, 3, 7, 13, 25. Median is 7.

Median is not too sensitive with extreme values.

Quartiles: Q1: lower quartile, Q3: upper quartile

Percentiles: Instead of dividing into 4 parts, the data is divided into 100 parts.

Eg. p=0.25, m = (n+1) * p, n is the number of observations

Mode: Observation that occurs with the maximum frequency

St.Petersburg’s Paradox. We care about averages (expected values), but also risk. Variability is a very important property of a data set. E.g. Country A 0 0 0 1000 and Country B 250 250 250 250 Goalie A 0 6 0 6 0 6 0 6 Goalie B 3 3 3 3 3 3 3

Measures of Volatility

Range: Max - Min

Interquartile Range (IQR): IQR = Q3 - Q1

Sample Variance and Sample Standard Deviation:

Sample Variance: s2 = 1/(n-1) sigma (yi - y average)2 is approximatey the average of the squared deviation from the mean

Standard Deviation: s = positive square root of s^2, the sample variance

Mean Absolute Deviation(MAD): 1/n sigma abs(yi - y average)

Say a data set of xi is transformed by an affine transformation yi = a + bxi. Then s2 of y = b2s2. And the standard deviation is multiplied by the absolute value of b.

May 9, 2016 - Lecture 3

  • Practice problems with solutions on LEARN
  • <= Wednesday’s class for the TQ1 next week


  • Measures of Symmetry
  • Measures of Kurtosis
  • Applications
  • Measures of Association
  • Graphical Measures

Objective: To figure out the shape of our sample

  • Centre: sample mean
  • Variability: This measures how much the data set is spread out across the centre; how volatile the data set is; Use s2, variance; Standard deviation is the sqrt(variance)
    • If y=a+bx, s2y = b^2s^2_x s_y = abs(b)s_x
    • The variance treats data on either side of the mean symmetrically
  • Symmetry:
    • 3 shapes:
      • Symmetric
      • Right-skewed (mode is on the left side, mean is on the right side), long right tail
      • Left-skewed (mode is on the right side, mean is on the left side), long left tail
    • Measure of Skewness:
      • Quick estimate of skewness:
        e.g. 1, 3, 5, 7, 9, 11, 13
        Symmetric Data -> Mean = Median
        e.g. 1, 3, 5, 7, 9, 11, 107 Mean » Median
        Measure of Skewness = Mean - Median
        • if > 0, right-skewed, mean greater than median
        • if < 0, left-skewed, mean less than median
        • if = 0, symmetric

Gould: “The median is not the message”
“Abdominal mesothalamia”: median of life expectancy after diagnosis = 8 months

If the distribution (of frequency of death against time since diagnosis) is symmetric, then he will die within 16 months at best.

The actual distribution is right-skewed.

Kurtosis measure how frequent extreme observations are, with respect to the Normal Distribution.

Kurtosis checks whether the data set has “fatter” tails compared to the Normal distribution.

Kurtosis: how much “peakedness” the graph has

For a normal distribution, kurtosis = 3. If k for your data > 3, then it has fatter tails compared to the normal. If it is < 3, then it has narrower tails.

Kurtosis checks whether we can apply the Normal Distribution assumption to our data set by comparing our sample kurtosis to 3.

Measures of Association
Objective: To find whether there is any evidence of association between two variables

  • Categorical Variables lmao ayy hey
    • Relative Risk: measures the association between two catgorical variables; Measured by [R.R.]
    • A and B are indepedent iff P(A|B) = P(A)
    • If R.R. is approximately equal to 1, then there is little evidence of association
    • If R.R. > > 1 or < < 1, evidence of association
    • But how high is too high? UNANSWERED QUESTION
    • The table = Two Way Contingency Table
  • Numerical Variables
    How to find a measure of association between two numerical variables?
    • Sample Correlation Coefficient rxy: the sign of r tells us the direction of the relationship, and the value of r tells us the strength of the association/relationship

xi = # of beers you drink
yi = STAT 231 mark
Collect a data set:
{x1, y1},…{xn, yn}

So it is most likely that if xn - x average > 0, yn - y average < 0

Thus, the numerator tells us the direction of the relationship. The denominator ensures that -1 <= rxy <= 1

The closer abs(r_xy) is to 1, the strong is the evidence of association.

r captures the linear relationship between x and y. Not good in capture non-linear relationship

E.g. y = x2, x=-3, y=9, x=-1, y=1, x=1, y=1, x=3, y=9. r_xy = 0 !!! It’s a function so the two variables are associated for sure, but the Coefficient gives 0. Does not work well with quadratic relationships, or non-linear in general.

E.g. y = ax + b. r_xy = 1 if a > 0, = -1 if a < 0

Since we are going from sample to population, we can only find evidence of association. Strong correlation != Causation

Graphical Measures

  • Density Histogram: Histogram -> grouped data; Density histogram -> area of every rectangle = relative frequency of the group
  • Empirical cdf
  • Box Plot
  • Scatter Plot
  • Q-Q Plot

May 11, 2016 - Lecture 4

A note about TQ1 will be posted on LEARN tonight -> Up to today’s material (Ignore the last question on the sample quiz)


  • Centre -> Sample Mean
  • Variability -> Variance, Standard Deviation
  • Symmetry -> Mean - Median
  • Kurtosis -> Compare to Normal Distribution
  • Correlation Coefficient: rxy measures association

Reasons why we care about Graphical Summaries

  • To “identify” the distribution from which the data is drawn,
  • To find the shape of the data set

The Five Number Summary

  • Minimum
  • Q1
  • Q2
  • Q3
  • Maximum

These five numbers provide the rough shape of the graph

Correlation Coefficient measures the linear association between two variables

Graphical Summary

Histogram: Grouped Data Example: Group Freq (The cohorts are called bins)
Frequency Histogram: Frequency as y-axis, and bins as x-axis (Bar graph)
Density Histogram: x-axis = bins, y-axis = densities

  • The length(height) of each rectangle in a density histogram is chosen such that the area of the rectangle is equal to the relative frequency of the bin
  • To construct a density histogram: Find the frequency and relative frequency for each bin; Relative Frequency = frequency for the bin / total frequency

The reason why we draw density histogram is that we want to compare our data (shape of graph) with known density functions of distributions

Box Plot: Box and Whiskers Plot

  • Lower end of the box/rectangle -> Q1
  • Upper end of the box/rectangle -> Q3
  • The line inside the box -> Median = Q2
  • The upper whisker goes up to the maximum of the data set, that is NOT AN OUTLIER (extreme observations; a data point yi is an outlier if yi > Q3 + 1.5IQR or yi < Q1 - 1.5IQR; Remember IQR = Q3-Q1); The upper whisker ends at the highest value of your data set that is lower than the outliers; We mark the outliers each and every one of them separately
  • The lower whisker goes down to the minimum of the data set, that is not an outlier
  • The Box Plot gives us the Five Number Summary
  • Reading the Box Plot sideways gives the rough shape and the skewness of the distribution

Empirical cdf (Cumulative Distribution Function)
Definition: F(y) = # of observations <= y / Total # of observations, and the plot (y, F(y)) is called the empirical cdf.

{1, 3, 5, 5, 9}, F(0) = 0, F(1) = 0.2, F(3) = 0.4 … the plot of a step function

  • We can find a lot of info from the empirical cdf
  • F(median) = 1/2, we can identify any percentile
  • The number with the biggest jump vertically on the graph is the mode of this data set

Q-Q Plot
We want to check whether our data set “resembles” a Normal Distribution. The Q-Q plot plots the alpha-th quantile of your sample to the alpha-th quantile of Z Distribution ~N(0, 1); Plots two quantiles: Sample quantiles against the theoretical quantiles

{1, 2, 7, …, 155}

Z(alpha) on the x-axis, y(alpha) on the y-axis. 0 = median and mean of Z

If the Q-Q Plot is a straight line, then the data resembles the Normal Distribution Z

Q: If we have a distribution that is normal, how do we prove that it will have a straight line Q-Q plot?
That is, y~N(mu, sigma^2), the Q-Q plot will be a straight line


Scatter Plot
Scatter Plot checks for association. Just plot (xi, yi).

May 16, 2016 - Lecture 5


  • typical problem: we have a population of observations, the characteristics of the population is unknown
  • We draw a sample, and based on the sample observations infer properties about the population (statistical inference)
  • Examples: 1. we want to find the approval rating of Trump among likely US voters; 2. Average income of a starting math undergraduate career in Canada; 3. Does a regular intake of Vitamin C reduce the duration of a flu? 4. Are Canadian Jeopardy contestants better than Americans?
  • PPDAC approach gives you an algorithm to tackle statistical problems that are mentioned above; PPDAC = Problem, Plan, Data, Analysis, Conclusion

Problem: Descriptive, Causative, Predictive

  • Descriptive: We are interested in some unknown characteristic of the population (e.g. Trump)
  • Causative: Whether a variable X causes a variable Y to change (e.g. Vitamin C)
  • Predictive: To estimate the future values of some variable (e.g. predicting Blackberry stock prices)

Target Population: the population of interest (e.g. All likely US voters in the Trump example)
Unit: each member of the population
Variate: a characteristic of the unit (e.g. whether the voter approves or disapproves of Trump)
Attribute: a function of the variate (e.g. Proportion of approvals = Attribute)

For the income of math undergrad problem, each individual income is the variate.
Average income is the attribute.

Typically, we want to find some attributes of the population.


Study Population: the set of observations from which your sample is drawn
Example: Suppose we are doing phone interviews -> study population would be all voters with a phone; In medical tests, study population might not be a subset of the target population

  • Experimental Plan: Some of the variates can be controlled by the analyst with dependent and independent variables
  • Observational Plan: The variates are not under the analyst’s control

Setting up a statistical model: we assume that the data is drawn from known distribution with unknown parameters.


  • Bias: systematic error -> make sure the data is not biased
  • Measurement error: random error -> measured value - actual value

Population parameters, usually Greek letters, are unknown. Sample parameters, usually Latin letters, are known (sample mean, sample variance).

Random Sampling
Sampling protocol: we want the sample to represent the population
Sampling techniques: methods of drawing the “right” sample
Errors: The difference between the target and the study populations = Study Error; Difference between the study population and the sample = Sampling Error

Difference in the value of the attribute between the Total Population and the Study Population is the study error. Difference in the value of the attribute between the Study Population and the Sample is the sampling error. We want to quantify the sampling error.

We want to make sure that the conclusion is understood by non-statisticians

##The Theory of Estimation## Method of Maximum Likelihood
Based on our sample, what is the mostly likely value of the unknown parameter? This is a question every statistician asks. Example: We have a biased coin. The probability of getting a head is either 1/2 or 1/3. Toss the coin 100 times and we see 25 heads. We choose the value of the parameter that maximizes the probability of observing your sample.

Suppose Y is a discrete distribution with a pf f and an unknown parameter theta. Data = {y1, …, yn}.
L(theta; y1, … , yn) = Likelihood Function = P(Y1 = y1, Y2 = y2, … , Yn = yn) as a function of theta; the chance of observing the sample

E.g. Number of accidents ~ Poi(mu), with mu unknown. Sample is {1, 0, 2, 5, 7}. Based on the sample, what is the MLE (Maximum Likelihood Estimate) for mu? Use the pf for the Poisson distribution to calculate the probability of observing each unit in the sample. Multiply them together. The value of mu that maximizes L above is called the MLE. In the above example, L = e^-5mu* mu^15 / 1!0!…7!. In order to maximize it, take the log(base e) function and maximize the log function. Take derivatives and equate it to zero, solve for mu (MLE). log L = l = log-likelihood function. In the above example, mu = 3

May 18, 2016 - Lecture 6


We are interested in some unknown characteristics (attributes) of the population. We draw a random, independent sample from this population {y1,…,yn}. Based on these sample observations, we want to come up with the best estimate for the unknown characterstic g(y1,…,yn) will be my best guess for theta.

Some starting points

+ usually, all statistical inference problems start with some estimation of some parameter.
+ {y1,...,yn} -> sample
+ we think of the data points {y1,...,yn} as outcomes of some random variable Y (statistical model); We assume that the data is drawn from some distribution with unknown parameters

Example: toss a coin; objective: estimate pi = P(head). Each data point yi can be thought of as an outcome of Yi. The distribution of Yi is called the Statistical Model.

##Steps Involving Estimation##

  • Identify the distribution from which your data set is drawn
  • Construct the likelihood function L(theta; y1,…,yn) = P(Y1=y1,…Yn=yn) = Probability of observing your sample = P(Y1=y1)P(Yn=yn)
  • Find the MLE (best guess for the unknown parameter). Theta hat is that value of theta that maximizes the likelihood function

Example 1: To estimate the approval rating of Trump. Approved Rating = pi (Population proportion, unknown). Sample = {YYNNNNYNNN} Given this sample, what is pi hat, the MLE for pi? Y is either Yes with probability pi, or No with probability 1-pi

L(pi; y1,…yn) = pipi(1-pi)(1-pi) = pi^3 * (1-pi)^7 is the Maximum Likelihood Function. The value of pi that maximizes the function is pi hat = MLE. The log likelihood function is l(pi) = 3ln(pi) + 7ln(1-pi), take derivative, find maximum value of pi = 0.3

Example 2: Objective: To estimate mu, the average number of hits on your blog in an hour. A sample of n hours {y1,…,yn}. What is mu hat, the MLE for mu? We will assume that the data is drawn from a Poisson distribution. L(mu, y1,…yn) = P(Y=y1)*…P(Y=yn). Calculating mu hat, we get mu hat = 1/n(sigma yi) = y average

Example 3: To estimate pi = prob that a Canadian contestant wins Jeopardy. Sample = {y1,…,yn} yi = number of shows contestant i appeared in. Sample = {1,2,1,3,5}. The data in this problem is drawn from a Geometric distribution. P(Y=2) = pi(1-pi). P(Y=4) = pi^3(1-pi). L(pi; y1,…yn) = pi^[(sigma yi)-n] * (1-pi)^n. Take log derivatives and equate it to zero. Get pi hat.

Example 4: To estimate the proportion of southpaws at UW.

+ Strategy 1: Go interview people till we get 10 left handers, interviewed 100 people
+ Strategy 2: Go interview 100 people and 10 of them are left handers

Strategy 1: pi = proportion of left handers. L(pi) = C^99_9pi^9(1-pi)^90 * pi
Strategy 2: L(pi) = C^100_10pi^10(1-pi)^90

So pi hat in both cases are 1/10.

Some Final Points

  1. For complete analysis, we have to make sure theta hat is indeed the maximum (check the 2nd order condition)
  2. L(theta; y1,…,yn) = P(Y1=y1)P(Yn=yn) = f(y1)*…*f(yn). f = distribution function = pi(f(yi))

Since the likelihood function is a product of the probabilities, the value can be really small for large sample sizes.
R(theta) = Relative Likelihood Function = L(theta)/L(theta hat), where theta hat = MLE. The graph of R(theta) against theta is bell-shaped, maximized at theta hat. 0<=R(theta)<=1, R hits its maximum at theta = theta hat = MLE

May 25, 2016 - Lecture 7, 2016

Syllabus for the Midterm: <= End of this week + STAT 230; Fall 2015 Midterm and practice problems posted (solutions will be posted this Friday)


  • Overview of statistical modelling, estimatation and the MLE calculation
  • Likelihood functions and the MLE for continuous r.v.s
  • Invariance property of the MLE
  • Special case -> Uniform distribution

Paul the Octopus Problem

  • Toss a coin 100 times. Your friend “guesses” right 70 times. Is there evidence that your friend has ESP?
  • What is the attribute of interest? (What do we care about the population we are looking at?)
  • We are trying to find pi, the probability(your friend will guess right); it is unknown
  • y = number of successes in 100 trials, y = 70 -> outcome of some r.v. Y. Data points are not just numbers, but outcomes of a random experiment
  • We need to construct a statistical model (assuption on the distribution Y)
  • In this case, Y~Bin(100, pi)

Statistical Model

  • Construct the likelihood function
  • L(pi) = C^100_70 pi^70(1-pi)^30
  • Find the MLE(we choose pi hat which maximizes L(pi)), pi hat = 0.7
  • We choose that value of the parameter that maximizes the chance of what we observed

Formal Definition: Yi = f(theta;yi), i=1,…,n, f=distribution function, Yi’s are independent, {y1,…,yn} -> data set

L(theta;y1,y2,…,yn) = n Pi i=1 f(theta;yi) Product of the Distribution functions evaluated at the sample points

Example: {y1,…,yn} is independently drawn from yi
f(y;theta) = (1-theta)^y * theta; y = 0,1,2,…

Continuous Distributions

Definition: Model: Yi~f(yi;theta) where Yi is a continuous random variable with unknown parameter theta

Example: Objective: To estimate mu = average lifetime of a lightbulb produced by a company. An independent sample {y1,…,yn} is drawn. Based on this sample, what is mu hat?

Model: Yi~Exp(mu), i=1,..,n, Yi’s are independent; L(mu, yi,…,yn) = i/mu^n e^(-1/u sigma yi). Take ln, take derivative, equate to zero; get mu hat = y average

Example: Objective: To estimate mu = population average IQ of UW Math students; sigma^2 = variance of the IQ

An independent sample {y1,y2…,yn} is drawn. Based on the sample, what are sigma^2 hat and mu hat?

Solution: Model: Yi~N(mu, sigma^2), assuming normality, i=1,…,n, Yi’s independent

Likelihood function: Take partial derivatives: first with respect to mu, get mu hat = y average (If the distribution is Normal, then y average is the MLE for mu); then take partial derivative with respect to sigma, get sigma^2 hat = 1/n sigma(yi - y average)^2. The MLE for population mean is the sample mean, but the population variance is NOT the sample variance (which is 1/n-1 instead of 1/n)

Example: Suppose {y1,…,yn} is drawn from a random variable with density function f(theta;y) = 2y/theta e^(-(y^2)/theta), with y>0, theta>0. L(theta) = (2^n(y1y2…yn))/theta^n e^(-1/theta sigma yi^2), take ln, l(theta) - ln K - nln(theta) - 1/theta sigma yi^2. Take derivative, get theta hat = 1/n sigma yi^2

##Invariance Property##

Result: If theta hat is the MLE (best guess for theta) for theta, then g(theta hat) is the MLE for g(theta) for any continuous g

Example: The scores of STAT231 are normally distributed with mean mu and variance sigma^2. Find the estimate for the 95-th percentile of the population. Looking at the Z table, 95-th percentile corresponds to 1.64.
P(Y<=x) = 0.95
P((Y-mu)/sigma <= (x-mu)/sigma) = 0.95
P(Z<=x-mu/sigma) = 0.95
x-mu/sigma = 1.64
x = mu + 1.64 sigma

MLE for the 95th percentile = mu hat + 1.64 sigma hat. Find mu hat and sigma hat, plug in, done.

Example: Suppose we toss a coin 60 times, and we observe 20 successes (heads). Find the MLE for the variance of this distribution. pi hat = 1/3
Var = npi(1-pi), Mean = npi

##Uniform Distribution##

Yi~U[0,theta], theta is unknown. {yi,…,yn} What is theta hat? U[a,b], pf = 1/(b-a). f(y;theta) = 1/theta if 0<y<theta and 0 otherwise. Then L(theta) = 1/theta^n if 0<=yi<=theta for all i (or can be written as “if theta >= max{yi,…,yn}”), or 0 otherwise. But the graph is not continuous at max{y1,…,yn}. MLE theta hat = max{y1,…,yn}

May 30, 2016 - Lecture 8

Interval Estimation: Likelihood function -> likelihood intervals, sampling distribution -> confidence intervals

The Theory of Estimation

Problem: theta = unknown population attribute of interest; {y1,…,yn} -> sample drawn from the population; based on the sample, theta hat(y1,…,yn) which is our ESTIMATE for theta

Method of Maximum Likelihood

  1. Identify the distribution from which the data is drawn (Modelling)
  2. Construct the likelihood function, calculate the log likelihood function, take derivatives to find the maximum likelihood estimate

Example: Y1,…,Yn ~ N(mu, sigma^2). Data set: {y1,…,yn}. Objective: Estimate mu
Alternative Method - Method of Least Squares (more practical, do not care about distribution. only care about how accurate the prediction is): Choose mu hat such that the sum of SQUARED errors is minimized. Min sigma[(yi-mu)^2]

2 sigma(yi-mu)(-1) = 0
sigma(yi-mu) = 0
mu = y average

For the Normal problem, MLE and LS estimate for mu are the same

Interval Estimation

Obejective: We have an unknown parameter theta. We want to estimate the random interval [A, B] which would contain theta with a “high” probability (that is pre-specified/defined)

Example: Margin of Error for opinion polls

Method 1: Through the likelihood function
Definition: A 100p % likelihood interval (p is in (0,1)) for an unknown parameter theta is given by

{theta: R(theta) >= p}, where R(theta) = Relative Likelihood Function

Example: p=0.5, you want a 50% likelihood interval; so {theta: R(theta) >= 0.5}; Draw a horizontal line at y=0.5, the line intersects the RLF graph at points a and b [a,b] = 50% likelihood interval

Question: If theta_0 belongs to the 20% likelihood interval, does it belong to the 10% likelihood interval? Yes.

Question: Does theta hat (MLE) belong to all likelihood intervals? Yes.

Suppose theta_0 belongs to the 30% likelihood interval -> means that the likelihood function, evaluated at theta_0, is at least 30% of the likelihood function evaluated at the MLE. L(theta_0)/L(theta hat) >= 0.3 => L(theta_0) >= 0.3 L(theta hat)

Method 2 - Method of Sampling Distributions: Method 1 is not an INTUITIVE interpretation


  1. Identify the PIVOTAL DISTRIBUTION from your model
  2. Find the endpoints of your pivotal distribution and rearrange to construct the coverage interval
  3. Estimate the coverage interval using your sample -> confidence interval

Notations: Y bar (a random variable from which sample means , y bar, are drawn), y bar (sample mean, known), and mu (population mean, unknown)

Example: A population of 1, 2, 3, 4, 5
A sample of size 2 is drawn with replacement from this population

  • mu = 3 (usually unknown)
  • y bar: {1, 3} -> 2; {1, 1} -> 1; depends on sampling
  • Y bar: random variable from drawing y bars by graphing the frequency of each

Capital letters = random variables = Estimators
Lowercase letters = Estimates from the sample -> known numbers
Greek letters = Population parameters -> unknown parameters

Example: Objective: To construct a 95% confidence interval for the average IQ of UW professors. {y1,…,yn}, n=25, sigma yi = 2200. We know that the IQs are normally distributed with a variance of sigma^2 = 49 (unrealistic, because sigma^2 is known but not mu). Construct a 95% confidence interval for mu

Model: Yi ~ N(mu, 49). i = 1,…,25 are independent

Y bar ~ N(mu, 49/25) since Y bar ~ N(mu, sigma^2 / n)

(Y bar - mu) / 7/5 = z ~ N(0, 1) PIVOT

Go to the Z-table and find the 95% interval. Find 0.975. a = 1.96

P(-1.96 < z < 1.96) = 0.95
P(-1.96 < (Y bar - mu) / 7/5 < 1.96) = 0.95

So mu > Y bar - 1.967/5, and mu < Y bar + 1.967/5, and these two numbers make up the bounds of the coverage interval. [y bar - 1.967/5, y bar + 1.967/5] since y bar is known is the confidence interval

General Rules to estimate mu when sigma is known: (Y bar - Z* sigma/sqrt(n), Y bar + Z* sigma/sqrt(n)) is the coverage interval. Sub y for Y, confidence interval

Suppose we had a 99% coverage, then Z* will increase (to 2.58).

Suppose we want the 95% confidence interval to be of fixed length y bar +- A.

Two problems:

  1. Sigma is usually not known
  2. The population might not be normal

Central Limit Theorem: Suppose n is large, and Y1,…,Yn, are drawn from a distribution with mean mu and variance signma^2, then Y bar is approximately Y bar ~N(mu, sigma^2 / n)

June 1, 2016 - Lecture 9


  • Confidence interval for mu from a normal population with known sigma (population SD)
  • Confidence interval for Binomial
  • Confidence interval for Exponential (when n is large)
  • The chi-squared distribution chi^2_k
  • The T distribution

Example: (pre-specified) 99% CI

Example 1: The incomes of UW grads are Normally distributed with mean mu, and SD sigma = $5000

Object: To fund the 99% for mu, i.e. we want to construct a random interval [A, B], such that P(A < mu < B) = 0.99 (This is called the coverage interval, and 0.99 is called the coverage probability or level of confidence)

A sample of 64 UW grads are drawn at random {y1,…,y64}, y bar = 60000, sigma = 7000. What is your CI?

Example 2: November 16, an exit poll is conducted with a random sample of 1000 US voters and 530 of them voted for Trump. Pi = Proportion of total votes Trump will get. We want to construct a 95% CI for pi.

Confidence Interval Approach: construct the pivotal distribution based on your estimator or a function of it; mu = population mean, y bar = sample mean = a number, Y bar = r.v. from which y bar is drawn. The distribution of Y bar is the pivot.

Example 2: pi = population proportion, pi hat = 530/1000 = 0.53 (Estimate), pi tilda (pivot) = r.v. from which pi hat is drawn

An estimate is a number, an estimator is a random variable.

Construct a coverage interval from the pivot. This interval will contain the parameter with the given probability. Estimate this interval using your sample -> Confidence Interval

Example 1: Model: Yi ~ N(mu, 5000^2), i=1,…,64; Yi’s independent

Y1,…,Yn ~ N(mu, sigma^2), Y bar ~ N(mu, sigma^2/n). So in this case, Y bar ~ N(mu, 5000^2/64)

(Y bar - mu) / (5000/8) = Z ~ N(0,1) is the pivotal distribution; Z-table -> 0.995, a=2.58. P(-2.58 <= Z <= 2.58) = 0.99. Plug (Y bar - mu) / (5000/8) = Z back in, we get mu >= Y bar - 2.58 * 5000 / 8 and mu <= Y bar + 2.58 * 5000 / 8. So P(Y bar - 2.58 * 5000 / 8 < mu < Y bar + 2.58 * 5000 / 8) = 0.99. This interval is the coverage interval. [60000 +- 2.58 * 5000 / 8]

NOTE: It is not true that the above interval contains mu with probability 0.99

General Rule

If the population is normal, and the variance is known, then the CI for mu is given [y bar += z* sigma/sqrt(n)]. So z = from the z0table, y bar = sample mean, r = population standard deviation, n = sample size. The interval will be wider if the level of confidence increase

Example 2: pi, pi hat = 0.53, pi tilda = estimator

Theorem - Central Limit Theorem
If n is “large”, then pi

Y1,…,Yn ~ N(mu, sigma^2), Y bar ~ N(mu, sigma^2/n). Let n be large, Y1,…,Yn have mean mu and variance sigma^2. Then (Y bar - mu) / (sigma/sqrt(n)) approximately equals Z for large n

Find the two points for the pivot, between [-a, a] the probability is 0.95, a = 1.96

Example: Y1,…,Y100 ~ Exp(mu); {y1,…,y100} -> Data. Find the 95% CI for mu.

By CLT, (Y bar - mu) / (mu/sqrt(n)) = Z ~ N(0,1)