STAT 231  Statistics
STAT 231  Statistics
Instructor: Surya Banerjee
Location: DC 1350
Time: Mondays and Wednesdays 2:30am  3:50pm
Tutorials: DC 1351 Mondays 5:30pm  6:20pm
Term: Spring 2016
May 2, 2016  Lecture 1
Email: s22baner@uwaterloo.ca or suryabanerjee@gmail.com
Course Marking Scheme: 3 Tutorial Quizzes (TQ) and 2 Midterms
STAT 231 is the reverse of STAT 230. E.g. We toss a coin 100 times, we get 60 heads. In STAT 230, we calculate the probability of that happening (binomial distribution, 100 and 0.5, P(X=60)) with all the parameters given (P = 0.5 to get a head for a fair coin). In STAT 231, we try to find the parameters by experiments, and try to infer what we can say about the “fairness” of the coin.
MLB: Between 19011941, batters with BA .400+: 11. 1941 onwards: 0. Why? Use statistics to find out.
Kansas Weathermen: They are correct about whether it will rain the next day about 85% of the time. Good record, right? However, it only rains about 10% of the time in Kansas. So if I go to the weather broadcasting station and say “no rain tomorrow” and then go home, I will be statistically more accuarate that these meteorologists.
Baseball Pundits: Pundits are right about the outcome of baseball games about 48% of the time. Literally worse than a coin.
In STAT 231, we learn how to test claims and check for correlation and causation. How can we make the leap from correlation to causation? (Recommended reading: Freakanomics)
Introduction to Data Analysis
Data can be classified into two categories: numerical and categorical (nonnumerical).
Numerical Data is either discrete or continuous. Discrete data (e.g. number of accidents) takes integer values, while continuous data (e.g. height, weight) measure noninteger values.
Categorical Data is either ordinal (if there’s an underlying order, e.g. on a scale from 1 to 10, 1 being least satisfied and 10 being most satisfied) or nonordinal (no order, e.g. nationality, favorite color).
Transformation {x_{1}, x_{2},…, x_{n}}
y_{i} = f(x_{i}) is a transformation.
An affine transformation is a linear transformation: y_{i} = ax_{i} + b
Coding is the conversion of categorical data into numerical data.
Data Summaries
We want to extract important information about data.
 Numerical: come up with numbers that represent different properties of the data set
 Graphical: graph that tells us the shape of the data set
There are some properties of interest:

Centre of the data set > Central Tendency

How volatile is the data set > Dispersion

If the data set is symmetric > Measures of Symmetry

How “fat” the tails of the data set are (i.e. the size of extrema) > Kurtosis
Measures of Central Tendency
Data set: {y_{1}, y_{2},…, y_{n}}
Sample Mean:
Property: The sum of the deviations of the observations from the sample mean is 0.
Geometric Mean
Harmonic Mean
May 4, 2016  Lecture 2
Lecture notes will be posted every weekend.
Practice questions (with solutions) will be posted this week. First tutuorial quiz on May 17 (T), 2016 (11:25am).
Outline
 measures of central tendency: medias, quartiles, mode
 measures of variability: range, IQR, variance
Sample > Predictions about the population
Data Summaries
 Centre: CENTRAL TENDENCY
 Variability: DISPERSION
 Tails are similar: SYMMETRY
 Frequency of Extreme Obersevations: KURTOSIS
Numerical and Graphical Summaries
{y_{1}, y_{2}, … , y_{n}}
Sample mean: 1/n sigma y_{i}
Properties: sigma (y_{i}  mean) = 0
Also, affine transformation preserves linear combinations (i.e. the arithmetic mean changes accordingly by the affine transformation)
Median: the middle most observations
arrange the data set in ascending order and pick the middle one. e.g. 1, 3, 7, 13, 25. Median is 7.
Median is not too sensitive with extreme values.
Quartiles: Q1: lower quartile, Q3: upper quartile
Percentiles: Instead of dividing into 4 parts, the data is divided into 100 parts.
Eg. p=0.25, m = (n+1) * p, n is the number of observations
Mode: Observation that occurs with the maximum frequency
St.Petersburg’s Paradox. We care about averages (expected values), but also risk. Variability is a very important property of a data set. E.g. Country A 0 0 0 1000 and Country B 250 250 250 250 Goalie A 0 6 0 6 0 6 0 6 Goalie B 3 3 3 3 3 3 3
Measures of Volatility
Range: Max  Min
Interquartile Range (IQR): IQR = Q3  Q1
Sample Variance and Sample Standard Deviation:
Sample Variance: s^{2} = 1/(n1) sigma (y_{i}  y average)^{2} is approximatey the average of the squared deviation from the mean
Standard Deviation: s = positive square root of s^2, the sample variance
Mean Absolute Deviation(MAD): 1/n sigma abs(y_{i}  y average)
Say a data set of x_{i} is transformed by an affine transformation y_{i} = a + bx_{i}. Then s^{2} of y = b^{2}s^{2}. And the standard deviation is multiplied by the absolute value of b.
May 9, 2016  Lecture 3
 Practice problems with solutions on LEARN
 <= Wednesday’s class for the TQ1 next week
Today
 Measures of Symmetry
 Measures of Kurtosis
 Applications
 Measures of Association
 Graphical Measures
Objective: To figure out the shape of our sample
 Centre: sample mean
 Variability: This measures how much the data set is spread out across the centre; how volatile the data set is; Use s^{2}, variance; Standard deviation is the sqrt(variance)
 If y=a+bx, s^{2}_{y} = b^2s^2_x s_y = abs(b)s_x
 The variance treats data on either side of the mean symmetrically
 Symmetry:
 3 shapes:
 Symmetric
 Rightskewed (mode is on the left side, mean is on the right side), long right tail
 Leftskewed (mode is on the right side, mean is on the left side), long left tail
 Measure of Skewness:
 Quick estimate of skewness:
e.g. 1, 3, 5, 7, 9, 11, 13
Symmetric Data > Mean = Median
e.g. 1, 3, 5, 7, 9, 11, 10^{7} Mean » Median
Measure of Skewness = Mean  Median if > 0, rightskewed, mean greater than median
 if < 0, leftskewed, mean less than median
 if = 0, symmetric
 Quick estimate of skewness:
 3 shapes:
Applications
Gould: “The median is not the message”
“Abdominal mesothalamia”: median of life expectancy after diagnosis = 8 months
If the distribution (of frequency of death against time since diagnosis) is symmetric, then he will die within 16 months at best.
The actual distribution is rightskewed.
Kurtosis
Kurtosis measure how frequent extreme observations are, with respect to the Normal Distribution.
Kurtosis checks whether the data set has “fatter” tails compared to the Normal distribution.
Kurtosis: how much “peakedness” the graph has
For a normal distribution, kurtosis = 3. If k for your data > 3, then it has fatter tails compared to the normal. If it is < 3, then it has narrower tails.
Kurtosis checks whether we can apply the Normal Distribution assumption to our data set by comparing our sample kurtosis to 3.
Measures of Association
Objective: To find whether there is any evidence of association between two variables
 Categorical Variables
lmao
ayy
hey
 Relative Risk: measures the association between two catgorical variables; Measured by [R.R.]
 A and B are indepedent iff P(AB) = P(A)
 If R.R. is approximately equal to 1, then there is little evidence of association
 If R.R. > > 1 or < < 1, evidence of association
 But how high is too high? UNANSWERED QUESTION
 The table = Two Way Contingency Table
 Numerical Variables
How to find a measure of association between two numerical variables? Sample Correlation Coefficient r_{xy}: the sign of r tells us the direction of the relationship, and the value of r tells us the strength of the association/relationship
E.g.
x_{i} = # of beers you drink
y_{i} = STAT 231 mark
Collect a data set:
{x1, y1},…{xn, yn}
So it is most likely that if xn  x average > 0, yn  y average < 0
Thus, the numerator tells us the direction of the relationship. The denominator ensures that 1 <= r_{xy} <= 1
The closer abs(r_xy) is to 1, the strong is the evidence of association.
r captures the linear relationship between x and y. Not good in capture nonlinear relationship
E.g. y = x^{2}, x=3, y=9, x=1, y=1, x=1, y=1, x=3, y=9. r_xy = 0 !!! It’s a function so the two variables are associated for sure, but the Coefficient gives 0. Does not work well with quadratic relationships, or nonlinear in general.
E.g. y = ax + b. r_xy = 1 if a > 0, = 1 if a < 0
Since we are going from sample to population, we can only find evidence of association. Strong correlation != Causation
Graphical Measures
 Density Histogram: Histogram > grouped data; Density histogram > area of every rectangle = relative frequency of the group
 Empirical cdf
 Box Plot
 Scatter Plot
 QQ Plot
May 11, 2016  Lecture 4
A note about TQ1 will be posted on LEARN tonight > Up to today’s material (Ignore the last question on the sample quiz)
Recap
 Centre > Sample Mean
 Variability > Variance, Standard Deviation
 Symmetry > Mean  Median
 Kurtosis > Compare to Normal Distribution
 Correlation Coefficient: r_{xy} measures association
Reasons why we care about Graphical Summaries
 To “identify” the distribution from which the data is drawn,
 To find the shape of the data set
The Five Number Summary
 Minimum
 Q1
 Q2
 Q3
 Maximum
These five numbers provide the rough shape of the graph
Correlation Coefficient measures the linear association between two variables
Graphical Summary
Histogram: Grouped Data
Example: Group Freq (The cohorts are called bins)
Frequency Histogram: Frequency as yaxis, and bins as xaxis (Bar graph)
Density Histogram: xaxis = bins, yaxis = densities
 The length(height) of each rectangle in a density histogram is chosen such that the area of the rectangle is equal to the relative frequency of the bin
 To construct a density histogram: Find the frequency and relative frequency for each bin; Relative Frequency = frequency for the bin / total frequency
The reason why we draw density histogram is that we want to compare our data (shape of graph) with known density functions of distributions
Box Plot: Box and Whiskers Plot
 Lower end of the box/rectangle > Q1
 Upper end of the box/rectangle > Q3
 The line inside the box > Median = Q2
 The upper whisker goes up to the maximum of the data set, that is NOT AN OUTLIER (extreme observations; a data point y_{i} is an outlier if y_{i} > Q3 + 1.5IQR or y_{i} < Q1  1.5IQR; Remember IQR = Q3Q1); The upper whisker ends at the highest value of your data set that is lower than the outliers; We mark the outliers each and every one of them separately
 The lower whisker goes down to the minimum of the data set, that is not an outlier
 The Box Plot gives us the Five Number Summary
 Reading the Box Plot sideways gives the rough shape and the skewness of the distribution
Empirical cdf (Cumulative Distribution Function)
Definition: F(y) = # of observations <= y / Total # of observations, and the plot (y, F(y)) is called the empirical cdf.
{1, 3, 5, 5, 9}, F(0) = 0, F(1) = 0.2, F(3) = 0.4 … the plot of a step function
 We can find a lot of info from the empirical cdf
 F(median) = 1/2, we can identify any percentile
 The number with the biggest jump vertically on the graph is the mode of this data set
QQ Plot
We want to check whether our data set “resembles” a Normal Distribution. The QQ plot plots the alphath quantile of your sample to the alphath quantile of Z Distribution ~N(0, 1); Plots two quantiles: Sample quantiles against the theoretical quantiles
{1, 2, 7, …, 155}
Z(alpha) on the xaxis, y(alpha) on the yaxis. 0 = median and mean of Z
If the QQ Plot is a straight line, then the data resembles the Normal Distribution Z
Q: If we have a distribution that is normal, how do we prove that it will have a straight line QQ plot?
That is, y~N(mu, sigma^2), the QQ plot will be a straight line
Proof:?
Scatter Plot
Scatter Plot checks for association. Just plot (xi, yi).
May 16, 2016  Lecture 5
PPDAC
 typical problem: we have a population of observations, the characteristics of the population is unknown
 We draw a sample, and based on the sample observations infer properties about the population (statistical inference)
 Examples: 1. we want to find the approval rating of Trump among likely US voters; 2. Average income of a starting math undergraduate career in Canada; 3. Does a regular intake of Vitamin C reduce the duration of a flu? 4. Are Canadian Jeopardy contestants better than Americans?
 PPDAC approach gives you an algorithm to tackle statistical problems that are mentioned above; PPDAC = Problem, Plan, Data, Analysis, Conclusion
Problem: Descriptive, Causative, Predictive
 Descriptive: We are interested in some unknown characteristic of the population (e.g. Trump)
 Causative: Whether a variable X causes a variable Y to change (e.g. Vitamin C)
 Predictive: To estimate the future values of some variable (e.g. predicting Blackberry stock prices)
Target Population: the population of interest (e.g. All likely US voters in the Trump example)
Unit: each member of the population
Variate: a characteristic of the unit (e.g. whether the voter approves or disapproves of Trump)
Attribute: a function of the variate (e.g. Proportion of approvals = Attribute)
For the income of math undergrad problem, each individual income is the variate.
Average income is the attribute.
Typically, we want to find some attributes of the population.
Plan
Study Population: the set of observations from which your sample is drawn
Example: Suppose we are doing phone interviews > study population would be all voters with a phone; In medical tests, study population might not be a subset of the target population
 Experimental Plan: Some of the variates can be controlled by the analyst with dependent and independent variables
 Observational Plan: The variates are not under the analyst’s control
Analysis
Setting up a statistical model: we assume that the data is drawn from known distribution with unknown parameters.
Data
 Bias: systematic error > make sure the data is not biased
 Measurement error: random error > measured value  actual value
Population parameters, usually Greek letters, are unknown. Sample parameters, usually Latin letters, are known (sample mean, sample variance).
Random Sampling
Sampling protocol: we want the sample to represent the population
Sampling techniques: methods of drawing the “right” sample
Errors: The difference between the target and the study populations = Study Error; Difference between the study population and the sample = Sampling Error
Difference in the value of the attribute between the Total Population and the Study Population is the study error. Difference in the value of the attribute between the Study Population and the Sample is the sampling error. We want to quantify the sampling error.
Conclusion
We want to make sure that the conclusion is understood by nonstatisticians
##The Theory of Estimation##
Method of Maximum Likelihood
Based on our sample, what is the mostly likely value of the unknown parameter? This is a question every statistician asks. Example: We have a biased coin. The probability of getting a head is either 1/2 or 1/3. Toss the coin 100 times and we see 25 heads. We choose the value of the parameter that maximizes the probability of observing your sample.
Suppose Y is a discrete distribution with a pf f and an unknown parameter theta. Data = {y1, …, yn}.
L(theta; y1, … , yn) = Likelihood Function = P(Y1 = y1, Y2 = y2, … , Yn = yn) as a function of theta; the chance of observing the sample
E.g. Number of accidents ~ Poi(mu), with mu unknown. Sample is {1, 0, 2, 5, 7}. Based on the sample, what is the MLE (Maximum Likelihood Estimate) for mu? Use the pf for the Poisson distribution to calculate the probability of observing each unit in the sample. Multiply them together. The value of mu that maximizes L above is called the MLE. In the above example, L = e^5mu* mu^15 / 1!0!…7!. In order to maximize it, take the log(base e) function and maximize the log function. Take derivatives and equate it to zero, solve for mu (MLE). log L = l = loglikelihood function. In the above example, mu = 3
May 18, 2016  Lecture 6
##Estimation##
We are interested in some unknown characteristics (attributes) of the population. We draw a random, independent sample from this population {y1,…,yn}. Based on these sample observations, we want to come up with the best estimate for the unknown characterstic g(y1,…,yn) will be my best guess for theta.
Some starting points
+ usually, all statistical inference problems start with some estimation of some parameter.
+ {y1,...,yn} > sample
+ we think of the data points {y1,...,yn} as outcomes of some random variable Y (statistical model); We assume that the data is drawn from some distribution with unknown parameters
Example: toss a coin; objective: estimate pi = P(head). Each data point yi can be thought of as an outcome of Yi. The distribution of Yi is called the Statistical Model.
##Steps Involving Estimation##
 Identify the distribution from which your data set is drawn
 Construct the likelihood function L(theta; y1,…,yn) = P(Y1=y1,…Yn=yn) = Probability of observing your sample = P(Y1=y1)…P(Yn=yn)
 Find the MLE (best guess for the unknown parameter). Theta hat is that value of theta that maximizes the likelihood function
Example 1: To estimate the approval rating of Trump. Approved Rating = pi (Population proportion, unknown). Sample = {YYNNNNYNNN} Given this sample, what is pi hat, the MLE for pi? Y is either Yes with probability pi, or No with probability 1pi
L(pi; y1,…yn) = pipi(1pi)…(1pi) = pi^3 * (1pi)^7 is the Maximum Likelihood Function. The value of pi that maximizes the function is pi hat = MLE. The log likelihood function is l(pi) = 3ln(pi) + 7ln(1pi), take derivative, find maximum value of pi = 0.3
Example 2: Objective: To estimate mu, the average number of hits on your blog in an hour. A sample of n hours {y1,…,yn}. What is mu hat, the MLE for mu? We will assume that the data is drawn from a Poisson distribution. L(mu, y1,…yn) = P(Y=y1)*…P(Y=yn). Calculating mu hat, we get mu hat = 1/n(sigma yi) = y average
Example 3: To estimate pi = prob that a Canadian contestant wins Jeopardy. Sample = {y1,…,yn} yi = number of shows contestant i appeared in. Sample = {1,2,1,3,5}. The data in this problem is drawn from a Geometric distribution. P(Y=2) = pi(1pi). P(Y=4) = pi^3(1pi). L(pi; y1,…yn) = pi^[(sigma yi)n] * (1pi)^n. Take log derivatives and equate it to zero. Get pi hat.
Example 4: To estimate the proportion of southpaws at UW.
+ Strategy 1: Go interview people till we get 10 left handers, interviewed 100 people
+ Strategy 2: Go interview 100 people and 10 of them are left handers
Strategy 1: pi = proportion of left handers. L(pi) = C^99_9pi^9(1pi)^90 * pi
Strategy 2: L(pi) = C^100_10pi^10(1pi)^90
So pi hat in both cases are 1/10.
Some Final Points
 For complete analysis, we have to make sure theta hat is indeed the maximum (check the 2nd order condition)
 L(theta; y1,…,yn) = P(Y1=y1)…P(Yn=yn) = f(y1)*…*f(yn). f = distribution function = pi(f(yi))
Since the likelihood function is a product of the probabilities, the value can be really small for large sample sizes.
R(theta) = Relative Likelihood Function = L(theta)/L(theta hat), where theta hat = MLE. The graph of R(theta) against theta is bellshaped, maximized at theta hat. 0<=R(theta)<=1, R hits its maximum at theta = theta hat = MLE
May 25, 2016  Lecture 7, 2016
Syllabus for the Midterm: <= End of this week + STAT 230; Fall 2015 Midterm and practice problems posted (solutions will be posted this Friday)
Today
 Overview of statistical modelling, estimatation and the MLE calculation
 Likelihood functions and the MLE for continuous r.v.s
 Invariance property of the MLE
 Special case > Uniform distribution
Paul the Octopus Problem
 Toss a coin 100 times. Your friend “guesses” right 70 times. Is there evidence that your friend has ESP?
 What is the attribute of interest? (What do we care about the population we are looking at?)
 We are trying to find pi, the probability(your friend will guess right); it is unknown
 y = number of successes in 100 trials, y = 70 > outcome of some r.v. Y. Data points are not just numbers, but outcomes of a random experiment
 We need to construct a statistical model (assuption on the distribution Y)
 In this case, Y~Bin(100, pi)
Statistical Model
 Construct the likelihood function
 L(pi) = C^100_70 pi^70(1pi)^30
 Find the MLE(we choose pi hat which maximizes L(pi)), pi hat = 0.7
 We choose that value of the parameter that maximizes the chance of what we observed
Formal Definition: Yi = f(theta;yi), i=1,…,n, f=distribution function, Yi’s are independent, {y1,…,yn} > data set
L(theta;y1,y2,…,yn) = n Pi i=1 f(theta;yi) Product of the Distribution functions evaluated at the sample points
Example: {y1,…,yn} is independently drawn from yi
f(y;theta) = (1theta)^y * theta; y = 0,1,2,…
Continuous Distributions
Definition: Model: Yi~f(yi;theta) where Yi is a continuous random variable with unknown parameter theta
Example: Objective: To estimate mu = average lifetime of a lightbulb produced by a company. An independent sample {y1,…,yn} is drawn. Based on this sample, what is mu hat?
Model: Yi~Exp(mu), i=1,..,n, Yi’s are independent; L(mu, yi,…,yn) = i/mu^n e^(1/u sigma yi). Take ln, take derivative, equate to zero; get mu hat = y average
Example: Objective: To estimate mu = population average IQ of UW Math students; sigma^2 = variance of the IQ
An independent sample {y1,y2…,yn} is drawn. Based on the sample, what are sigma^2 hat and mu hat?
Solution: Model: Yi~N(mu, sigma^2), assuming normality, i=1,…,n, Yi’s independent
Likelihood function: Take partial derivatives: first with respect to mu, get mu hat = y average (If the distribution is Normal, then y average is the MLE for mu); then take partial derivative with respect to sigma, get sigma^2 hat = 1/n sigma(yi  y average)^2. The MLE for population mean is the sample mean, but the population variance is NOT the sample variance (which is 1/n1 instead of 1/n)
Example: Suppose {y1,…,yn} is drawn from a random variable with density function f(theta;y) = 2y/theta e^((y^2)/theta), with y>0, theta>0. L(theta) = (2^n(y1y2…yn))/theta^n e^(1/theta sigma yi^2), take ln, l(theta)  ln K  nln(theta)  1/theta sigma yi^2. Take derivative, get theta hat = 1/n sigma yi^2
##Invariance Property##
Result: If theta hat is the MLE (best guess for theta) for theta, then g(theta hat) is the MLE for g(theta) for any continuous g
Example: The scores of STAT231 are normally distributed with mean mu and variance sigma^2. Find the estimate for the 95th percentile of the population. Looking at the Z table, 95th percentile corresponds to 1.64.
P(Y<=x) = 0.95
P((Ymu)/sigma <= (xmu)/sigma) = 0.95
P(Z<=xmu/sigma) = 0.95
xmu/sigma = 1.64
x = mu + 1.64 sigma
MLE for the 95th percentile = mu hat + 1.64 sigma hat. Find mu hat and sigma hat, plug in, done.
Example: Suppose we toss a coin 60 times, and we observe 20 successes (heads). Find the MLE for the variance of this distribution. pi hat = 1/3
Var = npi(1pi), Mean = npi
##Uniform Distribution##
Yi~U[0,theta], theta is unknown. {yi,…,yn} What is theta hat? U[a,b], pf = 1/(ba). f(y;theta) = 1/theta if 0<y<theta and 0 otherwise. Then L(theta) = 1/theta^n if 0<=yi<=theta for all i (or can be written as “if theta >= max{yi,…,yn}”), or 0 otherwise. But the graph is not continuous at max{y1,…,yn}. MLE theta hat = max{y1,…,yn}
May 30, 2016  Lecture 8
Interval Estimation: Likelihood function > likelihood intervals, sampling distribution > confidence intervals
The Theory of Estimation
Problem: theta = unknown population attribute of interest; {y1,…,yn} > sample drawn from the population; based on the sample, theta hat(y1,…,yn) which is our ESTIMATE for theta
Method of Maximum Likelihood
 Identify the distribution from which the data is drawn (Modelling)
 Construct the likelihood function, calculate the log likelihood function, take derivatives to find the maximum likelihood estimate
Example: Y1,…,Yn ~ N(mu, sigma^2). Data set: {y1,…,yn}. Objective: Estimate mu
Alternative Method  Method of Least Squares (more practical, do not care about distribution. only care about how accurate the prediction is): Choose mu hat such that the sum of SQUARED errors is minimized. Min sigma[(yimu)^2]
2 sigma(yimu)(1) = 0
sigma(yimu) = 0
mu = y average
For the Normal problem, MLE and LS estimate for mu are the same
Interval Estimation
Obejective: We have an unknown parameter theta. We want to estimate the random interval [A, B] which would contain theta with a “high” probability (that is prespecified/defined)
Example: Margin of Error for opinion polls
Method 1: Through the likelihood function
Definition: A 100p % likelihood interval (p is in (0,1)) for an unknown parameter theta is given by
{theta: R(theta) >= p}, where R(theta) = Relative Likelihood Function
Example: p=0.5, you want a 50% likelihood interval; so {theta: R(theta) >= 0.5}; Draw a horizontal line at y=0.5, the line intersects the RLF graph at points a and b [a,b] = 50% likelihood interval
Question: If theta_0 belongs to the 20% likelihood interval, does it belong to the 10% likelihood interval? Yes.
Question: Does theta hat (MLE) belong to all likelihood intervals? Yes.
Suppose theta_0 belongs to the 30% likelihood interval > means that the likelihood function, evaluated at theta_0, is at least 30% of the likelihood function evaluated at the MLE. L(theta_0)/L(theta hat) >= 0.3 => L(theta_0) >= 0.3 L(theta hat)
Method 2  Method of Sampling Distributions: Method 1 is not an ＩＮＴＵＩＴＩＶＥ interpretation
Steps:
 Identify the ＰＩＶＯＴＡＬ ＤＩＳＴＲＩＢＵＴＩＯＮ from your model
 Find the endpoints of your pivotal distribution and rearrange to construct the coverage interval
 Estimate the coverage interval using your sample > confidence interval
Notations: Y bar (a random variable from which sample means , y bar, are drawn), y bar (sample mean, known), and mu (population mean, unknown)
Example: A population of 1, 2, 3, 4, 5
A sample of size 2 is drawn with replacement from this population
 mu = 3 (usually unknown)
 y bar: {1, 3} > 2; {1, 1} > 1; depends on sampling
 Y bar: random variable from drawing y bars by graphing the frequency of each
Capital letters = random variables = Estimators
Lowercase letters = Estimates from the sample > known numbers
Greek letters = Population parameters > unknown parameters
Example: Objective: To construct a 95% confidence interval for the average IQ of UW professors. {y1,…,yn}, n=25, sigma yi = 2200. We know that the IQs are normally distributed with a variance of sigma^2 = 49 (unrealistic, because sigma^2 is known but not mu). Construct a 95% confidence interval for mu
Model: Yi ~ N(mu, 49). i = 1,…,25 are independent
Y bar ~ N(mu, 49/25) since Y bar ~ N(mu, sigma^2 / n)
(Y bar  mu) / 7/5 = z ~ N(0, 1) PIVOT
Go to the Ztable and find the 95% interval. Find 0.975. a = 1.96
P(1.96 < z < 1.96) = 0.95
P(1.96 < (Y bar  mu) / 7/5 < 1.96) = 0.95
So mu > Y bar  1.967/5, and mu < Y bar + 1.967/5, and these two numbers make up the bounds of the coverage interval. [y bar  1.967/5, y bar + 1.967/5] since y bar is known is the confidence interval
General Rules to estimate mu when sigma is known: (Y bar  Z* sigma/sqrt(n), Y bar + Z* sigma/sqrt(n)) is the coverage interval. Sub y for Y, confidence interval
Suppose we had a 99% coverage, then Z* will increase (to 2.58).
Suppose we want the 95% confidence interval to be of fixed length y bar + A.
Two problems:
 Sigma is usually not known
 The population might not be normal
Central Limit Theorem: Suppose n is large, and Y1,…,Yn, are drawn from a distribution with mean mu and variance signma^2, then Y bar is approximately Y bar ~N(mu, sigma^2 / n)
June 1, 2016  Lecture 9
Today
 Confidence interval for mu from a normal population with known sigma (population SD)
 Confidence interval for Binomial
 Confidence interval for Exponential (when n is large)
 The chisquared distribution chi^2_k
 The T distribution
Example: (prespecified) 99% CI
Example 1: The incomes of UW grads are Normally distributed with mean mu, and SD sigma = $5000
Object: To fund the 99% for mu, i.e. we want to construct a random interval [A, B], such that P(A < mu < B) = 0.99 (This is called the coverage interval, and 0.99 is called the coverage probability or level of confidence)
A sample of 64 UW grads are drawn at random {y1,…,y64}, y bar = 60000, sigma = 7000. What is your CI?
Example 2: November 16, an exit poll is conducted with a random sample of 1000 US voters and 530 of them voted for Trump. Pi = Proportion of total votes Trump will get. We want to construct a 95% CI for pi.
Confidence Interval Approach: construct the pivotal distribution based on your estimator or a function of it; mu = population mean, y bar = sample mean = a number, Y bar = r.v. from which y bar is drawn. The distribution of Y bar is the pivot.
Example 2: pi = population proportion, pi hat = 530/1000 = 0.53 (Estimate), pi tilda (pivot) = r.v. from which pi hat is drawn
An estimate is a number, an estimator is a random variable.
Construct a coverage interval from the pivot. This interval will contain the parameter with the given probability. Estimate this interval using your sample > Confidence Interval
Example 1: Model: Yi ~ N(mu, 5000^2), i=1,…,64; Yi’s independent
Y1,…,Yn ~ N(mu, sigma^2), Y bar ~ N(mu, sigma^2/n). So in this case, Y bar ~ N(mu, 5000^2/64)
(Y bar  mu) / (5000/8) = Z ~ N(0,1) is the pivotal distribution; Ztable > 0.995, a=2.58. P(2.58 <= Z <= 2.58) = 0.99. Plug (Y bar  mu) / (5000/8) = Z back in, we get mu >= Y bar  2.58 * 5000 / 8 and mu <= Y bar + 2.58 * 5000 / 8. So P(Y bar  2.58 * 5000 / 8 < mu < Y bar + 2.58 * 5000 / 8) = 0.99. This interval is the coverage interval. [60000 + 2.58 * 5000 / 8]
NOTE: It is not true that the above interval contains mu with probability 0.99
General Rule
If the population is normal, and the variance is known, then the CI for mu is given [y bar += z* sigma/sqrt(n)]. So z = from the z0table, y bar = sample mean, r = population standard deviation, n = sample size. The interval will be wider if the level of confidence increase
Example 2: pi, pi hat = 0.53, pi tilda = estimator
Theorem  Central Limit Theorem
If n is “large”, then pi
Y1,…,Yn ~ N(mu, sigma^2), Y bar ~ N(mu, sigma^2/n). Let n be large, Y1,…,Yn have mean mu and variance sigma^2. Then (Y bar  mu) / (sigma/sqrt(n)) approximately equals Z for large n
Find the two points for the pivot, between [a, a] the probability is 0.95, a = 1.96
Example: Y1,…,Y100 ~ Exp(mu); {y1,…,y100} > Data. Find the 95% CI for mu.
By CLT, (Y bar  mu) / (mu/sqrt(n)) = Z ~ N(0,1)