ECON 400: Introduction to Econometrics
Problem Set #5
You have collected 14,925 observations from the Current Population Survey. There are 6,285
females in the sample, and 8,640 males. The females report a mean of average hourly earnings of
$16.50 with a standard deviation of $9.06. The males have an average of $20.09 and a standard
deviation of $10.85. The overall mean average hourly earnings is $18.58.
a) Using the t-statistic for testing differences between two means (section 3.4 of your
textbook), decide whether or not there is sufficient evidence to reject the null hypothesis
that females and males have identical average hourly earnings.
b) You decide to run two regressions: first, you simply regress average hourly earnings on an
intercept only. Next, you repeat this regression, but only for the 6,285 females in the sample.
What will the regression coefficients be in each of the two regressions?
c) Finally you run a regression over the entire sample of average hourly earnings on an
intercept and a binary variable DFemme where this variable takes on a value of 1 if the
individual is a female, and is 0 otherwise. What will be the value of the intercept? What will
be the value of the coefficient of the binary variable?
d) What is the standard error on the slope coefficient? What is the t-statistic?
e) Had you used the homoskedasticity-only standard error in (d) and calculated the t-statistic,
how would you have had to change the test-statistic in (a) to get the identical result?
Labor economists studying the determinants of women’s earnings discovered a puzzling empirical
result. Using randomly selected employed women, they regressed earnings on the women’s number
of children and a set of control variables (age, education, occupation, and so forth). They found that
women with more children had higher wages, controlling for these other factors. Explain how
sample selection might be the cause of this result. (Hint: Notice that women who do not work
outside the home are missing from the sample.) [This empirical puzzle motivated James Heckman’s
research on sample selection that led to his 2000 Nobel Prize in Economics.]
Assume that the regression model ???? = ??0 + ??1???? + ???? satisfies the least squares assumptions
in You and a friend collect a random sample of 300 observations on Y and X.
a. Your friend reports that he inadvertently scrambled the X observations for 20% of the
sample. For these scrambled observations, the value of X does not correspond to ???? for the i
th observation; rather, it corresponds to the value of X for some other observation. The
measured value of the regressor, ??�?? , is equal to???? for 80% of the observations, but it is
equal to a randomly selected ???? for the remaining 20% of the observations. You regress ????on
??�?? . Show that ??�??
�1� = 0.8??1 .
b. Explain how you could construct an unbiased estimate of ??1 using the OLS estimator in (a).
c. Suppose now your friend tells you that the X’s were scrambled for the first 60 observations
but that the remaining 240 observations are correct. You estimate β1β1 by regressing Y on X,
using only the correctly measured 240 observations. Is this estimator of ??1better than the
estimator you proposed in (b)? Explain.
For this problem, you need to copy and run missing.do and missing.dta fromCanvas. This question
illustrates some basic data manipulation techniques for dealing with missing data. Suppose a
researcher is estimating the effects of demographic factors on an individual’ income using the
General Social Survey of 1991. Her research assistant has tried to prepare the data for her to
generate the estimates, and her do file includes some hard to understand comments. Your job is to
help decipher exactly what was done by the research assistant and to help interpret her results.
a) Based on the frequencies from part 1 of the program, how prevalent is missing data? Does it
exist primarily in the DV (Income), one or more of the IVs, or both?
b) In part 2, why do you think her assistant decided to recode the income variable? Why didn’t
the assistant think MD was being handled correctly in the original coding?
c) In part 3, why does the assistant create the PAEDUC2 and MDPAEDUCvariables? Why are
they coded that way?
d) In part 4, the assistant comments that “This regression model will give us an idea of
whether or not the MD in PAEDUC is missing on a random basis.” How do these regression
models accomplish this? What does the coefficient for MDPAEDUC supposedly tell you? Is
this a valid approach, why or why not?
In the process of collecting weight and height data from 29 female and 81 male studentsat your
university, you also asked the students for the number of siblings they have. Although it was not
quite clear to you initially what you would use that variable for, you construct a new theory that
suggests that children who have more siblings come from poorer families and will have to share
the food on the table. Although a friend tells you that this theory does not pass the “straight-face”
test, you decide to hypothesize that peers with many siblings will weigh less, on average, for a given
height. In addition, you believe that the muscle/fat tissue composition of male bodies suggests that
females will weigh less, on average, for a given height. To test these theories, you perform the
= –229.92 – 6.52 × Female +0.51 × Sibs+ 5.58 ×Height; R2=0.50, SER= 21.08
(44.01) (5.52) (2.25) (0.62)
where Studentwis in pounds, Height is in inches, Female takes a value of 1 for females and is 0
otherwise, Sibs is the number of siblings (heteroskedasticity-robust standard errors in
a) Carrying out hypotheses tests using the relevant t-statistics to test your two claims separately, is
there strong evidence in favor of your hypotheses? Is it appropriate to use two separate tests in this
b) You also perform an F-test on the joint hypothesis that the two coefficients for females and
siblings are zero. The calculated F-statistic is 0.84. Find the critical value from the F-table. Can you
reject the null hypothesis? Is it possible that one of the two parameters is zero in the population, but
not the other?
c) You are now a bit worried that the entire regression does not make sense and therefore also test
for the height coefficient to be zero. The resulting F-statistic is 57.25. Does that prove that there is a
relationship between weight and height?
Problem #6: Select the appropriate answer to each question, along with the reasoning supporting
Part 1: Under the least squares assumptions (zero conditional mean for the error term, Xi and Yi
being i.i.d., and Xi and ui having finite fourth moments), the OLS estimator for the slope and
A) has an exact normal distribution for n > 15.
B) is BLUE.
C) has a normal distribution even in small samples.
D) is unbiased.
Part 2: One of the following steps is not required as a step to test for the null hypothesis:
A) compute the standard error of 1.
B) test for the errors to be normally distributed.
C) compute the t-statistic.
D) compute the p-value.
Part 3: If you wanted to test, using a 5% significance level, whether or not a specific slope
coefficient is equal to one, then you should
A) subtract 1 from the estimated coefficient, divide the difference by the standard error, and check if
the resulting ratio is larger than 1.96.
B) add and subtract 1.96 from the slope and check if that interval includes 1.
C) see if the slope coefficient is between 0.95 and 1.05.
D) check if the adjusted R2 is close to 1.
Part 4: When your multiple regression function includes a single omitted variable regressor, then
A) use a two-sided alternative hypothesis to check the influence of all included variables.
B) the estimator for your included regressors will be biased if at least one of the included variables
is correlated with the omitted variable.
C) the estimator for your included regressors will always be biased.
D) lower the critical value to 1.645 from 1.96 in a two-sided alternative hypothesis to test the
significance of the coefficients of the included variables.
Part 5: All of the following are true, with the exception of one condition:
A) a high R2 or does not mean that the regressors are a true cause of the dependent variable.
B) a high R2 or does not mean that there is no omitted variable bias.
C) a high R2 or always means that an added variable is statistically significant.
D) a high R2 or does not necessarily mean that you have the most appropriate set of regressors.
Part 6: The following interactions between binary and continuous variables are possible, with the
A) Yi = β0 + β1Xi +β2Di + β3(Xi ×Di) + ui.
B) Yi = β0 + β1Xi +β2(Xi × Di) + ui.
C) Yi = (β0 +Di) +β1Xi + ui.
D) Yi = β0 + β1Xi +β2Di + ui.
Part 7: To decide whether Yi = β0 +β1X + ui or ln(Yi) = β0 + β1X + ui fits the data better, you cannot
consult the regression R2 because
A) ln(Y) may be negative for 0<Y<1.
B) the TSSare not measured in the same units between the two models.
C) the slope no longer indicates the effect of a unit change of X on Y in the log-linear model.
D) the regression R2 can be greater than one in the second model.
Part 8: To test whether or not the population regression function is linear rather than a polynomial
of order r,
A) check whether the regression R2 for the polynomial regression is higher than that of the linear
B) compare the TSSfrom both regressions.
C) look at the pattern of the coefficients: if they change from positive to negative to positive, etc.,
then the polynomial regression should be used.
D) use the test of (r-1) restrictions using the F-statistic.
Part 9: The binary variable interaction regression
A) can only be applied when there are two binary variables, but not three or more.
B) is the same as testing for differences in means.
C) cannot be used with logarithmic regression functions because ln(0) is not defined.
D) allows the effect of changing one of the binary independent variables to depend on the value of
the other binary variable.
Part 10: A statistical analysis is internally valid if
A) its inferences and conclusions can be generalized from the population and setting studied to
other populations and settings.
B) statistical inference is conducted inside the sample period.
C) the hypothesized parameter value is inside the confidence interval.
D) the statistical inferences about causal effects are valid for the population being studied
EasyDue™ 支持PayPal, AliPay, WechatPay, Taobao等各种付款方式!
E-mail: firstname.lastname@example.org 微信:easydue