使用R检验一些基本且重要的过程行为

FIT3154 Assignment 1

Due Date: 11:55PM, Thursday, 26/9/2019

Introduction

There are a total of four questions worth 9 + 14 + 15 + 12 = 50 marks in this assignment.

This assignment is worth a total of 20% of your final mark, subject to hurdles and any other matters (e.g., late penalties, special consideration, etc.) as specified in the FIT3154 Unit Guide or elsewhere in the FIT3154 Moodle site (including Faculty of I.T. and Monash University policies).

1

Question 1 (9 marks)

In this question you will use R to examine the behaviour of two basic and important shrinkage proce- dures: (i) the hard thresholding method, and (ii) the lasso shrinkage method in the context of the very simple (single) mean model. In particular, you will examine the behaviour of these two estimators under the very simple model:

y ∼ N(μ,1) (1) For reference, the least-squares estimate of μ using y is just μˆLS(y) = y which has squared-error risk

E (μ − μˆLS(y))2 = 1.

Under this very simple model the estimates of μ produced by the thresholding estimator and lasso

estimator are

and

TH y if|y|>λ μˆ (y; λ) = 0 if |y| < λ

μˆL1(y; λ) = sgn(y) max(|y| − λ, 0)

respectively, where λ > 0 is the hyperparameter controlling the degree of shrinkage. We are interested in the squared-error estimation risks (average squared errors) of these two estimators under the model (1). Complete the following questions; provide R code in your report for questions 1.1 and 1.2.

1. Write an R function that takes a μ, λ and calculates the squared error risk of the hard thresholding function, i.e.,

2∞ 2

Rμ,μˆTH=E μ−μˆTH(y;λ) =

where y follows the model (1). You should do this using simulation, as generating random samples from the data model (1) is very straightforward. Your function should take a number of simulations to run as an additional argument. [2 marks]

2. Write an R function that takes a μ, λ and a number of simulation iterations, and calculates the approximate squared error risk of the lasso shrinkage estimator, i.e.,

2∞ 2 Rμ,μˆL1=E μ−μˆL1(y;λ) = p(y|μ,1)μ−μˆL1(y;λ) dy

−∞

As above, this should be done using simulation. [2 marks]

3. Once you have coded these functions, you should produce the squared-error risk curves as μ is varied from (−10, 10) by steps of 0.1 for λ = 1 and λ = 2. Produce plots for each of these settings, and in each plot overlay the squared error risk curves for the hard-thresholding estimator, the lasso estimator and the least-squares estimator. Make sure these plots are neatly formatted with appropriate axis labels. Make sure you run your simulations for sufficient iterations to get smooth looking curves. [2 marks]

4. Analyse the resulting plots. Discuss how the hard thresholding estimator compares to the lasso estimator, and how they both compare to the least-squares estimator? What possible advantages and drawbacks do these plots suggest that each of these estimators might have? [3 marks]

2

−∞

p(y|μ,1)μ−μˆTH(y;λ) dy

Question 2 (14 marks)

This question will require you to analyse a regression dataset using Bayesian lasso regression. In particular, you will be looking at predicting the fuel efficiency of a car (in litres per kilometer) based on characteristics of the car and its engine. This is clearly an important and useful problem. The dataset fuel2017-20.csv contains n = 2, 000 observations on p = 9 predictors obtained from actual fuel efficiency tables for car models available for sale during the years 2017 through to 2020. The data dictionary for this dataset is given in Table 1.

To analyse this data in a Bayesian fashion you will use the Bayesian lasso as implemented in the file bayes.lasso.R. This has the same inputs and produces the same output variables as the bayes.cauchy.linreg.R code you used in Studio 6.

1. Fit a regression model to the fuel efficiency data using the least-squares method in R (i.e., using lm()). From this output, what variables do you believe are potentially associated with increased concrete strength, and why? [2 marks]

2. Analyse the fuel efficiency data using the Bayesian lasso. Compare posterior means, standard deviations and t-statistics to those obtained using least-squares. How do they compare, and if you believe they are different, why do you think they are different? [2 marks]

3. From your Bayesian analysis, which variables seem unimportant/unlikely to be associated with fuel efficiency? Justify your answer [2 marks]

4. Fit a model using the Bayesian lasso with the unimportant variables removed

• Write down your best estimate of the regression equation relating fuel efficiency to the

explanatory variables. [2 marks]

• If we wanted to advise a car manufacturer regarding the choice of an engine to have likely

good fuel efficiency, what does this model suggest we should look for? [2 marks]

5. Imagine that a friend proposes purchasing a particular model of car with the properties given in

Table 2 Use your Bayesian model to:

(a) Predict the mean fuel efficiency for this car. [1 mark]

(b) Provide a histogram of the posterior distribution of the mean fuel efficiency for this car, and a 95% credible interval on the plausible range of mean fuel efficiency for this car. [1 mark]

6. An engineer suggests they believe there is likely an interaction effect between engine displacement and the number of cylinders. Using the Bayesian lasso, and statistics produced by the Bayesian lasso code, assess whether you think this is the case, and what effect it has on the model? [2 marks]

3

Variable name

Model.Year

Eng.Displacement

No.Cylinders

Aspiration

No.Gears

Lockup.Torque.Converter

Drive.Sys

Max.Ethanol

Fuel.Type

Comb.FE

Description

Year of sale

Engine Displacement (litres, l) Number of Cylinders

Engine Aspiration (Oxygen intake)

Number of Gears

Lockup torque converter present? Drive System

Maximum % of Ethanol allowed Type of Fuel

Fuel Efficiency (l/km)

Values

2017 − 2020

0.9 − 8.4

3 − 16

N: Naturally∗

OT: Other

SC: Supercharged

TC: Turbocharged

TS: Turbo+supercharged 1 − 10

N∗ and Y

4∗: 4-wheel drive

A:All-wheel

F:Front-wheel

P:Part-time 4-wheel

R:Rear-wheel

10 − 85

G∗: Regular Unleaded

GM: Mid-grade Unleaded Recommended GP: Premium Unleaded Recommended GPR: Premium Unleaded Required 4.974 − 26.224

Table 1: Fuel efficiency data dictionary. The ∗ denotes the reference category for each categorical variable.

Variable Model.Year Eng.Displacement No.Cylinders Aspiration No.Gears Lockup.Torque.Converter Drive.Sys Max.Ethanol Fuel.Type Value 2018 3.5 6 N 8 N R 15 GPR

some text

Table 2: Engine Properties.

4

Question 3 (15 marks)

This question involves some theory regarding estimation of the Poisson distribution. Imagine we have n observations generated from a Poisson distribution

y1,…,yn ∼ Poi(λ) with unknown λ and choose to estimate λ using

ˆ 1

λa,b= n+b (a+ny ̄) (2)

where y ̄ = (1/n) ni=1 yi is the sample mean and a ≥ 0 and b ≥ 0 are user chosen hyperparameters. The bias and variance of the estimator (2) are

ˆ a−bλ ˆ nλ

biasλ(λa,b)= n+b , Vλ λa,b = (n+b)2 (3)

Please answer the following questions. Provide appropriate working/mathematical justification.

1. Prove that the bias and variance are equal to (3). [3 marks]

2. Describe the dependency of the bias and variance on the hyperparameters a and b and the population value of λ. [2 marks]

3. Plot the squared-error risk (expected squared-error) for n = 5 and λ ∈ (0, 10) for (a = 1, b = 1), (a = 5, b = 1), (a = 5, b = 5) and maximum likelihood. You may plot all curves on the one plot. Describe the key features of the risk curves. [3 marks]

4. Given an a and a b, for which population value of λ does the estimator (2) attain the smallest risk? [2 marks]

5. Is the estimator (2) asymptotically efficient as n → ∞? [2 marks]

Sometimes the alternative log-rate parameterisation v = logλ is used for the Poisson distribution.

This is in some (statistical) ways a more natural parameterisation than the rate λ.

6. Write down the probability mass function of the Poisson distribution parameterised in terms of

the log-rate v. [1 mark]

7. Derive the Fisher information Jn(v) for the log-rate parameter v. (hint: remember the observa-

tions are independently and identically distributed) [2 marks]

5

Question 4 (12 marks)

It was believed for a long time by medical practitioners that the full moon influenced the expression of medical conditions including fevers, rheumatism, epilepsy and bipolar disorder – in fact, the antiquated term “lunatic” derives from the word lunar, i.e., of the moon. In the late 1990’s a (tongue in cheek) study was undertaken to test if the full moon induced dogs to become more aggressive, with a resulting increased likelihood of biting people. In addition to being a little bit of fun, examining a problem like this through the lense of data science is an instructive example on how quantitative methods can be used to answer “folk-lore” questions/hypotheses.

A dataset was recorded that contained the daily number of admissions to hospital of people being bitten by dogs from 13th of June, 1997 through to 30th of June, 19981. The number of dogbite incidences on full moon days was

y = (0,3,4,3,6,3,9,6,3,5,8,4,1)

The average number of dogbites on the remaining non-full moon days was 4.51, which we can take as being a sufficiently accurate estimate given the amount of data on which it is based.

This question involves using the Poisson distribution to analyse this dogbite data. Let us first analyse the data using maximum likelihood and the asymptotic distribution of maximum likelihood estimates. Please answer the following questions. Provide working/justifications/code as required to support your answers.

1. Fit a Poisson distribution to the dog-bite data above using maximum likelihood. What is the estimated rate of dog-bites on full moon days? Provide an approximate 95% confidence interval for this estimate using the asymptotic distribution of the maximum likelihood estimate. [3 marks]

2. Using the asymptotic distribution of the ML estimate of λ, test the hypothesis that dogs bite more frequently on full moon nights than on non-full moon nights. Clearly specify the hypothesis you are testing and discuss the resulting p-value. [3 marks]

Let us now consider a Bayesian analysis of the data using the hierarchy: yi |λ ∼ Poi(λ), i = 1,…,n

λ|a,b ∼ Ga(a,b)

where Ga(a,b) denotes a gamma distribution with shape a and scale b. After consulting a number of physicians at emergency wards, we find a reasonable guess at the average number of dog bite incidences on any given day to be 5, with a 95% prior probability of the average number of daily dog bite incidences ranging anywhere from 1 to 14.

3. Select appropriate values of a and b for our gamma prior distribution to (approximately) encode the specified information (hint: use the pgamma() function). [1 mark]

4. Using this prior information and the provided data, write down the posterior distribution for λ (i.e., the estimate of the average number of dog bite incidences on a full moon day). [1 mark]

5. Plot the prior and posterior distributions of λ (hint: use the dgamma() function). [1 mark]

6. Report on the posterior mean and 95% credible intervals. How do these compare to our maximum

likelihood estimates? (hint: use the qgamma() function) [2 marks]

7. Do you think the Bayesian analysis supports the hypothesis that dogs bite more frequently on

full moon days than non-full moon days? [1 mark]

1Data source is taken from the Australian Institute of Health and Welfare Database of Australian Hospital Statistics.

6