这个作业是用R语言完成贝叶斯学习和图形模型相关的习题

SIT743 Bayesian Learning and Graphical Models

Assignment-1

Total Marks = 120, Weighting – 25%

Due date: 26 April 2020 by 11.30 PM

—————————————————————————————————————

INSTRUCTIONS:

• For this assignment, you need to submit the following THREE files.

1. A written document (A single pdf only) covering all of the items described in the

questions. All answers to the questions must be written in this document, i.e, not in

the other files (code files) that you will be submitting. All the relevant results

(outputs, figures) obtained by executing your R code must be included in this

document.

For questions that involve mathematical formulas, you may write the answers

manually (hand written answers), scan it to pdf and combine with your answer

document. Submit a combined single pdf of your answer document.

2. A separate “.R” file or ‘.txt’ file containing your code (R-code script) that you

implemented to produce the results. Name the file as “name-StudentID-Ass1-

Code.R” (where `name’ is replaced with your name – you can use your surname or

first name, and StudentID with your student ID).

3. A data file named “name-StudentID-LzMyData.txt” (where `name’ is replaced with

your name – you can use your surname or first name, and StudentID with your

student ID).

• All the documents and files should be submitted (uploaded) via SIT 743 Clouddeakin

Assignment Dropbox by the due date and time.

• Zip files are NOT accepted. All three files should be uploaded separately to the

CloudDeakin.

• E-mail or manual submissions are NOT allowed. Photos of the document are NOT

allowed.

• The questions Q2 and Q3 do not require any R programming.

=================================================================

Some of the questions in this assignment require you to use the “Lizard Island” dataset. This

dataset is given as a CSV file, named “LZIsData.csv”. You can download this from the

Assignment folder in CloudDeakin. Below is the description of this dataset.

Lizard Island dataset:

This dataset gives the weather measurements collected at Lizard Island, which is an island in

the Great Barrier Reef (North Queensland, Australia).

[http://weather.aims.gov.au/#/station/1166 ].

The data gives 10 minutes sample measurements collected over a 1 month period between

May 2019 and June 2019.

The variables include the following (4 variables; in the same order of columns appear in the

file LZIsData.csv):

Page 2 of 8

Air Temperature: Air temperature in degrees Celsius.

Humidity: Humidity in percentage.

Wind Speed: Maximum Wind speed in kilometre per hour

Air Pressure: pressure measurements expressed in units of Hectopascals

Q1) [19 Marks]:

• Download the data file “LZIsData.csv” and save it to your R working directory.

• Assign the data to a matrix, e.g. using

the.data <- as.matrix(read.csv(“LZIsData.csv”, header = FALSE, sep = “,”))

• Generate a sample of 1500 data using the following:

my.data <- the.data [sample(1: 4464,1500),c(1:4)]

Save “my.data” to a text file titled “name-StudentID-LzMyData.txt” using the

following R code (NOTE: you ‘must’ upload this data text file and the R code along

with your submission. If not, ZERO marks will be given for this whole question).

write.table(my.data,”name-StudentID-LzMyData.txt”)

Use the sampled data (“my.data”) to answer the following questions.

1.1) Draw histograms for ‘Air temperature’ and ‘Air Pressure” values, and comment on

them. [2 Marks]

1.2) Draw a parallel Box plot using the two variables; ‘Air Temperature’ and the

‘Wind Speed’.

Find five number summaries of these two variables.

Use both five number summaries and the Boxplots to compare and comment on

them. [5 Marks]

1.3) Which summary statistics would you choose to summarize the center and spread

for the ‘Humidity’ data? Why (support your answer with proper plot/s)? Find

those summary statistics for the “Humidity” data.

[4 Marks]

1.4) Draw a scatterplot of ‘‘Air Temperature’ (as x) and ‘Humidity’ (as y) for the first

1000 data vectors selected from the “my.data” (name the axes).

Fit a linear regression model to the above two variables and plot the (regression)

line on the same scatter plot.

Write down the linear regression equation.

Compute the correlation coefficient and the coefficient of Determination.

Explain what these results reveal. [8 Marks]

Page 3 of 8

Q2) [21 Marks]

2.1) The table shows results of a survey conducted about the favorite sports, in different

states over some period in 2020.

State

New south Wales

(N)

Victoria

(V)

Queensland

(Q)

Sports

Footy (F) 1000 2000 1300

Basketball

(B)

1500 500 500

Cricket (C) 1400 1000 800

Suppose we select a person at random,

a) What is the probability that the person is from Victoria (V)? [1 mark]

b) What is the probability that the person likes cricket (C) and from New South Wales

(N)? [1 Mark]

c) What is the probability that the person likes Footy (F) given that he/she is from

Queensland (Q)? [2 Marks]

d) What is the probability that the person, who likes Basketball (B) is from Victoria

(V)? [2 Marks]

e) What is the probability that the person is from Victoria (V) or likes cricket (C)?

[2 Marks]

f) Find the marginal distribution of sports. [3 marks]

g) Are sports and state mutually exclusive? Explain [2 Marks]

h) Are sports and state independent? Explain [3 marks]

Page 4 of 8

2.2) The weather in Victoria can be summarised as follows

If it rains one day there is a 75% chance it will rain the following day. If it is sunny one

day there is a 30% chance it will be sunny the following day. Assume that the prior

probability it rained yesterday is 0.6, what is the probability that it was sunny yesterday

given that it is rainy today? [5 Marks]

Q3) [5 Marks]

3.1) State two differences between frequentist way and the Bayesian way of estimating

a parameter [2 marks]

3.2) Why conjugate priors are useful in Bayesian statistics? [1 mark]

3.3) Give two examples of Conjugate pairs (i.e., give two pairs of distributions that

can be used for prior and likelihood) [2 marks]

Q4) Frequentist and Bayesian estimations [31 Marks]

An Artificial Intelligence solutions provider, BigSecAI Ltd. houses several computing servers

to perform computationally intensive processing, such as deep learning, on sensitive (secure)

data for customers, including government agencies. In order to provide reliable service,

BigSecAI wants to improve their monitoring and maintenance activities of their computer

servers. As part of their planning, the BigSecAI wants to model the lifetime pattern of their

servers. BigSecAI assumes that the length of time

(in years) a computer server lasts follows

a form of exponential distribution with an unknown parameter , as shown below. Here, the

quantity

represents on average, how long a certain server last.

~ ()

() = (

|) = ()

Assume that there are servers used, and each of their lifetime are independently and

identically distributed (iid).

4.1) BigSecAI first decided to use a frequentist approach to arrive at an estimate for .

Answer the following questions.

a) Show that the joint distribution of lifetime of servers can be given by the below

equation (show the steps clearly).

(|) = ()

, where = ∑

[3 marks]

b) Find a simplified expression for the log-likelihood function () = ((|))

[3 marks]

Page 5 of 8

c) Show that the Maximum likelihood Estimate ( ) of the parameter is given by:

=

, “ℎ$%$ =

&

[4 Marks]

d) Suppose that the lifetimes of six of their servers are

{2, 7, 6, 10, 8, 3}, what is the Maximum likelihood Estimate (MLE) of

parameter given this data? [2 Marks]

e) Hence, on the average, how long would 7 servers last if they are used one after

another? [2 Marks]

f) What is the probability that a server lasts between six and twelve years?

Hint: Use cumulative distribution function (cdf) of exponential distribution. The

cdf of the exponential distribution is given by ‘(() = 1 − $−+(,

.

[4 marks]

4.2) BigSecAI has now consulted an overseas computer hardware vendor,

HardwareExpert, which has more experience working with large servers, and obtained

some prior information about the lifetime of servers of similar capacity and processing

capabilities. The HardwareExpert mentioned that their value follows a pattern that

can be described using a form of Gamma distribution, Gamma (a,b), where – and . are

the hyper-parameters of the Gamma distribution, with – = 0.1 and . = 0.1.

12332 (4, 5) = 6 54

(4)

5

, where 7 is a constant.

a) BigSecAI has decided to use this prior information from HardwareExpert for their

estimation. If it uses the Gamma distribution prior, Gamma (a,b), obtain an

expression for the posterior distribution (show all the steps).

Show that the posterior distribution is also a Gamma distribution, Gamma (a’, b’),

with different hyper-parameters –

8

and .′. Express –

8

and .′ in terms of 4, 5,

and . [5 Marks]

b) Use the values for a and b hyper-parameters suggested by the HardwareExpert,

and the server lifetimes that has been observed from 6 servers: {2, 7, 6, 10, 8, 3},

to find the value of –

8

and .′. What is the posterior mean estimate of ? [4 Marks]

c) Write a R program and plot the obtained likelihood distribution, the prior

distribution and the posterior distribution on the same graph. Use different colors

to show the distributions on the plot. [4 Marks]

Page 6 of 8

Q5) Bayesian inference for Gaussians (unknown mean and known variance) [15 marks]

A metal factory in Queensland produces iron bars. They are quality tested by measuring

the amount of sag they undergo under a standard load. A random sample of n iron bars

shows an average sag measurement of 5cm. Assume that the sag measurements are

normally distributed with unknown mean θ and known standard deviation 0.25 cm.

Suppose your prior distribution for θ is normal with mean 4 cm and standard deviation of

2 cm.

a) Find the posterior distribution for in terms of :. (Do not derive the formulae)

[3 Marks]

b) For n=20, find the mean and the standard deviation of the posterior distribution.

Comment on the posterior variance [3 Marks]

c) For n=100, find the mean and the standard deviation of the posterior distribution.

Compare with the results obtained for n=20 in the above question Q5(b) and comment.

[3 Marks]

d) Assume that the prior distribution is changed, and now the prior is distributed as a

triangle defined over the range 4 to 6, as shown below:

;() = <

4

3

−

16

3

@A% 4 ≤ ≤ 4.75

−

4

5

+

24

5

@A% 4.75 < ≤ 6

Write a R program to implement this triangular prior, and compute the posterior

distribution considering : = 1. Using R program find the posterior mean estimate of

. Sketch, on a single coordinate axes, the prior, likelihood and the posterior

distributions obtained. [6 Marks]

(Hint. Use ‘Bolstad’ package in R to perform this.

library(Bolstad)

#https://cran.r-project.org/web/packages/Bolstad/Bolstad.pdf)

Q6) Clustering: [11 marks]

6.1) K-Means clustering: Use the data file “IOCdata2020.txt” provided in CloudDeakin for

this question. Load the file “IOCdata2020.txt” using the following:

zz<-read.table(“IOCdata2020.txt”)

zz<-as.matrix(zz)

a) Draw a scatter plot of the data. [1 mark].

b) State the number of classes/clusters that can be found in the data (by visual examination

of the scatter plot) [1 marks].

Page 7 of 8

c) Use the above number of classes as the k value and perform the k-means clustering on

that data. Show the results using a scatterplot (show the different clusters with different

colours). Comment on the clusters obtained. [2 Marks]

d) Vary the number of clusters (k value) from 2 to 20 in increments of 1 and perform the

k-means clustering for the above data. Record the total within sum of squares

(TOTWSS) value for each k, and plot a graph of TOTWSS verses k. Explain how you

can use this graph to find the correct number of classes/clusters in the data. [3 marks]

6.2) Spectral Clustering: Use the same dataset (zz) and run a spectral clustering (use the

number of clusters/centers as 3) on it. Show the results on a scatter plot (with colour coding).

Compare these clusters with the clusters obtained using the k-means above and comment on

the results. [4 Marks]

Q7) [18 Marks]

For this question you will be using “HeronIslandWaterTemp” dataset. This dataset

is given as a CSV file, named “HeronIslandWaterTemp.csv”. You can download

this dataset from the Assignment folder in CloudDeakin.

This dataset contains only one variable, namely “Water Temperature” (WT).

Use the following R code to load the whole data for WT variable

WTempdata <- as.matrix(read.csv(“HeronIslandWaterTemp.csv”, header =

TRUE, sep = “,”))

]

7.1) Provide a time series plot of the WT data (use the index as the time (x-axis))

using R code. [1 Marks]

7.2) Plot the histogram for WT data. Comment on the shape. How many modes can

be observed in the data? [2 Marks]

7.3) Fit a single Gaussian model H(I, JK

) to the distribution of the data, where I

is the mean and J is the standard deviation of the Gaussian distribution.

Find the maximum likelihood estimate (MLE) of the parameters, i.e., the

mean I and the standard deviation (J).

Plot the obtained (single Gaussian) density distribution along with the

histogram on the same graph.

[3 Marks]

7.4) Fit a mixture of Gaussians model to the distribution of the data using the

number of Gaussians equal to the number of modes found in the data (in

Q7.2 above) . Write the R code to perform this. Provide the mixing coefficients,

mean and standard deviation for each of the Gaussians found. [4 Marks]

Page 8 of 8

7.5) Plot these Gaussians on top of the histogram plot. Include a plot of the combined

density distribution as well (use different colors for the density plots in the same

graph). [3 Marks]

7.6) Provide a plot of the log likelihood values obtained over the iterations and

comment on them. [2 Marks]

7.7) Comment on the distribution models obtained in Q7.3 and Q7.4. Which one is

better? [1 Marks]

7.8) What is the main problem that you might come across when performing a

maximum likelihood estimation using mixture of Gaussians? How can you

resolve that problem in practice? [2 Marks]

EasyDue™ 支持PayPal, AliPay, WechatPay, Taobao等各种付款方式!

**E-mail:** easydue@outlook.com **微信:**easydue

**EasyDue™是一个服务全球中国留学生的专业代写公司
专注提供稳定可靠的北美、澳洲、英国代写服务
专注提供CS、统计、金融、经济、数学等覆盖100+专业的作业代写服务**