本次澳洲代写是SAS数据分析的一个assignment

Instructions

Assignment should be typed. Note that you can copy & paste selected SAS syntax and SAS output into a Word document. For questions requiring use of SAS, you must provide a copy of your SAS program (i.e. syntax included in the answer) as well as the relevant output (and any requested additional commentary and hand-calculation). Do not hand in duplicated or unrequested output. Show working and reasoning. Write proper sentences.

Marking

Total marks for each question are shown. Marks will be deducted for incorrect and incomplete answers, inadequate explanation, poor quality comment and interpretation sentences, and not following instructions and poor presentation.

NOTE: Questions 1 and 2 both use the dataset (survey.sas7bdat) relating to a cross-sectional community health survey and contains data on 1,552 participants aged 40-69 years. To investigate relationships with two dichotomous outcomes (high cholesterol [high_CHOL] and chronic obstructive pulmonary disease [COPD]) use the code below to create these variables and the format statement to help with your interpretation.

data surveynew;

set biostats.survey;

COPD=(FEV/FVC<0.70);

high_CHOL=CHOL>6;

run;

proc format;

value yesnof 0=’b No’ 1=’a Yes’;

value smokef 1=’d Never’ 2=’c Former’ 3=’b <15 cig/d’ 4=’a 15+ cig/d’;

value drinkf 1=’e Never’ 2=’d Former’ 3=’c <20gms/d’ 4=’b 20-60gms/d’ 5=’a >60gms/d’;

value sexf 0=’Male’ 1=’Female’;

value rxh 1=’a Yes’ 0=’b No’;

run;

Question 1 [7 marks] Use the community survey data to explore which variables are associated with high cholesterol (people with CHOL measure greater than 6):

 

  • [1 mark] Use Proc logistic to fit a logistic regression model that examines whether there is an association between high cholesterol (use high_CHOL as the response) and treatment for hypertension (use RXHYPER). Obtain the odds ratio and 95% confidence interval for high cholesterol comparing RXHYPER=yes vs no and provide an associated p-value. Write a sentence that includes and interprets these results.
  • [2 marks] Examine the relationship between high cholesterol (use high_CHOL as the response) and age (AGE) by using Proc logistic to fit a model that allows a quadratic relationship with AGE (i.e. include AGE and AGE*AGE).
    Write down the algebraic representation of the fitted (i.e. with estimated coefficients) quadratic relationship. Provide evidence from your output as to whether you believe the relationship between high cholesterol and age is curved or straight. Also provide an interpretation of the relationship between high cholesterol and age using one or more odds ratios obtained from your fitted model that included the quadratic term.
  • [2 marks] Use Proc logistic to investigate the relationship between high cholesterol and AGE that allows separate curved relationships with AGE for men and women.
    Write down the algebraic representation of the fitted model separately for men and women. Perform (and interpret) a test of whether the curved relationship between AGE and high cholesterol is significantly different between men and women.
  • [1 mark] Obtain appropriate odds ratio estimates (using the fitted model in (c)) that compares high cholesterol between women and men at ages 40, 50, 60, and 70 years and provide an interpretation of these odds ratios in the context of the model fitted.
  • [1 mark] Use Proc logistic to fit a logistic regression model that examines the relationship between high cholesterol and RXHYPER, after adjustment for SEX and AGE (use your results from (b) and (c) to decide how to fully adjust for SEX and AGE). Describe how adjustment for SEX and AGE has changed your findings from (a) of the relationship between high cholesterol and RXHYPER.

 

Question 2 [8 marks] Use the community survey data to explore predictors of COPD (i.e. those with a FEV to FVC ratio less than 0.7):

  • [1 mark] Perform a stepwise (backward) search for predictors of COPD from among the following list of potential predictors: SEX, ANGINA, asthma, bronch, diabetes, hayfever, myocard, smoking (all as categorical variables), age, alcgrams, bmi, chol, dbp, exercise, sbp, weighT (all as quantitative variables). In your search for predictors, consider main effects only, ignore squares of quantitative variables and interactions and use the p = 0.05 criterion for dropping variables. Provide the output that shows the order in which terms were dropped and the output showing the fitted final model (i.e., its estimated coefficients).
  • [2 marks] Which of the categorical variables (i.e. sex, angina, asthma, bronch, diabetes, hayfever, myocard, smoking) in your final model from (a) has the largest effect on COPD?
    Which of the quantitative variables (i.e. age, alcgrams, bmi, chol, dbp, exercise, sbp, weight) in your final model has the largest effect on COPD?
  • [2 marks] Evaluate the prediction performance of your final model from (a) by
    (i) obtaining and comparing the histograms of the estimated probabilities of COPD = 1 for people with COPD = 1 and people with COPD = 0;
    (ii) obtaining and interpreting the area under the ROC curve.
  • [1 mark] If people in the dataset with an estimated probability ≥ pcut were predicted to have COPD, what value of pcut would give the highest percentage of correct predictions for the 1,552 people in this dataset and what would be the sensitivity and specificity associated with this.
  • [1 mark] Do you think this model is good at predicting individuals who have COPD? Explain your answer.
  • [1 mark] If one person in the dataset who had COPD is chosen at random and one person who did not have COPD is randomly chosen, what is the probability that the person who actually had COPD has a larger estimated probability of COPD from the model?

 

Question 3 [5 marks] The data set below arises from an age-stratified case-control study of the association between alcohol consumption and oesophageal cancer in a region of France. The age strata were 40 – 49, 50 – 59 and 60 – 69 years. All people in the age group newly diagnosed with oesophageal cancer were the cases. For each age group, the same number of controls (people without cancer) were randomly selected from the electoral register. The usual alcohol consumption over the last 5 years was obtained for each case and control and categorised as < 80g/day and 80 + g/day.

 

Alcohol consumption
Age group 80 + < 80 Total
40 – 49 Cases 25 21 46
Controls 8 38 46
50 – 59 Cases 42 34 76
Controls 13 63 76
60 – 69 Cases 19 36 55
Controls 9 46 55

 

  • [1 mark] Create a SAS dataset for these data suitable for a logistic regression analysis.
  • [1 mark] Use Proc Logistic to fit an appropriate logistic regression model that produces the age-adjusted estimate (and 95% CI) of the odds ratio that compares the odds of cancer in 80 + vs < 80 alcohol groups. In your output highlight the odds ratio, its 95% CI and the relevant test p-value of whether the age-adjusted odds ratio estimate is significantly different from one and provide a conclusion.
  • [2 marks] Use Proc Logistic to fit a single, appropriate logistic regression model that performs a test of whether the odds ratios (that compare odds of cancer in 80 + vs < 80 alcohol groups) are significantly different across the age groups and produces estimates (and 95% CI) of the odds ratio for each age group. In your output highlight the relevant test p-value and the odds ratios (and their 95% CI). State your conclusion about whether the odds ratios differ significantly across age groups.
  • [1 mark] Compare your results from (b) and (c). Which results do you think are better and why?