Univariate Statistics and Methodology using R – 2019/2020
Read this whole document before you do anything else.
Deadline: 20th of January 2020
For the course assignment, you will be expected to retrieve, clean, and analyse a data set. In this document
we provide the primary research questions to be answered, information on the structure and format of the
final report, information on code that should be submitted, and a brief overview of the marking criteria. You
can find an R script template on LEARN.
It can be tempting to over-complicate assessments like this, particularly if you have a long time to complete
them. The labs have been designed to prepare you for this assignment: to explore data, to conduct appropriate analyses for given data types, and to make decisions that you can justify. Bear in mind that completing
this assessment does not require any knowledge that wasn’t covered in lectures, labs, and readings.
What you need to submit
For your assessment you need to submit two documents: your report and your R code. More instructions
on how to submit are below. Here, we provide more detail on what to submit.
You need to produce a report answering the assignment questions below. Your report should include appropriate analyses to provide answers to these questions while describing the process and utilising graphics
where necessary to illustrate your points.
• Your report should clearly identify the decisions you made in analysing the data, as well as summarising what can be concluded from your analysis.
• Figures and tables should be numbered and captioned, and referred to in the text; important statistical outcomes should be summarised in the text.
• Reporting should follow APA 6th Edition guidelines for the presentation of tables, figures, and statistical results (see final lecture for more information). Alternative style is acceptable so long as it is clear
• Your report should be a maximum of 4 sides of A4 (including tables and figures), in a standard font,
size 12, with normal 1 inch margins.
Your report must be accompanied by an R script (a text file with the extension .R, the default file type when
saving a script from R-Studio) which can be used to exactly reproduce the results set out in your submitted
report. It should include all steps taken in data cleaning and all analyses. Every answer to the assignment
tasks/questions given below must be obtainable from your code.
Please try to write clear and informative comments within the file. You may find it useful to have a comment
heading for the code which answers each question.
Any code copied and pasted or otherwise adapted from internet examples should be cited appropriately
in the comments. An appropriate citation should include the URL where the code was found, the name
of the website or blog, and the original author’s name. In the absence of a proper name, you can cite the
contributor’s nickname or alias.
1. A code template is available on LEARN.
2. At the top of your script you should place any use of library() to load any packages you need,
the reading in of data, any self-written functions, and details of collaboration (see below).
3. Collaboration: You can work on the R-script in small groups(no more than 4 students) if preferred. Please
include a comment line (line starting with #) which includes the exam numbers (not the names) of
those you worked with. For example:
# Produced in collaboration with students B045329 and B018429
Within the script point out (again using comments) which blocks of code are shared. Please ensure
that your acknowledgements match those of others in your group (if you say you produced the script
in collaboration with B045329, we expect B045329 to acknowledge you).
Important: While the code can be worked on in small groups, the written report must be produced entirely independently. It is not OK to include sections in the written report that are written collaboratively.
Submission and Marking
Submitting your work
All coursework must be submitted before 12:00 (noon) on 20th of January 2020 via Turnitin. You can access
it by clicking on the “Assessment details and submission” tab of the course page on LEARN. There are two
sections there, one for each of the two files you are required to submit. You will be asked to provide your
name and submission title. The submission title must be your exam number (and nothing more). Your name
will not appear anywhere in the documents accessed by the markers. To ensure that the marking is entirely
anonymous, please do not include your name or student number anywhere in either of the submitted files.
Remember, the files you are required to submit are:
• Report, as described above. The filename must be your exam number with whatever extension is provided by your chosen word processor (e.g., ‘B045329.docx’). The file you create should have your
exam number on each page (e.g., in the header or footer).
• R script which runs all of the data cleaning and the final analyses reported. The filename must be your
exam number with the .R extension (e.g., ‘B045329.R’).
Please ensure that you name your documents exactly as above. File names such as ‘R Script for B04329.R’ or
‘B044329 Report final.docx’ slow down document matching and marking and will result in loss of marks.
Please check LEARN for detailed instructions on the submission process prior to submitting.
The code is worth 20% of the coursework marks, and the report is worth 80% of the coursework marks.
Work will automatically fail (max mark of 30%) unless both components are submitted.
You will be assessed on the following:
1. Appropriate cleaning of the data set and key variables of interest, making appropriate and justified
decisions on the steps you take.
2. Selection of appropriate statistical tests and variables to answer the primary research question and
the justifications provided for your selections.
3. Interpretation of the results of the selected analyses.
4. R-code thatruns without errors all the way through, is clear and appropriately commented. For handy
tips on writing good code, see http://adv-r.had.co.nz/Style.html (no need to stick religiously to the
guidelines but following them does make code nice and tidy).
5. Last but not least: Clarity of writing and formatting. The report should conform to the APA 6th Edition style guidelines for formatting text, tables, and figures, reporting results of statistical analyses,
writing style, etc. However, alternative style is acceptable provided it is comparably clear and consistent. For a useful resource, see https://owl.purdue.edu/owl/research_and_citation/apa_style/
1. Do you have two separate files to submit, and are the filenames correct?
2. Is your exam number present?
3. Have you removed any mention of your name?
4. Is your code reproducible? If you clear your environment (use the little broomstick icon in the rstudio environment pane), and run your script from start to finish, does it successfully reproduce your
5. Does your code include the exam numbers of all of your collaborators?
6. Is your report within the page limit?
Motoring offences dataset
Two datasets can be loaded from the following url:
Data provided contains information about the nature and circumstances of motorists stopped and breathalysed by the Police.
Data is collected every time that driver is stopped by the Police and breathalysed. Records indicate the
speed at which the driver is travelling when they are stopped, and the blood alcohol content of the driver
when measured via breathalyser. Information is also captured on the age and prior motoring offences of the
driver, and whether the incident occurred at day or night. Police officers may have had reasons for stopping
drivers other than presuming them to be intoxicated (for instance, someone who is stopped for speeding
may subsequently be breathalysed if they are deemed to be acting unusually).
Each time a police officer stops a motorist, an incident ID is created. A separate database used primarily for
administrative purposes includes records of which officer (recorded as initials) attends which incidents.
age Age of driver (in years)
nighttime Whether or not the incident occurred at night
prior_offence Offence code for any prior motoring offences
speed Speed when stopped by police (mph)
bac Blood Alcohol Content (%) as measured by breathalyser
outcome Outcome of stop (‘fine’,‘warning’)
incident_id ID of incident
officer Officer attending (initials)
Offence code Description
N No prior offence
DR50 In charge of veh. while unfit through drink
DR80 In charge of veh. while unfit through drugs
CD.. Careless Driving offences …
PL.. Driving without ‘L’ Plates
SP.. Speeding offences …
TS.. Traffic direction and signs offences …
1 The relationship between blood alcohol content and age.
Once you are content that the data are appropriately cleaned, run the following model:
m1<-lm(bac ~ age, data = drinkdriving)
Check model diagnostics, and when you are happy, concisely report and interpret the results of your model
What is the predicted blood alcohol content for a 50 year old driver who gets stopped by the Police?
Produce and interpret a diagnostic plot of the model that shows whether or not the model residuals are
2 Driving speeds, night vs. day
Does either time of day or speed of driving predict the blood alcohol content over and above their age?
Fit appropriate model(s) to test this question.
Run model diagnostics and, if needed, re-fit the model(s). How much variation in blood alcohol content is
accounted for by drivers’ ages and speeds, and time of day of incidents?
Is there evidence to suggest that people drive faster at night than during the day?
3 Fines vs. Warnings
Construct a model to investigate what contributes to the likelihood of a driver receiving a fine as opposed
to a warning.
Report the results of your model.
What has the biggest effect on the likelihood of receiving a fine, 1~SD increase in driving speed or 1~SD
increase in blood alcohol content?
Are people with prior convictions for drink driving offences more likely to get a penalty fine (as opposed to
a warning) than those who have non-drink-related offences?
Does whether or not a driver has a prior motoring offence (of any kind) influence the likelihood of receiving
4 Plotting predicted probabilities.
Create a visualisation of the predicted probabilities of receiving a fine for different ages of drivers who have
no prior offences and are stopped during the daytime driving 30mph with 0 blood alcohol content.
• To do this, you will need to make predictions from your model for a set of different values for age,
while holding the other variables constant at the values above (see below). You needn’t worry about
• Then you will need to plot the predictions for each age value
age nighttime speed bac prior_offence model_prediction
17 day 30 0 None ?
18 day 30 0 None ?
19 day 30 0 None ?
20 day 30 0 None ?
21 day 30 0 None ?
22 day 30 0 None ?
… … … … … …
5 Corrupt cops?
Investigate the hypothesis that one of the police officers is biased in how the give out fines and warnings
with respect to a driver’s age.
Does the data suggest that this might be the case? If so, which police officer(s) is biased, and how?