MATH2349 Data Wrangling  Assignment 1

( Last Updated 24.02.2020 )

 Assessment type: Written report (PDF document) using R Markdown Due date: 29 March 2020, 23:59 AEST. Weighting: 5% Word limit: Maximum 10 pages Feedback mode: Feedback will be provided using Turnitin’s inline marking tool and general text comments.

Overview

This assignment allows you to apply the data-pre-processing knowledge and skills learned in Week 1-3. You will locate an open data from the web, import it into R, reflect upon the data types, formats and structures in your data set, and inspect the data using R functions. You will create a report using R Markdown template to explain the steps taken by you in order to perform the mentioned data related tasks.

Assessment criteria and weighting

Please see the marking rubric to know the assessment criteria and weightage.

Course Learning outcomes

This assessment is linked to the following course learning outcomes:

1. Accurately, logically and ethically combine data from multiple sources to make suitable for statistical analysis and draw valid interpretations.
2. Select, perform and justify data validation processes for raw datasets.
3. Use leading open source software (e.g. R) for reproducible, automated data processing.

Assignment Instructions

This assignment requires you to locate open data from the web, import it into R, and reflect upon the data types, formats and structures in your data set. You are expected to submit a report on your data exploration. Use the given R Markdown template to create the report.

Step 1. Locate an open source of data from the web. This can be a tabular, spreadsheet data (i.e., .txt, .csv, .xls, .xlsx files), data sets from other statistical software (i.e., SPSS, SAS, Stata etc. data files), or you can scrape HTML table data.

Some sources for open data are provided below, but I encourage you to find others:

As a minimum, the data set should include:

• one numeric variable.
• one qualitative (categorical) variable.

There is no limit on the number of observations and number of variables. But keep in mind that when you have a very large data set, it will increase your reading time.

Step 2. Read/Import the data into R, then save it as a data frame. You can use Base R functions or readr, xlsx, readxl, foreign, rvest packages for this purpose. In this step, you must provide the R codes with outputs (i.e. head of data set) and explain everything that you do in order to import/read/scrape the data set.

Step 3. Provide a clear description of the data and its source (i.e. URL of the web site). Provide variable descriptions.

Step 4. Inspect the dataset and variables using R functions. You should:

• check the dimensions of the data frame.
• summarise the types of variables by checking the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct data type, apply proper type conversions.
• check the levels of factor variables, rename/rearrange them if required.
• check the column names in the data frame, rename them if required.
• provide the R codes with outputs and explain everything that you do in this step.

Step 5. Subset the data frame using the first 10 observations (include all variables). Then convert it to a matrix. Check the structure of that matrix (i.e. check whether the matrix is a character, numeric, integer, factor, or logical) and explain in a few words why you ended up with that structure. Provide the R codes with outputs and explain everything that you do in this step.

Step 6. Subset the data frame including only the first and the last variable in the data set, save it as an R object file (.RData). Provide the R codes with outputs and explain everything that you do in this step.

Step 7. You will create a new data frame from scratch. Note that, this step is independent from the dataset that you used in the previous steps. In this step you should:

• Create a data frame from scratch with 2 variables and 10 observations. Your data frame has to contain one integer variable and one ordinal variable. Make sure that you factorised and ordered the ordinal variable properly. Show the structure of your variables and the levels of the ordinal variable.
• Create another numeric vector and use cbind() to add this vector to your data frame. After this step you should have 3 variables in the data frame.
• Check the attributes and the dimension of your new data frame.
• Provide the R codes with outputs and explain everything that you do in this step.

Important Note: You must provide the R codes with outputs and explain everything that you do in each step. Failure to do this would result in a reduction (needs improvement) in the mark. Check the report sections below and the marking rubric for more information.

Create the report using R Markdown

The assignment 1 report must be completed using the R Markdown template provided here:

R Markdown Template – Assignment 1

Note that this is an R Markdown notebook template. Information for using the R Markdown package can be found here. The R Markdown template must be updated with your name and student number. You must use the headings and chunks provided in the template. You can add more chunks if required. Your report will be composed of the following sections.

Sections of the report:

1. Report title and student details [YAML input]: You can add the title of your report (i.e. Assignment 1) and student number by updating the “title” and “author” entries in the YAML header (located at the top of the R Markdown Template).
2. Data Description [Plain text]: A clear description of data, its source (i.e. URL of the web site) and variable descriptions should be provided.
3. Read/Import Data [Plain text & R code & Output]: Read/Import the data into R, then save it as a data frame. You can use Base R functions or readr, xlsx, readxl, foreign, rvest packages for this purpose. In this section, you must provide the R codes with outputs (i.e. head of data set) and explain everything that you do in order to import/read/scrape the data set.
4. Inspect and Understand [Plain text & R code & Output]: Summarise the types of variables and data structures, apply proper type conversions, check the attributes in the data. Provide the R codes with outputs and explain everything that you do in this step.
5. Subsetting I [Plain text & R code & Output]: Subset the data frame using first 10 observations (include all variables). Then convert it to a matrix. Check the structure of that matrix (i.e. check whether the matrix is a character, numeric, integer, factor, or logical) and explain in a few words why you ended up with that structure. Provide the R codes with outputs and explain everything that you do in this step.
6. Subsetting II [Plain text & R code & Output]: Subset the data frame including only the first and the last variable in the data set, save it as an R object file (.RData). Provide the R codes with outputs and explain everything that you do in this step.
7. Create a New Data frame [Plain text & R code & Output]: Create a data frame from scratch with 2 variables and 10 observations. Your data frame has to contain one integer variable and one ordinal variable. Make sure that you factorised and ordered the ordinal variable properly. Create another numeric vector and use cbind() to add this vector to your data frame. After this step you should have 3 variables in the data frame. Check the attributes and the dimension of your new data frame. Provide the R codes with outputs and explain everything that you do in this step.

Submission Format

Upload the report as one single file (PDF) via the Assignment 1 page in CANVAS.

The easiest way to produce a PDF file from the RMarkdown is to Run all R chunks, then Preview your notebook in HTML (by clicking Preview) → Open in Browser (Chrome) → Right click on the report in Chrome → Click Print and Select the Destination Option to Save as PDF.

After creating your PDF file make sure and check that your codes and outputs are visible.

Referencing guidelines

You must acknowledge all the sources of information you have used in your assessments. Refer to the RMIT to see examples and tips on how to reference in the appropriate style. You can also refer to the for more tools such as EndNote, referencing tutorials and referencing guides for printing. Use the RMIT Harvard referencing method for this assessment.

Collaboration

You are permitted to discuss and collaborate on the assignment with your classmates. However, the write-up of the report must be an individual effort. Assignments will be submitted through Turnitin, so if you’ve copied from a classmate, it will be detected. It is your responsibility to ensure you do not copy or do not allow another classmate to copy your work. If plagiarism is detected, both the copier and the student copied from will be responsible. It is good practice to never share assignment files with other students. You should ensure you understand your responsibilities by reading the RMIT University website on academic integrity. Ignorance is no excuse.

You should take extreme care that you have:

• acknowledged words, data, diagrams, models, frameworks and/or ideas of others you have quoted (i.e. directly copied), summarised, paraphrased, discussed or mentioned in your assessment through the appropriate referencing methods
• provided a reference list of the publication details so your reader can locate the source if necessary. This includes material taken from internet sites.

If you do not acknowledge the sources of your material, you may be accused of plagiarism because you have passed off the work and ideas of another person, without appropriate referencing, as if they were your own.

RMIT University treats plagiarism as a very serious offence constituting misconduct. Plagiarism covers a variety of inappropriate behaviours, including:

• failure to properly document a source
• copyright material from the internet or databases
• collusion between students

for further information on our policies and procedures, please refer to the

Assessment Declaration

When you submit work electronically, you agree to the Assessment Declaration.

Extensions and Special Consideration

This course follows the RMIT University Assessment policy for extensions and special consideration. Information is available. Ensure you understand these guidelines before applying.

Extensions will only be granted in accordance with the RMIT University Extension and Special Consideration Policy. No exceptions. Assignments submitted late will be penalised (see below for further details).

Late Submission of Assessment

Late submissions, without an approved extension or special consideration, will incur a penalty of 10% of the total mark per business day late for up to 5 business days late (so the maximum late penalty is 50%). Submissions more than 5 days late are not accepted.

See the examples of penalties applied to the assessment’s mark below:

 Overdue Penalty Example 1 (Mark 8/10) Example 2 (Mark 6/10) ≤ 1 business day -10% 7.2 5.4 ≤ 2 business days -20% 6.4 4.8 ≤ 3 business days -30% 5.6 4.2 ≤ 4 business days -40% 4.8 3.6 ≤ 5 business days -50% 4 3

Assignment 1 Marking Rubric

 Criteria Not acceptable(0) Needs Improvement(1) Excellent(3) Locate data(10%) No data source was given or the data didn’t meet the minimum requirements. The data source was given butit was described poorly, ORR codes were given but the output was missing OR output was given but R codes were missing. A complete data source was provided and data met the minimum requirements. Read/Import and save data (20%) The attempt to read/import data set was unsuccessful. The attempt to read/import data set was successful but unable to save the data in the correct format, ORR codes were given but the output was missing OR output was given but R codes were missing. Able to read/import and save the data correctly. Inspect data (30%) There was no attempt to inspect the data and the variables in the data set. There was an attempt to inspect the data and variables but it didn’t meet the minimum requirements, ORR codes were given but the output was missing OR output was given but R codes were missing. A complete inspection of data and variables. Subset and convert to a matrix (20%) Unable to subset the data frame correctly. Subsetting data frame was successful, but attempt to convert it to a matrix was missing or needed improvement, ORR codes were given but the output was missing OR output was given but R codes were missing. A complete subsetting and data type conversion were provided. Subset and save as an R object (10%) Unable to subset the data frame and save it as an R object. Able to subset the data frame correctly but failed to save it as an R object, ORR codes were given but the output was missing OR output was given but R codes were missing. A complete subsetting and data conversion were provided. Create a new Data frame (10%) Unable to create a new data frame with given specifications Able to create a data frame but there is room for improvement (i.e. at least one of the required tasks is missing) or it was poorly described, ORR codes were given but the output was missing OR output was given but R codes were missing. A complete set of tasks were provided to create a new data frame.