STA238 – Winter 2021

Final Project Instructions



该项目将根据“作业Quercus”页面上可用的标题进行评分。 TA将查看每个部分(在提交的pdf上),并根据该部分(pdf)的粗略概述(一次性阅读)为该部分选择合适的等级。阅读过一次后,普通大学水平的学生应该对您的项目有所了解。我建议您确保您的(pdf)文档看起来干净,美观并且已经过校对。由于这是最后的项目,因此由统计科学系负责审核此评估的成绩。因此,您将需要通过一个流程(稍后再待定)来申请,以查看已分级的专栏并潜在地询问是否需要重新分级。如果在多个部分中似乎都出现了相同的问题,则可能会有(TA)提供一些注释/反馈,但是您可能不会收到任何注释/反馈(由于类和标记的缩放)。


在此项目中,您将编写有关数据分析的报告,其中主要方法将包括STA238 2021年冬季课程中讲授的一系列技术。该方法必须包括以下内容:


至少一个置信区间(通过引导或Z / t方法);





至少一个贝叶斯可信区间。 (将后验派生到附录中)。




这将使您查看数据的一些有趣方面。请通过本课程以前的作业中未使用的任何R包找到一些开源数据。我们在本课程中使用过的带有数据的R包的一些示例是dplyr,nycflights13等。这是可用的R包的列表:https://cran.r-project.org/web/packages/available_packages_by_name。 html。此外,如果您更喜欢使用网站上提供的其他一些数据(例如kaggle,github等),也可以选择,只要这些数据是开放的,免费的并且在道德上对您来说是可行的。如果不确定您的数据是否合适,请访问我们的办公时间之一,我们将很乐意与您讨论。


CES或Open Toronto数据门户网站上的数据,则此项目将收到0。 (即,请勿使用该项目中来自多伦多公开数据的数据;请勿使用该项目中来自CES的数据;请勿使用此项目中的Stats Canada数据中的数据)


Project grading

There are three parts to this project. You must complete all three parts to be considered for the full 30%. For instance, if you do NOT submit a Final Report you will not receive the completion points from the rough draft and peer review process.

As mentioned above, this project will be marked based on the output in the pdf submission. You must submit both the Rmd and pdf files for this project to receive full marks in terms of reproducibility. Furthermore, this is an individual project. You are expected to work individually. The workload level is higher than that of an assignment, since this is a project. Thus, it is recommended that you start early.

This project will be graded based off the rubric available on the Assignment Quercus page. TAs will look over each section (on the submitted pdf) and select the appropriate grade for that section based off a coarse overview (one-time read over) of that section (of the pdf). Your project should be well understood to the average university level student after reading it once. I would suggest you make sure your (pdf) document looks clean, aesthetically pleasing, and has been proofread. Since this is a final project, the process to review your grade on this assessment is handled by the Department of Statistical Sciences. Thus, you will need to apply through a process (TBA at a later date) to see the graded rubric and potentially inquire about a regrading. There may be some comments/feedback provided (by the TAs) if the same issue seems to be arising in multiple sections, but you will likely receive no comments/feedback (due to the scaling of the class and marking).


In this project you will write a report on a data analysis in which your main methodology will comprise of a collection of techniques taught in STA238 Winter 2021. The methodology must include the following:

  • at least one simple linear regression;
  • at least one confidence interval (either through a bootstrap or the Z/t approach);
  • at least one maximum likelihood estimator derivation (I would recommend putting the mathematics

    in the Appendix);

  • at least one hypothesis test of the mean;
  • at least one goodness of fit test;
  • at least one Bayesian credible interval. (Put derivations of the posterior into the Appendix).

    Please keep in mind that this analysis is for our course. Thus the analysis should be to answer a question about an underlying random process we have data from. You will find some data, form an interesting question and answer the question through your analysis. Your question should be stated clearly so that the reader can quickly identify it in the introduction (and repeated maybe more formally as a hypothesis test in the methods section).

    The report will consist of 8 sections: Abstract, Introduction, Data, Methods, Results, Conclusions, Bibliog- raphy and Appendix.

    There should be no evidence that this is a class project, I should be able to take a screenshot of this and paste it into a newspaper/blog. There should be no raw code. All output, tables, figures, etc. should be nicely formatted.

    This will allow you to look at some interesting aspects of the data. Please find some open source data through any R package that has not been used on a previous assignment in this course. Some examples of R packages with data that we have used in this course are dplyr, nycflights13, etc. Here is a list of R pack- ages available: https://cran.r-project.org/web/packages/available_packages_by_name.html. Additionally, if you prefer to use some other data available through a website (e.g., kaggle, github, etc.) that is also an option so long as the data is open, free and ethically viable for you to analyze. If you are unsure about whether your data is appropriate please visit one of our office hours and we will be happy to discuss.

    Based off the above criteria, the following three packages are OFF LIMITS. You CAN NOT use data from any of the following sources: The Toronto Open Data Portal, survey data from the 2019 Canadian Election Study.

    If you use data from the  CES, or Open Toronto Data Portal you will receive a 0 on this project. (I.e., do NOT use data from open Toronto data on this project; do NOT use data from CES on this project; do NOT use data from Stats Canada data on this project)

    The material and text on this project should be different from that of your previous assignments in this course. Thus, you should NOT directly copy your previous assignment work. We highly encourage you use feedback from previous assignments to amend/proofread/update your Final Project. If your work is a direct copy of a previous submission or is a direct copy of another person’s submission this is considered an academic offense.