A key feature of this class is applying learning to a real-world dataset. This project to be completed individually involves performing steps of data analysis including exploring the data, summarizing it, preparing it, analyzing it, and presenting findings. The project requires you to submit your predictions on the competition page, complete a report, and make a presentation.
The first part of this project involves conducting the steps of data analysis to generate a set of predictions. You must submit your Kaggle username and first set of predictions before the end of Week 9 and submit at least five sets of predictions before the deadline. Note, there is a limit of three submissions per day.
The second part involves putting together a succinct presentation outlining what you learned from the experience. Your presentation should be supported by just one slide.
The third part is a short report summarizing the data analysis process and what you learned from the experience. Your report should include insights from exploring the data, efforts to prepare the data, and analysis techniques explored. The report should cover not only the ingredients of the final analysis but also the failed steps or missteps along the way.
The project is hosted on Kaggle where you will be able to get the data from, submit your predictions and monitor your performance. For this project, you are given a listing of over 25,000 Airbnb rentals in New York City. The goal of this competition is to predict the price for a rental using over 90 variables on the property, host, and past reviews. To arrive at the predictions, you are encouraged to apply your learning on data exploration, summarization, preparation and analysis.
Arriving at good predictions begins with gaining a thorough understanding of the data. This could be gleaned from examining the description of predictors, learning of the types of variables, and inspecting summary characteristics of the variables. Visual exploration may yield insights missed from merely examining descriptive characteristics. Often the edge in predictive modeling comes from variable transformations such as mean centering or imputing missing values. Review the predictors to look for candidates for transformation.
Not all variables are predictive, and models with too many predictors often overfit the data they are estimated on. With the large number of predictors available for this project, it is critically important to judiciously select features for inclusion in the model.
There are a number of predictive techniques discussed in this course, some strong in one area while others strong in another. Furthermore, default model parameters seldom yield the best fit. Each problem is different, therefore deserves a model that is tuned for it.
Finally, predictive modeling is an iterative exercise. It is more than likely that after estimating the model, you will want to go back to the data preparation stage to try a different variable transformation.
Once you construct a set of predictions for Airbnb rentals in the scoring dataset, you will upload your prediction file to Kaggle. Your submission will be evaluated based on RMSE (root mean squared error) and results posted on Kaggle’s Leaderboard. Lower the RMSE, better the model.
This assignment has the following deliverables.
You must submit your first set of predictions through the Kaggle competition page before the end of Week 9. You can verify your submission on Kaggle. You also need to share your Kaggle username through Canvas before the end of Week 9. Your Kaggle username does not have to be related to your real name or anything about yourself but once set, this username must not be changed until the end of the course.
You must submit a total of at least five sets of predictions before the deadline. Note, there is a limit of three submissions per day.
The ability to explain and communicate your analytical findings to a general audience is critical to your success in using data to influence decisions at your organization. Equally important is to Keep it Simple and Short. Accordingly, you will construct deliver a succinct presentation supported by just one presentation slide. Specific time allowed for your presentation will depend on class size and will be determined by your instructor, but you should expect it will be between 1-3 minutes. Your brief presentation should focus on just two issues:
• What you did right with the analysis and where you went wrong.
• If you had to do it over, what you would do different.
This is a short report summarizing the data analysis process and what you learnt from the experience. Your report should include insights from exploring the data, efforts to prepare the data, and analysis techniques explored. The report should cover not only the ingredients of the final analysis but also the failed steps or missteps along the way. The length of the report should be 2-4 pages of text plus an Appendix including R code or R Markdown.
Your assignment will be graded on three criteria described below:
1. Commitment to the project (20 pts): consistent work and completion of interim deliverables
2. Prediction accuracy (100 pts): accuracy of predictions at the end of the project
3. Quality of modeling (30 pts): knowledge and use of data exploration methods