## Question 1 (40 pts)

For students NOT on the programme Data Science with Financial Technology

Download the dataset coursework other.csv from Blackboard. This dataset consists of the number of rental bikes out for rent at each hour (column Rented Bike count), together with other features such as the hour of the day, the date, humidity, and so on. Your task is to build a model to predict the number of rental bikes out at each hour, given the values of the other features of the dataset.

For students ON the programme Data Science with Financial Technology

Download the dataset coursework fintech.csv from Blackboard. This dataset consists of daily stock prices of Apple Inc. from 03/01/1995 to 31/12/2021, including open, high, low, close,and adjusted close price. Your task is to build a model to predict adjusted close price, given the values of the other features of the dataset.

For ALL students You should consider the following aspects:

• The kind of algorithm to use (e.g.: classifification/regression/clustering)
• The metric to use to measure the performance of the model
• What sort of baseline to compare the model to (sklearn has a module sklearn.dummy which may be useful in generating a baseline)
• How to choose the hyperparameters of your model
• How to test the performance of your model

Concretely, you should use two algorithms from scikit-learn and compare their performance on the dataset. You should also compare the performance of your chosen models against a baseline–i.e. a simple model that more complex models should be able to beat. sklearn has a module sklearn.dummy which may be useful in generating a baseline. You should use techniques to assess the ability of the models to generalise to unseen data and to ensure that your assessment of the models’ performance is robust.

Material from worksheets 13, 14, 16, and 17 will be helpful here.

Your answer to this question should take the form of a short report (maximum 4 pages),together with commented code, detailing the approach you will take. Make sure you address all the bullet points above, and explain your decisions. For example: ‘I chose to use a X algorithm because Y’. ‘Because of Z, I used metric M’. You should use plots and fifigures as appropriate to illustrate your decisions.

The code will not be marked for elegance, but it should run correctly. If you are using jupyter, a good tip is to make sure you have restarted the kernel and made sure that the code can run from scratch before submitting.

### Q1 mark scheme (40 pts)

At least 2 algorithms should be tested. If only 1 is tested then the maximum points for the question is 20. You can obtain full marks using 2 algorithms plus the baseline.

(5pts) Overall presentation of the report, including use of appropriate sections, plots,diagrams, or tables to make your point. Do not include code snippets in the report. Instead,describe in words or equations what you are implementing. Format equations correctly.

(3pts) Picking a suitable type of algorithm (classifification/regression/clustering) and justifying this choice. The lectures and worksheet from week 13 will be helpful here.

(3 pts) An appropriate choice of performance metric (e.g.: accuracy/precision/mean squared error etc) and justifification. The lectures and worksheet from week 13 will be helpful here.

(4 pts) Discussion of the kind of baseline to compare against. (sklearn has a module sklearn.dummy which may be useful in generating a baseline).

(15 pts) Use of an appropriate method to select the hyperparameters of the chosen algorithms. The explanation of which hyperparameters are selected should be backed up with e.g. tables and plots to show which hyperparameter values were chosen and why. Please choose at least one model that uses hyperparameters so that you can show your knowledge in this area. If you choose one model without hyperparameters then please explain in a couple of sentences what the benefifits of choosing a model without hyperparameters are. The lectures and worksheet from week 13 will be helpful here.

Breakdown

• 3 pts: Show that you understand what hyperparameters are and how they can be selected.
• 5 pts: Look at the effffects of difffferent hyperparameter choices on the performance of your models.
• 5 pts: Present the effffects of the difffferent hyperparameter choices on the performance of your models using tables, plots, or other presentation.
• 2 pts: State what hyperparameter choices you make and why.

(10 pts) Training and testing the performance of the models in a way that shows whether the models are able to generalise to unseen data and that ensures that the performance of the models is robust. The lectures and worksheet from week 13 will be helpful here.

• 4 pts: Train models and select hyperparameters in a way that gives robust performance
• 3 pts: Test the performance of your models and compare their performance
• 3 pts: Make sure your models are tested in a way that shows whether they are able to generalise to unseen data

Recommended structure of the short report

The short report should be no more than 4 pages. Shorter is fifine. You should use LATEX,MS Word, or a similar text editor to prepare the report and submit it as a pdf document.

• Introduction: State what the problem is. State what kind of algorithm needs to be used (classifification/regression/clustering) and explain why that kind of algorithm needs to be used.
• Methods: State which specifific algorithms you will use. State which performance metric you will use and why. Describe the baseline that you will measure your algorithms against. Describe how you will choose the hyperparameters of the algorithms. Explain which hyperparameters you have selected for each model using tables or plots to illustrate your decision.
• Results: Report the results of your models. Use tables or plots as appropriate to illustrate your results.

### Question 2: 10pts

The Flickr-Faces-HQ (FFHQ) dataset is available at https://github.com/NVlabs/ffhq-dataset and described in appendix A of the paper Karras et al. [2019]. NB: You do not need to read the whole paper. I have provided a template with a selection of the datasheet questions in sections 3.2 (Composition), 3.3 (Collection Process) and 3.5 (Uses) of the paper Gebruet al. [2021]. Please provide answers to the questions in the template.

Page guide: The template is 2 pages long. The completed template with your answers should be about 3 pages long – most questions need a sentence or two answer. Some may need longer or shorter answers.

Question 2 Mark Scheme

• Section 3.2: Composition. 5 pts
• Section 3.3: Collection Process. 3 pts
• Section 3.5: Uses. 2 pts

A template containing just the relevant questions is available on Blackboard.

The worksheet from week 19 will be helpful here. Example datasheets can also be seen in the appendix to the paper.

## References

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum´e Iii, and Kate Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021. URL https://arxiv.org/abs/1803.09010.

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. URL https://arxiv.org/abs/1812.04948.