ST 625: Spring 2020
Individual Data Assignment
Due: April 19th
1. Demonstrate mastery of ST625 content
2. Communicate results of a statistical analysis to a general audience
You’ll prepare a report for your client (played by me). The project description is on the next
page. The primary components of your report should be:
1. Abstract: A brief statement (few sentences at most) summarizing the purpose of the
report as well as the results and what they mean in substantive rather than statistical
terms. Be brief and to the point, stimulating your reader’s interest. You essentially
have 15 seconds to let the reader know what’s in the report and if they should read it.
2. Introduction: Give background and motivate the question to be investigated. Introduce the data, perhaps with visualization or descriptive statistics, but only if you think
it significantly adds to the narrative. A reader could jump here to the Results. While
a formal research study has explicit hypotheses, you likely won’t have any, but at least
try to suggest which variables in what form you expect to be important predictors.
3. Methodology: Describe the approach used to analyze the problem while keeping
in mind that your reader likely never took or doesn’t remember ST625. Why your
approach is appropriate? What are the assumptions? Are there any concerns about
these assumptions?
4. Results: Present the recommendation simply and clearly. Use graphical display, table,
and discussion as you see appropriate. You are providing an answer to the question
posed in the introduction.
5. Conclusion: Briefly summarize everything. Does the model make sense? Do the
predictors seem reasonable? What does it all mean? Suggestions for further analysis
or other data might be appropriate.
• The report must be typed and self-contained, with equations or plots incorporated.
• While there is no length requirement, your report will likely be a few pages. Significantly
shorter is probably not thorough, while much longer is likely too wordy and contains
irrelevant information… your client will not appreciate either of these!
ST 625: Spring 2020 Prof. Cherveny
• Your report should be mathematically correct, professional, engaging, efficient, and
written at the correct level. Correct level means that your client with only a
general education should be able to follow it. I’m not interested in how much
statistics you can show off, and in fact doing so will hurt your grade. In your professional
life, you’ll need to get the statistics right on your own and communicate the results to
people who aren’t mathematics professors!
• On that note, undigested computer output is definitely not appropriate.
• When writing the methodology, please focus on what will interest your reader or be
important to them. Many issues you can gloss over or dismiss with a single sentence if
there was nothing extraordinary. You certainly should not give a detailed rundown of
every single model you tried.
• Any plots should be fully labeled and also referenced in the report.
Further Comments
• There is no “correct” final model or a minimum R2
. Rather, there are things
you should check and try, and things you should uncover.
• This project involves no outside research on bike sharing, weather, Washington DC,
etc. Just do as much as you can with the dataset provided and the context.
• I’m always willing to give general advice, but in the interest of time I won’t be reading
rough drafts from everybody. It takes forever! I’m also not going to say what to
try, or if your model is “good enough”. It’s on you to decide if it’s good enough!
• You’ll turn in the report by uploading a Word .doc file to Blackboard.
• This is an individual assignment. Submitting your report is affirmation that you
alone did the work.
ST 625: Spring 2020 Prof. Cherveny
The Problem: Bike Sharing
Bike sharing is a 21st century take on traditional bike rental in which the whole process
from membership to rental to return has been automated. Through these systems, a user
is able to rent a bike from a particular location and return the bike at another point easily.
Understanding the factors influencing demand for bikes is important for addressing traffic, environmental and health issues. Fortunately, because the system is fully automated,
extensive data about each trip is readily available.
Capital Bikeshare is metro DC’s bike sharing system (operated by the same parent company as Boston’s Bluebikes). With more than 4,500 bikes available 24/7 from 500+ stations
across seven jurisdictions of Washington DC, Capital Bikeshare offers a large bike sharing
network for the nation’s capital. Users rent bikes either for a single 30-minute ride or a 24-
hour period (casual), or subscribe to a 30-day or annual membership plan (registered).
The data set BIKESHARE contains 752 one-hour observations of the following variables:
casual Number of unregistered/non-subscription rentals
registered Number of subscription rentals
season Season
hr Hour (6 to 23, where 6 means 6am-7am and 23 means 11pm-12am)
holiday Federal holiday or DC Emancipation Day (1 is yes and 0 is no)
day Day of the week (1 to 7, where 1 is Sunday and 7 is Saturday)
weather Weather situation, labelled by
1 = Clear or Partly cloudy
2 = Cloudy and/or Mist
3 = Light Rain or Light Snow
4 = Heavy Rain or Heavy Snow or Ice Pellets
temp Temperature (Fahrenheit)
feelslike “Feels like” temperature (Fahrenheit)
hum Humidity (percent)
windspeed Windspeed (miles per hour)
Using the range of models, tools, and techniques studied in ST625, build and present two
models: one for predicting casual bike rentals using the independent variables, and another
for predicting registered bike rentals. Your ultimate goal is to explain the factors that
influence bike demand, and as such your model/variable selection should be based both on
context and statistics. Model interpretation will be very important part of your report!
What influences demand for bikes, and how, and to the extent plausible why?
You should at a minimum perform linear regression using all the available independent
variables as well as consider some types of complex model (terms that are higher-order,
interaction, dummy), then perform variable selection/compare models. And to be very
clear, casual should not be a variable used to predict registered, nor vice-versa.
ST 625: Spring 2020 Prof. Cherveny
Further Remarks
• The dataset is a curated sample assembled from https://www.capitalbikeshare.com
as well as historical weather databases. It is simple random sample of a size sufficient
for a good deal of analysis but not so large that everything you try looks significant. I
tailored the number of holiday and heavy rain/snow observations.
• Rides between 12am and 6am are intentionally omitted. Although it’s possible to rent
bikes during these times, round-the-clock rentals are likely periodic after controlling for
other factors and so naturally modeled with trig functions… I didn’t want that.
• The exact date of each rental is also intentionally omitted. Feel free to ask for the
specific calendar date of a very small number of observations if you think it
will be of use.