2. Assignment description
The coursework consists of two scenarios to assess the implementation of the Data
Science process with Python as main tool.
Scenario 1 of 2: Twitter network map data extraction, pre-processing,
You have been asked to analyse information of the social media Twitter, such as the
network of certain accounts, hashtags and some other data that can be extracted from
it. You are required to implement a full Data Science Workflow going from the data
gathering, cleaning, pre-processing, implementation of a model (network), and
analysis of different statistics (e.g. Degree Distribution, Cluster coefficient, etc.); you
are also required to provide justification of the process, analysis of the findings,
reasoning behind the design and implementation, decisions, and assumptions.
Your overall task is to implement the data science process on data collected from
Twitter of at least three accounts and three hundred tweets (most recent tweets) of
each account. The tasks need to be developed in a Jupiter notebook.
Task 1 – Data Gathering, Pre-processing and EDA
Implement a process/workflow to extract information from Twitter. Your solution
• API connection and data extraction from the data source.
• Data Pre-processing from the data source to transform the original data into a
• Perform a data cleansing activity considered relevant for the process.
• Provide the explanation of the process, the justification behind it, lessons
learned and findings.
• Exploratory Data Analysis of the accounts, e.g. number of followers, are the
accounts producing original twits or mostly retweeting, etc.
For more details of the data extraction from Twitter please review below in this
document section 5. Additional Considerations.
Task 2 – Network analysis
The goal of this task is to create a network that represents the area of influence of the
accounts/influencers selected. For this you need to consider the network as
bidirectional, there are two ways to do it: you can extract the accounts that the
influencer is following and/or create the links from the accounts that were retweeted.
You need to provide the following:
• Provide a sample (max 10 records) of the edge list and the neighbour list of the
• Produce a visualisation of the network topology and discuss the output.
• Calculate statistics of the network, plot them where relevant, and discuss the
results, explaining the meaning of any statistics you have calculated.
o Statistics of the network such as
▪ Degree Distribution
▪ Cluster coefficient
▪ Betweenness Centrality
• Conclusions and lessons learned.
Use Networkx (Python library) to calculate statistics of the network, rather than
implementing your own Python code to do so. The visualisation may be hard to
interpret at first, experimenting with different settings for the layout may help.
EasyDue™ 支持PayPal, AliPay, WechatPay, Taobao等各种付款方式!
E-mail: firstname.lastname@example.org 微信:easydue