这个作业是用Python分析电影数据并完成自动化推荐系统的Python代写案例分享

 

42913 Social and Information Network Analysis
Project topics
在这个小组项目中,您需要应用网络分析工具和算法来解决现实生活中的问题。为了让您遵循自己的兴趣,该项目不限于
具体主题。它可以是与社交和信息网络相关的任何主题,例如

•网络分析和可视化。分析来自不同地方的有趣网络
方面,例如学位分布,网络中心性,社区检测,网络
演化和图形可视化。
•算法。用于处理大量图的算法的可扩展实现
解决了真正的问题。
•应用。开发一个新颖的应用程序以在实际问题中提供新功能
基于网络分析。
为了帮助学生解决项目主题,我们提供了一些学生可以选择的主题
可以自由选择:
•选择的主题。我们提供5个选定的主题,包括朋友推荐,
电影推荐,POI推荐,搜索引擎原型和周期
检测问题。每个选定项目的详细信息将在附录中列出。
•其他可能的主题。
1.在线比赛的相关任务,例如Yelp数据挑战
(https://www.yelp.com/dataset_challenge)
2.网络可视化和分析。以下是斯坦福课程的一些示例:
YouTube频道推荐网络分析,
初创公司和风险投资公司以及Wikipedia的发展分析。有一个
有趣的论文,分析了《权力的游戏:网络》中的角色
王座。
3.与之相关的现有研究论文的实施和可能的改进
网络。您可能会从以下顶部找到有趣的研究论文
会议,例如WWW,KDD,WSDM,IJCAI,AAAI,VLDB和SIGMOD。
4.设计一种有效的算法,以找到以下任意两个参与者之间的最短路径
IMDB数据集,类似于凯文·培根的六度
(https://zh.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon)。
5.与图节点分类有关的主题(请参阅
https://hpi.de/fileadmin/user_upload/fachgebiete/mueller/courses/graphmining/Graph
Mining-06-NodeClassification.pdf)或图形嵌入(请参见
https://cs.stanford.edu/~jure/pubs/graphrepresentation-ieee17.pdf)。
6.您自己的主题。它必须涉及网络分析的一些实际工作
(社交和信息网络以外的网络都可以)。注意:你
您的项目需要在第5周之前获得教员的批准
话题。
42913社会和信息网络分析@ FEIT UTS
评估标准
学生应为他们的项目准备一份可发布或接近发布的报告。那个报告
可能包括摘要,简介,相关工作,数据集,方法(例如算法或
使用的网络指标),结果(例如实验报告,分析和可视化),以及
结论。
该项目将根据以下内容进行评估:
•作品的技术质量:技术材料是否有意义?是
事情尝试合理吗?提出的算法或应用程序是否聪明有趣?
作者是否传达了有关问题和/或算法的新颖见解?
•重要性:作者是否选择了一个有趣的或“真实的”问题进行研究,还是
只有一个小的“玩具”问题?这项工作可能有用和/或有影响吗?
•作品的新颖性和文字的清晰度
•结果介绍。格式正确,组织合理,经过拼写检查并检查语法文件

• Plagiarism check
• Members in the same group will receive equal marks. If some of the
group feel that other members are not contributing, the instructor should be informed
and a group meeting should be held to produce a solution 4 weeks before the deadline. No
complaints about group operation will be considered after the project has been handed
in.
==== More details ===
1. Technical Quality (40%)
Are the results technically sound?
Are there obvious flaws in the conceptual approach?
Are claims well-supported by theoretical analysis or experimental results?
Are the experiments well thought out and convincing?
Will it be possible for other researchers to replicate these results?
Is the evaluation appropriate? Did the authors clearly assess both the strengths and
weaknesses of their approach?
2.Quality of writing (30%)
Is the paper clearly written?
Is there a good use of examples and figures?
Is it well organized? Are there problems with style and grammar?
Are there issues with typos, formatting, references, etc.?
3. Novelty and Significance (30%)
We will recognise and reward papers that propose genuinely new ideas. Novel combinations,
adaptations or extensions of existing ideas are also valuable.
Is this a significant advance in the state of the art?
Is this a paper that people are likely to read and cite?
Does the paper address an important problem?
Is it a paper that is likely to have a lasting impact?
42913 Social and Information Network Analysis @ FEIT UTS
Research Proposal
A description of what you are planning for the group project. A couple of paragraphs would
usually be enough, including problem investigated, dataset, planned methods, and expected
outcomes.
Technical Report
• How to write a technical report. https://www.uts.edu.au/currentstudents/support/helps/self-help-resources/academic-writing/report-writing
• Subject 32144: Technology Research Preparation
• Approximately 10-15 pages for single column, single space format.
Resources of some real-life network data
Gephi https://github.com/gephi/gephi/wiki/Datasets
SNAP http://snap.stanford.edu/data/
Mark Newman http://www-personal.umich.edu/~mejn/netdata/
Data sets at CFinder.org http://cfinder.org/wiki/?n=Main.Data
UCIrvice Network Data Repository http://networkdata.ics.uci.edu/
You may find social and information network data from other resources, or crawl/generate
the network data by yourselves.
Appendix
Selected Topic 1: Friend Recommendation on Social Networks
Description
Social networks are usually highly dynamic; they grow and change quickly over time
through the addition of new edges and the removal of old ones. Identifying the mechanisms
by which they evolve over time is a fundamental question. In this project, we focus on the
link prediction problem on evolving social networks, which aims to predict the future links
between nodes by utilizing node features and network features.
Let’s take the Facebook “People You May Know” feature as an example. Facebook
periodically recommends new people to users such that users can make more new friends.
You may wonder how Facebook recommends friends to you. Are these people just randomly
selected, or do they have many common places with you? Actually, Facebook follows the
simple intuition that “similar” users are more likely to get connected in real life than the
“dissimilar” ones, and thus should be recommended to each other. Following this idea,
Facebook recommendation is achieved by mining the implicit online relationships between
users, which might finally lead to offline friendship in the future. For example, if two people
have lots of common friends, live in the same city or go to the same university, they are very
likely to be friends in the future. In this project, you need to investigate various features that
may contribute to the connection between two people by exploring network structures.
42913 Social and Information Network Analysis @ FEIT UTS
Datasets
A dataset will be provided for this project. We will provide two graph snapshot. The old
snapshot is used for algorithm training while the new one is used for evaluation.
Astro Physics collaboration network dataset
Arxiv ASTRO-PH (Astro Physics) collaboration network is from the e-print arXiv and
covers scientific collaborations between authors papers submitted to Astro Physics category.
If an author i co-authored a paper with author j, the graph contains an undirected edge
from i to j. If the paper is co-authored by k authors this generates a completely connected
(sub)graph on k nodes. The data covers papers in the period from January 1993 to April
2003 (124 months). It begins within a few months of the inception of the arXiv, and thus
represents essentially the complete history of its ASTRO-PH section.
The data file contains 18772 nodes (i.e., authors) and 198110 edges (i.e., collaborations).
Each line of the data file contains two values representing an edge. The first value is the
fromNodeId, and the second value is the toNodeId.
Data Download: https://www.dropbox.com/s/f3ct2c941y4wrnq/release.zip?dl=0
Evaluations
The evaluation method can be found in the paper “The link prediction problem for social
networks” as shown in the reference. Basically, it counts the intersection of the predicted
friends and true friends. The higher the counter is, the better the prediction result is.
References
1. Liben Nowell, David, and Jon Kleinberg. “The link prediction problem for social networks.”
Journal of the Association for Information Science and Technology 58.7 (2007): 1019-1031.
2. Backstrom, Lars, and Jure Leskovec. “Supervised random walks: predicting and recommending
links in social networks.” Proceedings of the fourth ACM international conference on Web
search and data mining. ACM, 2011.
Selected Topic 2:Movie Recommendation
Description
Recommendation system is used to predict the “rating” or “preference” that a user would
give to an item. Recommender systems have become increasingly popular in recent years,
and are utilized in a variety of areas such as movie recommendation. In this project, you
will be given MovieLens dataset which includes the information of movies and users and the
rating a user gives to a movie. Then you can base on that build a simple recommendation
system to predict which movies a user may like and predict the rate the user would give to a
movie.
Datasets
MovieLens dataset will be provided for this project. The data was collected through the
MovieLens web site (movielens.umn.edu) during the seven-month period from September
19th, 1997 through April 22nd, 1998. The dataset consists of 100,000 ratings (1-5) from 943
users on 1682 movies. This data has been cleaned up – users who had less than 20 ratings or
did not have complete demographic information were removed from this data set.
The details:
42913 Social and Information Network Analysis @ FEIT UTS
u.data: The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at
least 20 movies. Users and items are numbered consecutively from 1. The data is randomly
ordered. This is a tab separated list of user id | item id | rating | timestamp. The time stamps
are unix seconds since 1/1/1970 UTC
u.info: The number of users, items, and ratings in the u data set.
u.item: Information about the items (movies); this is a tab separated list of
movie id | movie title | release date | video release date |IMDb URL | unknown | Action |
Adventure | Animation |Children’s | Comedy | Crime | Documentary | Drama | Fantasy |FilmNoir | Horror | Musical | Mystery | Romance | Sci-Fi |Thriller | War | Western |
The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is
not; movies can be in several genres at once.
The movie ids are the ones used in the u.data data set.
u.genre: a list of all the genres.
u.user: Demographic information about the users; this is a tab separated list of
user id | age | gender | occupation | zip code
The user ids are the ones used in the u.data data set.
u.occupation: a list of the occupations
Data download: https://www.dropbox.com/s/7a1rpq684c33nca/ml-20m.zip?dl=0
https://www.dropbox.com/s/ip7x5v26a5kvixg/ml-100k.zip?dl=0
For more details: https://grouplens.org/datasets/movielens/
Evaluation
The data set has 80%/20% split of training data and test data. You can just use the test data to
evaluate your result or you can split the data by yourself using the timestamp to evaluate your
recommender system.
You can evaluate it by searching for the low prediction error (RMSD) and high recall
coverage. For details you can click the link in references. In your report you need to give the
RMSD and recall of your recommender system.
RMSD
The root-mean-square deviation (RMSD) is a frequently used measure of the differences
between values (sample and population values) predicted by a model or an estimator and the
values actually observed.
The RMSD of predicted values for times t of a regression’s dependent variable is computed
for n different predictions as the square root of the mean of the squares of the deviations:
Recall:
Recall ( ) is defined as the number of true positives ( ) over the number of true positives
plus the number of false negatives ( ).
42913 Social and Information Network Analysis @ FEIT UTS
e.g.
References
1. BN Miller, I Albert, SK Lam, JA Konstan. MovieLens unplugged: experiences with an
occasionally connected recommender system. IUI 2013.
2. Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets Chapter 9,
Cambridge University Press, second edition, 2014 (can be downloaded via http://www.mmds.org)
3. Evaluating Recommender Systems.
(https://medium.com/recombee-blog/evaluating-recommender-systems-choosing-the-best-onefor-your-business-c688ab781a35)
Selected Topic 3: POI Recommendation
Description
Point-of-interest (POI) recommendation has become a major issue with the rapid emergence
of location-based social networks (LBSNs). Unlike traditional recommendation approaches,
the LBSNs application domain comes with significant geographical and temporal
dimensions.
In this project, you can use the data from Yelp and Foursquare. Let’s take the Yelp data as
example. Yelp is one of the most famous LBSNs, and you will be provided the information
of the shops and users, e.g. the type of the business, the location of the business, the rate a
user gives to a business and the check-in information of a user. Based on this information,
you can find what type of a specified user like. For example, one user often goes to
Vietnamese restaurant and always rate highly for the them, we might have the conclusion
that the user likes to eat Vietnamese food and with the location information, you can
recommend some nearby Vietnamese restaurant for that user.
Datasets
You will be provided two datasets for this project. One is from the Yelp Dataset Competition,
and the other is from the Foursquare which is also a LBSN. These datasets both contains the
common information of business and users and also the location information. You are also
encouraged to find dataset which you are interested in.
The provide data has following information.
42913 Social and Information Network Analysis @ FEIT UTS
The business information, the users’ information, the review details, the check-in
information and tip information, you can get the dataset information in
https://www.dropbox.com/s/e80rv0wwm800mvq/yelp_dataset_challenge_round9.tar?dl=0
The Foursquare dataset download:
https://www.dropbox.com/sh/tf9wvlkk6ene6ry/AACOMB5QeR8BRziOh2QVPzdXa?dl=0
Evaluation
In this project you might need to predict the rate a user might give to a business and the
probability a user check-in in a place. For the Yelp dataset you will need to split some data
as test data, you can split it using date, you can use the new rates as test data.
You can evaluate your recommender system by searching for the low prediction error
(RMSD) and high recall coverage. For details you can click the link in references. In your
report you need to give the RMSD and recall of your recommender system.
RMSE
The root-mean-square deviation (RMSD) is a frequently used measure of the differences
between values (sample and population values) predicted by a model or an estimator and the
values actually observed.
The RMSD of predicted values for times t of a regression’s dependent variable is computed
for n different predictions as the square root of the mean of the squares of the deviations:
Recall:
Recall ( ) is defined as the number of true positives ( ) over the number of true positives
plus the number of false negatives ( ).
e.g.
References
1. Bin Liu, and Hui Xiong. Point-of-Interest Recommendation in Location Based Social Networks
with Topic and Location Awareness. Proceedings of the 2013 SIAM International.
2. M Xie, H Yin, H Wang, F Xu, W Chen, and S Wang, Learning Graph-based POI Embedding for
Location-based Recommendation. CIKM 16
3. Evaluating Recommender Systems.
42913 Social and Information Network Analysis @ FEIT UTS
(https://medium.com/recombee-blog/evaluating-recommender-systems-choosing-the-best-onefor-your-business-c688ab781a35)
Selected Topic 4: A Simple Google Search Prototype
Description
The amount of information on the web is growing rapidly every day. Thanks to search
engine like Google, users can easily find the information they want through a simple click.
Most of the search engines usually return pages of results according to their relevance to the
user query. One of the main factors that contribute to Google’s initial success is a ranking
model called PageRank. PageRank makes use of the link structure of the web to calculate a
quality ranking for each web page. In this project, you need to implement a simple Google
search prototype using PageRank. Specifically, given a query, such as “uts Australia”, your
search engine should be able to return a set of pages that contain both “uts” and “Australia”,
and the most relevant pages should be put on the top. The proposed ranking model should at
least use PageRank metric, and the students are encouraged to investigate other features that
could be used to improve the ranking.
Datasets
We will provide a web dataset, named WebSpam. WebSpam contains about 200K web
pages as well as their link structures and their raw html contents.
Specifically, we will provide three files:
1. url_graph_file: each node is represented by a unique URL. In this file, every unique
URL in the corpus is treated as a node in the web graph, and every unique link to another
URL in the corpus is stored as an edge in the web graph.
2. url_id_mapping: maps ids to real URLs.
3. Webspam2011_htl.tgz: the raw html files. You need to match the real URL to the raw
html content in order to get the mapping between real URL and its html.
Data Download: https://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html
Evaluations
The evaluation will be based on whether the implemented system can achieve its desired
functions. There is no ground truth in this project.
References
1. Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation
ranking: Bringing order to the web. Stanford InfoLab, 1999.
2. Brin, Sergey, and Lawrence Page. “The anatomy of a large-scale hypertextual web search
engine.” Computer networks and ISDN systems 30, no. 1 (1998): 107-117.
Selected Topic 5:Cycle Detection in Dynamic Graphs
Description
Data generated by an increasing number of applications is being modeled as graphs. This is
because the graph structure can encode complex relationships among entities which can
appear in social networks, e-commerce transactions, and electronic payments, etc.
42913 Social and Information Network Analysis @ FEIT UTS
Sophisticated analytics over such graphs provides valuable insights to the underlying dataset
and interactions among different entities.
As one of analytical approaches, cycle detection is the algorithmic problem of finding cycles
in a graph (including a set of vertices and edges). In this project, you will be given a
dynamic graph dataset including nodes, static edges and dynamic edges. your goal is to
identify the newly generated cycles and return them for a set of continuous queries
respectively for each incoming edge of the dynamic graph. Each query can ask for cycles
satisfying some predefined constraints, such as length constraints.
The following is a simple example about cycle detection among buyers and sellers in an ecommerce platform. We denote individual users (buyers or sellers) and their accounts as
vertices in the graph. There are two types of edges. One type of static edges (in solid lines)
models the association of accounts to users and the relationships among different users,
while online transactions including payment activities are denoted as dynamic edges (in
dotted lines) for the corresponding vertices. In order to increase the popularity for a
merchandise so as to improve future sales, fake transactions are placed to artificially bump
up the number of past transactions.
In this example, this is achieved through a third-party account (vertex 3) from which a
normal order is placed and its payment (edge 3 → 4) is completed at time ?2. However, the
merchandise is never shipped by the seller (vertex 5) and the money used for the payment by
the fake buyer (vertex 3) was previously transferred to him/her via the seller’s friend (vertex
1) at time ?1 using his or her own account (vertex 2). The entire process is rather
complicated involving multiple entities. Interestingly, it generates a cycle (1 → 2 → 3 → 4 →
5 → 1) in the graph, which can be returned as strong indication that a fraud may exist.
Datasets
P2p-Gnutella(04-09) series dataset will be provided for this project. The data was collected
through the Gnutella peer to peer network from August 2002. The dataset consists of about
10,000 nodes and 20,000-40,000 directed edges. You can split the whole edges set into 3/4
static edges and 1/4 dynamic edges randomly to construct the query processing. You can
also use other dataset including directed edges.
Dataset format:
FromNodeId ToNodeId
Dataset link: http://snap.stanford.edu/data/index.html#amazon
42913 Social and Information Network Analysis @ FEIT UTS
Evaluation
The evaluation will be based on query response times which means how long it takes for the
system to return correct query results. The faster correct results return, the better the
performance of the system is.
References
3. Qiu, X. , Cen, W. , Qian, Z. , Peng, Y. , & Zhang, Y. . (2018). Real-time constrained cycle
detection in large dynamic graphs. Proceedings of the VLDB Endowment, 11(12), 1876-1888.