这是一个美国的Python代写朴素贝叶斯分类器编程作业

Misinformation data collection

Any text classification project starts with data collection. For the first part of this assignment, you
will have to identify online true information and online misinformation in English on one topic: the
Ukraine-Russia war or Climate change. (We understand that some students may be directly affected
by the war; if that is the case, please select the Climate change topic for the data collection and later
for the experiments you will run).

Please use this form to submit true information and misinformation you identify online.
https://docs.google.com/forms/d/e/1FAIpQLSfKTaaY6BRqHq5v75J3xQ1BuVdWIdDjcuqNEn1BIvD
H7Ronhg/viewform

These are the steps you will have to follow to complete the form:

● First, please select the topic for which you will be submitting data: the Ukraine-Russia war,
or Climate change.

● Next, on the topic selected, please identify 7 pieces of true information and 7 pieces of
misinformation, all written in English. Any English online text source is acceptable, eg, news
sources, Twitter, blogs, etc. Regardless of the source, please make sure each piece of
(mis)information is at least 200 characters long and at most 1000 characters long. Please
also keep track of and record the URL from where you are getting the (mis)information.

● For at least 3 pieces of information and at least 3 pieces of misinformation, please also write
an explanation indicating why you believe the information is truthful (or misinformative);
for the remaining cases where you do not provide an explanation, please include NA.

● Consider saving the data prior to completing this form, to make sure you don’t lose it if
something goes wrong with the form submission.

Naive Bayes classifier.

Write a Python program that implements the Naive Bayes text classifier, as discussed in class. To
avoid zero counts, make sure you also implement the add-one smoothing.

Evaluate your implementation on:

(1) the fake news dataset fakenews.zip provided on Canvas under the Files section;

(2) one of the ClimateChange or UkraineWar datasets (your choice), which will be made available
on 3/17

The first dataset consists of the folder fakenews/ that contains 480 files, consisting of fake and
legitimate news, covering several domains (technology, education, business, sports, politics,
entertainment and celebrity news). The ground truth (in this case also referred to as the label or the
class) for each statement is encoded in the filename; for instance, the statement stored in the file
fake13.txt is misinformative (the class is fake).

For both evaluations, use the leave-one-out strategy. That is, assuming there are N files in a dataset,
train your Naive Bayes classifier N-1 files, and test on the remaining one file. Repeat this process N
times, using one file at a time as your test file.

Programming guidelines:

Write a program called naivebayes.py that trains and tests a Naive Bayes classification algorithm.
The program will receive one argument on the command line, consisting of the name of a folder
containing all the data files.

Include the following functions in naivebayes.py:

a. Function that trains a Naive Bayes classifier:

name: trainNaiveBayes;
input: the list of file paths to be used for training;
output: data structure with class probabilities (or log of probabilities);
output: data structure with word conditional probabilities (or log of probabilities);
output: any other parameters required (e.g., vocabulary size).

Given a set of training files, this function will:

– preprocess the content of the files provided as input, i.e., tokenize the text (you are encouraged to
use the functions you implemented for Assignment 1, but you can also use another
CAEN-compatible tokenizer if you prefer; please also include the related code if you choose to use
your own processing code) and compute a vocabulary (which is composed of all the tokens in the
training files, regardless of what class they belong to). In the basic implementation of your function,