### Problem 1 (15 pts): A Market Basket of Marvel Heros In HW 2.1, we analyzed the Marvel Universe using network analytics. This problem will use concepts from Association Analysis to explore the data. Use the dataset `marvel_association.csv`, which gives the heros that appeared in each comic, to answer the following questions. Treat the heros as the *items* and comics as the *transactions*. a. Provide the following descriptive analysis of the data: the number of items (heros), number of transactions (comics), and a visual representation of the distribution of *the number of items per transaction*.

b. What is the *lift* of the itemset: {CAPTAIN AMERICA, SPIDER-MAN}? What does the lift imply about the association of CAPTAIN AMERICA and SPIDER-MAN?

c. The [Fantastic Four](https://en.wikipedia.org/wiki/Fantastic_Four) comprises the heros: MR. FANTASTIC, THING, INVISIBLE WOMAN, and HUMAN TORCH. If a comic includes the Fantastic Four, which other hero is most likely to be in the comic? What is the estimated probability?

### Problem 2 (15 pts): Hero Clustering Consider two *binary* vectors $A \in \{0, 1\}^p$ and $B \in \{0, 1\}^p$ – E.g. $A=[0,1,1,1]$, $B = [1,0,1,0]$ for $p=4$. The dissimilarity, or distance, between two binary vectors can often be created from the following three measures: – $N_A$ represents the total number of 1’s in A. In example, $N_A = 3$. – $N_B$ represents the total number of 1’s in B. In example, $N_B = 2$. – $N_{AB}$ represents the total number of positions where where A and B both have a value of 1. In example, $N_{AB} = 1$ (due to the 3rd position/element of the vectors) a. Write out the equations for [*cosine distance*](https://en.wikipedia.org/wiki/Cosine_similarity), [*Jaccard’s distance*](https://en.wikipedia.org/wiki/Jaccard_index), and [*Squared Euclidean distance*](https://en.wikipedia.org/wiki/Euclidean_distance#Squared_Euclidean_distance) using $N_{AB}, N_A, N_B$. Note: use 1-similarity to convert a similarity scores to a *distance*.

b. The Marvel Heros INVISIBLE WOMAN and THING appeared in $668$ comics together. Use the data from Problem 1 to calculate the three *distances* between INVISIBLE WOMAN and THING. – The vectors represent the presence or absence of the heros in each comic

c. The dendrogram below is constructed by running hierarchical clustering using *Jaccard’s Distance* and *Single Linkage* on the 30 most frequent heros. Describe the first 3 merges. Who gets merged, at what (approximate) height are they merged, and how single linkage is used to calculate the height. “`{r, echo=FALSE, out.width=”70%”, fig.align=’center’} knitr::include_graphics(file.path(data.dir, “../other/jaccards-single.png”)) “`

d. The dendrogram below is constructed by running hierarchical clustering using *Cosine Distance* and *Complete Linkage* on the 30 most frequent heros. How many clusters result if the dendrogram is cut at a height of 0.70? What is the largest possible Cosine Distance between THOR and SCARLET WITCH? “`{r, echo=FALSE, out.width=”70%”, fig.align=’center’} knitr::include_graphics(file.path(data.dir, “../other/cosine-complete.png”)) “`

### Problem 3 (15 pts): Predictive Density Estimation Contest In HW 3.2, you estimated the (space-time) density of highway crashes on I-64 in 2016 using Kernel Density Estimation (KDE). In this problem, you will re-estimate the density, using the method of your choice, with the goal of predicting the density of the crashes in 2017. This will be a contest. You will submit your 2017 density estimates at a set of evaluation grid points and I will evaluate your predictions using the predictive log-likelihood ratio (or information gain) score: \[ score = \sum_{i=1}^n \log \frac{\hat{f}(mile_i, time_i)}{p(mile_i, time_i)} \] where $\hat{f}(m, t)$ is your prediction for $(mile=m, time=t)$, $p(m, t)$ is my predictions, and the sum is over the $n$ points in the 2017 test data. A positive score means that you did better than my predictions – you will receive a 4% bonus on your exam score! a. Use the same data that was given in HW 3.2, `crashes16.csv`, to make your predictive density estimate. Make predictions at the grid points “`{r} eval.pts = expand.grid(mile = seq(86.9, 136.1, by=.25), time = seq(0, 7, by=1/24)) “` Create a .csv file named `lastname_firstname.csv` that includes the columns named *mile*, *time*, *f*, where *f* is your estimated density. – Submit this file in Collab (along with your .Rmd and .html) – Show your code that makes the predictions.

b. Describe the model you used. Make sure to mention why you chose your model, what the unknown parameters in your model are, and how those parameters were estimated. You may need to read the function documentation; I’m not looking for the mathematical details but do need to specify the method used.

### Problem 4 (5 pts): Network Clustering It was mentioned in class (or class notes) that *community detection* could be considered as a clustering of the nodes in a network. *Spectral clustering* is a clustering approach that uses the eigenvectors from the graph *Laplacian* as the features for clustering. In this exercise you will implement a version of spectral clustering. More specifically, the spectral clustering algorithm for $K$ clusters is as follows: 1. Find the eigenvectors associated with the $K$ *smallest, non-zero* eigenvalues of the graph Laplacian $L$. 2. Run $k$-means clustering to find $K$ clusters using the $K$ eigenvectors as the features/variables. Use the Laplacian $L = D-A$, where $D$ is the diagonal matrix with node degree along the diagonal and $A$ is the (unweighted) adjacency matrix. Hints: the R function `eigen()` finds the eigenvalues and eigenvectors of a matrix. There will be one $0$ eigenvalue in this problem, but due to round-off error it will have a value on the order of $10^{-16}$; do not include the eigenvector associated with this zero eigenvalue when running $k$-means. a. The `UKfaculty` network found in the `igraphdata` R package is a social network of university faculty. The graph can be loaded into R with the command `data(UKfaculty, package=’igraphdata’)`. Calculate the graph Laplacian using an undirected and unweighted version of the graph. – Show the code used to produce $L$.

b. Implement spectral clustering of the `UKfaculty` network using $K=1,2,\ldots, 9$. – Show your code. – Also plot the sum of squared errors (SSE) as a function of $k$

c. Estimate $K$. Explain why you chose that value of $K$.

d. Using your chosen value of $K$, evaluate how well the resulting clustering can distinguish between the school affiliation of the nodes. Use the vertex attribute `Group` (in the `UKfaculty` graph) as the true label and calculate how many nodes would be *misclassified* by the clustering.