Homework #1
SYS 4582/6018 | Spring 2019
Problem 1.1: Interestingness
Suppose we have market basket data consisting of 100 transactions and 20 items. Assume the support for item {} is 20%, support for item {} is 85%, and support for itemset {} is 15%.
- What is the confidence of the rule {a} {b}?
- Will the apriori algorithm find this rule (interesting) if and ?
Yes
- Find the lift of this rule.
- Find the addedValue of this rule.
- Find the leverage/PS of this rule.
- Describe the nature of the relationship between items {a} and {b} according to lift, addedValue and leverage/PS. What observation can you draw from parts (b) and (c-e)?
There is a negative relationship between {a} and {b}, meaning that there are fewer transactions that contain items {a} and {b} than would be expected if {a} and {b} are independent. This shows us that confidence alone may not be a great measure of how interesting a rule is.
- Let , , and be the actual probabilities of observing items {a}, {b}, and {a,b} respectively in a transaction. What is the expected confidence rule a -> {b} if a and b are independent?
Confidence is . Under independence, and . Thus the confidence is expected to be . So even if the items are independent, confidence can be large if is large.
Problem 1.2: Online Retail
The website http://archive.ics.uci.edu/ml/datasets/online+retail describes some transactional data from an online retailer.
- Download the excel file to your machine and read it into R.
library(readxl)
data.dir = <"path/to/datadir">
X = read_excel(file.path(data.dir, "Online Retail.xlsx"))
library(readxl)
#-- Load the "Online Retail.csv"
data.dir = "../data"
X = read_excel(file.path(data.dir, "Online Retail.xlsx"))
- There are many quality problems with this dataset, but we will only address two of them. Remove all of the rows with missing Description values (NAs) and remove any duplicate items in a single transaction. Print the first 10 rows of the resulting data.
Y = X %>%
filter(!is.na(Description)) %>% #Remove rows with missing Description
distinct(InvoiceNo, Description, .keep_all=TRUE) # remove duplicates
#-- Print the first 10 rows
print(Y, n=10)
#> # A tibble: 529,477 x 8
#> InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice
#> <chr> <chr> <chr> <dbl> <dttm> <dbl>
#> 1 536365 85123A WHITE HANG… 6 2010-12-01 08:26:00 2.55
#> 2 536365 71053 WHITE META… 6 2010-12-01 08:26:00 3.39
#> 3 536365 84406B CREAM CUPI… 8 2010-12-01 08:26:00 2.75
#> 4 536365 84029G KNITTED UN… 6 2010-12-01 08:26:00 3.39
#> 5 536365 84029E RED WOOLLY… 6 2010-12-01 08:26:00 3.39
#> 6 536365 22752 SET 7 BABU… 2 2010-12-01 08:26:00 7.65
#> 7 536365 21730 GLASS STAR… 6 2010-12-01 08:26:00 4.25
#> 8 536366 22633 HAND WARME… 6 2010-12-01 08:28:00 1.85
#> 9 536366 22632 HAND WARME… 6 2010-12-01 08:28:00 1.85
#> 10 536367 84879 ASSORTED C… 32 2010-12-01 08:34:00 1.69
#> # ... with 5.295e+05 more rows, and 2 more variables: CustomerID <dbl>,
#> # Country <chr>
- Find the number of transactions and number of items using InvoiceNo for transactions and Description as items (i.e., ignore the StockCode column).
(NT = n_distinct(Y$InvoiceNo)) # Number of transactions
#> [1] 24446
(NI = n_distinct(Y$Description)) # Number of items
#> [1] 4211
- Convert the data frame into a transaction list and convert it into a transactions object (don’t forget to load the package). Print a summary (using ) of the new object.
library(arules)
tList = split(Y$Description, Y$InvoiceNo) # get transaction list
trans = as(tList, "transactions") # convert to transactions object
summary(trans) # print summary info
#> transactions as itemMatrix in sparse format with
#> 24446 rows (elements/itemsets/transactions) and
#> 4211 columns (items) and a density of 0.005143444
#>
#> most frequent items:
#> WHITE HANGING HEART T-LIGHT HOLDER REGENCY CAKESTAND 3 TIER
#> 2302 2169
#> JUMBO BAG RED RETROSPOT PARTY BUNTING
#> 2135 1706
#> LUNCH BAG RED RETROSPOT (Other)
#> 1607 519558
#>
#> element (itemset/transaction) length distribution:
#> sizes
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#> 4440 1590 1080 812 791 671 654 634 635 562 568 505 513 537 555
#> 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
#> 557 468 444 491 438 407 349 351 310 249 262 243 242 272 226
#> 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
#> 199 189 162 177 137 137 131 122 139 122 123 103 97 104 100
#> 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
#> 91 84 95 88 86 57 65 78 70 73 50 65 51 36 61
#> 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
#> 40 29 43 39 39 42 34 40 29 33 39 23 25 34 26
#> 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
#> 21 19 27 15 13 20 21 15 23 17 17 9 17 11 12
#> 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
#> 9 15 16 7 5 10 9 13 5 11 11 3 6 9 2
#> 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
#> 4 7 4 4 4 7 3 5 6 6 8 6 4 8 5
#> 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
#> 6 11 4 5 3 4 8 1 2 4 3 3 2 5 4
#> 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
#> 2 6 6 2 5 6 2 2 5 5 3 2 4 5 3
#> 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
#> 5 3 6 2 2 2 4 4 1 2 3 3 3 2 5
#> 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
#> 4 1 4 4 2 2 4 3 4 2 5 5 4 2 4
#> 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195
#> 2 6 4 3 3 3 2 3 4 4 2 3 2 3 3
#> 196 197 198 199 202 203 204 205 206 207 208 210 211 212 213
#> 4 2 2 3 2 5 5 1 2 1 4 1 4 1 1
#> 214 215 216 217 218 219 220 222 223 224 225 226 227 228 229
#> 2 1 2 4 2 2 2 1 1 3 3 1 1 1 2
#> 230 232 233 234 235 237 238 239 241 242 243 244 247 249 250
#> 1 1 1 1 1 3 3 1 2 1 2 2 2 3 2
#> 253 254 255 257 259 261 262 263 264 266 267 270 275 279 280
#> 1 2 2 2 1 2 2 1 2 1 1 2 1 2 2
#> 282 283 285 286 288 289 291 292 295 296 298 299 301 309 310
#> 1 2 2 1 2 1 1 1 1 1 2 1 1 1 1
#> 315 319 320 331 332 333 334 339 341 344 345 347 348 349 352
#> 1 1 1 1 4 1 1 1 1 1 1 2 1 1 2
#> 354 357 358 363 369 375 376 379 382 386 388 399 404 408 411
#> 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1
#> 414 415 416 419 420 428 433 434 438 439 443 449 453 455 458
#> 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1
#> 460 463 471 482 486 487 488 494 499 503 506 514 515 517 518
#> 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
#> 520 522 524 525 527 529 531 536 539 541 543 552 561 567 572
#> 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
#> 578 585 588 589 593 595 599 601 607 622 629 635 645 647 649
#> 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 661 673 676 687 703 720 731 748 1108
#> 1 1 1 1 1 1 1 1 1
#>
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 3.00 11.00 21.66 24.00 1108.00
#>
#> includes extended item information - examples:
#> labels
#> 1 ?
#> 2 ? sold as sets?
#> 3 ??
#>
#> includes extended transaction information - examples:
#> transactionID
#> 1 536365
#> 2 536366
#> 3 536367
- Find the items with the highest support. Print and plot the support of the top 10.
library(tidyverse)
#-- get item counts and support for single itemsets
itemFreq = count(Y, Description, sort=TRUE) %>% mutate(support=n/NT)
#-- Print the top 10
itemFreq %>% slice(1:10) %>% knitr::kable()
Description | n | support |
---|---|---|
WHITE HANGING HEART T-LIGHT HOLDER | 2302 | 0.0941667 |
REGENCY CAKESTAND 3 TIER | 2169 | 0.0887262 |
JUMBO BAG RED RETROSPOT | 2135 | 0.0873354 |
PARTY BUNTING | 1706 | 0.0697865 |
LUNCH BAG RED RETROSPOT | 1607 | 0.0657367 |
ASSORTED COLOUR BIRD ORNAMENT | 1467 | 0.0600098 |
SET OF 3 CAKE TINS PANTRY DESIGN | 1458 | 0.0596417 |
PACK OF 72 RETROSPOT CAKE CASES | 1334 | 0.0545693 |
LUNCH BAG BLACK SKULL. | 1295 | 0.0529739 |
NATURAL SLATE HEART CHALKBOARD | 1266 | 0.0517876 |
# plot top 10 items
itemFreq %>% slice(1:10) %>%
ggplot(aes(fct_reorder(Description, support), support)) + # Order Description
geom_col() + # barplot
coord_flip() + # rotate plot 90 deg
theme(axis.title.y = element_blank()) # remove y axis title
- Find the frequent itemsets that contain at least 3 items and have
- . Show the top 10 results, ordered by lift.
#-- Convert to data frame / tibble
# use this instead of inspect(), which only prints to screen
apriori2df <- function(x){
if(class(x) == "itemsets"){
out = data.frame(items=labels(x), x@quality, stringsAsFactors = FALSE)
}
else if(class(x) == "rules"){
out = data.frame(
lhs = labels(lhs(x)),
rhs = labels(rhs(x)),
x@quality,
stringsAsFactors = FALSE)
}
else stop("Only works with class of itemsets or rules")
if(require(dplyr)) tbl_df(out) else out
}
#-- Frequent itemsets with len>=3, s>=.02
fis3 = apriori(trans,
parameter = list(support = .02, minlen=3, target="frequent"),
control = list(verbose=FALSE))
apriori2df(fis3) %>% # convert to dataframe/tibble
arrange(-support) %>% # order by support (descending)
pull(items) # show items
#> [1] "{GREEN REGENCY TEACUP AND SAUCER,PINK REGENCY TEACUP AND SAUCER,ROSES REGENCY TEACUP AND SAUCER}"
- Find all of the association rules with
,
- . Add the PS/leverage and addedValue metrics. Show all results, ordered by addedValue
#-- Find association rules with support>=.02 and confidence>=.70
rules = apriori(trans,
parameter = list(support=.02, confidence=.70,
minlen=2,target="rules"),
control = list(verbose=FALSE))
#-- Add other interest measures
apriori2df(rules) %>%
mutate(addedValue = interestMeasure(rules, measure="addedValue", trans),
PS = interestMeasure(rules, measure="leverage", trans)) %>%
arrange(-addedValue)
#> # A tibble: 8 x 8
#> lhs rhs support confidence lift count addedValue PS
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 {PINK REGENCY… {GREEN … 0.0225 0.894 20.7 549 0.851 0.0214
#> 2 {GREEN REGENC… {ROSES … 0.0225 0.852 18.6 549 0.807 0.0213
#> 3 {PINK REGENCY… {GREEN … 0.0263 0.804 18.6 644 0.761 0.0249
#> 4 {PINK REGENCY… {ROSES … 0.0251 0.767 16.7 614 0.721 0.0236
#> 5 {GREEN REGENC… {ROSES … 0.0321 0.742 16.2 784. 0.696 0.0301
#> 6 {GARDENERS KN… {GARDEN… 0.0225 0.718 19.0 549 0.680 0.0213
#> 7 {GREEN REGENC… {PINK R… 0.0225 0.700 21.4 549 0.667 0.0214
#> 8 {ROSES REGENC… {GREEN … 0.0321 0.7 16.2 784. 0.657 0.0301
- Find one rule that you think is interesting. Write the rule and explain why you find it interesting.
I wanted to see what might be associated with HAND WARMERs, so I pulled all items that involve some version of HAND WARMER. Then apriori is run forcing only HAND WARMER on the lhs.
## Find rules that involve HAND WAMER on the lhs
lhs.items = itemFreq %>%
filter(str_detect(Description, "HAND WARMER")) %>%
pull(Description)
r = apriori(trans,
parameter=list(support=0.005, confidence=.2, minlen=2),
appearance = list(lhs=lhs.items),
control=list(verbose=FALSE))
apriori2df(r) %>%
mutate(addedValue = interestMeasure(r, measure="addedValue", trans),
PS = interestMeasure(r, measure="leverage", trans)) %>%
arrange(-addedValue)
#> # A tibble: 21 x 8
#> lhs rhs support confidence lift count addedValue PS
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 {HAND WAR… {HOT WATE… 0.00548 0.338 10.4 134 0.305 0.00496
#> 2 {HAND WAR… {SCOTTIE … 0.00679 0.300 9.43 166 0.268 0.00607
#> 3 {HAND WAR… {PAPER CH… 0.00700 0.309 6.46 171 0.261 0.00591
#> 4 {HAND WAR… {PAPER CH… 0.00695 0.304 6.34 170 0.256 0.00586
#> 5 {HAND WAR… {PAPER CH… 0.00794 0.292 6.10 194. 0.244 0.00663
#> 6 {HAND WAR… {HOT WATE… 0.00548 0.276 8.53 134 0.243 0.00484
#> 7 {HAND WAR… {HOT WATE… 0.00593 0.262 9.89 145 0.236 0.00533
#> 8 {HAND WAR… {HOT WATE… 0.00565 0.246 9.30 138 0.220 0.00504
#> 9 {HAND WAR… {HOT WATE… 0.00679 0.250 7.72 166 0.217 0.00591
#> 10 {HAND WAR… {HOT WATE… 0.00560 0.248 7.67 137 0.215 0.00487
#> # ... with 11 more rows
Looks like HOT WATER BOTTLES are associated with HAND WARMER. For a single rule, the highest life and addedValue is: HAND WARMER RED LOVE HEART}
{HOT WATER BOTTLE KEEP CALM}