SYS 4582/6018 | Spring 2019 


Problem 1.1: Interestingness

Suppose we have market basket data consisting of 100 transactions and 20 items. Assume the support for item {} is 20%, support for item {} is 85%, and support for itemset {} is 15%.

  1. What is the confidence of the rule {a} {b}?
  1. Will the apriori algorithm find this rule (interesting) if and ?

    Yes

    1. Find the lift of this rule.
    1. Find the addedValue of this rule.
    1. Find the leverage/PS of this rule.
    1. Describe the nature of the relationship between items {a} and {b} according to lift, addedValue and leverage/PS. What observation can you draw from parts (b) and (c-e)?

    There is a negative relationship between {a} and {b}, meaning that there are fewer transactions that contain items {a} and {b} than would be expected if {a} and {b} are independent. This shows us that confidence alone may not be a great measure of how interesting a rule is.

    1. Let , , and be the actual probabilities of observing items {a}, {b}, and {a,b} respectively in a transaction. What is the expected confidence rule a -> {b} if a and b are independent?

    Confidence is . Under independence, and . Thus the confidence is expected to be . So even if the items are independent, confidence can be large if is large.

    Problem 1.2: Online Retail

    The website http://archive.ics.uci.edu/ml/datasets/online+retail describes some transactional data from an online retailer.

    1. Download the excel file to your machine and read it into R.
    library(readxl)
    data.dir = <"path/to/datadir">
    X = read_excel(file.path(data.dir, "Online Retail.xlsx"))
    library(readxl)
    #-- Load the "Online Retail.csv" 
    data.dir = "../data"
    X = read_excel(file.path(data.dir, "Online Retail.xlsx"))
    1. There are many quality problems with this dataset, but we will only address two of them. Remove all of the rows with missing Description values (NAs) and remove any duplicate items in a single transaction. Print the first 10 rows of the resulting data.
    Y = X %>% 
      filter(!is.na(Description)) %>%    #Remove rows with missing Description 
      distinct(InvoiceNo, Description, .keep_all=TRUE) # remove duplicates
      
    #-- Print the first 10 rows  
    print(Y, n=10) 
    #> # A tibble: 529,477 x 8
    #>    InvoiceNo StockCode Description Quantity InvoiceDate         UnitPrice
    #>    <chr>     <chr>     <chr>          <dbl> <dttm>                  <dbl>
    #>  1 536365    85123A    WHITE HANG…        6 2010-12-01 08:26:00      2.55
    #>  2 536365    71053     WHITE META…        6 2010-12-01 08:26:00      3.39
    #>  3 536365    84406B    CREAM CUPI…        8 2010-12-01 08:26:00      2.75
    #>  4 536365    84029G    KNITTED UN…        6 2010-12-01 08:26:00      3.39
    #>  5 536365    84029E    RED WOOLLY…        6 2010-12-01 08:26:00      3.39
    #>  6 536365    22752     SET 7 BABU…        2 2010-12-01 08:26:00      7.65
    #>  7 536365    21730     GLASS STAR…        6 2010-12-01 08:26:00      4.25
    #>  8 536366    22633     HAND WARME…        6 2010-12-01 08:28:00      1.85
    #>  9 536366    22632     HAND WARME…        6 2010-12-01 08:28:00      1.85
    #> 10 536367    84879     ASSORTED C…       32 2010-12-01 08:34:00      1.69
    #> # ... with 5.295e+05 more rows, and 2 more variables: CustomerID <dbl>,
    #> #   Country <chr>
    1. Find the number of transactions and number of items using InvoiceNo for transactions and Description as items (i.e., ignore the StockCode column).
    (NT = n_distinct(Y$InvoiceNo))     # Number of transactions
    #> [1] 24446
    (NI = n_distinct(Y$Description))   # Number of items
    #> [1] 4211
    1. Convert the data frame into a transaction list and convert it into a transactions object (don’t forget to load the package). Print a summary (using ) of the new object.
    library(arules)
    
    tList = split(Y$Description, Y$InvoiceNo)    # get transaction list
    trans = as(tList, "transactions")            # convert to transactions object
    summary(trans)   # print summary info
    #> transactions as itemMatrix in sparse format with
    #>  24446 rows (elements/itemsets/transactions) and
    #>  4211 columns (items) and a density of 0.005143444 
    #> 
    #> most frequent items:
    #> WHITE HANGING HEART T-LIGHT HOLDER           REGENCY CAKESTAND 3 TIER 
    #>                               2302                               2169 
    #>            JUMBO BAG RED RETROSPOT                      PARTY BUNTING 
    #>                               2135                               1706 
    #>            LUNCH BAG RED RETROSPOT                            (Other) 
    #>                               1607                             519558 
    #> 
    #> element (itemset/transaction) length distribution:
    #> sizes
    #>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
    #> 4440 1590 1080  812  791  671  654  634  635  562  568  505  513  537  555 
    #>   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
    #>  557  468  444  491  438  407  349  351  310  249  262  243  242  272  226 
    #>   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
    #>  199  189  162  177  137  137  131  122  139  122  123  103   97  104  100 
    #>   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
    #>   91   84   95   88   86   57   65   78   70   73   50   65   51   36   61 
    #>   61   62   63   64   65   66   67   68   69   70   71   72   73   74   75 
    #>   40   29   43   39   39   42   34   40   29   33   39   23   25   34   26 
    #>   76   77   78   79   80   81   82   83   84   85   86   87   88   89   90 
    #>   21   19   27   15   13   20   21   15   23   17   17    9   17   11   12 
    #>   91   92   93   94   95   96   97   98   99  100  101  102  103  104  105 
    #>    9   15   16    7    5   10    9   13    5   11   11    3    6    9    2 
    #>  106  107  108  109  110  111  112  113  114  115  116  117  118  119  120 
    #>    4    7    4    4    4    7    3    5    6    6    8    6    4    8    5 
    #>  121  122  123  124  125  126  127  128  129  130  131  132  133  134  135 
    #>    6   11    4    5    3    4    8    1    2    4    3    3    2    5    4 
    #>  136  137  138  139  140  141  142  143  144  145  146  147  148  149  150 
    #>    2    6    6    2    5    6    2    2    5    5    3    2    4    5    3 
    #>  151  152  153  154  155  156  157  158  159  160  161  162  163  164  165 
    #>    5    3    6    2    2    2    4    4    1    2    3    3    3    2    5 
    #>  166  167  168  169  170  171  172  173  174  175  176  177  178  179  180 
    #>    4    1    4    4    2    2    4    3    4    2    5    5    4    2    4 
    #>  181  182  183  184  185  186  187  188  189  190  191  192  193  194  195 
    #>    2    6    4    3    3    3    2    3    4    4    2    3    2    3    3 
    #>  196  197  198  199  202  203  204  205  206  207  208  210  211  212  213 
    #>    4    2    2    3    2    5    5    1    2    1    4    1    4    1    1 
    #>  214  215  216  217  218  219  220  222  223  224  225  226  227  228  229 
    #>    2    1    2    4    2    2    2    1    1    3    3    1    1    1    2 
    #>  230  232  233  234  235  237  238  239  241  242  243  244  247  249  250 
    #>    1    1    1    1    1    3    3    1    2    1    2    2    2    3    2 
    #>  253  254  255  257  259  261  262  263  264  266  267  270  275  279  280 
    #>    1    2    2    2    1    2    2    1    2    1    1    2    1    2    2 
    #>  282  283  285  286  288  289  291  292  295  296  298  299  301  309  310 
    #>    1    2    2    1    2    1    1    1    1    1    2    1    1    1    1 
    #>  315  319  320  331  332  333  334  339  341  344  345  347  348  349  352 
    #>    1    1    1    1    4    1    1    1    1    1    1    2    1    1    2 
    #>  354  357  358  363  369  375  376  379  382  386  388  399  404  408  411 
    #>    1    1    1    1    1    1    1    1    1    1    1    2    2    1    1 
    #>  414  415  416  419  420  428  433  434  438  439  443  449  453  455  458 
    #>    1    1    2    1    1    1    1    2    1    2    1    1    1    1    1 
    #>  460  463  471  482  486  487  488  494  499  503  506  514  515  517  518 
    #>    1    1    1    1    1    1    1    1    1    2    1    1    1    1    1 
    #>  520  522  524  525  527  529  531  536  539  541  543  552  561  567  572 
    #>    1    1    1    2    1    1    1    1    1    1    1    1    1    1    1 
    #>  578  585  588  589  593  595  599  601  607  622  629  635  645  647  649 
    #>    1    2    1    1    1    1    1    1    1    1    1    1    1    1    1 
    #>  661  673  676  687  703  720  731  748 1108 
    #>    1    1    1    1    1    1    1    1    1 
    #> 
    #>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    #>    1.00    3.00   11.00   21.66   24.00 1108.00 
    #> 
    #> includes extended item information - examples:
    #>            labels
    #> 1               ?
    #> 2 ? sold as sets?
    #> 3              ??
    #> 
    #> includes extended transaction information - examples:
    #>   transactionID
    #> 1        536365
    #> 2        536366
    #> 3        536367
    1. Find the items with the highest support. Print and plot the support of the top 10.
    library(tidyverse)
    
    #-- get item counts and support for single itemsets
    itemFreq = count(Y, Description, sort=TRUE) %>% mutate(support=n/NT)
    
    #-- Print the top 10
    itemFreq %>% slice(1:10) %>% knitr::kable()
    Description n support
    WHITE HANGING HEART T-LIGHT HOLDER 2302 0.0941667
    REGENCY CAKESTAND 3 TIER 2169 0.0887262
    JUMBO BAG RED RETROSPOT 2135 0.0873354
    PARTY BUNTING 1706 0.0697865
    LUNCH BAG RED RETROSPOT 1607 0.0657367
    ASSORTED COLOUR BIRD ORNAMENT 1467 0.0600098
    SET OF 3 CAKE TINS PANTRY DESIGN 1458 0.0596417
    PACK OF 72 RETROSPOT CAKE CASES 1334 0.0545693
    LUNCH BAG BLACK SKULL. 1295 0.0529739
    NATURAL SLATE HEART CHALKBOARD 1266 0.0517876
    # plot top 10 items
    itemFreq %>% slice(1:10) %>% 
      ggplot(aes(fct_reorder(Description, support), support)) + # Order Description
      geom_col() +         # barplot
      coord_flip() +       # rotate plot 90 deg
      theme(axis.title.y = element_blank()) # remove y axis title

    1. Find the frequent itemsets that contain at least 3 items and have
    1. . Show the top 10 results, ordered by lift.
    #-- Convert to data frame / tibble
    # use this instead of inspect(), which only prints to screen
    apriori2df <- function(x){
      if(class(x) == "itemsets"){
        out = data.frame(items=labels(x), x@quality, stringsAsFactors = FALSE)
      }
      else if(class(x) == "rules"){
        out = data.frame(
          lhs = labels(lhs(x)),
          rhs = labels(rhs(x)),
          x@quality, 
          stringsAsFactors = FALSE)
      }
      else stop("Only works with class of itemsets or rules")
      if(require(dplyr)) tbl_df(out) else out
    }
    #-- Frequent itemsets with len>=3, s>=.02
    fis3 = apriori(trans, 
                   parameter = list(support = .02, minlen=3, target="frequent"), 
                   control = list(verbose=FALSE))
    
    apriori2df(fis3) %>%    # convert to dataframe/tibble
      arrange(-support) %>% # order by support (descending)
      pull(items)           # show items
    #> [1] "{GREEN REGENCY TEACUP AND SAUCER,PINK REGENCY TEACUP AND SAUCER,ROSES REGENCY TEACUP AND SAUCER}"
    1. Find all of the association rules with

    ,

    1. . Add the PS/leverage and addedValue metrics. Show all results, ordered by addedValue
    #-- Find association rules with support>=.02 and confidence>=.70
    rules = apriori(trans, 
                 parameter = list(support=.02, confidence=.70, 
                                  minlen=2,target="rules"),
                 control = list(verbose=FALSE))
    
    #-- Add other interest measures
    apriori2df(rules) %>% 
      mutate(addedValue = interestMeasure(rules, measure="addedValue", trans), 
             PS = interestMeasure(rules, measure="leverage", trans)) %>% 
      arrange(-addedValue)
    #> # A tibble: 8 x 8
    #>   lhs            rhs      support confidence  lift count addedValue     PS
    #>   <chr>          <chr>      <dbl>      <dbl> <dbl> <dbl>      <dbl>  <dbl>
    #> 1 {PINK REGENCY… {GREEN …  0.0225      0.894  20.7  549       0.851 0.0214
    #> 2 {GREEN REGENC… {ROSES …  0.0225      0.852  18.6  549       0.807 0.0213
    #> 3 {PINK REGENCY… {GREEN …  0.0263      0.804  18.6  644       0.761 0.0249
    #> 4 {PINK REGENCY… {ROSES …  0.0251      0.767  16.7  614       0.721 0.0236
    #> 5 {GREEN REGENC… {ROSES …  0.0321      0.742  16.2  784.      0.696 0.0301
    #> 6 {GARDENERS KN… {GARDEN…  0.0225      0.718  19.0  549       0.680 0.0213
    #> 7 {GREEN REGENC… {PINK R…  0.0225      0.700  21.4  549       0.667 0.0214
    #> 8 {ROSES REGENC… {GREEN …  0.0321      0.7    16.2  784.      0.657 0.0301
    1. Find one rule that you think is interesting. Write the rule and explain why you find it interesting.

    I wanted to see what might be associated with HAND WARMERs, so I pulled all items that involve some version of HAND WARMER. Then apriori is run forcing only HAND WARMER on the lhs.

    ## Find rules that involve HAND WAMER on the lhs
    
    lhs.items = itemFreq %>% 
      filter(str_detect(Description, "HAND WARMER")) %>% 
      pull(Description)
    
    r = apriori(trans, 
                parameter=list(support=0.005, confidence=.2, minlen=2),
                appearance = list(lhs=lhs.items), 
                control=list(verbose=FALSE))
    
    apriori2df(r) %>% 
      mutate(addedValue = interestMeasure(r, measure="addedValue", trans), 
             PS = interestMeasure(r, measure="leverage", trans)) %>% 
      arrange(-addedValue)
    #> # A tibble: 21 x 8
    #>    lhs        rhs        support confidence  lift count addedValue      PS
    #>    <chr>      <chr>        <dbl>      <dbl> <dbl> <dbl>      <dbl>   <dbl>
    #>  1 {HAND WAR… {HOT WATE… 0.00548      0.338 10.4   134       0.305 0.00496
    #>  2 {HAND WAR… {SCOTTIE … 0.00679      0.300  9.43  166       0.268 0.00607
    #>  3 {HAND WAR… {PAPER CH… 0.00700      0.309  6.46  171       0.261 0.00591
    #>  4 {HAND WAR… {PAPER CH… 0.00695      0.304  6.34  170       0.256 0.00586
    #>  5 {HAND WAR… {PAPER CH… 0.00794      0.292  6.10  194.      0.244 0.00663
    #>  6 {HAND WAR… {HOT WATE… 0.00548      0.276  8.53  134       0.243 0.00484
    #>  7 {HAND WAR… {HOT WATE… 0.00593      0.262  9.89  145       0.236 0.00533
    #>  8 {HAND WAR… {HOT WATE… 0.00565      0.246  9.30  138       0.220 0.00504
    #>  9 {HAND WAR… {HOT WATE… 0.00679      0.250  7.72  166       0.217 0.00591
    #> 10 {HAND WAR… {HOT WATE… 0.00560      0.248  7.67  137       0.215 0.00487
    #> # ... with 11 more rows

    Looks like HOT WATER BOTTLES are associated with HAND WARMER. For a single rule, the highest life and addedValue is: HAND WARMER RED LOVE HEART}

    {HOT WATER BOTTLE KEEP CALM}

    分类: R代写