the discovery of frequent itemsets =“association rules”를 발견하는 것!
<Course Outline>
1) “market-basket” model
2) First: Define
• Frequent itemsets
• Association rules: Confidence, Support, Interestingness
3) Then: Algorithms for finding frequent itemsets
• Finding frequent pairs
• A-Priori algorithm
• PCY algorithm
📎 “frequent”
⇒ , called the support threshold
I : set of items → support for I : I가 포함된 Basket의 수
if support for I > : I는 빈번하다(frequent)!
Example 6.1 :
Since the empty set is a subset of any set, the support for ∅ is 8.⇒ 하지만 일반적으로 공집합은 아무것도 말해주지 않기 때문에 신경쓰지 X
Example) “Dog”:(5)번을 제외 모든 basket에서 등장 ⇒ “Dog”의 support는 7
Suppose that we set our threshold at = 3 : Then there are five frequent singleton itemsets: {dog}, {cat}, {and}, {a}, and {training}
doubletons
frequent triple은 frequent doubletone의 조합으로만 가능
해당 예제에서는 하나의 frequent triple만 존재하므로,no frequent quadruples or larger sets.
: If someone buys diaper and milk, then he/she is likely to buy beer.
: extracting frequent sets of items from data = association rules라고 부르는 if-then 규칙의 모음으로 표시
* association rules
form : , where : set of items, : an item
means: “if a basket contains all of then it is likely to contain ”
In practice there are many rules, want to find significant/interesting ones!
1) Defining the confidence of the rule
📌 conf()
: 가 포함되는 모든 basket 의 비율
= support(I) is given → how often does appear next to it?
Example) Fig 6.1에서
The confidence of the rule {cat, dog} → and = 3/5.
의 support가 매우 크다면, confidence 하나만으로도 매우 useful 할 수 있음.
However, I가 에 (어떻게든) 영향을 미치는 실제 관계에서는 Association Rule이 더 중요함.
2) Thus, define the interest of an association rule : (rule 의 confidence)와 (를 포함하는 basket의 비율)의 차이
📌
basket의 합리적인 정도(reasonable fraction)에 적용되는 Association Rules 를 찾기 위해서는,
1) I의 support가 상당히(reasonably) 높아야 한다.
2) rule의 confidence 상당히 높아야 한다.
💡 Problem: Find all association rules with support ≥s and confidence ≥c
→ This means: support() ⇒ Conf
why?
Suppose) 1. s와 c : given to us by user or by the data analyst.
→ 높은 support && 높은 confidence를 가진 Association Rules를 찾을 수 있음
That is) if 가 frequent()한 개의 items을 가지고 있다면,
→ only n possible association rules involving this set of items
= 에 존재하는 모든 에 대하여 (총 n개 존재)
→ 가 frequent() 이면, 도 frequent()
Why? If has high support and confidence, then both and will be “frequent”
⇒ 와 의 support 비율
= 규칙 의 신뢰도 = =
Assumed that) there are not too many frequent itemsets and thus
not too many candidates for high-support, high-confidence association rules. → 너무 많은 frequent itemsets을 얻지 않도록 support threshold()를 조정하는 것이 일반적!
* How to make Algorithm ?
Step 1) Find all frequent itemsets
(we will explain this next)
Step 2) Rule generation
why? support({A,B}) > support({A,B,C})
conf(A,B,C→D) = support(A,B,C,D) / support(A,B,C)
conf(A,B→C,D) = support(A,B,C,D) / support(A,B)
$\therefore$ in every case, conf(A,B,C→D) < conf(A,B→C,D)
- Can generate “bigger” rules from smaller ones!
Output the rules above the confidence threshold
→ Finding Frequent Itemsets
1) Naive Algorithm : data file을 한번 스캔하여 각 pair가 몇번 등장하는지 count한다. → impossible
2) Counting Pairs in Memory
Approach 1) Triangular Matrix : Dense
Approach 2) Triples : Sparse
→ “how many different pairs actually occured in the data”에 따라 사용할 방법 결정.
& Approach 2 beats Approach 1 if less than 1/3 of possible pairs actually occur. (가능한 쌍의 1/3 미만이 발생할 경우 App2가 더 효율적)
💥 Problem : if we have too many items so the pairs do not fit into memory. Can we do better? → A-Priori Algorithm !
Key idea
: Monotonicity of Frequent⇒ if set of items 가 최소 번 나타난다면, 의 subset 도 최소 번 나타남. J가 less frequent 할 수 없음.
대우 ⇒ if item 가 개의 basket에서 나타나지 않는다면(support threshold를 넘지 못한다면), 를 포함하는 모든 집합은 를 넘을 수 없음.
method
: frequent singletones to frequent pairs!Pass 1)Read baskets and count in main memory the # of occurrences of each individual item
→ Items that appear times are the frequent items
Pass 2) Read baskets again and keep track of the count of only those pairs where both elements are frequent (from Pass 1)
→ Requires memory proportional to square of frequent items only (not all items)
→ K-tuple로 일반화 가능
: single → pairs → triples → ... k-tuples
Example)
C3의 {b,c,j} {b,m,j} {c,m,j} 는 non-frequent 할 수밖에 없음
why? {b,c,j}에서 {b,c}와 {c,j}는 frequent하지만, {b,j}는 frequent하지 않음. (by L2)
⇒ {b,c,j}는 frequent할 수 없으므로, generate하지 않아도 됨.
→ to cut down on the size of candidate set
= Also known as “DHP(Direct Hashing and Pruning)”
FOR (each basket) :
FOR (each item in the basket) :
add 1 to item’s count;
<!-- new in PCY , Hashing Process -->
FOR (each pair of items) :
hash the pair to a bucket;
add 1 to the count for that bucket;
PCY - Hash Table
Between Passes
Observation:
1) If a bucket contains a frequent pair, then the bucket is surely frequent
2) But, for a bucket with total count less than , none of its pairs can be frequent 🙂
cf> hash table을 만들 때, (메모리가 충분하다면) 가능한 fewer collision을 만드는 것이 좋음!
implementation : Replace the buckets by a bit-vector!
→ 1 means the bucket count exceeded the support (call it a frequent bucket); 0 means it did not
→ count를 위한 4byte Integer 가 bit-vector로 대체되면서 1/32 메모리만 사용 가능!
pass1에서 pass2로 넘어가면서 hash table이 frequent bucket인지 여부() 를 Bitmap으로 기록.
Pass 2
Count all pairs {i, j} that meet the conditions for being a candidate pair:
⇒ 두가지 조건이 모두 만족되면 Tracking :
Both conditions are necessary for the pair to have a chance of being frequent.
Pass 1 :
Item’s Count : 각 item이 몇번 등장하는지 count한다.
* Item Count 결과
Item | Count |
---|---|
1 | 5 |
2 | 5 |
3 | 6 |
4 | 2 |
5 | 4 |
Make Hash Table for bucket counts
step 1) 모든 basket의 가능한 Pair를 생성
step 2) 각 Pair를 Hash Table에 해싱
→ 여기서는 Hashing Rule을 다음과 같이 정의
: Hashing a pair {i, j} to a bucket k, where k = hash(i, j) = (i + j) / 5
예를 들어, {1,2}를 hashing한다면 (1+2)/5 = 3 이니까 3번 bucket에 hashing됨.
(1, 4) and (2, 3) -> k = 0
(1, 5) and (2, 4) -> k = 1
(2, 5) and (3, 4) -> k = 2
(1, 2) and (3, 5) -> k = 3
(1, 3) and (4, 5) -> k = 4
* Hash Table 결과
Bucket | Count |
---|---|
0 | 6 |
1 | 2 |
2 | 4 |
3 | 4 |
4 | 5 |
Pass 2 :
Frequent items : {1,2,3,5} (By Pass 1 - item’s count)
→ 이에 따라서, 가능한 candidate pair는 (1,2) (1,3) (1,5) (2,3) (2,5) (3,5)
⭐ (1,5)는 폐기 : because bucket 1 is not frequent! (By Pass 1 - hash Table)
→ Surviving Pairs = (1,2) (1,3) (2,3) (2,5) (3,5)
→ Counts of the Surviving Pairs
Pair | Count |
---|---|
(1,2) | 2 |
(1,3) | 4 |
(2,3) | 4 |
(2,5) | 3 |
(3,5) | 2 |
⇒ Result : Frequent itemsets are {1} {2} {3} {5} {1,3} {2,3} {2,5}
The MMDS book covers several other extensions beyond the PCY idea: “Multistage” and “Multihash”
: Can we use fewer passes? ( in passes)
* Frequent Itemsets in Passes
Use 2 or fewer passes for all sizes,but may miss some frequent itemsets
: 1) Take a random sample of the market baskets
→ 2) Run a-priori or one of its improvements in main memory
: Repeatedly read small subsets of the baskets into main memory and run an in-memory algorithm to find all frequent itemsets
Path 1) An itemset becomes a candidate if it is found to be frequent in any one or more subsets of the baskets. (: make Candidate itemsets)
pass 2) count all the candidate itemsets and determine which are frequent in the entire set.(: Verify)
→ SON 알고리즘은 병렬 컴퓨팅 환경에 적합. 각 pass를 MapReduce 작업으로 표현하여 두 단계의 MapReduce-MapReduce 시퀀스로 실행시킬 수 있음.
Pass 1)
Start with a random sample
Find frequent itemsets in the sample
construct the negative border
Pass 2) new sampling → Count all candidate frequent itemsets from the first pass, and also count sets in their negative border.
※ 대체로, Pass 1에서는 frequent할 수 있는 candidate pairs를 만들고 → Pass 2는 candidate들을 전체 dataset에 대해서 verify 하는 알고리즘!