DM Menu

Apriori algorithm

Apriori algorithm was the first algorithm that was proposed for frequent itemset mining. It was introduced by R Agarwal and R Srikant.

Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties.

Frequent Item Set

Frequent Itemset is an itemset whose support value is greater than a threshold value(support).

Apriori algorithm uses frequent itemsets to generate association rules. To improve the efficiency of level-wise generation of frequent itemsets, an important property is used called Apriori property which helps by reducing the search space.

Apriori Property

All subsets of a frequent itemset must be frequent (Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.

Steps in Apriori algorithm

Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the given database. A minimum support threshold is given in the problem or it is assumed by the user.

The steps followed in the Apriori Algorithm of data mining are:

Join Step: This step generates (K+1) itemset from K-itemsets by joining each item with itself.

Prune Step: This step scans the count of each item in the database. If the candidate item does not meet minimum support, then it is regarded as infrequent and thus it is removed. This step is performed to reduce the size of the candidate itemsets.

The above join and the prune steps iteratively until the most frequent itemsets are achieved.

Apriori Algorithm Example

Consider the following dataset and find frequent item sets and generate association rules for them. Assume that minimum support threshold (s = 50%) and minimum confident threshold (c = 80%).

Transaction	List of items
T1	I1, I2, I3
T2	I2, I3, I4
T3	I4, I5
T4	I1, I2, I4
T5	I1, I2, I3, I5
T6	I1, I2, I3, I4

Solution

Finding frequent item sets:

Support threshold=50% ⇒ 0.5*6 = 3 ⇒ min_sup = 3

Step-1:

(i) Create a table containing support count of each item present in dataset – Called C1 (candidate set).

Item	Count
I1	4
I2	5
I3	4
I4	4
I5	2

(ii) Prune Step: Compare candidate set item’s support count with minimum support count. The above table shows that I5 item does not meet min_sup = 3, thus it is removed, only I1, I2, I3, I4 meet min_sup count.

This gives us the following item set L1.

Item	Count
I1	4
I2	5
I3	4
I4	4

Step-2:

(i) Join step: Generate candidate set C2 (2-itemset) using L1.And find out the occurrences of 2-itemset from the given dataset.

Item	Count
I1, I2	4
I1, I3	3
I1, I4	2
I2, I3	4
I2, I4	3
I3, I4	2

(ii) Prune Step: Compare candidate set item’s support count with minimum support count. The above table shows that item sets {I1, I4} and {I3, I4} does not meet min_sup = 3, thus those are removed.

This gives us the following item set L2.

Item	Count
I1, I2	4
I1, I3	3
I2, I3	4
I2, I4	3

Step-3:

(i) Join step: Generate candidate set C3 (3-itemset) using L2.And find out the occurrences of 3-itemset from the given dataset.

Item	Count
I1, I2, I3	3
I1, I2, I4	2
I1, I3, I4	1
I2, I3, I4	2

(ii) Prune Step: Compare candidate set item’s support count with minimum support count. The above table shows that itemset {I1, I2, I4}, {I1, I3, I4} and {I2, I3, I4} does not meet min_sup = 3, thus those are removed. Only the item set {I1, I2, I3} meet min_sup count.

Generate Association Rules:

Thus, we have discovered all the frequent item-sets. Now we need to generate strong association rules (satisfies the minimum confidence threshold) from frequent item sets. For that we need to calculate confidence of each rule.

The given Confidence threshold is 80%.

The all possible association rules from the frequent item set {I1, I2, I3} are:

{I1, I2} ⇒ {I3}

$Confidence = \frac{support {I1, I2, I3}}{support {I1, I2}}$ = (3/ 4)* 100 = 75% (Rejected)

{I1, I3} ⇒ {I2}

$Confidence = \frac{support {I1, I2, I3}}{support {I1, I3}}$ = (3/ 3)* 100 = 100% (Selected)

{I2, I3} ⇒ {I1}

$Confidence = \frac{support {I1, I2, I3}}{support {I2, I3}}$ = (3/ 4)* 100 = 75% (Rejected)

{I1} ⇒ {I2, I3}

$Confidence = \frac{support {I1, I2, I3}}{support {I1}}$ = (3/ 4)* 100 = 75% (Rejected)

{I2} ⇒ {I1, I3}

$Confidence = \frac{support {I1, I2, I3}}{support {I2}}$ = (3/ 5)* 100 = 60% (Rejected)

{I3} ⇒ {I1, I2}

$Confidence = \frac{support {I1, I2, I3}}{support {I3}}$ = (3/ 4)* 100 = 75% (Rejected)

This shows that the association rule {I1, I3} ⇒ {I2} is strong if minimum confidence threshold is 80%.

Exercise1: Apriori Algorithm

TID	Items
T1	I1, I2, I5
T2	I2, I4
T3	I2, I3
T4	I1, I2, I4
T5	I1, I3
T6	I2, I3
T7	I1, I3
T8	I1, I2, I3, I5
T9	I1, I2, I3

Consider the above dataset and find frequent item sets and generate association rules for them. Assume that Minimum support count is 2 and minimum confident threshold (c = 50%).

Exercise2: Apriori Algorithm

TID	Items
T1	{milk, bread}
T2	{bread, sugar}
T3	{bread, butter}
T4	{milk, bread, sugar}
T5	{milk, bread, butter}
T6	{milk, bread, butter}
T7	{milk, sugar}
T8	{milk, sugar}
T9	{sugar, butter}
T10	{milk, sugar, butter}
T11	{milk, bread, butter}

Consider the above dataset and find frequent item sets and generate association rules for them. Assume that Minimum support count is 3 and minimum confident threshold (c = 60%).

Next Topic :FP-Growth algorithm