Market Basket Analysis
Market basket analysis is a technique used by retailers to find patterns in customer behaviour based on their history of transactions. If a customer purchased item A what is the probability that the customer is buying item B along with it?
With Market Basket Analysis (MBA) we will discover what items are more or less likely to be bought together, and based on this analysis of customers' transaction history we can help tailor the following to be as effective as possible:
- How to decide on what items should be organized on the shelves, paired together on sale, and/or rewarded more points
- What customers shall we target with what ads
- How can we increase the sales for a specific item
The data set I am using for this analysis contains information about customers buying different grocery items at a mall.
The analysis was done using Python programming language.
import pandas as pd
import seaborn as sns
import numpy as np
from pandas import DataFrame
import matplotlib.pyplot as plt
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
Reading the data set
basket = pd.read_csv(r"D:\building a portfolio\Market Basket Analysis\Market_Basket_Optimisation.csv", header = None)basket.head()
basket.shape
#Checking the data type
print(basket.dtypes)
Data Preparation
To apply the Apriori machine learning algorithm, first, we need to make some changes to the dataset.
#Converting the data frame into a list of lists
records = []
for i in range (0, 7501):
records.append([str(basket.values[i,j]) for j in range(0, 20)])
Using Transactionencoder we can transform this dataset into a logical data frame. Each column represents an item and each row represents a record or a transaction for one purchase.
TRUE if the transaction occurs FALSE if not.
TE = TransactionEncoder()
array = TE.fit(records).transform(records)#building the data frame rows are logical and columns are the items have been purchased
transf_df = pd.DataFrame(array, columns = TE.columns_)transf_df
#check column names
for col in transf_df.columns:
print(col)
We need to drop the nan column from the data frame.
#drop nan column
basket_clean = transf_df.drop(['nan'], axis = 1)
basket_clean
Let us take a look at the most popular items in our data set.
#most popular items
count = basket_clean.loc[:,:].sum()
pop_item = count.sort_values(0, ascending = False).head(10)
pop_item = pop_item.to_frame()
pop_item = pop_item.reset_index()
pop_item = pop_item.rename(columns = {“index”: “items”,0: “count”})#Data Visualization
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)
matplotlib.style.use('dark_background')
ax = pop_item.plot.barh(x = 'items', y = 'count')
plt.title('Most popular items')
plt.gca().invert_yaxis()
Data Processing:
To solve this case study I used Apriori Algorithm.
Now that the data is ready we need to identify some terminologies first.
Association Mining searches for frequent items in the data-set. Frequent Mining shows which items appear together in a transaction or relation.
Association rules are if-then statements that help to show the probability of relationships between data items within large data sets in various types of databases.
How to measure the association rules?
- Support how frequent an item-set is in all transactions. The ratio of the number of transactions in which item x appears to the total number of transactions.
2. Confidence how likely items are purchased together. The likelihood of item y being purchased when the item x is purchased.
3. Lift how likely item Y is purchased when item X is purchased while controlling for how popular item Y is.
Apriori Algorithm
Uses frequent item-sets to generate association rules. Based on the concept that a subset of a frequent item-set must be a frequent item-set. (A frequent item-set is an item-set whose support value is greater than a threshold value specified )
#I chose 0.04 minimum support
a_rules = apriori(basket_clean, min_support = 0.04, use_colnames = True)a_rules.head()
In total, we got 35 rules with 0.04 minimum support.
Now using lift with threshold equal to 1.
rules = association_rules(a_rules, metric = 'lift', min_threshold = 1)
rules
We can interpret the result of the first rule as:
The support is 0.05 calculated by dividing the number of transactions containing mineral water and chocolate by the total number of transactions.
The confidence level is 0.32 shows that out of all the transactions that contain chocolate 0.32 contain mineral water too.
The lift 1.348 tells us that mineral water is 1.348 times more likely to be bought by the customer who also buys chocolate.
Conclusion
- The most popular item in this data set is mineral water followed by eggs and spaghetti
- By applying the Apriori algorithm and association rules we can have a better insight into what items are more likely to be bought together
Data source