Santander Product Recommendation

Santander Bank is interested in improving its personal product recommendation system for its customers. The bank collects marketing information about its customers, and associates it with bank account products. Currently, customers receive recommendations, but the system will provide more recommendations to some customers while neglecting others. The goal is to create a more effective product recommendation system so that the individual needs of customers are met.

Problem Description

The Santander data set contains both training data and test data. The focus of this article will be in making sense of the training data. The training data contains 48 columns: 24 are marketing data related, and the other 24 represent the products customers currently own. The marketing data will be treated as the predictor variables, while the product data will be treated as the target variables. As a forewarning to the reader, all of the column headers are in Spanish.

Marketing Data

The marketing data contains mixed (numerical and categorical) data. The only two columns that provide numerical data are renta and age. The data could possibly contain naturally occurring market segments. Perhaps a clustering algorithm could be used to find the segments. The draw back is that a valid distance metric needs to be used for the mixed data types. The following table describes the marketing data columns.

Column names

Fecha_dato

The month from which the data originates

Fecha_alta

Date in which customers became as the first holder of a contract in the bank

Tiprel_1mes

Customer relation type at beginning of month A(active), I (inactive), P (former customer), R(potential)

Tipodom

Address type. 1, primary address

Ncodpers

Customer code

Ind_nuevo

New customer flag. 1 if the customer registered in the last 6 months

Indresi

Residence index (S (yes) or N (no) if the residence country is the same as the bank’s)

Cod_prov

Province code (customer’s address)

Ind_empleado

Employee index

A active, B ex employed, F filial, N not employee, P pasive

Antiguedad

Customer seniority (in months)

Indext

Foreigner index

Customer’s birth country same as bank country? (S (yes), N (no))

Nomprov

Province name

Pais_residencia

Customer’s country of residence

Indrel

1(first/primary), 99(primary customer during the month but not at the end of the month)

Conyuemp

Spouse index: 1 if the customer is the spouse of an employee

Ind_actividad_cliente

Activity index

(1, active customer; 0, inactive customer)

Sexo

Gender

Ult_fec_cli_1t

Last date as primary customer (if he isn’t at the end of the month)

Canal_entrada

Channel used by customer to join bank

Renta

Gross income of household

Age

Indrel_1mes

Customer type at beginning of month (1 (first/pimary customer), 2(co owner), P(potential), 3(former primary), 4 (former co owner)

Indfall

Deceased index

Segmento

Segmentation: 01 – VIP, 02 – individuals, 03 – college graduated

Product Data

The product data for each customer is represented using a feature vector. A one indicates if a customer has the product, and a zero indicates if the customer does not. Market basket analysis could be used to find association rules between one account and another. The association rules could then help the bank with cross-selling their products. Further, the association rules could be mined for each individual customer segment discovered from the clustering algorithm. The following table translates from Spanish to English the product columns.

Ind_ahor_fin_ult1

Savings account

Ind_ctma_fin_ult1

Particular account

Ind_ecue_fin_ult1

e-account

Ind_tjcr_fin_ult1

Credit card

Ind_aval_fin_ult1

Guarantees

Ind_ctop_fin_ult1

Particular account

Ind_fond_fin_ult1

Funds

Ind_valo_fin_ult1

Securities

Ind_cco_fin_ult1

Current accounts

Ind_ctpp_fin_ult1

Particular plus account

Ind_hip_fin_ult1

Mortgage

Ind_viv_fin_ult1

Home account

Ind_cder_fin_ult1

Derivative account

Ind_deco_fin_ult1

Short-term deposits

Ind_plan_fin_ult1

Pensions

Ind_nomina_ult1

Payroll

Ind_cno_fin_ult1

Payroll account

Ind_deme_fin_ult1

Medium-term deposits

Ind_pres_fin_ult1

Loans

Ind_nom_pens_ult1

Pensions

Ind_ctju_fin_ult1

Junior account

Ind_dela_fin_ult1

Long-term deposits

Ind_reca_fin_ult1

Taxes

Ind_recibo_ult1

Direct debit

Strategy

The strategy for the Santander product recommendation data set is to use clustering to discover customer segments, then to use the apriori algorithm to find bank account correlation rules for each customer segment. Since clustering algorithms rely on some notion of distance, a valid metric must be used for the mixed data types. Typically, the Euclidean distance between data points are calculated to enable clustering algorithms to work properly; however, since this data set contains categorical data, Euclidean distances will not work. Instead, the Gower distance (link to it) will be computed between data points.

The k-medoids clustering algorithm will be used instead of the k-means algorithm. This is because the centroids from the k-means clustering algorithm are sensitive to outliers, and can take on values not within the data set. K-medoids addresses these issues since it uses median values. K-modes and k-prototype were considered, but the algorithms were not readily available or implementable in Python and R. Since the number of customer segments is not known, the silhouette method will be used to discover the optimal number of clusters.

Analysis

Analysis for the Santander data set will be performed in Python and R. In addition to the analysis for the Santander data set, this article will explore the strengths and weaknesses of both languages in terms of implementation and ease. Further, this article will incorporate an aspect of super computing into the data analytics workflow.

Preprocessing The Data

The Santander analysis begins by loading in the data, and splitting the target and predictor data into variables. The Dataset class loads in the data and has subset functions to partition the data.

The data is then cleaned using functions in the CleanData class. Since the column data types are specified incorrectly, each column is formatted according to their observed data type. In addition, some of the values in the columns are not consistent. For example, indrel_1mes contains the value, “p”, but all of the other values are numerical ranging from 1 to 4. Because of this, the “p” values in the column are converted to the number 5. The data is then subdivided into months using the 17 unique values in the fecha_dato column. CSV files are written with corresponding data. The data is now ready to be processed.

import os
import sys
import csv
import numpy as np
import pandas as pd
class SantanderAnalysis(object):
    def __init__(self, filepath):
        self.pred_vars = pd.Series()
        self.clustees = pd.Series()
        self.clusters = []
        self.train_data = Dataset()
        self.train_data.data = self.train_data.load_data(filepath)
        self.get_pred_vars(self.train_data.data, 24, 1)
    
    def fix_data(self, data, column, old_dtype, new_dtype):
        #5, 8, 11, 15 have mixed types
        #age, antiguedad, indrel_1mes, conyuemp
        CleanData.format_col_dtype(CleanData, self.train_data, "age", "float")
        CleanData.format_col_dtype(CleanData, self.train_data, 
                                   "antiguedad", "float")
        CleanData.replace_col_value(CleanData, self.train_data, 
                                   "indrel_1mes", "P", "5")
        CleanData.format_col_dtype(CleanData, self.train_data, 
                                   "indrel_1mes", "float")
        CleanData.format_col_dtype(CleanData, self.train_data, 
                                   "conyuemp", "float")
    def separate_by_months(data):
        Dataset.write_split_data(Dataset, data, "fecha_dato")
    
    def subset_month_data(data, directory):
        # C:\\Users\Owner\Desktop\MGMT552\Santander\Months
        os.chdir(directory)
        sample = Dataset(data)
        sample_set = sample.sample_data(0.01)
        Dataset.write_split_data(Dataset, sample_set, "fecha_dato")
    
    def get_pred_vars(self, data, total_cols, clustee_col):
        print("Getting predictor data...")
        pred_cols = list(range(total_cols)) #predictor variable columns
        pred_cols.remove(clustee_col) # remove data ids
        self.pred_vars = data[pred_cols].copy()
        clustee = [clustee_col] #pandas copy function requires list
        self.clustees = data[clustee].values.tolist(

import os

import sys

import csv

import numpy as np

import pandas as pd

class SantanderAnalysis(object):

def __init__(self, filepath):

self.pred_vars = pd.Series()

self.clustees = pd.Series()

self.clusters = []

self.train_data = Dataset()

self.train_data.data = self.train_data.load_data(filepath)

self.get_pred_vars(self.train_data.data, 24, 1)

def fix_data(self, data, column, old_dtype, new_dtype):

#5, 8, 11, 15 have mixed types

#age, antiguedad, indrel_1mes, conyuemp

CleanData.format_col_dtype(CleanData, self.train_data, "age", "float")

CleanData.format_col_dtype(CleanData, self.train_data,

"antiguedad", "float")

CleanData.replace_col_value(CleanData, self.train_data,

"indrel_1mes", "P", "5")

CleanData.format_col_dtype(CleanData, self.train_data,

"indrel_1mes", "float")

CleanData.format_col_dtype(CleanData, self.train_data,

"conyuemp", "float")

def separate_by_months(data):

Dataset.write_split_data(Dataset, data, "fecha_dato")

def subset_month_data(data, directory):

# C:\\Users\Owner\Desktop\MGMT552\Santander\Months

os.chdir(directory)

sample = Dataset(data)

sample_set = sample.sample_data(0.01)

Dataset.write_split_data(Dataset, sample_set, "fecha_dato")

def get_pred_vars(self, data, total_cols, clustee_col):

print("Getting predictor data...")

pred_cols = list(range(total_cols)) #predictor variable columns

pred_cols.remove(clustee_col) # remove data ids

self.pred_vars = data[pred_cols].copy()

clustee = [clustee_col] #pandas copy function requires list

self.clustees = data[clustee].values.tolist(

Originally, the plan was to use the k-prototypes algorithm on the data set since it is recommended for mixed data; however, a readily available solution in Python capable of mining this data set was not available. Instead, the data mining was performed in R.

Data Mining in R

Luckily, an R library exists that provides an alternative solution to k-prototypes. The solution involved computing a custom distance matrix using Gower distances. The Gower distance is a metric perfectly suited for mixed data sets. Once the Gower distance matrix was computed, the PAM algorithm was used to compute the clusters.

Load Libraries

set.seed(1680)   # for reproducibility
library(dplyr)   # for data cleaning
library(cluster) # for gower similarity and pam
library(Rtsne)   # for t-SNE plot
library(ggplot2) # for visualization

set.seed(1680) # for reproducibility

library(dplyr) # for data cleaning

library(cluster) # for gower similarity and pam

library(Rtsne) # for t-SNE plot

library(ggplot2) # for visualization

Loading Data

library(readr)
data_dir <- "C:/Users/Owner/Desktop/MGMT552/Santander/renta_samples/2015-12-28.csv"
train_2 <- "C:/Users/Owner/Desktop/MGMT552/Santander/train_ver2.csv"
data <- read_csv(data_dir)
# View columns
names(data)

library(readr)

data_dir <- "C:/Users/Owner/Desktop/MGMT552/Santander/renta_samples/2015-12-28.csv"

train_2 <- "C:/Users/Owner/Desktop/MGMT552/Santander/train_ver2.csv"

data <- read_csv(data_dir)

# View columns

names(data)

Exploring the data

# View columns
names(data)
# View summary statistics for each column
summary(data)
# View unique values
apply(data, 2, function(x)length(unique(x)))
# View column data type
sapply(data, typeof)
# View total number of NaNs
sum(apply(data, 2, is.na))

# View columns

names(data)

# View summary statistics for each column

summary(data)

# View unique values

apply(data, 2, function(x)length(unique(x)))

# View column data type

sapply(data, typeof)

# View total number of NaNs

sum(apply(data, 2, is.na))

The loaded data needed to be reformatted in order to compute the Gower distances. The Santander data set contains two columns that have 99% NaN values. Those columns were dropped. Further, the only columns that should be numerical are age and renta. All of the other columns were factored.

#ult_fec_cli_1t (11), conyuemp (16)
clean_data <- data[c(-11,-16)]
factor_data <- clean_data
cols <- 1:length(clean_data)
cols <- cols[! cols %in% c(6, 21)] # exclude age and renta indices
factor_data[cols] <- lapply(factor_data[cols], factor)
sapply(factor_data, nlevels) # how many levels does each factored column have?
#sapply(factor_data, typeof)

#ult_fec_cli_1t (11), conyuemp (16)

clean_data <- data[c(-11,-16)]

factor_data <- clean_data

cols <- 1:length(clean_data)

cols <- cols[! cols %in% c(6, 21)] # exclude age and renta indices

factor_data[cols] <- lapply(factor_data[cols], factor)

sapply(factor_data, nlevels) # how many levels does each factored column have?

#sapply(factor_data, typeof)

The predictor and target variables are separated.

pred_var <- factor_data[1:22]
target_var <- factor_data[23:46]
# Merge feature vector rows in target_var
target_var_merge <- apply(target_var, 1, paste, collapse="")

pred_var <- factor_data[1:22]

target_var <- factor_data[23:46]

# Merge feature vector rows in target_var

target_var_merge <- apply(target_var, 1, paste, collapse="")

The PAM algorithm was run with clusters ranging from 2 – 112. The value 112 was determined by looking at the number of unique row entries in the target variable data set. This lead to the conclusion that there were 112 different classes, which meant there could be up to 112 different clusters.

library(plyr)
#counts frequency of each unique combination rows
classes <- count(target_var, vars = names(target_var))
nrow(classes) # 112 rows, therefore, 112 potential classes
# sorts unique rows from most frequent to least
classes[with(classes, order(-freq)), ]
# *** Alternatively ***
classes <- count(target_var_merge)
classes[with(classes, order(-freq)), ]

library(plyr)

#counts frequency of each unique combination rows

classes <- count(target_var, vars = names(target_var))

nrow(classes) # 112 rows, therefore, 112 potential classes

# sorts unique rows from most frequent to least

classes[with(classes, order(-freq)), ]

# *** Alternatively ***

classes <- count(target_var_merge)

classes[with(classes, order(-freq)), ]

Once the Gower distance matrix computation was completed, the most similar and and dissimilar row entries were evaluated.

library(cluster) # for gower similarity and pam
gower_dist <- daisy(pred_var[, -2],
                    metric = "gower",
                    type = list(logratio = 3))
summary(gower_dist)
# View most similar and disimilar items
gower_mat <- as.matrix(gower_dist)
# Most similar
pred_var[
  which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]),
        arr.ind = TRUE)[1, ], ]
# Most dissimilar
pred_var[
  which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]),
        arr.ind = TRUE)[1, ], ]

library(cluster) # for gower similarity and pam

gower_dist <- daisy(pred_var[, -2],

metric = "gower",

type = list(logratio = 3))

summary(gower_dist)

# View most similar and disimilar items

gower_mat <- as.matrix(gower_dist)

# Most similar

pred_var[

which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]),

arr.ind = TRUE)[1, ], ]

# Most dissimilar

pred_var[

which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]),

arr.ind = TRUE)[1, ], ]

The Gower distance matrix is evaluated using the PAM algorithm. Since there were 112 unique classes, the PAM algorithm is computed with the number of centroids ranging from 2 to 112.

# K-Medoids Run PAM on gower distances of marketing data
```{r}
#Determine how to parallelize this for larger dataset
sil_width <- c(NA)

for(i in 2:112){
  print(i) # print the cluster number being computed.
  pam_fit <- pam(gower_dist,
                 diss = TRUE,
                 k = i)
  
  sil_width[i] <- pam_fit$silinfo$avg.width
  
}
```

# K-Medoids Run PAM on gower distances of marketing data

```{r}

#Determine how to parallelize this for larger dataset

sil_width <- c(NA)

for(i in 2:112){

print(i) # print the cluster number being computed.

pam_fit <- pam(gower_dist,

diss = TRUE,

k = i)

sil_width[i] <- pam_fit$silinfo$avg.width

}

```

The cluster number was then evaluated using the silhouette method.

Plot silhouette
```{r}
# Plot sihouette width (higher is better)
plot(1:112, sil_width,
     xlab = "Number of clusters",
     ylab = "Silhouette Width")
lines(1:112, sil_width)
```
Using silhouette widths, the optimal number of clusters is three. However, since there are so very many more classes, it may make more sense to use a higher number of clusters in order have higher resolution in predicting which accounts customers are likely to want. The top five cluster numbers are: 3, 2, 102, 104, 103
```{r}
cluster_df <- data.frame(1:112, sil_width)
# sort from highest silhouette width to least
head(cluster_df[with(cluster_df, order(-sil_width)), ], 5)
```

Plot silhouette

```{r}

# Plot sihouette width (higher is better)

plot(1:112, sil_width,

xlab = "Number of clusters",

ylab = "Silhouette Width")

lines(1:112, sil_width)

```

Using silhouette widths, the optimal number of clusters is three. However, since there are so very many more classes, it may make more sense to use a higher number of clusters in order have higher resolution in predicting which accounts customers are likely to want. The top five cluster numbers are: 3, 2, 102, 104, 103

```{r}

cluster_df <- data.frame(1:112, sil_width)

# sort from highest silhouette width to least

head(cluster_df[with(cluster_df, order(-sil_width)), ], 5)

```

It is clear to see that the best cluster number is 3.

The cluster number assignments were then mapped to the corresponding rows in the predictor variable’s dataset. The predictor variables were then subdivided into the different clusters. At this point, the bank account product data is mapped to the respective market segments. The data is now ready to be mined for association rules using the Apriori algorithm.

pam_fit <- pam(gower_dist, diss = TRUE, k = 3)
pam_results <- pred_var %>%
  dplyr::select(-ncodpers) %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))

pam_results$the_summary

# Map clusters back to their target variables.

# merged target variables with cluster numbers
target_var_clusters <- data.frame(accounts = target_var_merge, 
                                  cluster = pam_fit$clustering)
# split target variables with cluster numbers
tar_var_clusters <- data.frame(target_var, cluster = pam_fit$clustering)
tar_var_clusters$cluster <- factor(tar_var_clusters$cluster)

# Divide target variables into their clusters. Remove cluster number column
cluster1 <- target_var_clusters[which(target_var_clusters$cluster==1), -2]
cluster2 <- target_var_clusters[which(target_var_clusters$cluster==2), -2]
cluster3 <- target_var_clusters[which(target_var_clusters$cluster==3), -2]

# Find frequency of unique rows in each cluster.
c1.classes <- count(cluster1)
c1.classes <- c1.classes[with(c1.classes, order(-freq)), ]
c2.classes <- count(cluster2)
c2.classes <- c2.classes[with(c2.classes, order(-freq)), ]
c3.classes <- count(cluster3)
c3.classes <- c3.classes[with(c3.classes, order(-freq)), ]

# Find differences between each cluster
c1c2 <- union(cluster1, cluster2)
c2c3 <- union(cluster2, cluster3)
c1c3 <- union(cluster1, cluster3)
c1.c2c3 <- setdiff(cluster1, c2c3) # in c1, not in c2 and c3
c2.c1c3 <- setdiff(cluster2, c1c3) # in c2, not in c1 and c3
c3.c1c2 <- setdiff(cluster3, c1c2) # in c3, not in c1 and c2

pam_fit <- pam(gower_dist, diss = TRUE, k = 3)

pam_results <- pred_var %>%

dplyr::select(-ncodpers) %>%

mutate(cluster = pam_fit$clustering) %>%

group_by(cluster) %>%

do(the_summary = summary(.))

pam_results$the_summary

# Map clusters back to their target variables.

# merged target variables with cluster numbers

target_var_clusters <- data.frame(accounts = target_var_merge,

cluster = pam_fit$clustering)

# split target variables with cluster numbers

tar_var_clusters <- data.frame(target_var, cluster = pam_fit$clustering)

tar_var_clusters$cluster <- factor(tar_var_clusters$cluster)

# Divide target variables into their clusters. Remove cluster number column

cluster1 <- target_var_clusters[which(target_var_clusters$cluster==1), -2]

cluster2 <- target_var_clusters[which(target_var_clusters$cluster==2), -2]

cluster3 <- target_var_clusters[which(target_var_clusters$cluster==3), -2]

# Find frequency of unique rows in each cluster.

c1.classes <- count(cluster1)

c1.classes <- c1.classes[with(c1.classes, order(-freq)), ]

c2.classes <- count(cluster2)

c2.classes <- c2.classes[with(c2.classes, order(-freq)), ]

c3.classes <- count(cluster3)

c3.classes <- c3.classes[with(c3.classes, order(-freq)), ]

# Find differences between each cluster

c1c2 <- union(cluster1, cluster2)

c2c3 <- union(cluster2, cluster3)

c1c3 <- union(cluster1, cluster3)

c1.c2c3 <- setdiff(cluster1, c2c3) # in c1, not in c2 and c3

c2.c1c3 <- setdiff(cluster2, c1c3) # in c2, not in c1 and c3

c3.c1c2 <- setdiff(cluster3, c1c2) # in c3, not in c1 and c2

The unique sequences in the various clusters were compared. Cluster 1 contained the most unique sequences with 95 sequences. Cluster 2 only contained two unique sequences. Cluster 3 did not contain any unique clusters.

The three cluster centroids are defined as follows:

Column	Cluster 1	Cluster 2	Cluster 3
Ind_empleado	B: 1 N: 263	B: 1 N: 223	B: 0 N: 226
Pais_residencia	ES: 264	ES: 224	ES: 226
Sexo	H: 100 V: 164	H: 83 V: 141	H: 142 V: 84
Age	Min: 3.00 1^st Q: 38.00 Median: 45.00 Mean: 45.75 3^rd Q: 53.25 Max: 90.00	Min: 20 1^st Q: 41.75 Median: 50 Mean: 52.40 3^rd Q: 63 Max: 97	Min: 17 1^st Q: 22 Median: 24 Mean: 25.51 3^rd Q: 27 Max: 62

Ind_nuevo	0: 255 1: 9	0: 223 1: 1	0: 201 1: 25

Indrel	1: 263 99: 1	1: 223 99: 1	1: 226 99: 0
Indrel_1mes	1: 264	1: 224	1: 226
Tiprel_1mes	A: 254 I: 10	A: 10 I: 214	A: 38 I: 188
Indresi	S: 264	S: 224	S: 226
indext	N: 252 S: 12	N: 213 S: 11	N: 216 S: 10
canal_entrada	KAT: 99 KFC: 83 KFA: 53 KHE: 14 KHK: 8 KHQ: 5	KAT: 92 KFC: 70 KFA: 7 KHE: 6 KCI: 5 KBZ: 4	KHE: 175 KHQ: 24 KFC: 12 KHD: 7 KHK: 3
indfall	N: 263 S: 1	N: 222 S: 2	N: 226 S: 0
Tipodom	1: 264	1: 224	1: 226
Nomprov	MADRID: 126 BARCELONA: 27 MALAGA: 14 SEVILLA: 10 VALENCIA: 9	MADRID: 77 BARCELONA: 27 VALENCIA: 13 ZARAGOZA: 9 CADIZ: 8	MADRID: 43 BARCELONA: 25 SEVILLA: 19 VALENCIA: 13 MALAGA: 12
Ind_actividad_cliente	0: 8 1: 256	0: 218 1: 6	0: 190 1: 36
Renta	Min: 22416 1^st Q: 75128 Median: 113960 Mean: 130795 3^rd Q: 155679 Max : 784578	Min: 12994 1^st Q: 69037 Median: 106792 Mean: 133758 3^rd Q: 166061 Max: 749963	Min: 17493 1^st Qu.: 60879 Median: 89193 Mean: 112073 3^rd Qu.: 133473 Max: 791313
Segmento	TOP: 23 PARTICULARES: 214 UNIVERSITARIO: 27	TOP: 0 PARTICULARES: 223 UNIVERSITARIO: 1	TOP: 0 PARTICULARES: 19 UNIVERSITARIO: 204

Cluster 1 is characterized by individual, middle-aged, male customers with active bank accounts that have a household income ranging from €75128 to €155679. Cluster 2 is characterized by individual, middle-aged, male customers with inactive bank accounts that have a household income ranging from €69037 to €166061. Cluster 3 is characterized by young, college-educated female adults with inactive bank accounts that have a household income ranging from €60879 to €133473.

The goal for the Apriori algorithm is to find association rules among data sets in each market segment. These association rules will then help Santander Bank make product recommendations to its customers within each respective segment.

library(arules)
rules = apriori(target_trans, parameter=list(minlen=24, maxlen=24, support=0.01, confidence=0.5))
inspect(head(sort(target_trans, by="lift"),10))

library(arules)

rules = apriori(target_trans, parameter=list(minlen=24, maxlen=24, support=0.01, confidence=0.5))

inspect(head(sort(target_trans, by="lift"),10))

Since we are only concerned with active customers, we will only evaluate rules from cluster 1. The Apriori algorithm identified two association rules with a lift greater than 2 and confidence greater than 80%. The association rules are as follows:

A customer will likely bundle a payroll account with a pensions account
A customer will likely bundle a direct debit account with a current account

Problem Description

Marketing Data

Product Data

Strategy

Analysis

Preprocessing The Data

Data Mining in R

Leave a Reply Cancel reply