La empresa X-Education ofrece cursos en línea a profesionales y un objetivo comercial importante es lograr que los potenciales usuarios se registren en algunos de los cursos ofrecidos por la compañía. Imagine que a usted la empresa le hace una oferta para trabajar en este proyecto y usted como experto marketero acepta. La información que usted posee del proyecto es la base de datos, el diccionario de las variables y un pequeño resumen que le envió la empresa.
Cuando un usuario entra a la web puede navegar por los cursos, ver vídeos o rellenar un formulario para ser contactado. Cuando sucede esto último, las personas entregan su número telefónico o e-mail y pasan a ser consideradas como un “objetivo” (asignados en la variable Lead Number). La siguiente fase es la captación de clientes. Cuando una persona ya es un “objetivo”, es contactada por el equipo de ventas mediante llamados telefónico o vía mail. En este proceso utilizan también “objetivos” que han sido marcados en el pasado (variable Lead Origin indica de dónde proviene cada “objetivo”). Gracias a esta estrategia algunos “objetivos” pasan a ser clientes y otros no (variable Converted). Actualmente, cerca de un 30% de los ”objetivos” contactados se vuelve cliente, por lo que la empresa ve como una buena oportunidad de mejora el lograr subir esta cifra, para así poder identificar a los mejores “objetivos” y de esta forma mejorar su estrategia de ventas y enfocarse en estos posibles clientes en vez de gastar tiempo comunic´andose con todos los “objetivos”.
Lo que veremos en esta auxiliar es de dónde provienen los lead.
rm(list=ls()) # Limpiamos todos los objetos creados en R
graphics.off() # Limpiamos los gráficos
options(digits = 3) # Declaramos dígitos despues del punto para observar (decimas, centesimas,...)
set.seed(12345) #fijar semilla de aleatoriedad
library(readr) #Para leer el .csv
library(glmnet) #Ajusta modelo lineal
library(ggplot2) #Para realización de gráficos complejos
library(corrplot)
library(dplyr)
library(fastDummies)
library(naniar)
library(RColorBrewer)
library(ggcorrplot)
library(caret)
library(MASS)
library(ggpmisc)
library(MLmetrics)
library(Metrics)
library(mlogit)
library(reshape2)
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_character(),
## `Lead Number` = col_double(),
## Converted = col_double(),
## TotalVisits = col_double(),
## `Total Time Spent on Website` = col_double(),
## `Page Views Per Visit` = col_double(),
## `Asymmetrique Activity Score` = col_double(),
## `Asymmetrique Profile Score` = col_double()
## )
## i Use `spec()` for the full column specifications.
Antes de crear algún modelo, se limpia la base y se realiza un EDA. Se asume que la variable “Select” es lo mismo que un NA (opción puesta por default al no responder), por lo que se hace el respectivo reemplazo.
#En esta sección se hace una revisión general de la base, con el fin de entender qué indica cada variable y que valores pueden tomar.
#View(BB)
#Lead origin tiene categorias casi vacias así que se eliminan las filas que contienen esa informacion.
BB<-BB[!(BB$`Lead Origin`=="Quick Add Form"),]
BB<-BB[!(BB$`Lead Origin`=="Lead Import"),]
summary(BB) #tipos de datos por columna
## Prospect ID Lead Number Lead Origin Lead Source
## Length:9184 Min. :579533 Length:9184 Length:9184
## Class :character 1st Qu.:596415 Class :character Class :character
## Mode :character Median :615140 Mode :character Mode :character
## Mean :616997
## 3rd Qu.:637086
## Max. :660737
##
## Do Not Email Do Not Call Converted TotalVisits
## Length:9184 Length:9184 Min. :0.000 Min. : 0.0
## Class :character Class :character 1st Qu.:0.000 1st Qu.: 1.0
## Mode :character Mode :character Median :0.000 Median : 3.0
## Mean :0.386 Mean : 3.5
## 3rd Qu.:1.000 3rd Qu.: 5.0
## Max. :1.000 Max. :251.0
## NA's :112
## Total Time Spent on Website Page Views Per Visit Last Activity
## Min. : 0 Min. : 0.0 Length:9184
## 1st Qu.: 14 1st Qu.: 1.0 Class :character
## Median : 250 Median : 2.0 Mode :character
## Mean : 489 Mean : 2.4
## 3rd Qu.: 938 3rd Qu.: 3.2
## Max. :2272 Max. :55.0
## NA's :112
## Country Specialization How did you hear about X Education
## Length:9184 Length:9184 Length:9184
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## What is your current occupation What matters most to you in choosing a course
## Length:9184 Length:9184
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## Search Magazine Newspaper Article X Education Forums
## Length:9184 Length:9184 Length:9184 Length:9184
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Newspaper Digital Advertisement Through Recommendations
## Length:9184 Length:9184 Length:9184
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Receive More Updates About Our Courses Tags Lead Quality
## Length:9184 Length:9184 Length:9184
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Update me on Supply Chain Content Get updates on DM Content Lead Profile
## Length:9184 Length:9184 Length:9184
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## City Asymmetrique Activity Index Asymmetrique Profile Index
## Length:9184 Length:9184 Length:9184
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Asymmetrique Activity Score Asymmetrique Profile Score
## Min. : 7 Min. :11
## 1st Qu.:14 1st Qu.:15
## Median :14 Median :16
## Mean :14 Mean :16
## 3rd Qu.:15 3rd Qu.:18
## Max. :18 Max. :20
## NA's :4218 NA's :4218
## I agree to pay the amount through cheque
## Length:9184
## Class :character
## Mode :character
##
##
##
##
## A free copy of Mastering The Interview Last Notable Activity
## Length:9184 Length:9184
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
BB[BB=='Select']<-NA #Transforma valores 'select' en NA (se asume que la gente dejó sin rellenar esos espacios, por lo que en verdad son NA's)
BB$Country[BB$Country!='India']<- NA #Se observa que más del 50% de los datos provienen de India, por lo que no se consideran los otros paises
BB$Country[is.na(BB$Country)]<-"Otro" #Los datos en que dejaron como NA se les consideró como Otro
BB$Specialization[is.na(BB$Specialization)]<-"Otro" #se considera que las personas que dejaron vacio ese campo fue porque no estaban incluidos en las opciones, por lo que se crea otra categoria
BB$`What is your current occupation`[is.na(BB$`What is your current occupation`)]<-"Otro" #Pasa lo mismo que en el caso anterior
#Juntar los dos Google
BB$`Lead Source` <- as.character(BB$`Lead Source`)
BB$`Lead Source`[BB$`Lead Source` == 'google'] <- 'Google'
Al observar la base de datos nos damos cuenta que hay muchas variables que tienen missing values. Para no entorpercer la inferencia, se deciden sacar las variables con una cantidad mayor de 40% de missing values entre sus datos, pero se mantienen algunas de esas variables que se consideran muy relevantes a la hora de buscar lo pedido. Estas son Asymmetrique Profile Index
,Asymmetrique Profile Score
,How did you hear about X Education
,Lead Profile
,Asymmetrique Activity Index
,Asymmetrique Activity Score
.
## Prospect ID
## 0.000
## Lead Number
## 0.000
## Lead Origin
## 0.000
## Lead Source
## 0.368
## Do Not Email
## 0.000
## Do Not Call
## 0.000
## Converted
## 0.000
## TotalVisits
## 1.212
## Total Time Spent on Website
## 0.000
## Page Views Per Visit
## 1.212
## Last Activity
## 0.942
## Country
## 0.000
## Specialization
## 0.000
## How did you hear about X Education
## 77.868
## What is your current occupation
## 0.000
## What matters most to you in choosing a course
## 29.221
## Search
## 0.000
## Magazine
## 0.000
## Newspaper Article
## 0.000
## X Education Forums
## 0.000
## Newspaper
## 0.000
## Digital Advertisement
## 0.000
## Through Recommendations
## 0.000
## Receive More Updates About Our Courses
## 0.000
## Tags
## 36.115
## Lead Quality
## 51.255
## Update me on Supply Chain Content
## 0.000
## Get updates on DM Content
## 0.000
## Lead Profile
## 73.701
## City
## 39.697
## Asymmetrique Activity Index
## 45.649
## Asymmetrique Profile Index
## 45.649
## Asymmetrique Activity Score
## 45.649
## Asymmetrique Profile Score
## 45.649
## I agree to pay the amount through cheque
## 0.000
## A free copy of Mastering The Interview
## 0.000
## Last Notable Activity
## 0.000
#se sacan las variables que tienen hartos missing values y que no tienen poder predictivo (intuitivamente)
BDLimpia = subset(BB, select = -c(`Asymmetrique Profile Index`,`Asymmetrique Profile Score`,`How did you hear about X Education`,`Lead Profile`,`Asymmetrique Activity Index`,`Asymmetrique Activity Score` ))
En honor al tiempo y para que no queden mareados con todos las relaciones que se buscaron en la base de datos se decide hacer el EDA de las variables que ocuparemos finalmente en la parte 1 y 2 del auxiliar.
#Pregunta 1 COUNTRY V/S LEAD ORIGIN
ggplot(BDLimpia, aes(`Country`)) +
geom_bar(aes(fill=`Lead Origin`), position = "dodge") +
theme_classic()+labs(title="Country V/S Lead Origin")
#Pregunta 1 SPECIALIZATION V/S LEAD ORIGIN
tableespecialization<-table(BDLimpia$`Lead Origin`, BDLimpia$Specialization)
tableespecialization
##
## Banking, Investment And Insurance
## API 39
## Landing Page Submission 270
## Lead Add Form 29
##
## Business Administration E-Business E-COMMERCE
## API 38 2 7
## Landing Page Submission 339 53 99
## Lead Add Form 25 2 3
##
## Finance Management Healthcare Management
## API 89 21
## Landing Page Submission 799 119
## Lead Add Form 84 18
##
## Hospitality Management Human Resource Management
## API 14 103
## Landing Page Submission 88 656
## Lead Add Form 11 86
##
## International Business IT Projects Management
## API 17 28
## Landing Page Submission 155 327
## Lead Add Form 5 11
##
## Marketing Management Media and Advertising
## API 135 23
## Landing Page Submission 615 176
## Lead Add Form 88 3
##
## Operations Management Otro Retail Management
## API 74 2915 8
## Landing Page Submission 398 125 90
## Lead Add Form 30 301 2
##
## Rural and Agribusiness Services Excellence
## API 9 3
## Landing Page Submission 62 36
## Lead Add Form 2 1
##
## Supply Chain Management Travel and Tourism
## API 37 18
## Landing Page Submission 296 183
## Lead Add Form 15 2
Plantee un modelo homogéneo para determinar de dónde proviene el objetivo (variable Lead Origin). Gracias al EDA observamos cierta correlación entre las variables Lead origin vs Country y Lead origin vs Specialization, así que ahora veremos cómo se comportan estas variables en el modelo.
#Desarrollo
#Para usar la librería mlogit, la que se vio en clases con el profe, se debe tener la base en formato long. Actualmente está en formato wide, por lo que hay que formatearla según las variables que nos importan.
#Se crea la base de datos que se usará en esta pregunta
PreguntaUno<-select(BDLimpia, Country, Specialization, `Lead Origin`, `Lead Number`)
UnoModificable<-PreguntaUno #Se modifica esta base
head(PreguntaUno,5)
#Transformamos las variables para poder usarlas en la base long
API <- (ifelse(UnoModificable$`Lead Origin`=="API", TRUE, FALSE))
Landing_Page_Submission <- (ifelse(UnoModificable$`Lead Origin`=="Landing Page Submission", TRUE, FALSE))
Lead_Add_Form <- (ifelse(UnoModificable$`Lead Origin`=="Lead Add Form", TRUE, FALSE))
UnoModificable <- cbind(UnoModificable, API)#se agregan las columnas
UnoModificable <- cbind(UnoModificable, Landing_Page_Submission)
UnoModificable <- cbind(UnoModificable, Lead_Add_Form)
UnoModificable$`Lead Origin` = NULL#se borra lead origin
head(UnoModificable,5)
#Se deja la base en formato long
BaseLong<-melt(UnoModificable,id.vars = c("Country", "Specialization","Lead Number"),
variable.name = "Lead Origin",value.name = "Choice")
head(BaseLong,5)#5 primeros datos
Teniendo la base en formato long, podemos estimar un modelo con la función mlogit
logit_data <- mlogit.data(data = BaseLong, shape = "long", choice = "Choice",id.var = "Lead Number")
#corremos el modelo
modelo1<-mlogit(Choice ~ 0| Country + Specialization | 0, data = logit_data)
Las opciones son: 1=API 2=Landing Page Submission 3=Lead Add Form
##
## Call:
## mlogit(formula = Choice ~ 0 | Country + Specialization | 0, data = logit_data,
## method = "nr")
##
## Frequencies of alternatives:choice
## 1 2 3
## 0.3898 0.5320 0.0782
##
## nr method
## 8 iterations, 0h:0m:3s
## g'(-H)^-1g = 9.96E-05
## successive function values within tolerance limits
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept):2 2.11419 0.17593 12.02 < 2e-16
## (Intercept):3 -2.83426 0.35149 -8.06 6.7e-16
## CountryOtro:2 -1.82379 0.10216 -17.85 < 2e-16
## CountryOtro:3 4.10313 0.21494 19.09 < 2e-16
## SpecializationBusiness Administration:2 0.33666 0.24867 1.35 0.1758
## SpecializationBusiness Administration:3 -0.44013 0.41824 -1.05 0.2926
## SpecializationE-Business:2 1.32433 0.74785 1.77 0.0766
## SpecializationE-Business:3 0.38504 1.16359 0.33 0.7407
## SpecializationE-COMMERCE:2 0.71685 0.43462 1.65 0.0991
## SpecializationE-COMMERCE:3 -0.56268 0.82345 -0.68 0.4944
## SpecializationFinance Management:2 0.25235 0.20914 1.21 0.2276
## SpecializationFinance Management:3 0.27444 0.35528 0.77 0.4398
## SpecializationHealthcare Management:2 -0.15330 0.30069 -0.51 0.6102
## SpecializationHealthcare Management:3 -0.05270 0.49352 -0.11 0.9150
## SpecializationHospitality Management:2 -0.10615 0.34269 -0.31 0.7568
## SpecializationHospitality Management:3 0.10122 0.59076 0.17 0.8640
## SpecializationHuman Resource Management:2 -0.06404 0.20641 -0.31 0.7564
## SpecializationHuman Resource Management:3 0.02968 0.35128 0.08 0.9327
## SpecializationInternational Business:2 0.42242 0.31745 1.33 0.1833
## SpecializationInternational Business:3 -1.42298 0.61488 -2.31 0.0207
## SpecializationIT Projects Management:2 0.63200 0.26792 2.36 0.0183
## SpecializationIT Projects Management:3 -1.03323 0.48488 -2.13 0.0331
## SpecializationMarketing Management:2 -0.34904 0.20126 -1.73 0.0829
## SpecializationMarketing Management:3 -0.40527 0.34336 -1.18 0.2379
## SpecializationMedia and Advertising:2 0.08516 0.28669 0.30 0.7664
## SpecializationMedia and Advertising:3 -1.66728 0.72634 -2.30 0.0217
## SpecializationOperations Management:2 -0.11797 0.21981 -0.54 0.5915
## SpecializationOperations Management:3 -1.07039 0.38466 -2.78 0.0054
## SpecializationOtro:2 -4.66347 0.19827 -23.52 < 2e-16
## SpecializationOtro:3 -2.93368 0.31294 -9.37 < 2e-16
## SpecializationRetail Management:2 0.52466 0.41616 1.26 0.2074
## SpecializationRetail Management:3 -1.25557 0.89901 -1.40 0.1625
## SpecializationRural and Agribusiness:2 -0.00407 0.40572 -0.01 0.9920
## SpecializationRural and Agribusiness:3 -1.21193 0.91991 -1.32 0.1877
## SpecializationServices Excellence:2 0.55992 0.63722 0.88 0.3796
## SpecializationServices Excellence:3 -0.84738 1.30533 -0.65 0.5162
## SpecializationSupply Chain Management:2 0.36357 0.25314 1.44 0.1509
## SpecializationSupply Chain Management:3 -1.26004 0.44181 -2.85 0.0043
## SpecializationTravel and Tourism:2 0.41183 0.30775 1.34 0.1808
## SpecializationTravel and Tourism:3 -2.02099 0.83344 -2.42 0.0153
##
## (Intercept):2 ***
## (Intercept):3 ***
## CountryOtro:2 ***
## CountryOtro:3 ***
## SpecializationBusiness Administration:2
## SpecializationBusiness Administration:3
## SpecializationE-Business:2 .
## SpecializationE-Business:3
## SpecializationE-COMMERCE:2 .
## SpecializationE-COMMERCE:3
## SpecializationFinance Management:2
## SpecializationFinance Management:3
## SpecializationHealthcare Management:2
## SpecializationHealthcare Management:3
## SpecializationHospitality Management:2
## SpecializationHospitality Management:3
## SpecializationHuman Resource Management:2
## SpecializationHuman Resource Management:3
## SpecializationInternational Business:2
## SpecializationInternational Business:3 *
## SpecializationIT Projects Management:2 *
## SpecializationIT Projects Management:3 *
## SpecializationMarketing Management:2 .
## SpecializationMarketing Management:3
## SpecializationMedia and Advertising:2
## SpecializationMedia and Advertising:3 *
## SpecializationOperations Management:2
## SpecializationOperations Management:3 **
## SpecializationOtro:2 ***
## SpecializationOtro:3 ***
## SpecializationRetail Management:2
## SpecializationRetail Management:3
## SpecializationRural and Agribusiness:2
## SpecializationRural and Agribusiness:3
## SpecializationServices Excellence:2
## SpecializationServices Excellence:3
## SpecializationSupply Chain Management:2
## SpecializationSupply Chain Management:3 **
## SpecializationTravel and Tourism:2
## SpecializationTravel and Tourism:3 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -3870
## McFadden R^2: 0.533
## Likelihood ratio test : chisq = 8840 (p.value = <2e-16)
¿Qué quieren decir los estimadores?
Discuta la capacidad de predicción del modelo anterior.
#separamos en train y test
index <- sample(1:nrow(UnoModificable), size=round(0.8*nrow(UnoModificable),0))
train <- UnoModificable[index,]
test <- UnoModificable[-index,]
Repetimos el proceso para arreglar el formato de la base
#Base en long
# train
long.train <- melt(train, id.vars = c("Lead Number", "Specialization", "Country" ), variable.name = "Lead_Origin", value.name = "choice")
# test
long.test <- melt(test, id.vars = c("Lead Number", "Specialization", "Country"), variable.name = "Lead_Origin", value.name = "choice")
# data mlogit
# train
long.train.data <- mlogit.data(data=long.train, shape="long", choice="choice", id.var = "Lead Number")
# test
long.test.data <- mlogit.data(data=long.test, shape="long", choice="choice", id.var = "Lead Number")
Se prueba la capacidad de predicción con la data de train
modelotrain<-mlogit(choice ~ 0| Country + Specialization |0, data = long.train.data, model = TRUE)
summary(modelotrain)
##
## Call:
## mlogit(formula = choice ~ 0 | Country + Specialization | 0, data = long.train.data,
## model = TRUE, method = "nr")
##
## Frequencies of alternatives:choice
## 1 2 3
## 0.3870 0.5349 0.0781
##
## nr method
## 18 iterations, 0h:0m:3s
## g'(-H)^-1g = 3.69E-07
## gradient close to zero
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept):2 2.2492 0.2062 10.91 < 2e-16
## (Intercept):3 -2.9441 0.4059 -7.25 4.1e-13
## CountryOtro:2 -1.7921 0.1137 -15.77 < 2e-16
## CountryOtro:3 4.1084 0.2401 17.11 < 2e-16
## SpecializationBusiness Administration:2 0.3009 0.2921 1.03 0.303
## SpecializationBusiness Administration:3 -0.3397 0.4865 -0.70 0.485
## SpecializationE-Business:2 0.9289 0.7622 1.22 0.223
## SpecializationE-Business:3 -16.1244 2714.1327 -0.01 0.995
## SpecializationE-COMMERCE:2 0.4510 0.4760 0.95 0.343
## SpecializationE-COMMERCE:3 -0.4654 0.9881 -0.47 0.638
## SpecializationFinance Management:2 0.2631 0.2460 1.07 0.285
## SpecializationFinance Management:3 0.5238 0.4096 1.28 0.201
## SpecializationHealthcare Management:2 -0.2393 0.3399 -0.70 0.481
## SpecializationHealthcare Management:3 0.0334 0.5500 0.06 0.952
## SpecializationHospitality Management:2 -0.3240 0.3780 -0.86 0.391
## SpecializationHospitality Management:3 0.4622 0.6650 0.70 0.487
## SpecializationHuman Resource Management:2 -0.2250 0.2382 -0.94 0.345
## SpecializationHuman Resource Management:3 0.1631 0.4024 0.41 0.685
## SpecializationInternational Business:2 0.3785 0.3563 1.06 0.288
## SpecializationInternational Business:3 -1.2279 0.6477 -1.90 0.058
## SpecializationIT Projects Management:2 0.4632 0.3068 1.51 0.131
## SpecializationIT Projects Management:3 -0.8156 0.5508 -1.48 0.139
## SpecializationMarketing Management:2 -0.4597 0.2333 -1.97 0.049
## SpecializationMarketing Management:3 -0.1818 0.3941 -0.46 0.645
## SpecializationMedia and Advertising:2 -0.1438 0.3158 -0.46 0.649
## SpecializationMedia and Advertising:3 -1.5860 0.7496 -2.12 0.034
## SpecializationOperations Management:2 -0.2392 0.2533 -0.94 0.345
## SpecializationOperations Management:3 -0.9900 0.4387 -2.26 0.024
## SpecializationOtro:2 -4.7779 0.2295 -20.82 < 2e-16
## SpecializationOtro:3 -2.8587 0.3618 -7.90 2.9e-15
## SpecializationRetail Management:2 0.1068 0.4375 0.24 0.807
## SpecializationRetail Management:3 -1.3046 0.9139 -1.43 0.153
## SpecializationRural and Agribusiness:2 0.0810 0.4859 0.17 0.868
## SpecializationRural and Agribusiness:3 -1.3552 1.2184 -1.11 0.266
## SpecializationServices Excellence:2 0.2890 0.6512 0.44 0.657
## SpecializationServices Excellence:3 -0.8069 1.3162 -0.61 0.540
## SpecializationSupply Chain Management:2 0.1985 0.2882 0.69 0.491
## SpecializationSupply Chain Management:3 -0.9855 0.4902 -2.01 0.044
## SpecializationTravel and Tourism:2 0.2823 0.3451 0.82 0.413
## SpecializationTravel and Tourism:3 -1.7994 0.8602 -2.09 0.036
##
## (Intercept):2 ***
## (Intercept):3 ***
## CountryOtro:2 ***
## CountryOtro:3 ***
## SpecializationBusiness Administration:2
## SpecializationBusiness Administration:3
## SpecializationE-Business:2
## SpecializationE-Business:3
## SpecializationE-COMMERCE:2
## SpecializationE-COMMERCE:3
## SpecializationFinance Management:2
## SpecializationFinance Management:3
## SpecializationHealthcare Management:2
## SpecializationHealthcare Management:3
## SpecializationHospitality Management:2
## SpecializationHospitality Management:3
## SpecializationHuman Resource Management:2
## SpecializationHuman Resource Management:3
## SpecializationInternational Business:2
## SpecializationInternational Business:3 .
## SpecializationIT Projects Management:2
## SpecializationIT Projects Management:3
## SpecializationMarketing Management:2 *
## SpecializationMarketing Management:3
## SpecializationMedia and Advertising:2
## SpecializationMedia and Advertising:3 *
## SpecializationOperations Management:2
## SpecializationOperations Management:3 *
## SpecializationOtro:2 ***
## SpecializationOtro:3 ***
## SpecializationRetail Management:2
## SpecializationRetail Management:3
## SpecializationRural and Agribusiness:2
## SpecializationRural and Agribusiness:3
## SpecializationServices Excellence:2
## SpecializationServices Excellence:3
## SpecializationSupply Chain Management:2
## SpecializationSupply Chain Management:3 *
## SpecializationTravel and Tourism:2
## SpecializationTravel and Tourism:3 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -3080
## McFadden R^2: 0.536
## Likelihood ratio test : chisq = 7090 (p.value = <2e-16)
Para poder usar la matriz de confusión necesitamos que test tenga forma y valores en el formato que lo entrega predict. Para esto necesitamos que no existan personas repetidas, es decir, que una observación, sea una fila y no 3 como en long. Y que en vez de el nombre de la categoría de Lead_Origen, esten los niveles que entrega el modelo, o sea, API = 1, Landing_Page_Submission = 2, Lead Add Form = 3.
# sacamos los valores predichos por el modelo para base test
logit.test.m <- predict(modelotrain, newdata = long.test.data)
logit.test.m <- max.col(logit.test.m)
logit.test.m <- factor(logit.test.m)
# obtenemos el vector "real" de clasificación transformando la base test
test.filt <- filter(long.test, choice == TRUE)
test.filt.org <- (ifelse(test.filt$Lead_Origin=="API", 1, ifelse(test.filt$Lead_Origin == "Landing_Page_Submission", 2, 3)))
confusionMatrix(logit.test.m, factor(test.filt.org))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 269 355 61
## 2 429 543 74
## 3 39 58 9
##
## Overall Statistics
##
## Accuracy : 0.447
## 95% CI : (0.424, 0.47)
## No Information Rate : 0.52
## P-Value [Acc > NIR] : 1.00000
##
## Kappa : -0.006
##
## Mcnemar's Test P-Value : 0.00324
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.365 0.568 0.0625
## Specificity 0.622 0.429 0.9427
## Pos Pred Value 0.393 0.519 0.0849
## Neg Pred Value 0.594 0.478 0.9220
## Prevalence 0.401 0.520 0.0784
## Detection Rate 0.146 0.296 0.0049
## Detection Prevalence 0.373 0.569 0.0577
## Balanced Accuracy 0.493 0.499 0.5026
¿Qué quieren decir los valores estimados?