

Evaluate and predict customer churn

This notebook is an adaptation from the work done by Sidney Phoon and Eleva Lowery with the following modifications:

Use datasets persisted in DB2 Warehouse running on ICP
Deploy and run the notebook on DSX local running on IBM Cloud Private (ICP)
Run spark Machine learning job on ICP as part of the worker nodes.
Document some actions for a beginner data scientist / developer who wants to understand what's going on.

The goal is to demonstrate how to build a predictive model with Spark machine learning API (SparkML) to predict customer churn, and deploy it for scoring in Machine Learning (ML) running on ICP. There is an equivalent notebook to run on Watson Data Platform and Watson Machine Learning.

Scope

A lot of industries have the issue of customers moving to competitors when the product differentiation is not that important, or there is some customer support issues. One industry illustrating this problem is the telecom industry with mobile, internet and IP TV product offerings.

Note book explanations

The notebook aims to follow the classical data science modeling steps:

load the data prepare the data analyze the data (iterate on those two activities) build a model validate the accuracy of the model deploy the model consume the model as a service

This jupyter notebook uses Apache Spark to run the machine learning jobs to build decision trees using random forest classifier to assess when a customer is at risk to move to competitor. Apache Spark offers a Python module called pyspark to operate on data and use ML constructs.

Start by all imports

As a best practices for notebook implementation is to do the import at the top of the notebook. Spark SQLContext a spark module to process structured data spark conf to access Spark cluster configuration and then be able to execute queries pandas Python super library for data analysis brunel API and tool to visualize data quickly. * pixiedust Visualize data inside Jupyter notebooks

The first cell below is to execute some system commands to update the kernel with updated dependant library.

# Library required for pixiedust - a visualization and dashboarding framework
!pip install --user --upgrade pixiedust

Requirement already up-to-date: pixiedust in /user-home/1002/.local/lib/python2.7/site-packages
Requirement already up-to-date: astunparse in /user-home/1002/.local/lib/python2.7/site-packages (from pixiedust)
Requirement already up-to-date: mpld3 in /opt/conda/lib/python2.7/site-packages (from pixiedust)
Requirement already up-to-date: markdown in /user-home/1002/.local/lib/python2.7/site-packages (from pixiedust)
Requirement already up-to-date: geojson in /user-home/1002/.local/lib/python2.7/site-packages (from pixiedust)
Requirement already up-to-date: lxml in /user-home/1002/.local/lib/python2.7/site-packages (from pixiedust)
Requirement already up-to-date: six<2.0,>=1.6.1 in /user-home/1002/.local/lib/python2.7/site-packages (from astunparse->pixiedust)
Requirement already up-to-date: wheel<1.0,>=0.23.0 in /user-home/1002/.local/lib/python2.7/site-packages (from astunparse->pixiedust)

import pyspark
import pandas as pd
import brunel
import numpy as np
from pyspark.sql import SQLContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import DoubleType
from pyspark.sql.types import DecimalType
from pyspark.sql.types import IntegerType
from pyspark.sql.types import LongType
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorIndexer, IndexToString
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pixiedust.display import *

Pixiedust database opened successfully




    <div style="margin:10px">
        <a href="https://github.com/ibm-watson-data-lab/pixiedust" target="_new">
            <img src="https://github.com/ibm-watson-data-lab/pixiedust/raw/master/docs/_static/pd_icon32.png" style="float:left;margin-right:10px"/>
        </a>
        <span>Pixiedust version 1.1.7.1</span>
    </div>

Load data from DB2

DSX ICP can be used to bring in data from multiple sources including but not limited to, files, datastores on cloud as well as on premises. DSX ICP includes features to connect to data sources, bring in the data, refine, and then perform analytics.

In this sample we connect to DB2 Data Warehouse deployed on ICP and bring data about customer, call notes and marketing campaign in.

import dsx_core_utils

dataSet = dsx_core_utils.get_remote_data_set_info('CUSTOMER')
dataSource = dsx_core_utils.get_data_source_info(dataSet['datasource'])
print(dataSource)
dbTableOrQuery = dataSet['schema'] + '.' + dataSet['table']
print(dbTableOrQuery)
sparkSession = SparkSession(sc).builder.getOrCreate()
df_customer_transactions = sparkSession.read.format("jdbc").option("url", dataSource['URL']).option("dbtable",dbTableOrQuery).option("user",'BLUADMIN').option("password","changemeplease").load()
df_customer_transactions.show(5)
df_customer_transactions.printSchema()

{u'description': u'', u'URL': u'jdbc:db2://172.16.40.131:32166/BLUDB', 'driver_class': 'com.ibm.db2.jcc.DB2Driver', u'dsx_artifact_type': u'datasource', u'shared': True, u'type': u'DB2', u'name': u'CUSTOMER'}
BLUADMIN.CUSTOMER
+----+------+------+--------+----------+---------+---------+--------------+-------+------------+-------------+-----+-------+---------+-------------+--------------------+------+--------+-----------+-----+
|  ID|GENDER|STATUS|CHILDREN|EST_INCOME|CAR_OWNER|      AGE|MARITAL_STATUS|ZIPCODE|LONGDISTANCE|INTERNATIONAL|LOCAL|DROPPED|PAYMETHOD|LOCALBILLTYPE|LONGDISTANCEBILLTYPE| USAGE|RATEPLAN|DEVICEOWNED|CHURN|
+----+------+------+--------+----------+---------+---------+--------------+-------+------------+-------------+-----+-------+---------+-------------+--------------------+------+--------+-----------+-----+
|1311|     F|     M|       0|  53432.20|        Y|57.000000|       Married|   null|       12.56|            0|83.48|      0|       CH|    FreeLocal|            Standard| 96.04|       1|       ipho|    F|
|1312|     M|     M|       0|  78894.20|        N|52.740000|       Married|   null|        2.00|            1|43.23|      0|       CH|       Budget|      Intnl_discount| 46.65|       3|       ipho|    T|
|1313|     M|     S|       1|  16432.10|        Y|20.920000|        Single|   null|       20.41|            8|80.47|      0|       CH|       Budget|      Intnl_discount|109.78|       3|       ipho|    T|
|1314|     M|     M|       0|  62797.90|        N|54.446667|       Married|   null|        8.32|            4|55.48|      0|       CH|       Budget|      Intnl_discount| 68.09|       3|       ipho|    T|
|1315|     M|     S|       1|  71329.90|        Y|42.233333|        Single|   null|       14.27|            3|45.92|      0|       CC|       Budget|            Standard| 63.75|       1|       ipho|    T|
+----+------+------+--------+----------+---------+---------+--------------+-------+------------+-------------+-----+-------+---------+-------------+--------------------+------+--------+-----------+-----+
only showing top 5 rows

root
 |-- ID: integer (nullable = true)
 |-- GENDER: string (nullable = true)
 |-- STATUS: string (nullable = true)
 |-- CHILDREN: integer (nullable = true)
 |-- EST_INCOME: decimal(8,2) (nullable = true)
 |-- CAR_OWNER: string (nullable = true)
 |-- AGE: decimal(14,6) (nullable = true)
 |-- MARITAL_STATUS: string (nullable = true)
 |-- ZIPCODE: integer (nullable = true)
 |-- LONGDISTANCE: decimal(6,2) (nullable = true)
 |-- INTERNATIONAL: integer (nullable = true)
 |-- LOCAL: decimal(7,2) (nullable = true)
 |-- DROPPED: integer (nullable = true)
 |-- PAYMETHOD: string (nullable = true)
 |-- LOCALBILLTYPE: string (nullable = true)
 |-- LONGDISTANCEBILLTYPE: string (nullable = true)
 |-- USAGE: decimal(7,2) (nullable = true)
 |-- RATEPLAN: integer (nullable = true)
 |-- DEVICEOWNED: string (nullable = true)
 |-- CHURN: string (nullable = true)

dataSet = dsx_core_utils.get_remote_data_set_info('CALLNOTES')
dataSource = dsx_core_utils.get_data_source_info(dataSet['datasource'])
print(dataSource)
dbTableOrQuery = dataSet['schema'] + '.' + dataSet['table']
print(dbTableOrQuery)
sparkSession = SparkSession(sc).builder.getOrCreate()
df_call_notes = sparkSession.read.format("jdbc").option("url", dataSource['URL']).option("dbtable",dbTableOrQuery).option("user",'BLUADMIN').option("password","changemeplease").load()
df_call_notes.show(5)
df_call_notes.printSchema()

{u'description': u'', u'URL': u'jdbc:db2://172.16.40.131:32166/BLUDB', 'driver_class': 'com.ibm.db2.jcc.DB2Driver', u'dsx_artifact_type': u'datasource', u'shared': True, u'type': u'DB2', u'name': u'CUSTOMER'}
BLUADMIN.CALLNOTES
+----+--------------------+----------+--------------+--------+
|  ID|            COMMENTS|SENTIMENTS|      KEYWORD1|KEYWORD2|
+----+--------------------+----------+--------------+--------+
|2253|Wants to change a...|      null|update records| address|
|2254|Wants to change a...|      null|update records| address|
|2255|Wants to change a...|      null|update records| address|
|2256|Wants to change a...|      null|update records| address|
|2257|Needed help figur...|analytical|       billing| charges|
+----+--------------------+----------+--------------+--------+
only showing top 5 rows

root
 |-- ID: integer (nullable = true)
 |-- COMMENTS: string (nullable = true)
 |-- SENTIMENTS: string (nullable = true)
 |-- KEYWORD1: string (nullable = true)
 |-- KEYWORD2: string (nullable = true)

dataSet = dsx_core_utils.get_remote_data_set_info('CAMPAIGNRESPONSES_EXPANDED')
dataSource = dsx_core_utils.get_data_source_info(dataSet['datasource'])
print(dataSource)
dbTableOrQuery = dataSet['schema'] + '.' + dataSet['table']
print(dbTableOrQuery)
sparkSession = SparkSession(sc).builder.getOrCreate()
df_campaign_responses = sparkSession.read.format("jdbc").option("url", dataSource['URL']).option("dbtable",dbTableOrQuery).option("user",'BLUADMIN').option("password","changemeplease").load()
df_campaign_responses.show(5)
df_campaign_responses.printSchema()

{u'description': u'', u'URL': u'jdbc:db2://172.16.40.131:32166/BLUDB', 'driver_class': 'com.ibm.db2.jcc.DB2Driver', u'dsx_artifact_type': u'datasource', u'shared': True, u'type': u'DB2', u'name': u'CUSTOMER'}
BLUADMIN.CAMPAIGNRESPONSES_EXPANDED
+----+------------------+---------------------------+-------------------------------------+
|  ID|RESPONDED_CAMPAIGN|OWNS_MULTIPLE_PHONE_NUMBERS|AVERAGE_TEXT_MESSAGES__90_DAY_PERIOD_|
+----+------------------+---------------------------+-------------------------------------+
|3064|       Kids Tablet|                          Y|                                 1561|
|3077|       Kids Tablet|                          Y|                                 1225|
|3105|       Kids Tablet|                          Y|                                 1661|
|3106|       Kids Tablet|                          N|                                 2498|
|3108|       Kids Tablet|                          N|                                 1118|
+----+------------------+---------------------------+-------------------------------------+
only showing top 5 rows

root
 |-- ID: integer (nullable = true)
 |-- RESPONDED_CAMPAIGN: string (nullable = true)
 |-- OWNS_MULTIPLE_PHONE_NUMBERS: string (nullable = true)
 |-- AVERAGE_TEXT_MESSAGES__90_DAY_PERIOD_: integer (nullable = true)

Data Preparation

The next few steps involve a series of data preparation tasks such as filling the missing values, joining datasets etc. The following cell fills the null values for average SMS count and replaces Nulls with spaces for other fields.

df_campaign_responses = df_campaign_responses.na.fill({'AVERAGE_TEXT_MESSAGES__90_DAY_PERIOD_':'0'})
df_call_notes = df_call_notes.na.fill({'SENTIMENTS':' '})
df_call_notes = df_call_notes.na.fill({'KEYWORD1':' '})
df_call_notes = df_call_notes.na.fill({'KEYWORD2':' '})

In the following cell we join some of the customer and call note data sources using the ID field. This ID field is the one coming from the CUSTOMER DB2 transactional database.

data_joined_callnotes_churn = df_call_notes.join(df_customer_transactions,df_call_notes['ID']==df_customer_transactions['ID'],'inner').select(df_call_notes['SENTIMENTS'],df_call_notes['KEYWORD1'],df_call_notes['KEYWORD2'],df_customer_transactions['*'])
data_joined_callnotes_churn_campaign = df_campaign_responses.join(data_joined_callnotes_churn,df_campaign_responses['ID']==data_joined_callnotes_churn['ID'],'inner').select(data_joined_callnotes_churn['*'],df_campaign_responses['RESPONDED_CAMPAIGN'],df_campaign_responses['OWNS_MULTIPLE_PHONE_NUMBERS'],df_campaign_responses['AVERAGE_TEXT_MESSAGES__90_DAY_PERIOD_'])
data_joined_callnotes_churn_campaign.take(5)

[Row(SENTIMENTS=u' ', KEYWORD1=u'help', KEYWORD2=u'support', ID=148, GENDER=u'M', STATUS=u'M', CHILDREN=2, EST_INCOME=Decimal('91272.20'), CAR_OWNER=u'Y', AGE=Decimal('25.033333'), MARITAL_STATUS=u'Married', ZIPCODE=None, LONGDISTANCE=Decimal('26.99'), INTERNATIONAL=0, LOCAL=Decimal('13.01'), DROPPED=0, PAYMETHOD=u'CC', LOCALBILLTYPE=u'FreeLocal', LONGDISTANCEBILLTYPE=u'Standard', USAGE=Decimal('40.00'), RATEPLAN=3, DEVICEOWNED=u'ipho', CHURN=u'F', RESPONDED_CAMPAIGN=u'Android Phone', OWNS_MULTIPLE_PHONE_NUMBERS=u'N', AVERAGE_TEXT_MESSAGES__90_DAY_PERIOD_=1900),
 Row(SENTIMENTS=u' ', KEYWORD1=u'update records', KEYWORD2=u'address', ID=463, GENDER=u'M', STATUS=u'M', CHILDREN=0, EST_INCOME=Decimal('69168.40'), CAR_OWNER=u'Y', AGE=Decimal('62.426667'), MARITAL_STATUS=u'Married', ZIPCODE=None, LONGDISTANCE=Decimal('14.16'), INTERNATIONAL=6, LOCAL=Decimal('214.73'), DROPPED=0, PAYMETHOD=u'CC', LOCALBILLTYPE=u'Budget', LONGDISTANCEBILLTYPE=u'Standard', USAGE=Decimal('234.91'), RATEPLAN=2, DEVICEOWNED=u'sam', CHURN=u'T', RESPONDED_CAMPAIGN=u'Dual SIM', OWNS_MULTIPLE_PHONE_NUMBERS=u'Y', AVERAGE_TEXT_MESSAGES__90_DAY_PERIOD_=1586),
 Row(SENTIMENTS=u'analytical', KEYWORD1=u'billing', KEYWORD2=u'charges', ID=471, GENDER=u'M', STATUS=u'M', CHILDREN=2, EST_INCOME=Decimal('90103.70'), CAR_OWNER=u'N', AGE=Decimal('34.946667'), MARITAL_STATUS=u'Married', ZIPCODE=None, LONGDISTANCE=Decimal('12.23'), INTERNATIONAL=8, LOCAL=Decimal('45.34'), DROPPED=0, PAYMETHOD=u'CC', LOCALBILLTYPE=u'Budget', LONGDISTANCEBILLTYPE=u'Intnl_discount', USAGE=Decimal('66.45'), RATEPLAN=3, DEVICEOWNED=u'sam', CHURN=u'F', RESPONDED_CAMPAIGN=u'More Storage', OWNS_MULTIPLE_PHONE_NUMBERS=u'N', AVERAGE_TEXT_MESSAGES__90_DAY_PERIOD_=1114),
 Row(SENTIMENTS=u'satisfied', KEYWORD1=u'battery', KEYWORD2=u'unpredictability', ID=1238, GENDER=u'F', STATUS=u'M', CHILDREN=2, EST_INCOME=Decimal('3193.60'), CAR_OWNER=u'N', AGE=Decimal('54.046667'), MARITAL_STATUS=u'Married', ZIPCODE=None, LONGDISTANCE=Decimal('4.19'), INTERNATIONAL=0, LOCAL=Decimal('114.62'), DROPPED=1, PAYMETHOD=u'CH', LOCALBILLTYPE=u'Budget', LONGDISTANCEBILLTYPE=u'Standard', USAGE=Decimal('118.82'), RATEPLAN=3, DEVICEOWNED=u'ipho', CHURN=u'F', RESPONDED_CAMPAIGN=u'Large Display', OWNS_MULTIPLE_PHONE_NUMBERS=u'N', AVERAGE_TEXT_MESSAGES__90_DAY_PERIOD_=1697),
 Row(SENTIMENTS=u'frustrated', KEYWORD1=u'charger', KEYWORD2=u'switch carrier', ID=1342, GENDER=u'M', STATUS=u'S', CHILDREN=0, EST_INCOME=Decimal('94928.30'), CAR_OWNER=u'N', AGE=Decimal('40.180000'), MARITAL_STATUS=u'Single', ZIPCODE=None, LONGDISTANCE=Decimal('14.42'), INTERNATIONAL=5, LOCAL=Decimal('73.74'), DROPPED=0, PAYMETHOD=u'CC', LOCALBILLTYPE=u'FreeLocal', LONGDISTANCEBILLTYPE=u'Standard', USAGE=Decimal('93.78'), RATEPLAN=1, DEVICEOWNED=u'ipho', CHURN=u'T', RESPONDED_CAMPAIGN=u'Dual SIM', OWNS_MULTIPLE_PHONE_NUMBERS=u'Y', AVERAGE_TEXT_MESSAGES__90_DAY_PERIOD_=1540)]

The following code block is intended to get a feel for Spark DataFrame APIs. We attempt to fix some of the column titles to promote readability, and also remove a duplicate column (Status and Marital Status are the same). Finally convert the DataFrame to Python Pandas structure for visualization. Since some fields are string types from DB2 tables, let us change some of them to other types

# Change some column names

data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumnRenamed("SENTIMENTS", "SENTIMENT").withColumnRenamed("OWNS_MULTIPLE_PHONE_NUMBERS","OMPN")
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumnRenamed("KEYWORD1", "KEYWORD_COMPONENT").withColumnRenamed("KEYWORD2","KEYWORD_QUERY")
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumnRenamed("AVERAGE_TEXT_MESSAGES__90_DAY_PERIOD_", "SMSCOUNT").withColumnRenamed("CAR_OWNER","CAROWNERSHIP")
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumnRenamed("MARITAL_STATUS", "MARITALSTATUS").withColumnRenamed("EST_INCOME","INCOME")
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.drop('Status')

# Change some of the data types

data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("CHILDREN", data_joined_callnotes_churn_campaign["CHILDREN"].cast(IntegerType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("INCOME", data_joined_callnotes_churn_campaign["INCOME"].cast(DecimalType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("AGE", data_joined_callnotes_churn_campaign["AGE"].cast(IntegerType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("LONGDISTANCE", data_joined_callnotes_churn_campaign["LONGDISTANCE"].cast(DecimalType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("INTERNATIONAL", data_joined_callnotes_churn_campaign["INTERNATIONAL"].cast(DecimalType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("LOCAL", data_joined_callnotes_churn_campaign["LOCAL"].cast(DecimalType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("DROPPED", data_joined_callnotes_churn_campaign["DROPPED"].cast(IntegerType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("USAGE", data_joined_callnotes_churn_campaign["USAGE"].cast(DecimalType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("RATEPLAN", data_joined_callnotes_churn_campaign["RATEPLAN"].cast(IntegerType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("SMSCOUNT", data_joined_callnotes_churn_campaign["SMSCOUNT"].cast(IntegerType()))
data_joined_callnotes_churn_campaign.show(10)
data_joined_callnotes_churn_campaign.printSchema()

pandas_df_callnotes_campaign_churn = data_joined_callnotes_churn_campaign.toPandas()
pandas_df_callnotes_campaign_churn.head(12)

+----------+-----------------+----------------+----+------+--------+------+------------+---+-------------+-------+------------+-------------+-----+-------+---------+-------------+--------------------+-----+--------+-----------+-----+------------------+----+--------+
| SENTIMENT|KEYWORD_COMPONENT|   KEYWORD_QUERY|  ID|GENDER|CHILDREN|INCOME|CAROWNERSHIP|AGE|MARITALSTATUS|ZIPCODE|LONGDISTANCE|INTERNATIONAL|LOCAL|DROPPED|PAYMETHOD|LOCALBILLTYPE|LONGDISTANCEBILLTYPE|USAGE|RATEPLAN|DEVICEOWNED|CHURN|RESPONDED_CAMPAIGN|OMPN|SMSCOUNT|
+----------+-----------------+----------------+----+------+--------+------+------------+---+-------------+-------+------------+-------------+-----+-------+---------+-------------+--------------------+-----+--------+-----------+-----+------------------+----+--------+
|          |             help|         support| 148|     M|       2| 91272|           Y| 25|      Married|   null|          27|            0|   13|      0|       CC|    FreeLocal|            Standard|   40|       3|       ipho|    F|     Android Phone|   N|    1900|
|          |   update records|         address| 463|     M|       0| 69168|           Y| 62|      Married|   null|          14|            6|  215|      0|       CC|       Budget|            Standard|  235|       2|        sam|    T|          Dual SIM|   Y|    1586|
|analytical|          billing|         charges| 471|     M|       2| 90104|           N| 34|      Married|   null|          12|            8|   45|      0|       CC|       Budget|      Intnl_discount|   66|       3|        sam|    F|      More Storage|   N|    1114|
| satisfied|          battery|unpredictability|1238|     F|       2|  3194|           N| 54|      Married|   null|           4|            0|  115|      1|       CH|       Budget|            Standard|  119|       3|       ipho|    F|     Large Display|   N|    1697|
|frustrated|          charger|  switch carrier|1342|     M|       0| 94928|           N| 40|       Single|   null|          14|            5|   74|      0|       CC|    FreeLocal|            Standard|   94|       1|       ipho|    T|          Dual SIM|   Y|    1540|
|analytical|       new number|   customer care|1591|     F|       0| 45613|           N| 14|       Single|   null|          13|            0|  311|      0|       CC|       Budget|            Standard|  324|       4|       ipho|    F|     Android Phone|   N|    1681|
|frustrated|  call forwarding|        features|1645|     M|       1| 92648|           N| 56|       Single|   null|          16|            5|   10|      0|       CC|       Budget|            Standard|   32|       4|       ipho|    T|     Android Phone|   N|    2291|
|analytical|           tablet|    new offering|1959|     F|       1| 13829|           N| 19|      Married|   null|          42|            0|  160|      0|       CC|    FreeLocal|            Standard|  177|       2|       ipho|    T|     Android Phone|   N|    1821|
|analytical|        rate plan|   customer care|1959|     F|       1| 13829|           N| 19|      Married|   null|          42|            0|  160|      0|       CC|    FreeLocal|            Standard|  177|       2|       ipho|    T|     Android Phone|   N|    1821|
|          |       new number|         service|2122|     M|       2| 49911|           Y| 51|      Married|   null|          27|            0|   24|      0|       CC|       Budget|            Standard|   51|       1|       ipho|    F|     Android Phone|   N|    1487|
+----------+-----------------+----------------+----+------+--------+------+------------+---+-------------+-------+------------+-------------+-----+-------+---------+-------------+--------------------+-----+--------+-----------+-----+------------------+----+--------+
only showing top 10 rows

root
 |-- SENTIMENT: string (nullable = false)
 |-- KEYWORD_COMPONENT: string (nullable = false)
 |-- KEYWORD_QUERY: string (nullable = false)
 |-- ID: integer (nullable = true)
 |-- GENDER: string (nullable = true)
 |-- CHILDREN: integer (nullable = true)
 |-- INCOME: decimal(10,0) (nullable = true)
 |-- CAROWNERSHIP: string (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- MARITALSTATUS: string (nullable = true)
 |-- ZIPCODE: integer (nullable = true)
 |-- LONGDISTANCE: decimal(10,0) (nullable = true)
 |-- INTERNATIONAL: decimal(10,0) (nullable = true)
 |-- LOCAL: decimal(10,0) (nullable = true)
 |-- DROPPED: integer (nullable = true)
 |-- PAYMETHOD: string (nullable = true)
 |-- LOCALBILLTYPE: string (nullable = true)
 |-- LONGDISTANCEBILLTYPE: string (nullable = true)
 |-- USAGE: decimal(10,0) (nullable = true)
 |-- RATEPLAN: integer (nullable = true)
 |-- DEVICEOWNED: string (nullable = true)
 |-- CHURN: string (nullable = true)
 |-- RESPONDED_CAMPAIGN: string (nullable = true)
 |-- OMPN: string (nullable = true)
 |-- SMSCOUNT: integer (nullable = true)

	SENTIMENT	KEYWORD_COMPONENT	KEYWORD_QUERY	ID	GENDER	CHILDREN	INCOME	CAROWNERSHIP	AGE	MARITALSTATUS	...	PAYMETHOD	LOCALBILLTYPE	LONGDISTANCEBILLTYPE	USAGE	RATEPLAN	DEVICEOWNED	CHURN	RESPONDED_CAMPAIGN	OMPN	SMSCOUNT
0		help	support	148	M	2	91272	Y	25	Married	...	CC	FreeLocal	Standard	40	3	ipho	F	Android Phone	N	1900
1		update records	address	463	M	0	69168	Y	62	Married	...	CC	Budget	Standard	235	2	sam	T	Dual SIM	Y	1586
2	analytical	billing	charges	471	M	2	90104	N	34	Married	...	CC	Budget	Intnl_discount	66	3	sam	F	More Storage	N	1114
3	satisfied	battery	unpredictability	1238	F	2	3194	N	54	Married	...	CH	Budget	Standard	119	3	ipho	F	Large Display	N	1697
4	frustrated	charger	switch carrier	1342	M	0	94928	N	40	Single	...	CC	FreeLocal	Standard	94	1	ipho	T	Dual SIM	Y	1540
5	analytical	new number	customer care	1591	F	0	45613	N	14	Single	...	CC	Budget	Standard	324	4	ipho	F	Android Phone	N	1681
6	frustrated	call forwarding	features	1645	M	1	92648	N	56	Single	...	CC	Budget	Standard	32	4	ipho	T	Android Phone	N	2291
7	analytical	tablet	new offering	1959	F	1	13829	N	19	Married	...	CC	FreeLocal	Standard	177	2	ipho	T	Android Phone	N	1821
8	analytical	rate plan	customer care	1959	F	1	13829	N	19	Married	...	CC	FreeLocal	Standard	177	2	ipho	T	Android Phone	N	1821
9		new number	service	2122	M	2	49911	Y	51	Married	...	CC	Budget	Standard	51	1	ipho	F	Android Phone	N	1487
10		lost phone	service suspension	2366	M	2	1765	N	18	Married	...	CC	FreeLocal	Standard	48	4	sam	F	More Storage	Y	1865
11	frustrated	data plan	speed	2659	F	1	98680	Y	43	Married	...	CH	FreeLocal	Intnl_discount	26	3	sam	F	Windows Phone	N	1732

12 rows × 25 columns

The following brunel based visualization can also be performed from Data Refinery. Shown here to get the feel for APIs

%brunel data('pandas_df_callnotes_campaign_churn') bar y(#count) stack polar color(Sentiment) sort(#count) label(Sentiment, ' (', #count, '%)') tooltip(#all) percent(#count) legends(none)

wzxhzdk:9

wzxhzdk:10

##### The following cell shows an example of how pixiedust can be used to build interactive dashboards, and how it can be exported out wzxhzdk:11

Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter

##### Building RandomForest based classifier wzxhzdk:12 wzxhzdk:13 ##### Split the dataset into training and test using 70:30 split ratio and build the model wzxhzdk:14 ##### Testing the test dataset wzxhzdk:15

	ID	CHURN	label	predictedLabel	prediction	probability
0	148	F	0.0	F	0.0	[0.8728832555, 0.1271167445]
1	1959	T	1.0	T	1.0	[0.285629950796, 0.714370049204]
2	3749	T	1.0	T	1.0	[0.165697434112, 0.834302565888]
3	2659	F	0.0	F	0.0	[0.714051839779, 0.285948160221]
4	1238	F	0.0	F	0.0	[0.87563833506, 0.12436166494]
5	1460	F	0.0	F	0.0	[0.925838160212, 0.0741618397881]

##### Model Evaluation wzxhzdk:16 Precision model1 = 0.89. Area under ROC curve = 0.88. wzxhzdk:17 wzxhzdk:18 wzxhzdk:19 ##### Save pipeline and model artifacts to Machine Learning repository: wzxhzdk:20 wzxhzdk:21 modelType: sparkml-model-2.0 creationTime: 2018-02-19 22:18:53.157000+00:00 modelVersionHref: https://internal-nginx-svc.ibm-private-cloud.svc.cluster.local:12443/v2/artifacts/models/56af74d4-93f4-41bd-8069-850a5117475d/versions/4de851a3-4d05-4957-a27f-6701418e6ce4 label: CHURN