

Evaluate and predict customer churn

This notebook is an adaptation from the work done by Sidney Phoon and Eleva Lowery with the following modifications:

Use datasets persisted in DB2 Warehouse running on ICP
Use additional datasets from multiple datasources such as Cloudant on public cloud, and Amazon S3
Use of Watson Data Platform for data ingestion and preparation (using Data Refinery)
Deploy and run the notebook on DSX enterprise running on IBM Cloud Private (ICP)
Run spark Machine learning job on ICP as part of the worker nodes.
Document some actions for a beginner data scienctist / developer who wants to understand what's going on.
The web application was separated in another git project

The goal is to demonstrate how to build a predictive model with Spark machine learning API (SparkML) to predict customer churn, and deploy it for scoring in Machine Learning (ML) running on ICP or within IBM public Cloud, Watson Machine Learning service.

Scope

A lot of industries have the issue of customers moving to competitors when the product differentiation is not that important, or there is some customer support issues. One industry illustrating this problem is the telecom industry with mobile, internet and IP TV product offerings.

Note book explanations

The notebook aims to follow the classical data science modeling steps:

load the data
prepare the data
analyze the data (iterate on those two activities)
build a model
validate the accuracy of the model
deploy the model
consume the model as a service

This jupyter notebook uses Apache Spark to run the machine learning jobs to build decision trees and random forest classifier to assess when a customer is at risk to move to competitor. Apache Spark offers a Python module called pyspark to operate on data and use ML constructs.

Start by all imports

As a best practices for notebook implementation is to do the import at the top of the notebook. Spark SQLContext a spark module to process structured data spark conf to access Spark cluster configuration and then be able to execute queries jaydebeapi is used to connect to the DB 2 warehouse where customer data are persisted. We assume they are loaded. ibmdbpy interface for data manipulation and access to in-database algorithms in IBM dashDB and IBM DB2. pandas Python super library for data analysis brunel API and tool to visualize data quickly. * pixiedust Visualize data inside Jupyter notebooks

The first cell below is to execute some system commands to update the kernel with updated dependant library.

# Needed to access data from IBM Cloud Object Storage
!pip install --upgrade ibm-cos-sdk

Requirement already up-to-date: ibm-cos-sdk in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages
Requirement already up-to-date: ibm-cos-sdk-core==2.*,>=2.0.0 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from ibm-cos-sdk)
Requirement already up-to-date: ibm-cos-sdk-s3transfer==2.*,>=2.0.0 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from ibm-cos-sdk)
Requirement already up-to-date: python-dateutil<3.0.0,>=2.1 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from ibm-cos-sdk-core==2.*,>=2.0.0->ibm-cos-sdk)
Requirement already up-to-date: docutils>=0.10 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from ibm-cos-sdk-core==2.*,>=2.0.0->ibm-cos-sdk)
Requirement already up-to-date: jmespath<1.0.0,>=0.7.1 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from ibm-cos-sdk-core==2.*,>=2.0.0->ibm-cos-sdk)
Requirement already up-to-date: six>=1.5 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from python-dateutil<3.0.0,>=2.1->ibm-cos-sdk-core==2.*,>=2.0.0->ibm-cos-sdk)

# Required for accessing data on IBM Cloud Object Storage
!pip install --upgrade boto3

Requirement already up-to-date: boto3 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages
Requirement already up-to-date: botocore<1.9.0,>=1.8.36 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from boto3)
Requirement already up-to-date: s3transfer<0.2.0,>=0.1.10 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from boto3)
Requirement already up-to-date: jmespath<1.0.0,>=0.7.1 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from boto3)
Requirement already up-to-date: python-dateutil<3.0.0,>=2.1 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from botocore<1.9.0,>=1.8.36->boto3)
Requirement already up-to-date: docutils>=0.10 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from botocore<1.9.0,>=1.8.36->boto3)
Requirement already up-to-date: six>=1.5 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.9.0,>=1.8.36->boto3)

# The following will be needed if you want to download datasets from outside the WDP environment for any reason. But we won't be needing this in the current sample exercise
!pip install --upgrade wget

Requirement already up-to-date: wget in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages

# Library required for pixiedust - a visualization and dashboarding framework
!pip install --user --upgrade pixiedust

Requirement already up-to-date: pixiedust in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages
Requirement already up-to-date: astunparse in /usr/local/src/conda3_runtime.v27/4.1.1/lib/python3.5/site-packages (from pixiedust)
Requirement already up-to-date: lxml in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from pixiedust)
Requirement already up-to-date: mpld3 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from pixiedust)
Requirement already up-to-date: geojson in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from pixiedust)
Requirement already up-to-date: markdown in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from pixiedust)
Requirement already up-to-date: wheel<1.0,>=0.23.0 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from astunparse->pixiedust)
Requirement already up-to-date: six<2.0,>=1.6.1 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from astunparse->pixiedust)

# Needed to deploy the model on Watson Machine Learning Service
!pip install --upgrade watson-machine-learning-client

Requirement already up-to-date: watson-machine-learning-client in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages
Requirement already up-to-date: lomond in /usr/local/src/conda3_runtime.v27/4.1.1/lib/python3.5/site-packages (from watson-machine-learning-client)
Requirement already up-to-date: tabulate in /usr/local/src/conda3_runtime.v27/4.1.1/lib/python3.5/site-packages (from watson-machine-learning-client)
Requirement already up-to-date: tqdm in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from watson-machine-learning-client)
Requirement already up-to-date: requests in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from watson-machine-learning-client)
Requirement already up-to-date: urllib3 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from watson-machine-learning-client)
Requirement already up-to-date: certifi in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from watson-machine-learning-client)
Requirement already up-to-date: pandas in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from watson-machine-learning-client)
Requirement already up-to-date: six>=1.10.0 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from lomond->watson-machine-learning-client)
Requirement already up-to-date: chardet<3.1.0,>=3.0.2 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from requests->watson-machine-learning-client)
Requirement already up-to-date: idna<2.7,>=2.5 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from requests->watson-machine-learning-client)
Requirement already up-to-date: python-dateutil>=2 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from pandas->watson-machine-learning-client)
Requirement already up-to-date: numpy>=1.9.0 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from pandas->watson-machine-learning-client)
Requirement already up-to-date: pytz>=2011k in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s796-0c7a8ec08c3a1c-8b16cafc4fd4/.local/lib/python3.5/site-packages (from pandas->watson-machine-learning-client)

import pyspark
import pandas as pd
import brunel
import numpy as np
from pyspark.sql import SQLContext
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import DoubleType
from pyspark.sql.types import DecimalType
from pyspark.sql.types import IntegerType
from pyspark.sql.types import LongType
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorIndexer, IndexToString
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pixiedust.display import *
import ibm_boto3

Load data using Watson Data Platform (WDP)

Watson Data platform (WDP) can be used to bring in data from multiple sources including but not limited to, files, data stores on cloud as well as on premises. WDP includes features to connect to data sources, bring in the data, refine, and then perform analytics.

In this sample are using WDP approach. We connect to Amazon S3, Cloudant on IBM public (or private) cloud, and DB2 Data Warehouse on IBM public or private cloud. Once we bring in the data, we refine / cleanse them using Data Refinery and export the result as a CSV file for training analytical model. These steps are already accomplished using WDP so that we can start loading that data for analytics in the following cells.

# Load customer information along with churn status. We read this from the CSV file prepared for training purposes
import ibmos2spark

# @hidden_cell
credentials = {
    'endpoint': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'api_key': '',
    'service_id': 'iam-ServiceId-de6a2704-3436-4927-8ceb-a2e3dfc3288e',
    'iam_service_endpoint': 'https://iam.ng.bluemix.net/oidc/token'}

configuration_name = 'os_d21bf0a6588f494d822173729799c934_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_customer_transactions = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .load(cos.url('data_asset/YTc2MDUyNjAtYzg5OC00MDk5LTgyZGItYThlZjI4ZDczNmZl_data_6283e5af-2e69-426e-b327-1a3bbc23ded3.csv', 'customerchurnproject86ce96f6a9384a669a14bd5dd9b3028e'))
df_customer_transactions.take(5)
df_customer_transactions.printSchema()

root
 |-- ID: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Children: string (nullable = true)
 |-- Est Income: string (nullable = true)
 |-- Car Owner: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Marital Status: string (nullable = true)
 |-- zipcode: string (nullable = true)
 |-- LongDistance: string (nullable = true)
 |-- International: string (nullable = true)
 |-- Local: string (nullable = true)
 |-- Dropped: string (nullable = true)
 |-- Paymethod: string (nullable = true)
 |-- LocalBilltype: string (nullable = true)
 |-- LongDistanceBilltype: string (nullable = true)
 |-- Usage: string (nullable = true)
 |-- RatePlan: string (nullable = true)
 |-- DeviceOwned: string (nullable = true)
 |-- CHURN: string (nullable = true)

# Load the call notes dataset
df_call_notes = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .load(cos.url('data_asset/MTIwYWQwZDctMmMyMC00NTkzLWI3YjItNDI4NGVjMzhlYjA5_data_0601330c-2331-46df-b9bc-df5ab0eb9a41.csv', 'customerchurnproject86ce96f6a9384a669a14bd5dd9b3028e'))
df_call_notes.take(5)
df_call_notes.describe()
df_call_notes.show(25)

+--------------------+---+------------+-------------+----------+
|            Comments| ID|    Keyword1|     Keyword2|Sentiments|
+--------------------+---+------------+-------------+----------+
|Asked about stora...|  1|         sim|      storage|analytical|
|Asked about low-e...|  6|basic config|   smartphone|analytical|
|Dissatisfied with...|  8|   promotion|       rebate|frustrated|
|Asked about low-e...| 11|basic config|   smartphone|analytical|
|Asked about low-e...| 14|basic config|   smartphone|analytical|
|Dissatisfied with...| 17|   promotion|       rebate|frustrated|
|Asked about low-e...| 18|basic config|   smartphone|analytical|
|Asked about low-e...| 21|basic config|   smartphone|analytical|
|Upset about the d...| 22|   data plan|        speed|frustrated|
|Asked about low-e...| 23|basic config|   smartphone|analytical|
|Asked how to inst...| 24|     sd card|         apps|analytical|
|Asked how to inst...| 29|     sd card|         apps|analytical|
|Said his battery ...| 35|     battery|    new phone|frustrated|
|Said his battery ...| 36|     battery|    new phone|frustrated|
|Said his battery ...| 37|     battery|    new phone|frustrated|
|Said his battery ...| 38|     battery|    new phone|frustrated|
|Said his battery ...| 40|     battery|    new phone|frustrated|
|He charges it and...| 42|    charging|      battery|frustrated|
|Asked how to inst...| 45|     sd card|         apps|analytical|
|Asked how to inst...| 48|     sd card|         apps|analytical|
|He expected signi...| 52|   technical|      support|frustrated|
|He asked for a ne...| 53|  new number|customer care|analytical|
|He expected signi...| 54|   technical|      support|frustrated|
|He expected signi...| 60|   technical|      support|frustrated|
|transferred to th...| 61|  supervisor|   delegation|   neutral|
+--------------------+---+------------+-------------+----------+
only showing top 25 rows

# Load customer campaign responses dataset
df_campaign_responses = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .load(cos.url('data_asset/MmQwNjUxYmUtYjU3My00ZmViLTk0YWEtYTc5ZWExZmU3Mjg3_data_e26e9c54-7143-449c-bbf4-71998290d86a.csv', 'customerchurnproject86ce96f6a9384a669a14bd5dd9b3028e'))
df_campaign_responses.take(5)

[Row(CUSTOMERID='6', Preference='android phone', Owns Multiple Phone Numbers='N', Ave Text Msgs='1807', Owns Multiple Lines='0'),
 Row(CUSTOMERID='8', Preference='android phone', Owns Multiple Phone Numbers='N', Ave Text Msgs='1364', Owns Multiple Lines='0'),
 Row(CUSTOMERID='18', Preference='android phone', Owns Multiple Phone Numbers='N', Ave Text Msgs='1407', Owns Multiple Lines='0'),
 Row(CUSTOMERID='29', Preference='android phone', Owns Multiple Phone Numbers='N', Ave Text Msgs='1043', Owns Multiple Lines='0'),
 Row(CUSTOMERID='37', Preference='android phone', Owns Multiple Phone Numbers='N', Ave Text Msgs='841', Owns Multiple Lines='0')]

The next few steps involve a series of data preparation tasks such as filling the missing values, joining datasets etc. The following cell fills the null values for average SMS count. Note that this can be accomplished by using Data Refinery, but the following snippet is shown to iullustrate the API way of accomplishing the same.

df_campaign_responses = df_campaign_responses.na.fill({'Ave Text Msgs':'0'})

In the following cell we join some of our data sources. Note that we could have done some of these using Data Refinery on Watson Data Platform using GUI support.

data_joined_callnotes_churn = df_call_notes.join(df_customer_transactions,df_call_notes['ID']==df_customer_transactions['ID'],'inner').select(df_call_notes['Sentiments'],df_call_notes['Keyword1'],df_call_notes['Keyword2'],df_customer_transactions['*'])
data_joined_callnotes_churn_campaign = df_campaign_responses.join(data_joined_callnotes_churn,df_campaign_responses['CUSTOMERID']==data_joined_callnotes_churn['ID'],'inner').select(data_joined_callnotes_churn['*'],df_campaign_responses['Preference'],df_campaign_responses['Owns Multiple Phone Numbers'],df_campaign_responses['Ave Text Msgs'])
data_joined_callnotes_churn_campaign.take(10)

[Row(Sentiments='analytical', Keyword1='sd card', Keyword2='apps', ID='24', Gender='F', Status='M', Children='2', Est Income='47902.00', Car Owner='N', Age='26.033333', Marital Status='Married', zipcode=None, LongDistance='17.44', International='4.94', Local='49.92', Dropped='1', Paymethod='Auto', LocalBilltype='FreeLocal', LongDistanceBilltype='Standard', Usage='72.31', RatePlan='2', DeviceOwned='moto', CHURN='F', Preference='more storage', Owns Multiple Phone Numbers='N', Ave Text Msgs='1864'),
 Row(Sentiments='frustrated', Keyword1='battery', Keyword2='new phone', ID='35', Gender='F', Status='S', Children='0', Est Income='78851.30', Car Owner='N', Age='48.373333', Marital Status='Single', zipcode=None, LongDistance='0.37', International='0.00', Local='28.66', Dropped='0', Paymethod='CC', LocalBilltype='FreeLocal', LongDistanceBilltype='Standard', Usage='29.04', RatePlan='4', DeviceOwned='ipho', CHURN='T', Preference='dual sim', Owns Multiple Phone Numbers='Y', Ave Text Msgs='1173'),
 Row(Sentiments='frustrated', Keyword1='battery', Keyword2='new phone', ID='37', Gender='F', Status='M', Children='0', Est Income='83891.90', Car Owner='Y', Age='61.020000', Marital Status='Married', zipcode=None, LongDistance='28.92', International='0.00', Local='45.47', Dropped='0', Paymethod='CH', LocalBilltype='Budget', LongDistanceBilltype='Standard', Usage='74.40', RatePlan='4', DeviceOwned='ipho', CHURN='T', Preference='android phone', Owns Multiple Phone Numbers='N', Ave Text Msgs='841'),
 Row(Sentiments='frustrated', Keyword1='battery', Keyword2='new phone', ID='38', Gender='F', Status='M', Children='2', Est Income='28220.80', Car Owner='N', Age='38.766667', Marital Status='Married', zipcode=None, LongDistance='26.49', International='0.00', Local='12.46', Dropped='0', Paymethod='CC', LocalBilltype='FreeLocal', LongDistanceBilltype='Standard', Usage='38.95', RatePlan='4', DeviceOwned='ipho', CHURN='T', Preference='windows phone', Owns Multiple Phone Numbers='N', Ave Text Msgs='1779'),
 Row(Sentiments='frustrated', Keyword1='battery', Keyword2='new phone', ID='40', Gender='F', Status='S', Children='0', Est Income='28589.10', Car Owner='N', Age='15.600000', Marital Status='Single', zipcode=None, LongDistance='13.19', International='0.00', Local='87.09', Dropped='0', Paymethod='CC', LocalBilltype='FreeLocal', LongDistanceBilltype='Standard', Usage='100.28', RatePlan='4', DeviceOwned='ipho', CHURN='T', Preference='large display', Owns Multiple Phone Numbers='N', Ave Text Msgs='1720'),
 Row(Sentiments='analytical', Keyword1='sd card', Keyword2='apps', ID='45', Gender='M', Status='S', Children='2', Est Income='89459.90', Car Owner='N', Age='53.280000', Marital Status='Single', zipcode=None, LongDistance='11.54', International='1.61', Local='22.90', Dropped='0', Paymethod='CC', LocalBilltype='FreeLocal', LongDistanceBilltype='Standard', Usage='36.05', RatePlan='2', DeviceOwned='ipho', CHURN='T', Preference='more storage', Owns Multiple Phone Numbers='N', Ave Text Msgs='1886'),
 Row(Sentiments='frustrated', Keyword1='technical', Keyword2='support', ID='52', Gender='F', Status='M', Children='2', Est Income='67388.00', Car Owner='N', Age='53.120000', Marital Status='Married', zipcode=None, LongDistance='4.79', International='0.50', Local='91.04', Dropped='1', Paymethod='CC', LocalBilltype='Budget', LongDistanceBilltype='Standard', Usage='96.33', RatePlan='3', DeviceOwned='ipho', CHURN='T', Preference='android phone', Owns Multiple Phone Numbers='N', Ave Text Msgs='1590'),
 Row(Sentiments='analytical', Keyword1='new number', Keyword2='customer care', ID='53', Gender='F', Status='M', Children='1', Est Income='57063.00', Car Owner='Y', Age='52.333333', Marital Status='Married', zipcode=None, LongDistance='16.79', International='0.00', Local='81.30', Dropped='0', Paymethod='CH', LocalBilltype='Budget', LongDistanceBilltype='Standard', Usage='98.10', RatePlan='4', DeviceOwned='ipho', CHURN='F', Preference='android phone', Owns Multiple Phone Numbers='N', Ave Text Msgs='1205'),
 Row(Sentiments='frustrated', Keyword1='technical', Keyword2='support', ID='54', Gender='F', Status='M', Children='2', Est Income='84166.10', Car Owner='N', Age='54.013333', Marital Status='Married', zipcode=None, LongDistance='3.28', International='0.00', Local='11.74', Dropped='1', Paymethod='CC', LocalBilltype='Budget', LongDistanceBilltype='Standard', Usage='15.02', RatePlan='2', DeviceOwned='ipho', CHURN='T', Preference='dual sim', Owns Multiple Phone Numbers='Y', Ave Text Msgs='1625'),
 Row(Sentiments='neutral', Keyword1='supervisor', Keyword2='delegation', ID='61', Gender='M', Status='S', Children='2', Est Income='100020.00', Car Owner='N', Age='50.000000', Marital Status='Single', zipcode=None, LongDistance='21.37', International='0.00', Local='293.24', Dropped='0', Paymethod='CH', LocalBilltype='Budget', LongDistanceBilltype='Standard', Usage='314.62', RatePlan='4', DeviceOwned='ipho', CHURN='T', Preference='android phone', Owns Multiple Phone Numbers='N', Ave Text Msgs='1696')]

The following code block is intended to get a feel for Spark DataFrame APIs. We attempt to fix some of the column titles to promote readability, and also remove a duplicate column (Status and Marital Status are the same). Finally convert the DataFrame to Python Pandas structure for visualization. Since all are string types from the CSV file, let us change some of them to other types

# Change some column names

data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumnRenamed("Sentiments", "Sentiment").withColumnRenamed("Owns Multiple Phone Numbers","OMPN")
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumnRenamed("Keyword1", "Keyword_Component").withColumnRenamed("Keyword2","Keyword_Query")
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumnRenamed("Ave Text Msgs", "SMSCount").withColumnRenamed("Car Owner","CarOwnership")
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumnRenamed("Marital Status", "MaritalStatus").withColumnRenamed("Est Income","Income")
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.drop('Status')

# Change some of the data types

data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("Children", data_joined_callnotes_churn_campaign["Children"].cast(IntegerType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("Income", data_joined_callnotes_churn_campaign["Income"].cast(DecimalType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("Age", data_joined_callnotes_churn_campaign["Age"].cast(IntegerType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("LongDistance", data_joined_callnotes_churn_campaign["LongDistance"].cast(DecimalType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("International", data_joined_callnotes_churn_campaign["International"].cast(DecimalType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("Local", data_joined_callnotes_churn_campaign["Local"].cast(DecimalType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("Dropped", data_joined_callnotes_churn_campaign["Dropped"].cast(IntegerType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("Usage", data_joined_callnotes_churn_campaign["Usage"].cast(DecimalType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("RatePlan", data_joined_callnotes_churn_campaign["RatePlan"].cast(IntegerType()))
data_joined_callnotes_churn_campaign = data_joined_callnotes_churn_campaign.withColumn("SMSCount", data_joined_callnotes_churn_campaign["SMSCount"].cast(IntegerType()))
data_joined_callnotes_churn_campaign.show(10)
data_joined_callnotes_churn_campaign.printSchema()

pandas_df_callnotes_campaign_churn = data_joined_callnotes_churn_campaign.toPandas()
pandas_df_callnotes_campaign_churn.head(12)

+----------+-----------------+-------------+---+------+--------+------+------------+---+-------------+-------+------------+-------------+-----+-------+---------+-------------+--------------------+-----+--------+-----------+-----+-------------+----+--------+
| Sentiment|Keyword_Component|Keyword_Query| ID|Gender|Children|Income|CarOwnership|Age|MaritalStatus|zipcode|LongDistance|International|Local|Dropped|Paymethod|LocalBilltype|LongDistanceBilltype|Usage|RatePlan|DeviceOwned|CHURN|   Preference|OMPN|SMSCount|
+----------+-----------------+-------------+---+------+--------+------+------------+---+-------------+-------+------------+-------------+-----+-------+---------+-------------+--------------------+-----+--------+-----------+-----+-------------+----+--------+
|analytical|          sd card|         apps| 24|     F|       2| 47902|           N| 26|      Married|   null|          17|            5|   50|      1|     Auto|    FreeLocal|            Standard|   72|       2|       moto|    F| more storage|   N|    1864|
|frustrated|          battery|    new phone| 35|     F|       0| 78851|           N| 48|       Single|   null|           0|            0|   29|      0|       CC|    FreeLocal|            Standard|   29|       4|       ipho|    T|     dual sim|   Y|    1173|
|frustrated|          battery|    new phone| 37|     F|       0| 83892|           Y| 61|      Married|   null|          29|            0|   45|      0|       CH|       Budget|            Standard|   74|       4|       ipho|    T|android phone|   N|     841|
|frustrated|          battery|    new phone| 38|     F|       2| 28221|           N| 38|      Married|   null|          26|            0|   12|      0|       CC|    FreeLocal|            Standard|   39|       4|       ipho|    T|windows phone|   N|    1779|
|frustrated|          battery|    new phone| 40|     F|       0| 28589|           N| 15|       Single|   null|          13|            0|   87|      0|       CC|    FreeLocal|            Standard|  100|       4|       ipho|    T|large display|   N|    1720|
|analytical|          sd card|         apps| 45|     M|       2| 89460|           N| 53|       Single|   null|          12|            2|   23|      0|       CC|    FreeLocal|            Standard|   36|       2|       ipho|    T| more storage|   N|    1886|
|frustrated|        technical|      support| 52|     F|       2| 67388|           N| 53|      Married|   null|           5|            1|   91|      1|       CC|       Budget|            Standard|   96|       3|       ipho|    T|android phone|   N|    1590|
|analytical|       new number|customer care| 53|     F|       1| 57063|           Y| 52|      Married|   null|          17|            0|   81|      0|       CH|       Budget|            Standard|   98|       4|       ipho|    F|android phone|   N|    1205|
|frustrated|        technical|      support| 54|     F|       2| 84166|           N| 54|      Married|   null|           3|            0|   12|      1|       CC|       Budget|            Standard|   15|       2|       ipho|    T|     dual sim|   Y|    1625|
|   neutral|       supervisor|   delegation| 61|     M|       2|100020|           N| 50|       Single|   null|          21|            0|  293|      0|       CH|       Budget|            Standard|  315|       4|       ipho|    T|android phone|   N|    1696|
+----------+-----------------+-------------+---+------+--------+------+------------+---+-------------+-------+------------+-------------+-----+-------+---------+-------------+--------------------+-----+--------+-----------+-----+-------------+----+--------+
only showing top 10 rows

root
 |-- Sentiment: string (nullable = true)
 |-- Keyword_Component: string (nullable = true)
 |-- Keyword_Query: string (nullable = true)
 |-- ID: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Children: integer (nullable = true)
 |-- Income: decimal(10,0) (nullable = true)
 |-- CarOwnership: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- MaritalStatus: string (nullable = true)
 |-- zipcode: string (nullable = true)
 |-- LongDistance: decimal(10,0) (nullable = true)
 |-- International: decimal(10,0) (nullable = true)
 |-- Local: decimal(10,0) (nullable = true)
 |-- Dropped: integer (nullable = true)
 |-- Paymethod: string (nullable = true)
 |-- LocalBilltype: string (nullable = true)
 |-- LongDistanceBilltype: string (nullable = true)
 |-- Usage: decimal(10,0) (nullable = true)
 |-- RatePlan: integer (nullable = true)
 |-- DeviceOwned: string (nullable = true)
 |-- CHURN: string (nullable = true)
 |-- Preference: string (nullable = true)
 |-- OMPN: string (nullable = true)
 |-- SMSCount: integer (nullable = true)

	Sentiment	Keyword_Component	Keyword_Query	ID	Gender	Children	Income	CarOwnership	Age	MaritalStatus	...	Paymethod	LocalBilltype	LongDistanceBilltype	Usage	RatePlan	DeviceOwned	CHURN	Preference	OMPN	SMSCount
0	analytical	sd card	apps	24	F	2	47902	N	26	Married	...	Auto	FreeLocal	Standard	72	2	moto	F	more storage	N	1864
1	frustrated	battery	new phone	35	F	0	78851	N	48	Single	...	CC	FreeLocal	Standard	29	4	ipho	T	dual sim	Y	1173
2	frustrated	battery	new phone	37	F	0	83892	Y	61	Married	...	CH	Budget	Standard	74	4	ipho	T	android phone	N	841
3	frustrated	battery	new phone	38	F	2	28221	N	38	Married	...	CC	FreeLocal	Standard	39	4	ipho	T	windows phone	N	1779
4	frustrated	battery	new phone	40	F	0	28589	N	15	Single	...	CC	FreeLocal	Standard	100	4	ipho	T	large display	N	1720
5	analytical	sd card	apps	45	M	2	89460	N	53	Single	...	CC	FreeLocal	Standard	36	2	ipho	T	more storage	N	1886
6	frustrated	technical	support	52	F	2	67388	N	53	Married	...	CC	Budget	Standard	96	3	ipho	T	android phone	N	1590
7	analytical	new number	customer care	53	F	1	57063	Y	52	Married	...	CH	Budget	Standard	98	4	ipho	F	android phone	N	1205
8	frustrated	technical	support	54	F	2	84166	N	54	Married	...	CC	Budget	Standard	15	2	ipho	T	dual sim	Y	1625
9	neutral	supervisor	delegation	61	M	2	100020	N	50	Single	...	CH	Budget	Standard	315	4	ipho	T	android phone	N	1696
10	frustrated	car adapter	battery	62	F	2	45288	Y	29	Married	...	CC	Budget	Standard	3	3	ipho	F	dual sim	N	1579
11	frustrated	technical	support	63	F	2	59613	N	34	Married	...	CC	Budget	Intnl_discount	176	1	ipho	T	dual sim	Y	1662

12 rows × 25 columns

The following brunel based visualization can also be performed from Data Refinery. Shown here to get the feel for APIs

%brunel data('pandas_df_callnotes_campaign_churn') bar y(#count) stack polar color(Sentiment) sort(#count) label(Sentiment, ' (', #count, '%)') tooltip(#all) percent(#count) legends(none)

wzxhzdk:13

wzxhzdk:14

##### The following cell shows an example of how pixiedust can be used to build interactive dashboards, and how it can be exported out wzxhzdk:15

Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter

##### Building RandomForest based classifier wzxhzdk:16 wzxhzdk:17 ##### Split the dataset into training and test using 70:30 split ratio and build the model wzxhzdk:18 ##### Testing the test dataset wzxhzdk:19

	ID	CHURN	label	predictedLabel	prediction	probability
0	1006	T	1.0	T	1.0	[0.15866010494609709, 0.8413398950539029]
1	1231	F	0.0	F	0.0	[0.8565502761039058, 0.14344972389609417]
2	1239	F	0.0	F	0.0	[0.5644293832170035, 0.43557061678299647]
3	1292	F	0.0	F	0.0	[0.7184336545774561, 0.281566345422544]
4	1293	F	0.0	F	0.0	[0.7324334755270526, 0.26756652447294743]
5	1299	F	0.0	F	0.0	[0.7425781475656279, 0.25742185243437216]

##### Model Evaluation wzxhzdk:20 Precision model1 = 0.88. Area under ROC curve = 0.86. ##### Deployment in Watson Machine Learning. Note that there are multiple ways of deploying a model in WML. Reading the credentials of your service from an external directory or at least a file, is better than hard coding in the code. First cell below loads the credentials, and the next cell instantiates the service. wzxhzdk:21 02f481c2-7717-4e79-b60f-3c1214b5f374 wzxhzdk:22 modelType: sparkml-model-2.1 creationTime: 2018-01-29 05:00:35.460000+00:00 modelVersionHref: https://ibm-watson-ml.mybluemix.net/v2/artifacts/models/a6d5cae3-c31a-46a6-b160-af82d11a6df9/versions/fefe0848-8426-463c-984f-0137e28ad56e label: CHURN