Le passage à léchelle de la connaissance client, lapplication du Big Data dans lE- commerce vers une expérience client meilleur : cas du Jumia [308085]

Le passage à l’échelle de la connaissance client, l’application du Big Data dans l’E- commerce vers une expérience client meilleur : cas du Jumia

Mémoire de fin d’études présenté pour l’obtention

du diplôme d’Ingénieur d’Etat

de l’Ecole des Sciences de l’Information

Naima HAMMOUTI

Sous la direction de : Mme. Najima Daoudi

Membres de jury :

Président Mme. Maryem RHANOUI

Membre 1 Mr. Zakaria ELMAHFOUDI

Membre 2 Mme. Najima Daoudi

Promotion 2014-2017

“We want to prove ourselves. We are proud of ourselves when we see that we are trusted and that we appreciate our work: it is rewarding. I [anonimizat], and enriched from a personal as well as a professional point of view. ”

[anonimizat], and friends.

First and foremost I want to thank my advisor Mrs. [anonimizat] (ESI). I [anonimizat]. [anonimizat]. I am also thankful for the excellent example she has provided as a successful woman engineer and professor.

[anonimizat]. Zakaria ELMAHFOUDI. Thank you for the opportunities you’ve given me to start my professional journey .Thank you for all of the meetings and chats over this four months. You recognized that I at times needed to work alone but also made sure to check in on me so that I stayed on the right path.

[anonimizat]. Homame SOUSSI. [anonimizat], and willingness that allowed me to pursue research on topics for which I am truly passionate. I [anonimizat] I thank you for letting me do the same.

To The members of Jumia group who have contributed immensely to my personal and professional time at Jumia. The group has been a source of friendships as well as good advice and collaboration.

[anonimizat]. I [anonimizat]. I always knew that you believed in me and wanted the best for me.

To my sister Mrs. [anonimizat], [anonimizat].

[anonimizat].

Abstract

No one can deny that the internet and Big Data are enabling new and better ways of finding and obtaining customers. . Big Data allows us to identify customers’ [anonimizat]. Indeed, Big Data provides extremely robust insights into the needs and preferences of customers. As a result, Big Data provides the opportunity to identify precisely the customers that have certain preferences and market to them.

A key factor for a company is to implement a CRM strategy that allows them to leverage the opportunity of Big Data with its customer information. [anonimizat] to reams of data about the company and its competitors.

It is on this track that we see the graduation thesis project that we have developed within the Jumia and which had as objective: Integration of all the existing Data sources into one and only surface in order to implement a Big Data solution to constitute a 360° view of the clients and its emotions. Two sub-objectives drew the path from this perspective namely:

Establish an overview of the ecommerce fabric in general and Jumia in particular;

Identify the axes of analysis in response to the needs identified from the research methods used and the expectations of JUMIA;

Keywords: Big Data, Big Data Analysis, Customer Relationship Management, Ecommerce, Social Medias, Data sources

Résumé

Les Big Data désignent l’ensemble des méthodes et des technologies permettant de stocker, traiter et analyser des données et contenus hétérogènes, afin d’en faire ressortir de la valeur ajoutée et de la richesse, pour des environnements évolutifs qui se caractérisent par l’augmentation du volume de données et du nombre d'utilisateurs/producteurs, la variété des données ainsi que la vitesse de productions des flux important de données.

Témoignant d’une efficacité irréfutable au profit des secteurs techniques et industriels, le Big Data est mûr pour proposer des enjeux intéressants pour la scène du e-commerce.

L’e-Commerce qui est un domaine en pleine expansion, avec une croissance annuelle très intéressante au Maroc, atteint désormais un niveau de maturité qui l’amène à se confronter aux mêmes problématiques et enjeux que les autres secteurs et canaux de vente traditionnelles et donc de mettre en œuvre à son tour de véritables stratégies et outils de Gestion de la Relation Client (ou CRM) : Prospection de nouveaux clients, fidélisation de ses clients existants,…autant de tactiques à mettre en œuvre, de processus à définir par les e-commerçants et donc d’outils à mettre en œuvre.

Ces stratégies dépend d’une grand part sur des grandes quantités de données; Ces données et leurs traitement introduisent donc la notion de Big Data qui offre la réponse qui permettra le déploiement d’un CRM efficace et puissant, dont l’objectif c’est d’intégrer le « e-marketing intelligence » pour s’assurer un CRM Opérationnel qui intègre et propose toutes les fonctions utiles au e-marketing et à la mise en place de programmes de fidélisation clients et/ou d’animation de base prospects.

Avec les Big Data, Jumia voudra avoir la capacité de maintenir l’efficacité des CRM et même l’élargir sur un niveau intelligent et prometteur selon deux axes : l’Anticipation  et la Personnalisation.

L’intérêt de leur application peut donc se résumer en plusieurs points dont principalement :

Connaitre la provenance des utilisateurs

Avoir une idée sur le comportement des utilisateurs sur le site

Dresser le profil comportemental des utilisateurs en dehors du site

Mots-clés : Big Data, Big Data Analysis, Gestion des Relations Clients, E-commerce, Réseaux sociaux, Sources de données.

Figures Table

Figure 1 Rocket Internet regional leaders 16

Figure 2 AIG’s presence within Africa’s countries 17

Figure 3 Jumia’s subsidiaries in Africa 18

Figure 4 Jumia’s Organigramme 2016/2017 19

Figure 5 African GDP Growth in 2012-2014 20

Figure 6 Direct Observation plan scheme 22

Figure 7 Big-Data-Redesign_Diagram_On-Demand-Analytics-Amazon web experience 29

Figure 8 Xcally Dashbord’s screenshot 32

Figure 9 Real Time Graphs screenshots on Xcally Dashboard 32

Figure 10 Administrator user interface, with the Apps administration page selected 33

Figure 11 Jumia’s presence within social Medias 34

Figure 12 Interfaces used by Jumia costumer service 35

Figure 13 The power of Salesforce in today’s tech-savvy world. 38

Figure 14 Service Cloud / Big CRM environnement 39

Figure 15 Salesforce wave analytics to uncover insights and take instant action overview 40

Figure 16 Simplified diagram that explains the process of building an app within Salesforce Wave analytics 40

Figure 17 Deep dive into Jumia's current Information system 41

Figure 18 Data analysis cycle 44

Figure 19 Data mining process 46

Figure 20 Application of Classification 47

Figure 21 Zendesk Features within SalesForce 52

Figure 22 An Overview of the combining process of the audit information for each ticket with the ticket table 55

Figure 23 Resulting dataset overview sample 55

Figure 24 Hadoop timeline 56

Figure 25 Spark bases 58

Figure 26 The Spark Stack 58

Figure 27 Introduction to DataBricks 59

Figure 28 Spark features 60

Figure 29 An overview of the HDFS formatting script 62

Figure 30 An overview of sparkshell 63

Figure 31 An overview of the SCALA API section of the project program 65

Figure 32 An overview of the SCALA API for the updates 66

Figure 33 Databricks overview of the spark-Salesforce package within it 66

Figure 34 Databricks overview of the table created after the data extraction 67

Figure 35 Creating the Twitter API 68

Figure 36 Packages needed to be imported to our Spark Program 68

Figure 37 First overview of the Twitter program running within ECLIPSE IDE 70

Figure 38 second overview of the Twitter program running within ECLIPSE IDE 70

Figure 39 An overview of the result console 71

Figure 40 The number of optimal clusters in the dataset graph 77

Figure 41 K-means clustering in R results 78

Figure 42 Result Console –Scala API resulting 80

Figure 43 Output files and directory 80

Figure 44 Output file containing Twitter username and timestamp 81

Figure 45 Output file containing Twitter extracted data and sentiment analysis 81

Figure 46 How Salesforce analytics cloud works 82

Figure 47 Datasets overview within Salesforce 83

Figure 48 Resulting Dataset overview within Salesforce 84

Figure 49 an overall screenshot of the Big CRM platform // SalesForce 85

Figure 50 An overall screenshot of the 360° client view feature 85

Figure 51 An overall screenshot of the client segmentation icon we would want to achieve 86

Tables Tab

Tableau 1 Definitional aspects of big data analytics (BDA) in e-commerce 27

Tableau 2 Xcally integration Functionalities over view 49

Tableau 3 Types of Data about calls 50

Tableau 4 Types of data we need to extract from Zendesk & its equivalent within Salesforce 53

Tableau 5 An output SQL table of test data Table 54

Tableau 6 Apache Spark Vs Hadoop Mapreduce 60

Abbreviations list

CRM Customer Relationship Management

BDA Big Data Analysis

OMS Order Management System

AIG Africa Internet Group

API Application Programming Interface

Introduction

The idea of data creating business value is not new, however, the effective use of data is becoming the basis of competition. Business has always wanted to derive insights from information in order to make better, smarter, real time, fact-based decisions.

Big data will fundamentally change the way businesses compete and operate. Companies that invest in and successfully derive value from their data will have a distinct advantage over their competitors with the emerging technologies and digital channels that offers better acquisition that enable faster and easier data analysis.

In short, the possibilities of big data- the data itself and the technologies for harnessing it- are quite amazing.

Talking about data and value will take us directly to customer’s data which leads us to Client Relationship Management (CRM) process in which what the main goal is to increase customer satisfaction in order to achieve organizational business objectives. CRM believes in investing in the customer, who is the prime asset of any business. Implemented right, CRM helps organizations to understand and engage customers better, become more relevant to customers and users.

CRM solutions can become smarter by converting their data into customer value, subsequently using these to improve customer processes, and to predict eventual opportunities which would improve customer satisfaction, conversion, loyalty and advocacy.

A field of reflection that was warmly welcomed by a leading organization in its domain known in Morocco by Jumia ; Jumia who has drawn from its beginnings the love of its customers as its strategy path , for that the overall project focuses on designing and prototyping a Big Data solution to improve customer knowledge and attract new ones.

This field is animated by a two interrogations:

What will Big Data change for Jumia in terms of customer knowledge and customer relationship management?

Does Big data has the potential to change the way Jumia manage customer relationships only by combining customer’s internal data with their behavior made on social networks?

This thesis paper aims to offer this migration a theoretical and technological perspective. Indeed, we carried out a project whose main objectives were to outline e-commerce fabric on the new vector of customer behavior.

So, in order to give an accurate value of the work carried out, this paper is organized around three parts:

The first one will cover the contextual and methodological perimeter;

The second section will take the existing data sources and potential solution as the main asset;

The third section will cover both of the proposed prototype and the implementation process

Research problems & hypothesis

No one can deny that E-Commerce is an expanding field with a very interesting annual growth in Africa, and is now reaching a level of maturity that brings it to face the same issues and challenges as other traditional sectors and sales channels. For e-commerce business customers are their lifeblood, and essentially their most precious entity. Whether they have already built up a loyal customer base, or are setting out to execute a business development plan, selecting a way of effectively managing customers should be a priority.

These priorities depend largely on large amounts of data; This data and their processing introduces the concept of Big Data , which is known as the set of methods and technologies for storing, processing, and analyzing heterogeneous data and content, has been irrefutably effective in the technical and industrial sectors, so it is ripe to propose interesting issues for the e-commerce scene; which offers the answer that will allow the deployment of an efficient ,intelligent and powerful CRM tool, whose objective is to integrate e-marketing intelligence to ensure the integration of all CRM operations and propose all the useful functions to e-marketing and the implementation of customer loyalty programs and / or basic animation prospects: Exploration, Reporting, Analysis….

With Big Data, we will have the ability to maintain the effectiveness of CRMs among two different axes: Anticipation and Personalization.

The interest of their application can be summarized the in several points, mainly:

Improved customer analysis – The analysis of all customer touch points, including social media, email, and internet and call center, allow CRM and big data to segment customers according to actions. Customer trends can be mined from big data and used to predict needs, directing product development and promotional efforts.

Better picture of customer-facing operations – Big data will provide businesses with sales, marketing and customer service performance metrics. With big data, organizations can predict and determine ROI and use it to endorse additional CRM investment.

Better decision making– Once the value in customer-facing operations is made clear, businesses can make course corrections and better decisions going forward.

Predictive Modeling – Using big data, businesses gain the ability to predict how customers will respond in the future, based on demographics and behavioral history.

Benchmarking – A powerful component of big data is the ability to implement comprehensive benchmarking over time, enabling organizations to define vital indicators such as customer sentiment, retention and cost vs. revenue per service. Once the areas that need improvement are emphasized, companies have the tools necessary to rise above industry standards

As a leader in its field, Jumia is the first e-commerce firm in Africa with more than 26 forecasting countries and more than 20,000 visits per day in real time for each website.

So, the main improvement in the Big Data era for Jumia lies in the large variety of data sources has, a variety that is meanly animated by a triple interrogatory:

What will the Big Data change for Jumia in terms of customer acknowledgement and customer relationship management?

Does Big data has the potential to change the way Jumia manage customer relationships only by combining customer’s internal data with their behavior made on social networks?

Can we imagine predicting customer behavior and estimating new one just by using Big Data?

Three questions that are theoretically intriguing of a strongly colored context by technological trends, in which no answer can be attributed unless we were able to understand that the Big data can give rise to a new edge of glory which is translated into a set of business innovations, mainly oriented around nit only the personalization of the offer but also the improvement of the processes and the optimization of the marketing actions.

Chapter 1: Study Context

Introduction

As is customary, at the beginning of this paper, we will introduce the general context of our study in this chapter, in which we will first give a brief presentation of the host organization, then next we will try to give a general glance of E-commerce in Africa.

Organization

As we’ve mentioned earlier, we will present in this part the organization that hosted our project.

The project toke place within Jumia Morocco which is also known as one of the most influential Rocket Internet regional leaders; so in this section we will try first to represent in brief Rocket Internet group and then pass to Jumia as whole to finally put the lights into Jumia Morocco as a unit.

Rocket Internet group

Rocket Internet SE is a German Internet company headquartered in Berlin. The company builds online startups and owns shareholdings in various models of internet retail businesses. The company model is known as a startup studio or a venture builder.

Rocket Internet has more than 28,000 employees across its worldwide network of companies, which consists of over 100 entities active in 110 countries. The company's market capitalization was 3.151 billion euros in October 10, 2016

The network of companies

Rocket Internet follows the strategy of building companies on the basis of proven Internet-based business models. According to Rocket Internet's financial statements the company especially concentrates on Food & Groceries, Fashion, Home & Living and Travel. In addition to the companies in the five industry sectors, Rocket Internet owns stakes in companies at varying maturity stages, ranging from recently launched models to companies that are in the process of establishing leadership positions or still expanding their geographic reach.

Figure 1 Rocket Internet regional leaders

Africa Internet Group (AIG)

AIG is one of Rocket Internet regional leaders, it is a young company which has existed for almost 5 years, present in 26 countries in Africa. They have created nearly 5,000 direct jobs in Africa. The capital is shared equally between Rocket Internet, MTN Group, operator South African telecom leader and leader in mobile telephony in Africa, and Millicom.

Leader in mobile telephony in Rwanda, Senegal, Tanzania, the Democratic Republic of the Congo, Chad and Ghana. For its part, AIG holds 100% of the capital of all its subsidiaries.

Figure 2 AIG’s presence within Africa’s countries

Africa internet group is the first internet group in Africa. founded in 2012, AIG is present currently in 26 countries with 71 enterprises, it is a company that has several subsidiaries in Africa and follows the same strategy as Rocket Internet SE of building companies on the basis of proven Internet-based business models.

Jumia

Jumia is Africa’s leading online shopping destination. Customers across the continent can shop from the widest assortment of high quality products with everything from fashion, consumer electronics, and home appliances to beauty products on offer at affordable prices. Jumia was the first African company to win an award at the World Retail Awards 2013 in Paris, receiving the title of “Best New Retail Launch” of the year.

Figure 3 Jumia’s subsidiaries in Africa

Jumia Morocco

Launched in June 2012, Jumia Morocco is the Moroccan subsidiary of Africa Internet Group. Has the same concept of an online mall as Jumia. It offers a wide choice of products from different brands at the best prices

More than 200 Employee

Warehouses in the main cities of the kingdom, on 3,000 m².

4 to 6 million visits per month

An average basket of 600 to 1,000 dirhams Annual turnover between 144 million MAD and 240 million MAD per year.

Represents 10% of the activity of AIG, the platform receives more visitors than in other country where the population is higher.

Mission:

Revolutionize the concept shopping by offering customers the best Online shopping experience. This is the main mission of JUMIA.

Vision:

Becoming the leading online merchant in Morocco, offering an unmatched assortment, with a wide choice of product categories, and offering a unique customer experience before, during and after the purchase.

Organizational chart:

Figure 4 Jumia’s Organigramme 2016/2017

E-commerce in Africa: A growing business with a promising future

E-commerce is a transaction of buying or selling online. Electronic commerce draws on technologies such as mobile commerce, electronic funds transfer, supply chain management, Internet marketing, online transaction processing, electronic data interchange (EDI), inventory management systems, and automated data collection systems. Modern electronic commerce typically uses the World Wide Web for at least one part of the transaction's life cycle although it may also use other technologies such as e-mail.

Today since the majority of companies have an online presence, E-commerce has grown in importance. In fact, having the ability to conduct business through the Internet has become a necessity. Everything from food and clothes to entertainment and furniture can be purchased online.

E-commerce has existed in Africa since 2008, the date in which the banks have established a secure payment system by credit card. In 2010, 223,000 transactions were recorded by credit card on the internet. Since then the e-commerce sector in Morocco has truly boomed in the last two years and has provided an easy alternative for Moroccans to shop, pay their bills and utilize other online services from the comfort of their homes.

Africa Internet Group which is now known as Jumia, is considered as a leader in Africa e-commerce Fabric with the most influential contribution in African gross domestic product (GDP) between 2012 and 2014:

Figure 5 African GDP Growth in 2012-2014

Conclusion

Through this first chapter, we have defined the contextual anchor of our study, which is based on two axes: the host organization and the global renewal that both the e-commerce actors of the African society. A double interrogation on the maturity of the power of influence of the e-commerce as a new powerful commerce mean in Africa and on the existence of the technological means to gauge the new esplanade of Big Data applications .

Chapter 2: Methodological anchoring of the study

Introduction

In the following, we will present the methodology followed in our study .Thus, based on our stated problem, we would first like to dissect all the objectives to be attained and the questions arising from them, the methods and instruments used to collect the data, the population being studied, research flow, and last but not least the limits encountered and the scope of our work.

Research Objectives

In order to see the full image of the study we decided to divide the research objectives into main objective and operational ones

Main objective

As we all know, winning the love of customers is the key in Jumia ’strategy, for that the overall project focuses on designing and prototyping a Big Data solution to improve customer knowledge and attract new ones. So the main objective of my research is to first identify the needs and expectations of Jumia Morocco of this project and then determine the technical and human means to prepare the environment of a Big Data solution that will allow us to extract the data on and out the site, in real time, in order to investigate the possibility of establishing what can be called the "Big CRM".

Operational objectives

We can divide this main objective into three directions:

Collect and diagnose the existing E-Marketing strategy and CRM tools used within Jumia Morocco ‘services.

Study the methods and techniques used to implement the Big Data project to improve customer relations.

Implement a CRM solution based on a data mining process in the context of Big data, in favor of Jumia Morocco.

Research questions

The objectives set out above can be divided into several research questions; these questions can be presented as follow:

Research methods

As we all know the choice of a research method is influenced by the data collection strategy, the type of variable, the accuracy required and the collection point. Links between a variable, its source and practical methods for its collection can help in choosing appropriate methods. The main data collection methods are:

Documentary method

The documentary method, enabled us to seek, collect and identify all the documents that deal with the different aspects of our research topic. Indeed, we have been able to write a detailed literature review that enhance all the themes attached to our study.

Field Investigation method

Forms which are completed with a specific population in order to collect their opinions directly on a subject, the field survey enabled us to go deeper into our study of the existing situation and tools in order to confirm the predefined needs, for that we decided to use a one and an only method due to the project’s nature and its limited number of the study population, direct observation was the most appropriate tool for confirming and detailing the expectations from our project, which stands especially on making direct measurements for many variables, this research method was deployed during the various searches & navigations inside & outside Jumia ‘website, which enabled us to explore more the existing tools and instruments.

.

Figure 6 Direct Observation plan scheme

Research population

The target population of this survey consisted of Jumia in general and Jumia Morocco in particular matter. Because of the broad definition of CRM, it’s increasingly wide applicability in business practice and the fact that CRM systems house some of the most valuable data which is considered as one of the most valuable strategic assets for Jumia, our Research population will concentrate first on the customer service department.

So, as we know data in and of itself is often of limited practical use; its real value comes from data analysis, visualization tools and its wide different resources to mitigate the complexity that permeates the object of the study on the one hand and to propose theoretical and technological choices aligned with the possibilities offered by the existing tools and the new ones that will need to be implemented.

Conclusion

This study, the first of its kind in Jumia, is part of the application of Big Data technology in the e-commerce field in general which crystallizes through a draft study of one of the most critical components namely the commercial fabric.

Chapter 3: Review of literature

Introduction

It is generally agreed that with more and more data generated, it has become a big challenge for traditional architectures and infrastructures to process large amounts of data within an acceptable time and resources. Moreover, because of value networks, emergence of social networks, and the huge information flow across and within organizations, more and more businesses are interested in utilizing big data analytics.

E-commerce companies as well find themselves in a promising and challenging situation at the same time. E-commerce is growing overall, yet, the competitive pressure is significant since it is a scalable business and the winner take it all. Customers are not loyal and mostly arrive from search engines and online advertisements. On search engines, they typically are presented with multiple search results, including prices, which puts pressure on margins. At the same time advertisements on these search engines are sold through a bidding and leave decreasing inefficiencies in the market to be exploited. So, in order to efficiently extract value from these data, organizations need to find new tools and methods specialized for big data processing.

For those reasons, big data analytics has become a key factor for companies to reveal hidden information and achieve competitive advantages in the market.
Currently, enormous publications of big data, big data analytics, e-commerce and customer behavior have been released, make it difficult for practitioners and researchers to find topics they are interested in and track up to date. However, it remains poorly-explored as a concept, which obstructs its theoretical and practical development. This literature review explores Big Data in e-commerce by drawing on a systematic review of the literature and aims to present an overview of big data analytics’ content, scope and findings as well as opportunities provided by the application of big data analytics within e-commerce and customer relationship management.

To resume the main focus of this chapter is to elucidate knowledge on the characteristics of big data analytics literature as well as to explore the areas that lack sufficient research within the big data analytics domain and presenting a new term known as “Big Customer Relationship Management” (Big CRM) that enhance the application of big data on customer experience.

Definition of Big data

What is big data?

Big data is a term for data sets that are so large or complex that traditional data processing application software are inadequate to deal with them. It is also defined by as the term that describes the large volume of data whether structured or unstructured – that inundates a business on a day-to-day basis. Challenges include capture, storage, analysis, data curation, search, sharing, and transfer, visualization, querying, and updating information privacy.

The term "big data" often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data.

« Big Data is the result of collecting information at its most granular level — it’s what you get when you instrument a system and keep all of the data that your instrumentation is able to gather » .Furthermore, the concept gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three Vs:

Volume: Volume dictates the amount of data being processed, it’s the huge amount of information that need to be parsed to make the data usable, and which organizations collect from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data.

Velocity: The speed in which data can be sent, shared, and processed; data streams in at an unprecedented speed and must be dealt with in a timely manner.

Variety: Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured data; big data is not about basic information. Data can draw from images, audio, video, and other sources, as well as text.

In defining big data, IBM (2012); Johnson (2012a), and Davenport (2012) focused more on the variety of data sources, while other authors, such as Rouse (2011); Fisher (2012); Havens (2012), and Jacobs (2009) emphasized the storage and analysis requirements of dealing with big data and claims that it’s not the amount of data that’s important, it’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

The sheer volume of academic and industry research provides evidence on the importance of big data in many functional areas, and many practical business environments.

What is big data analytics (BDA)?

Big data analytics is an analytical process that enable examine the large amounts of data to uncover hidden patterns, correlations and other insights. With today’s technology, it’s possible to analyze data and get answers from it almost immediately

Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier customers. A well implemented BDA process offers companies value in different fields and that by following different ways such as:

Cost reduction: Big data technologies and cloud-based analytics bring significant cost advantages when it comes to storing large amounts of data.

Faster, better decision making: With the speed of the new Big Data technologies, their real time and in-memory analytics, combined with the ability to analyze new sources of data, businesses are able to analyze information immediately – and make decisions based on what they’ve learned.

New products and services: With the ability to gauge customer needs and satisfaction through analytics comes the power to give customers what they want. Davenport points out that with big data analytics, more companies are creating new products to meet customers’ needs.

Defining BDA in the e-commerce environment

In defining BDA, two sides of the research node gives a different streams of BDA, one stream of research has focused on analytics that create sustainable value for business. To highlight, LaValle (2011) explained that the application of business analytics (or the ability to use big data) for decision making must essentially be connected with the organization’s strategy. Indeed, strategy-driven analytics has received much attention due to its role in better decision making. Studies have also focused on “competitive advantages” and differentiation, while applying analytics to analyze real-time data.

The other stream of research defines BDA from the perspective of identifying new opportunities with big data. For example Davenport (2012) explained that BDA attempts to explore new products and value-added activities. Similar arguments have also been offered in another study by Davenport in scanning the external environment and identifying emerging events and opportunities.

E-commerce firms are one of the fastest groups of BDA adopters due to their need to stay on top of their field .In most cases, e-commerce firm’s deal with both structured and unstructured data. Whereas structured data focuses on demographic data including name, age, gender, date of birth, address, and preferences, unstructured data includes clicks, likes, links, tweets, voices, etc. In the Big data analytics environment, the challenge is to deal with both types of data in order to generate meaningful insights to increase conversions. Schroeck (2012) found that the definition of big data incorporated various dimensions including: greater scope of information; new kinds of data and analysis; real-time information; non-traditional forms of media data; new technology-driven data; a large volume of data; the latest buzz word; and social media data.

In follow we try to give an idea (table 1) about the most popular studies and aspects of BDA in e-commerce environment explaining each perspective and given its purpose and opening the door to a new and innovative potential research areas.

Tableau 1Definitional aspects of big data analytics (BDA) in e-commerce

Types of big data used in E-commerce

E-commerce refers to the online transactions: selling goods and services on the internet, either in one transaction or through an ongoing transaction. E-commerce firms ranging data within and outside their websites capture various types of data which can be broadly classified into three categories:

Transaction or business activity data: Structured data from retail transactions, customer profiles, distribution frequency and volume, product consumption and service usage, nature and frequency of customer complaints.

Click-stream data: Data gathered from the web, social media content, online advertisements (tweets, blogs, Facebook wall postings, etc.)

video data and voice data : Voice data from phone calls, call centers, customer service

In e-commerce, data are the key to track consumer shopping behavior to personalize offers, which are collected over time using consumer browsing and transactional points.

Big Data and CRM, a future e-commerce winning couple tool

After defining Big Data and Big Data analytics, and explaining the huge role that both plays in conducting values to customer data and drawing a high level of competitive advantage to companies in e-commerce fields, it seems very useful to talk about Big data and CRM .

We can all agree that the volume and velocity of quality performance data, coupled with always-on connectivity, enables improved customer service. Machine-to-machine automation enables the real-time aggregation of performance data from thousands of devices. That in turn enables the rapid deployment of over-the-air application updates to address error codes and security threats, or to deliver enhanced services based on actual usage.

Convenient access to their account, service, and delivery information is becoming a standard customer expectation. When brands respond quickly to anticipate and manage customer expectations that improves their experience.

Communication systems that enable the choice of mediums and provide automated or personalized messages for customer service or product updates ,guaranties a communication wave between customers and companies, reduce the number of in-bound customer service calls and increase overall satisfaction. Those communication systems are generally known as a customer relationship management systems or CRM which are defined as an approach applied within companies to managing interaction with current and potential future customers. It tries to analyze data about customers' history with a company and to improve business relationships with customers, specifically focusing on customer retention and ultimately driving sales growth.

Combining Big data applications with CRM process introduce us a new term which can be called Big CRM (big data customer relationship management) which refers to the practice of integrating big data into a company's CRM processes with the goals of improving customer service, calculating return on investment on various initiatives and predicting clientele behavior.

As we mentioned above companies struggle, in general, to make sense of big data because of its sheer volume, the speed in which it is collected and the great variety of content it encompasses. Tools and procedures are evolving in order to help companies house and examine these large amounts of data and help companies move toward making data-driven decisions.

Companies now are trying to smarter their services towards their customers , so using BDA companies ‘goal is to combine internal CRM data with customer sentiment data that exists outside of the company's existing system, such as on social media networks and trying to take the most advantage from the internal data such as product costs and price ,stock level sales, advertisement campaign and pricing data in order to improve customer analysis and lead to predictive modeling and other practices .

The American experience: How Do They Do It? Amazon's CRM Success Story

Established in the US, Amazon is an ecommerce company which sells products to millions of customers all over the world. It originally started as a books retailer but now sells an extensive range of products through its own store as well as through extensive online marketplaces.

Over the past 20 years, Amazon has consistently proven it is capable of running a world-class CRM strategy. For its millions of loyal customers, Amazon has remained their preferred online shopping destination amid mass competition. A well-managed and efficient CRM strategy has been a crucial aspect of their success.

Amazon have a reputation for providing customers with everything that they need, all in one place. What has since become known as 'the Amazon Effect', the company have successfully managed relations with millions of customers without ever meeting them face-to-face.

The Secret to CRM Success

CRM is the basis behind all of these useful features. Amazon have built their own CRM software in-house, meaning it is tailored to their own exact requirements. Their software allows Amazon to capture customer data, such as location and previous purchases, and use that to instantly customize a user's on-site experience.

Amazon’s CRM also deals with most of their customer queries before reaching the stage where human intervention is required. For example, each customer can access their own order history, which allows them to see what they have ordered, where in the delivery process it is, and how much they spent on it. Their returns policy is similarly automated, vastly reducing their need for customer service staff and the costs associated with them.

Figure 7 Big-Data-Redesign_Diagram_On-Demand-Analytics-Amazon web experience

Top 5 Ways Amazon use CRM

In this section we will try to explain the scheme given previously, resuming every steps that Amazon uses big data in it, in order to achieve this level of maturity.

Data collection

Amazon encourages all users to create accounts to make it easier for them to make future purchases. These accounts also give Amazon a targeted marketing method as customers can be emailed with offers and promotions based on their past purchases.

Personal data storage

One way in which creating an Amazon account benefits the customer is ease of purchase. With an account all payment, personal, and address details can be stored- allowing for quick and easy future purchases.

Recommendations

Amazon pioneered the recommended products feature. Whenever users are logged into their account, Amazon will recommend products they may be interested in based on past buying habits. More recently, Amazon introduced the 'customers who bought this item also bought' feature. These recommendations are perfect for boosting sales without the customer feeling pressured into buying.

Customer support

Amazon’s returns process is dealt with entirely online through a customer’s account. If there is an issue that does require a customer to speak with a customer service assistant over the phone, they will have access to the customer’s account and order details, meaning that any issues can be dealt with quickly and efficiently.

Kindle Marketplace

From personal accounts, to storage, payments and recommendations, the Kindle products and experience would not be the same without CRM

Conclusion

No one can deny that when referring to customer data, big data refers to large amounts of either transactional data or analytical data. It can also be structured, or easily quantified in charts, graphs or other standard record-keeping applications, or unstructured and contain things like audio, video or other images. These data are collected and aggregated data by companies from an array of mostly free – often social media – data sources, and deliver the data to their existing CRM in order to make sense of big data collected, Big data CRM requires then powerful data integration capabilities as well as data quality and cleaning that needs to be addressed before any value can be extracted from analysis.Big data can provide businesses with metrics on sales, marketing and other areas to gauge performance and quality. It can also help make better forecasting decisions by allowing for real-time decision-making as well as giving information on product inventories, customer segmentation and assist in the development of products and services.

Chapter 4: Analysis the current information system and solution

Introduction

In accordance with the scientific writing tradition, we will devote this chapter to the study of the existent through an identification of the customer relationship management systems and chains used within Jumia Costumer Service. Also, we will present the needs and expectations that we aim realize after these analysis.

Inventory of data sources ‘fixtures

It is generally agreed that customers are the most important asset of an organization. For building, managing and strengthening loyal and long-lasting customer relationship, the strategy called CRM is used.

As for Jumia, winning the love of its customers is key in the company strategy , but ²strategy is not really reflected in the general contacts processes between clients / customer service. In Jumia there is a large variety of contact means that are being used to be connected with customers:

Xcally

Zendesk

Social Medias ( Facebook )

Order Management System (OMS) developed by Jumia

All those means are manually connected to each other, which means that the agent must collect and extract information in real-time by the end of each connection between the client and the customer service and fill in those information into some information system as required.

Xcally

XCALLY Shuttle is an innovative Omni Channel solution that integrates Asterisk™ with the Shuttle and Motion technologies, developed in the Xenia lab research center. The solution is considered as one of the best Contact Center management platform for multi-channel – voice, chat, email, SMS, fax and custom channels (social, video …) – through standard APIs:

Responsive Supervisor web interface HTML5

Support for 3 type of agent experiences:

Windows Computer Telephony Integration (CTI) phone bar

External CTI SIP phones

WebRTC (experimental)

Integration with 3rd party software (i.e. Zendesk) using the Shuttle Push Technology

Advanced reportings

Dashboard:

In the Dashboard section there is an overview on the state of the system, in particular:

Real-time monitoring of agents, calls and queues

Server and Disk Stats

Analytics and real-time graphs

Figure 8 Xcally Dashbord’s screenshot

Real Time Graphs & Global Service Level

The Dashboard provide also real time graphs and global service level to show dynamically:

The Waiting and Active calls;

The Answer Rate, showing the analytics about Completed, Abandoned and Timeout Calls;

The Service Level (SL) 90% [10s], SL 80%[20s], SL 70%[30s], SL 70%+[30+s].

Figure 9 Real Time Graphs screenshots on Xcally Dashboard

Xcally and its real-time analytics enable Jumia systems to collect phone call tracking information including:

Calling phone number

Receiving phone number

When the call was made

Duration of the call

Sometimes, for cell phones, the location of the caller and receiver

With such big amount of phone calls, there is certainly a Big Data problem. There is no such way that anyone possibly can find anything of value amid all the calls ,no one can possibly gain significant insight looking at an Excel spreadsheet or a database showing rows and columns of all those calls.

Zendesk

Zendesk is a customer service solution that is designed to be smart, simple and is used by many of organizations to provide support to their customers. It’s a Software-as-a-Service (SaaS) product.

Figure 10 Administrator user interface, with the Apps administration page selected

Zendesk offer to its users a large variety of services, but rather than explaining all of the product terms up front, we’ll explain only the most important ones used by Jumia to collect and to extract customers’ requests and complaints :

Ticket

A support request submitted by a customer to ask for assistance. The term is selected to be as generic as possible, to capture the broad range of requests submitted to your customer service team.

Tickets are the means through which customers communicate with agents in Zendesk Support. Tickets can originate via a number of support channels, including email, Help Center, chat, phone call, Twitter, Facebook, or the API. All tickets have a core set of properties.

Field

Before a ticket is submitted, the user will provide details about his request by entering values into the ticket fields. Examples of default system fields are Subject, Description, and Priority. It’s also possible for administrators to add custom fields, which capture more specific information in the ticket.

Comment

These are pieces of text that are added to a ticket and form the conversation that will help solve it. Comments can be public, which means that they’re visible to end users who have access to the ticket. Comments can also be private, which means that only members of your internal support team and administrators will be able to read them.

Each week Zendesk’s system generate thousands of consumer requests and complaints. Data from those complaints could help Jumia understand the financial marketplace and protect consumers.

A Ticket generate a customer complaint or request which highlights a problem, whether that’s a problem with product, employees or internal processes, and by hearing these problems directly from customers, it will enable us to investigate and improve to prevent further complaints in the future or provide what they’ve request as soon as possible .

Tickets are originate via email, Help Center, chat, phone call, Twitter, Facebook, or the API have a core set of properties and might be considered one of Big Data problems.

Social Medias

One of the best ways to connect with customers is through social media, such as Facebook and Twitter. With social media, Jumia has reached out to its customers at any moment rather than wait for customers to send emails, requests or phone calls with feedback.

Figure 11 Jumia’s presence within social Medias

Figure 12 Interfaces used by Jumia costumer service

Jumia uses company's Facebook and twitter fan pages to engage followers and keep conversations going.

In Jumia Facebook and twitter are also used as a form of customer service, where directly Social media representatives answer customer questions and concerns, each private message or comment on Facebook or Twitter creates a Zendesk ticket which is processed by the CS.

In brief, understanding customers require a thoughtful analysis of where and how we can collect meaningful data. By better defining which aspects of their behavior or profiles are most significant to our business, we would be start to measure and analyze better ways to engage them and ultimately sell more. All those contact forms or means are legitimately the main customer’s information sources.

If we focus only on the field of CRM, the data gathered every day from each source such as the amount of visits on our website per day or on real-time, conversations on blogs and social media, votes, notices, phone calls, and e-mails are added to the already substantial volume of data collected.
To this is added the open data: all the data that are publicly available and exploitable thanks to the use of more and more complete APIs: for example, data related to the environment such as weather or Geo-localization or the data on searches made on Google by the Net surfers.

Our project is a process designed to blend data from surveys, inbound customer communications, social media, Zendesk and Xcally, so that managers can develop a comprehensive perspective of the total customer experience.   It is a process intended to make all data sources “work together” to furnish insights that could not be derived from any single source alone.

Order management system

Order management system (OMS) powered by Jumia, for confidentiality reasons we cannot go into details.

In general OMS is a computer software system used for order entry and processing. It is an integrated order management system which encompass these modules:

Product information (descriptions, attributes, locations, quantities)

Vendors, purchasing, and receiving

Marketing (catalogs, promotions, pricing)

Customers and prospects

Order entry and customer service (including returns and refunds)

Financial processing (credit cards, billing, payment on account)

Order processing (selection, printing, picking, packing, and shipping.

Potential solution: Salesforce

As a Big Data solution, we can all agree that Big Data can be defined as the huge volumes of data being generated every day can be mined for information.  In our case, the data is generated from multiple sources such as Phone calls, social media platforms, server logs, web clickstream, mobile apps, database stores, business records etc. The possibilities that Big Data offers are endless but we need to have the right tools to derive detailed insights from heaps of data.

So, in order to reap the full benefits of our data sources, Jumia would first need to invest in Customer Relationship Management software. A CRM that can keep track of the data trail that customers leave on various online platforms & sources and present the data in a rational and coherent way.

Interestingly, Jumia is connected to Salesforce CRM software that is known for its intelligence & its ability to be connected with such large amounts & volumes of data.

In this chapter we will present this existing potential solution that will constitute our “BIG CRM” environment which is “Salesforce” that will enable us to make the limitless data generated with real-time sensors embedded in devices that will be connected to our “Big CRM” environment.

Salesforce

Salesforce is a cloud computing company which offers a variety of services (SaaS) and products (PaaS). Salesforce started as Software as a Service (SAAS) CRM company. Salesforce now provides various software solutions and a platform for users and developers to develop and distribute custom software.

Salesforce’s customer relationship management (CRM) service is broken down into several broad categories:

With such a large amount of intelligent products & services that Salesforce offers to its users, Jumia uses just two of those services:

Sales Cloud which is used by Jumia’s Sales team

Marketing Cloud which used by Jumia’s marketing team

In order to reap the full benefits of generated data, Jumia needs first to invest in the integration of all the existing data sources & try to conduct a Big Data approach to it, so as we will be able to take the full advantages from the huge potential that Salesforce offers us to create what we are calling the “Big CRM”.

Salesforce.com has already made a significant acquisition in the big data analytics space. In time, all of the possibilities of big data can become realities for businesses like Jumia. Through enhanced analysis of the massive volumes of data coming from all sources Salesforce gave us the ability to transforms Big Data into Customer Success with the Salesforce Analytics Cloud

Figure 13 The power of Salesforce in today’s tech-savvy world.

In our study we will be concentrated on two of Salesforce products:

Service Cloud: which will be our purely “Big CRM” interface to integrate all data sources.

Service Analytics Cloud: which will be our Big Data analysis mirror to explore data and get answers instantly in real-time.

Salesforce Service Cloud

Salesforce Service Cloud is a tool that lets users engage and interact with their customers using all known business channels and avenues, be it calls or even social media. This particular tool helps businesses manage customer cases and other issues from a single, unified interface.

Customers can get full benefits of Salesforce’s Service Cloud as they can tap into the system’s knowledge base and get in touch with other members of the communities as they look for solutions for their issues. With this kind of setup, Salesforce Service Cloud becomes a platform that increases customer’s acknowledgement and improve customer retention.

The system also offers a variety of useful integrations with Salesforce products and third-party applications, covering providers such as Desk.com, Vocalcom, InGenius, Zendesk, Xcally etc.

Service cloud puts all of the information representatives need at their fingertips, all in one console. With Service Cloud from Salesforce, agents can manage cases, track customer history, view dashboards and a lot more — all in one view.

Service Cloud is a technology that will enable us to gather all in one, to make a global view of customers on real time.

Figure 14 Service Cloud / Big CRM environnement

Salesforce Analytics Cloud

Salesforce Analytics Cloud was designed to empower everyone to explore all forms of data, uncover new insights and take action instantly from any device. And with Salesforce wave for Big Data, tries to extend the volume, variety and velocity of big data to the business user—unlocking its value to transform every customer relationship. Each product that Salesforce offers us such as sales, service and marketing can discover correlations and patterns across any combination of transactional data and unstructured or semi-structured big data sets, all from within the Analytics Cloud.

Salesforce analytics cloud can be considered as a big data reporting interface that can help transform virtually all aspects of the enterprise. From quickly producing actionable intelligence to driving productivity to gaining real time visibility into customers and markets, analysis and big data reporting promise to deliver a wealth of benefits for competitive advantage.

Figure 15 Salesforce wave analytics to uncover insights and take instant action overview

The Salesforce Wave platform is one of the fastest emerging analytics platforms, offering greater capabilities than simple business intelligence (BI) or reporting tools.

Figure 16 Simplified diagram that explains the process of building an app within Salesforce Wave analytics

Deep Dive into Jumia’s Current Information System

In brief we can put the loop into Jumia’s information system and give an up close view of all its data sources either coming from the data sources we’ve talked about earlier or others such as :

Figure 17 Deep dive into Jumia's current Information system

Conclusion

This chapter allowed us to put into the sights both the existing data sources within Jumia in general and its current information system in particular .That said, this study must be reflected through the choices and models designed to allow a better projection compared to the context studied and the existing potential solutions.

Chapter 5: Design & Modelization of the study case

Introduction

This chapter is intended as a presentation of the global design of the proposed solution, from the integration of all costumers ‘data sources (CS support & contact sources ) within Salesforce to the integration of a Big data analysis solution in order to make a 360° view of the customer and initiate a client segmentation approach to increase their satisfaction .

We will first examine the exhaustiveness of the integration process & manipulations that can be envisaged before spreading on to the global design that focus on the Big Data solution.

Designing the overall solution

The implementation plan requires drawing a guideline for the project. As such, it is recommended to use increments (coherent functional parts) where each increment covers a set of technological requirements that can be tested separately. The architecture proposed includes five decisive phases, which we present as follows:

Data extraction

This stage includes accumulation of information from a few sorts of information sources, information bazaars, and information stockrooms. For that the data needed in our project is Data about each of our customers, this data is being collected on a daily basis through their regular actions within Jumia’s sites or outside it. In addition to internal data sources, external sources are also considered as important as the internal data sources.

Hence, Zendesk, Xcally, OMS and social Media data will thus be the raw material of this mechanics by encapsulating the data coming from:

Zendesk as Jumia’s current CRM software

Xcally as the phone calls contact mean

OMS as our structured Data source

And last but not least the social network Facebook &Twitter especially.

The actual collection comes later and examines the range of technological possibilities that will capture the data from the selected sources and according to a precise, extensible and flexible thematic classification scheme to support the submitted queries. The choice of the Framework is conditioned by the requirements dictated by the 3V characterizing the object to be extracted and imposing new metrics namely frequency and recency. A benchmark of the plethora of tools would be beneficial in deciding the choice to deploy. This benchmark can include several criteria, including:

Scalability: the tool's ability to meet future needs through its highly flexible architecture;

Integrity: the ability of the tool to reproduce the information as accurately as possible, which requires that the tool understands the characteristics of what it receives: format, frequency, type of encoding, etc. ;

Interoperability: level of architecture independence from other implementations;

Manageability: including man / machine experience, fault tolerance, response time, scalability through its ability to respond to an increasing volume of requests

Streaming: the ability of the tool to give results on real-time basis.

Preprocessing

Commonly referred to as formatting or data conditioning, this phase includes processes for

Data cleaning routines that works to “clean” the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies and transforming data sets to facilitate their ingestion by the rest of the chain.

Data storage & integration

As we’ve already declared the integration process is considered not only a phase from the five phases but also it is one of our project objectives,

Data integration involves combining data from several disparate sources, which are stored using various technologies and provide a unified view of the data. Data integration becomes increasingly important in a case like ours where the main objective is to create a Big CRM which is defined as a CRM that combines all customers ‘data sources, in order to prep them for a Big Data Analysis process and a Data Mining techniques

As for the storage phase Arbitration is based on data management requirements. Between Batch Processing, Micro-Batch and Real Time Processing, the choice of the solution must support the analytical processes and sure the streaming process.

Storage consists of two components: the Shell infrastructure adaptable to any environment and the data storage methods deployed on the hardware layer to support the Large-Scale or intensive analysis on demand of huge data sets.

Data Analysis

Data analysis is the process of examining large and varied data sets – i.e., big data – to uncover problem sources, unknown correlations, market trends, customer preferences and other useful information that can help us to make more-informed business decisions.
Through the different methods, tools and algorithms, data analysis is the heart of the process. The interest of the analysis is to examine the large sets and extract useful values ​​or insights for understanding the data.

Predictive analysis, feelings digging and statistical analysis and automatic learning are all possible tracks of interest to our study each in its own way.

Figure 18 Data analysis cycle

Data analysis results visualization

As it is communally known Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns either through point clouds, tag clouds, bubble charts, or even through histogram matrices into maps.

With its interactive visualization, Jumia has chosen Salesforce as its data visualization interface, which mean that we will be able to take the concept a step further by using a CRM tool and new technology to drill down into charts and graphs for more detail, interactively changing what data we see and how it’s processed.

Refined modeling of the solution in relation to BDA:

With the Big Data and the integration processes within Salesforce, the interest of big data analytics has become a need, this interest can be considered as a reflection of the need to use advanced techniques, mostly data mining and statistical, to find hidden patterns in our (big) data solution.

So, talking about data mining will lead us directly to our main project objectives which are building a 360° view of the costumer, Clustering, Classification and last but not least increasing customer retention and loyalty.

In these section we will try to describe Data Mining in our “Big CRM”. Next we will study the techniques such as Sentiment Analysis, Clustering and Classification that we will stand for in order to give our project a valuable meaning.

Data mining within a CRM approach:

Data mining, which is also called KDD (Knowledge Discovery in Database), is the process of abstracting unaware, potential and useful information and knowledge from plentiful, incomplete, noisy, fuzzy and stochastic actual data. Simply speaking, it is a process to pick up the information and knowledge which cannot be discovered directly but with potential value from a mass of data. Here, we use the process of Clustering, Classification and Neural Networks for the mining data.

Data Mining is defined as “the process of discovering meaningful new correlations, patterns, and trends by digging into large amounts of data stored in warehouses”. Data mining is not specific to any industry. It requires intelligent technologies and the willingness to explore the possibility of hidden knowledge that resides in the data.

Figure 19 Data mining process

Data Mining Techniques in Analyzing the Effectiveness of CRM

A Rule based Data Mining technique has been used to generate new rules and patterns by using sales, marketing, IT and customer’s data.

Actually Customer’s data are clustered by using several characteristics of customers to recognize and understand the customer. Data Mining is used to extract knowledge from all existing data sources and evaluate it for future purpose.

Clustering and Classification technique are used in order to make data in manipulated form.

Whenever a query is raised, the reply will be given with the help of mined data and this will be saved in a database for future augmentation:

Classification and prediction aims to build a model to predict future customer behaviors through classifying all data sources records into a number of predefined classed based on certain criteria. Classification methods can be used to classify the potential customers in the existing categories. Customers will be abandoned, if they are found to not fit for bringing profit to customers.

Clustering is used to classify the similar customers and divide the dissimilar customers into different groups. By using cluster analysis, the enterprise can find a customer group of different characteristics by purchase mode, for making more efficient marketing strategy.

Figure 20 Application of Classification

Conclusion

To conclude this section, Data Mining is a part of our Big CRM. It is a growing discipline in data and big data management community. Application of CRM with Data mining is found to be more beneficial. The application of Data Mining techniques within CRM provides satisfactory results. The future work is to focus on the customer retention technique to enhance customer relationship via Data Mining techniques which leads to a good profit.

Chapter 6: Implementing the solution prototype

Introduction

This is the long-awaited section of this report: the technological ramification of the solution prototyping, Thus, we have chosen to test a wide range of data collection, pre-processing and analysis tools, as much as possible all of which are freely accessible ,open source and mostly under the Apache label.

From this point of view, we will start by presenting the main data sources integration processes within Salesforce in order to enhance next the wide range of Big Data technologies used according to the place they occupy in the mesh specified in "Chapter 5" before returning to the implementation of the scenario Chosen as a prototype of the solution.

Data sources Integration & preprocessing

As we know, Data integration involves combining data from several disparate sources, which are stored using various technologies and provide a unified view of the data

As we previously announced, there are four main data sources within Jumia’s customer service.

Following the technical PMO of the project we find ourselves obliged to first integrate all the four data sources within Salesforce and then we would be able to extract the data from Salesforce directly and in real-time; we will first examine the exhaustiveness of the integration process & manipulations that can be envisaged before spreading on to the global design that focus on the Big Data solution.

As the nature of the project we are obliged to follow the PMO from the integration to the data visualization phase, but for technical issues we will extract some data directly from its sources in order to enhance the Big Data engine chosen.

So first, we will explain in details the integration of Xcally & Zendesk only, since the Social Medias will be treated sparely from data design point of view and not the technical integration one.

PS: The Integration technical assets will be attached as an Annex

Xcally Integration

XCALLY Shuttle provides a seamless computer telephony integration with Salesforce. The integration works on every product of Salesforce .com (as seen previously).

In our project we are trying to collect all the telephone call data records (CDR) in Salesforce service cloud for Jumia’s customer service and we are aiming to gather it in one preface, those data are generally divided on:

Cases which will be created after every call

Calling phone number

Receiving phone number

When the call was made

Duration of the call

Sometimes, for cell phones, the location of the caller and receiver

With hundreds of phone calls every day per agent, it would generate a billions of data which can certainly be considered as a Big Data problem that turns out to be a door to gain a lot of knowledge from CDR when it's loaded into our Salesforce service cloud. Especially if there's additional information to identify phone numbers associated with known suspects.

With such integration we would be able to extract and use this data the way it should be , we will first present the functionalities which we expect from Xcally integration and then try to give in details the data design generated from each phone call either is an out bounding or upcoming call .

Functionalities overview:

Tableau 2 Xcally integration Functionalities over view

Data Calls design

Each out bounding or upcoming call creates a case, each case will itself provides us automatically with an important data, and this data is designed as followed

Tableau 3 Types of Data about calls

Of course, just having the call data records alone is pretty useless. Without advanced analytics software, the data is overwhelming. Our project aims to provide a platform to gain insight into the massive number of phone calls. It's about managing large amounts of data, seeing the relationships between entities (phone numbers and people), drilling down where necessary, and filtering based on time, geography, and relationships:

Call detail records are imported into Salesforce service cloud.

Link Analysis Networks can be used to visually see the calls made by any phone number

Multiple levels of phone calls can be linked to identify groups of phones related to each other (cells of activity)

Geospatial Mapping and integration with Google Earth to see calls across the world

Social Network Analysis (SNA) to identify related phones (cells) and spanners between cells

Temporal Analysis can be used to filter data to specific time ranges

Link Traversal Analysis can be performed between two phone numbers to show all the phones related to them through multiple levels, and quickly filter out the unrelated calls to identify the "community of interest".

Zendesk integration:

With more than million customers in more than 20 different countries, Zendesk has become one of the main customers ‘data sources in Jumia. The service collects an incredible amount of data about Jumia’s customer experience in both its ticketing and help desk products—data that absolutely should be analyzed.

While Zendesk does provide the capability to create basic reports through its interface, the solution is simply not flexible nor intelligent enough to build sophisticated analyses or combine Zendesk data with data from other services (e.g. social media data) To make the best use of the data collected by Zendesk, we will try to integrate it with Salesforce in order to be able to extract data for our Big Data solution.

The challenge for such integration lies then in the entirety & types of data and its integration.

Functionalities overview

Zendesk for Salesforce helps close the loop between sales and support teams by enhancing visibility into customer information, gathering a 360° view of the customer experience and support activity between Salesforce and Zendesk.

Since Zendesk & Salesforce have been already installed in Jumia’s system, we are able to choose which features we want to configure.

The integration is designed so that we can choose the features we need and ignore others.

Figure 21 Zendesk Features within SalesForce

In addition to tickets features, we can also configure an ongoing sync from Zendesk to Salesforce that enable us from syncing Contacts/Leads to users and syncing Accounts to users.

With such features to integrate, it demand a full understanding of the data assets of the enterprise as whole and the software environment as a unit.

Data design

In order to know what data we can migrate from Zendesk to Salesforce, we will present the type of data we want to extract from Zendesk and its equivalent within Salesforce so as we would be able to have a global view of the data we would want to extract up next.

Tableau 4 Types of data we need to extract from Zendesk & its equivalent within Salesforce

Example of Zendesk Data Modeling within Salesforce:

Before we dive into SQL, it is worth emphasizing how important data cleansing and transformation are in our case. They provide the underlying layer upon which all of our later analysis is built.

We cannot deny that the main feature that Jumia’s support service need is the Tickets, for that the subsequent work we do will depend on the following four tables having clean data:

Our first step will be to combine ticket, audit and audit_events tables returned by the Zendesk API to get all audit events for every ticket.

In order to get a sense of what this data looks like, here’s the output of the above query on our test data:

Tableau 5 An output SQL table of test data Table

Next, we combine the audit information for each ticket with the ticket table to get a summary table that will be used repeatedly in our subsequent analysis:

Figure 22 An Overview of the combining process of the audit information for each ticket with the ticket table

Without even have to dive into the big data process, the resulting dataset is already extremely useful. We can think of every row in this dataset as being the answer to the question: “What was the lifecycle of this ticket?” Here is sample output to give a better sense of what it looks like

Figure 23 Resulting dataset overview sample

Now we can use this single table to look at the number of new and solved tickets across time, the time it took to reply to a ticket, the time it took to solve it, etc.

Proof of concept & Big Data technical engine choice

In this section, we are interested in introducing the new generation of massive data management and processing tools. We will thus shed light on the tools that have been the subject of our benchmark in order to choose the most responsive one to our needs installation and then trying to enhance the technical tests to illustrate the process of acquisition and value making of the collected data.

Benchmark between Apache Spark and Hadoop

Hadoop and Spark are popular apache projects in the big data ecosystem. Apache Spark is an improvement on the original Hadoop MapReduce component of the Hadoop big data ecosystem. There is great excitement around Apache Spark as it provides real advantage in interactive data interrogation on in-memory data sets and also in multi-pass iterative machine learning algorithms. However, there is a hot debate on whether spark can mount challenge to Apache Hadoop by replacing it and becoming the top big data analytics too.

For that we will specially put the light on Spark as the newest engine on big data field.

Hadoop Mapreduce

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.

Hadoop history

As the World Wide Web grew in the late 1900s and early 2000s, search engines and indexes were created to help locate relevant information amid the text-based content. In the early years, search results were returned by humans. But as the web grew from dozens to millions of pages, automation was needed. Web crawlers were created, many as university-led research projects, and search engine start-ups took off (Yahoo, AltaVista, etc.).

Figure 24 Hadoop timeline

Why is Hadoop important?

Ability to store and process huge amounts of any kind of data, quickly:  With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration.

Computing power: Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have.

Fault tolerance: Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically.

Flexibility: Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like phone calls, text, images and videos.

Low cost: The open-source framework is free and uses commodity hardware to store large quantities of data.

Scalability: You can easily grow your system to handle more data simply by adding nodes. Little administration is required.

Apache Spark

Apache® Spark™ is a powerful open source processing engine built around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009.

Apache Spark is a cluster computing platform designed to be fast and general-purpose.

Spark is designed to cover a wide range of workloads that previously required separate distributed systems, including batch applications, iterative algorithms, and interactive queries and streaming. By supporting these workloads in the same engine, Spark makes it easy and inexpensive to combine different processing types, which is often necessary in production data analysis pipelines. In addition, it reduces the management burden of maintaining separate tools.

Spark is designed to be highly accessible, offering simple APIs in Python, Java, Scala and SQL, and rich built-in libraries. It also integrates closely with other big data tools.

Figure 25 Spark bases

The Spark Stack

In this section we will briefly introduce each of the components shown in this figure:

Figure 26 The Spark Stack

Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines Resilient Distributed Datasets (RDDs), which are Spark’s main programming abstraction. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel.

Spark SQL is Spark’s package for working with structured data. It allows querying data via SQL as well as the Apache Hive variant of SQL, called the Hive Query Language (HQL), and it supports many sources of data including Hive tables, Parquet, and JSON. Beyond providing a SQL interface to Spark, Spark SQL allows developers to intermix SQL queries with the programmatic data manipulations supported by RDDs in Python, Java and Scala, all within a single application, thus combining SQL with complex analytics. This tight integration with the rich computing environment provided by Spark makes Spark SQL unlike any other open source data warehouse tool. Spark SQL was added to spark in version 1.0. Shark was an older SQL-on-Spark project out of UC Berkeley that modified Apache Hive to run on Spark. It has now been replaced by Spark SQL to provide better integration with the Spark engine and language APIs.

Spark Streaming is a Spark component that enables processing live streams of data. Examples of data streams include log files generated by production web servers, or queues of messages containing status updates posted by users of a web service. Spark Streaming provides an API for manipulating data streams that closely matches the Spark Core’s RDD API, making it easy for programmers to learn the project and move between applications that manipulate data stored in memory, on disk, or arriving in real-time. Underneath its API, Spark Streaming was designed to provide the same degree of fault tolerance, throughput, and scalability that the Spark Core provides.

MLlib Spark comes with a library containing common machine learning (ML) functionality called MLlib. MLlib provides multiple types of machine learning algorithms, including classification, regression, clustering and collaborative filtering, as well as support in functionality such as model evaluation and data import. It also provides some lower level ML primitives including a generic gradient descent optimization algorithm. All of these methods are designed to scale out across a cluster.

Data Bricks

Databricks is a company founded by the creator of Apache Spark and a number of executives with strong past experience starting up companies, such as Conviva, Opsware, and Nicria.

Databricks offers a cloud platform powered by Spark, that makes it easy to turn data into value, from ingest to production, without the hassle of managing complex infrastructure, systems and tools

Figure 27 Introduction to DataBricks

Databricks enhance spark, it makes it very easy to create a Spark cluster out of the box vice determining the requirements for a particular use case.

Databricks cloud helps analysts by organizing the data into "notebooks" and making it easy to visualize data through the use of dashboards. It also makes it easy to analyze data using machine learning (MLib), GraphX and Spark SQL.

Why is Spark important?

Figure 28 Spark features

General Comparison

Tableau 6 Apache Spark Vs Hadoop Mapreduce

The differences between Apache Spark and Hadoop MapReduce shows that Apache Spark is much-advanced cluster computing engine than MapReduce. Spark can handle any type of requirements (batch, interactive, iterative, streaming, graph) while MapReduce is limited to Batch processing. Spark is one of the favorite choices of data scientist. Apache Spark is growing very quickly and replacing MapReduce.

For this reasons we will take Spark as our Big data engine rather than Hadoop.

Downloading & Installing Spark under a Linux environment

The first step to using Spark is to download and unpack it into a usable form. Let’s start by downloading a recent precompiled released version of Spark that goes with the matching version Hadoop cluster or HDFS 2.4 that we’ve already have ,we don’t need to install the Hadoop package another time .

For that we’ve visited http://spark.apache.org/downloads.html, then select the package type of “Pre-built package for Hadoop 2.4”, click “direct file download”. This will download a compressed tar file, or “tarball,” called spark-2.1.0-bin-hadoop2.4.tgz

Now that we have downloaded Spark, let’s unpack it and take a look at what comes with the default Spark distribution, for that we should follow this commands to accomplish all of that.

Then we should format and start the existing HDFS in order to compile Spark:

Figure 29 An overview of the HDFS formatting script

Spark comes with interactive shells that make ad-hoc data analysis easy. Spark’s shells will feel familiar if you have used other shells such as those in R, Python, and Scala, or operating system shells like Bash or the Windows command prompt.

The first step is to open up one of Spark’s shells:

Figure 30 An overview of sparkshell

As we can notice the default shell that we’ve entered is the scala one, since Apache Spark is built on Scala.

Thus being proficient in Scala helps us digging into the source code when something does not work as we expect and since Spark is implemented in Scala, using Scala allows us to access the latest greatest features. Most features are first available on Scala and then port to Python.

In summary, Scala is my first choice of programming language for Spark projects and I will keep Python in mind when the use case fits.

PS: In order to make it more easier we decided to work with Eclipse on Ubuntu as our integrated development environment (IDE) for our Scala programming language for Spark.

Technical choice of data extraction & data storage

Extracting Data from Salesforce

After integrating all the data sources within Salesforce, we are able now to extract data from Salesforce, knowing that all the data are converted following the same design as a Salesforce data such as contacts, emails, cases, tickets etc…, and using an existing library for connecting Spark with Salesforce and Salesforce Analytics cloud (wave)

Dependencies needed:

Within Eclipse and as a maven apache project, dependencies are most needed to create a link against this library in our program:

Features

With this library that links Salesforce to our spark environment, an important amount of features will be enabled for us such as:

Dataset Creation – Create dataset in Salesforce Wave from Spark DataFrames

Read Salesforce Wave Dataset – User has to provide SAQL to read data from Salesforce Wave. The query result will be constructed as dataframe

Read Salesforce Object – User has to provide SOQL to read data from Salesforce object. The query result will be constructed as dataframe

Update Salesforce Object – Salesforce object will be updated with the details present in dataframe

Options

Salesforce Wave Username. This user should have privilege to upload datasets or execute SAQL or execute SOQL .

Salesforce Wave Password. We should append security token along with password.

(Optional) Salesforce Login URL.

(Optional) Name of the dataset to be created in Salesforce Wave. (Required) for Dataset Creation.

(Optional) Salesforce Object to be updated. (e.g.) Contact.

(Optional) Metadata configuration which will be used to construct [Salesforce Wave Dataset Metadata] .

(Optional) SAQL query to use to query Salesforce Wave. Mandatory for reading Salesforce Wave dataset.

(Optional) SOQL query to use to query Salesforce Object. Mandatory for reading Salesforce Object like Opportunity.

(Optional) Salesforce API Version. Default 35.0

(Optional) Inferschema from the query results. Sample rows will be taken to find the datatype.

(Optional) result variable used in SAQL query. To paginate SAQL queries this package will add the required offset and limit. For example, in this SAQL query q is the result variable.

(Optional) Page size for each query to be executed against Salesforce Wave. Default value is 2000. This option can only be used if is set.

Scala API

Here there are an example of writing Dataset using Spark-csv package in order to load DataFrames from Salesforce into a CSV file form, in real-time within Databricks:

Figure 31 An overview of the SCALA API section of the project program

Then we should update Salesforce object; The CSV file should contain Id column followed other fields to be updated.

Figure 32 An overview of the SCALA API for the updates

We can clearly see the result within Databricks as follow:

Figure 33 Databricks overview of the spark-Salesforce package within it

Figure 34 Databricks overview of the table created after the data extraction

Extracting Data from Twitter

Since the project spreads on Jumia as a unity that exists within over than 26 countries which consider Twitter as their first go to social media , thus our sentiment analysis will concentrate on Twitter Data on real- time .

For that and before starting with Spark, we should create our Twitter API in order to be able to extract data and use it

Twitter API

We should at first create our Twitter API and get the authentication and access keys on the https://apps.twitter.com/app

Figure 35 Creating the Twitter API

Scala Twitter Streaming API

After creating the Twitter API, we will need to create next a streaming program that is constantly running, fetching Twitter data in real-time, and also clustering the tweets based on the sentiment behind them.

Spark now enables us to quickly and easily write such a program with minimal lines of code. You will practice a wide range of Spark commands here to implement this application.

We would need first to import the necessary packages into the Spark program:

Figure 36 Packages needed to be imported to our Spark Program

Our Spark Program:

We would need next to set the system properties so that Twitter4j library used by twitter stream use them to generate the OAuth credentials

Streaming context takes two parameters; the application configuration and the streaming time. As Spark streams data in micro batches, we need to set some time so that for every set time (time_set), be it seconds or milliseconds, it will stream the data. Here, we have set 5 seconds, so for every 5 seconds, it will stream the data from Twitter and save it in a new file.

After that a RDD transformation will be needed, in our case we will use sortBy and the map function.

After that we will try to save our output at ~/twitter/directory

Executing the Scala program

After executing the program we can see that the program is running like so:

Figure 37 First overview of the Twitter program running within ECLIPSE IDE

Figure 38 second overview of the Twitter program running within ECLIPSE IDE

To take a closer look to the program result:

Figure 39 An overview of the result console

To conclude, Spark Streaming is used to collect tweets as the dataset. The tweets are written out in JSON format, one tweet per line. A file of tweets is written every time interval until at least the desired number of tweets is collected.

IV. Technical choice of Data Analysis

As we’ve mentioned earlier Apache Spark is a framework for distributed computing and data analysis, we’ve chosen to concentrate on the Data clustering & Twitter sentiment Analysis processes as our Data Analysis manners.

For these reasons we’re going to use first the MLlib which includes the popular K-means algorithm for clustering and the Spark Streaming for the Twitter sentiment analysis.

Clustering

Clustering or Segmentation is an integral component to the territory planning process. When managers are drawing the lines, market segments are just as important as geographical areas.  Without segmentation, sales reps will naturally gravitate towards one of two extremes. They tend to chase the largest opportunities with the fewest accounts or the smallest opportunities with the quickest turn-around time

Client segmentation, procedures include:

Deciding what data will be collected and how it will be gathered

Collecting data and integrating data from various sources

Developing methods of data analysis for segmentation

Establishing effective communication among relevant business units (such as marketing and customer service) about the segmentation

Implementing applications to effectively deal with the data and respond to the information it provides

For that we decided to use the MLlib library and K-means.

MLlib

Known as Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as:

ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering

Featurization: feature extraction, transformation, dimensionality reduction, and selection

Pipelines: tools for constructing, evaluating, and tuning ML Pipelines

Persistence: saving and load algorithms, models, and Pipelines

Utilities: linear algebra, statistics, data handling, etc.

K-means

MLlib includes the popular K-means algorithm for clustering, as well as a variant called

K-means|| that provides better initialization in parallel environments. K-means|| is similar to the K-means++ initialization procedure often used in single-node settings.

The most important parameter in K-means is a target number of clusters to generate,

K. In practice, we rarely know the “true” number of clusters in advance, so the best practice is to try several values of K, until the average inter-cluster distance stops decreasing dramatically. However, the algorithm only takes one K at a time. Apart from

K, K-means in MLlib takes the following parameters:

• InitializationMode: the method to initialize cluster centers, which can be either

“k-means||” or “random”; k-means|| (the default) generally leads to better results but is slightly more expensive.

• MaxIterations: maximum number of iterations to run (default: 100).

• Runs: number of concurrent runs of the algorithm to execute. MLlib’s K-means supports running from multiple starting positions concurrently and picking the best result, which is a good way to get a better overall model (as K-means runs can stop in local minima).

Concept

Generally, clustering is defined as grouping objects in sets, such that objects within a cluster are as similar as possible, whereas objects from different clusters are as dissimilar as possible. A good clustering will generate clusters with a high intra-class similarity and a low inter-class similarity.

Hence, the basic idea behind K-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized.

The equation to be solved can be defined as follow:

Equation 1 K-means concept

Where  is the  cluster and  is the within-cluster variation of the cluster

Equation 2 Within-cluster variation Kth cluster equation

With design a data point belonging to the cluster

is the mean value of the points assigned to the cluster

Algorithm

Now that we know what K-means is, we can now go to its application:

We should first pars and load data from the CSV file that we’ve created on the Databricks.

We use then KMeans object to cluster the data into three clusters that are commonly known within Jumia:

Gold Client

Silver Client

Bronze Client

I.e. since we’ve already want to cluster our client into three categories depending on several criteria such as: client Id, Sales Amount, opportunity etc. The number of desired clusters is passed to the algorithm.

Up next we will present the scala script code that we have used in order to cluster the clients, we would need a visualization engine in order to be able to see the result.

K-means clustering in R

What is R programming language?

R is an open source programming language and software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering…) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

R function for K-means clustering

K-mean clustering must be performed only on a data in which all variables are continuous as the algorithm uses variable means.

The standard R function for k-means clustering is kmeans () [in stats package]. A simplified format is:

With

x: numeric matrix, numeric data frame or a numeric vector

centers: Possible values are the number of clusters (k) or a set of initial (distinct) cluster centers. If a number, a random set of (distinct) rows in x is chosen as the initial centers.

iter.max: The maximum number of iterations allowed. Default value is 10.

nstart: The number of random starting partitions when centers is a number. Trying nstart > 1 is often recommended.

Data Format

Preload the data

The R code below generates a two-dimensional simulated data format which will be used for performing k-means clustering:

We’ll use the Databricks dataset we’ve created earlier in which contains statistics, about more than 100,000 costumer for in each of the 20 African countries and 6 Asian ones.

We were be able now to load the data and choose to remove any missing values with the na.omit function

It contains:

As we don’t want the k-means algorithm to depend to an arbitrary variable unit, we start by scaling the data using the R function scale () as follow:

Determine the number of optimal clusters in the data

Even though Jumia decided to cluster its customers into 3 segments ,we could not ignore the possibilities that K-means offers to determine the number of optimal clusters in the data .The idea is to compute a clustering algorithm of interest using different values of clusters k. Next, the wss (within sum of square) is drawn according to the number of clusters. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.

We’ll use the function fviz_nbclust() [in factoextra package] which format is:

With :

x: numeric matrix or data frame

FUNcluster: a partitioning function such as kmeans.

method: the method to be used for determining the optimal number of clusters.

The R code below computes the elbow method for kmeans ():

Using Factoextra library that enables the elbow method, we get the following results:

Figure 40 The number of optimal clusters in the dataset graph

We can see clearly that the function draw an elbow in 4 as the optimal Number of clusters k, but as we mentioned above and because of the project nature we must give k value 3 .

Compute K-means Clustering

The R code below performs k-means clustering with k =3

It’s possible to plot the data with coloring each data point according to its cluster assignment. The cluster centers are specified using “big stars”:

Figure 41 K-means clustering in R results

In this figure we can see clearly how R and kmeans() function have plot the clusters , choosing the clustering depending on the opportunities created from each customers and the Quantity of products has been bought for each client per SKU_ID and CustID

It’s possible to compute the mean of each of the variables in the clusters:

Emotions Analysis

As we’ve already mentioned and due to the project nature we’ve chosen Twitter as our social media data source, after extracting the data from twitter in the previous section we can now pars the data into a Sentiment Analysis process which is known as the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc., is positive, negative, or neutral. In essence, it is the process of determining the emotional tone behind a series of words, used to gain an understanding of the attitudes, opinions and emotions expressed within an online mention.

Sentiment analysis is extremely useful in social media monitoring as it allows us to gain an overview of the wider public opinion behind certain topics, in our case we’ve choose to extravagate within Jumia’s costumers sentiments about Jumia services in general .

Now, and after extracting the amount of data needed, we would want to add functionality to get an overall opinion of what people think about Jumia,

Figure 42 Result Console –Scala API resulting

We can see that some of the files have been created in our specified path.

Figure 43 Output files and directory

And on clicking those files, will be able to see some part of the files contain some of Twitter usernames, timestamps, and tweets with “NEGATIVE, POSITIVE & NEUTRAL” sentiment analysis.

We refer to the below screen shots for this.

Figure 44 Output file containing Twitter username and timestamp

Figure 45 Output file containing Twitter extracted data and sentiment analysis

Note that Neutrality can mean two different things. Type 1: a review contains as much positive and negative words Type 2: there is (almost) no sentiment expressed.

In brief, Sentiment analysis is the process of analyzing the opinions of a person, a thing or a topic expressed in a piece of text. Sentiment analysis will derive whether the person has a positive opinion or negative opinion or neutral opinion about that topic.

Data Visualization within Salesforce:

From data extracting to data analyzing, we would need to visualize the results within our “Big CRM” Platform, which is salesforce. For that we would need a connector to connect Spark with salesforce, in order to makes it easy to run our algorithms in spark and push the results to Salesforce Wave.

For the sake of the project nature and deadlines we wouldn’t be able to see the results of the data visualization, we will only highlight the connector, its uses and what we would want to see in our platform considering that this phase is the mainly the last one and the project aims to see its final lights by the end of August 2017.

For that we would need to create at first an app within Salesforce analytics cloud in order to push the algorithms and its results into our Salesforce platform and then we would need to connect spark with Salesforce

Salesforce Analytics cloud

Because of the nature of the project roadmap, we wouldn’t be able to see the visualization part yet , the whole principle of it ,was to be able to download the algorithms and the datasets we’ve created and extracted using Spark into our visualization platform which is Salesforce wave or in other words Analytics cloud ;

As we’ve already mentioned (Chapter 4) , Analytics cloud will give us the opportunity and the ability to explore all the extracted data and give a sense to all algorithms we’ve created or we would want to create in real time and as we would want it easily which means that with SalesForce Wave any one would be able to see advanced reports , graphs and schemes choosing the variables and the type of the reports they prefer and be able to have the answers they need without having to refer to an engineer or so .

Figure 46 How Salesforce analytics cloud works

Highlights of the connector

Using the connector, we can easily perform the pervious analysis we’ve realize, we can also use it to explore predictive analytics. E.g. lead scoring, Customer segmentation, propensity to buy etc. And we can see clearly those results within a one and an only platform.

This connector gives a large varity of features to use such as:

Packaged library that will enable us to deploy in spark clusters.

Support java, scala, python libraries to process output of a Dataframe to Salesforce Wave.

Read files from AWS S3 and push to Salesforce Wave.

Push the output of MLLibs into salesforce wave. Same applies for SparkR, output of R result can be pushed to wave easily.

Uses the Spark API and can execute in concurrent mode to push large files. Batching and data sequencing is built-in.

Easily import the library in databricks cloud

Automated Metadata JSON generation to easily push datasets.

After connecting the connector between Salesforce Wave and Spark, we will be able to see the dataset we’ve created :

Figure 47 Datasets overview within Salesforce

Figure 48 Resulting Dataset overview within Salesforce

And then we would be able to explore algorithms and reports we created in order to take the most advantage from them to solve the main problems that Jumia suffers from and to get insights of a better client experience.

Results & Conclusion:

As we’ve already discus, the main reason behind this project is that Jumia suffers from two major pains:

Separated interaction channels with no 360° view of the client within one brand

No data sharing across brands with Limited cross sell

So, we’ve judged that the application of big Data and the integration of all data sources within one interface would be very interesting and very useful to solve such problems.

So, by the end of the project we will be able to have a one platform that gathers all the data sources used by Jumia ‘agents in order to connect with customers everywhere, to have a 360° view of the client, be able to have an idea about the client segmentation and last but not least to be able to see real time and streaming analytics using Spark connector with Salesforce.

We refer to the below screen shots to see the upcoming results till now.

Figure 49 an overall screenshot of the Big CRM platform // SalesForce

Figure 50 An overall screenshot of the 360° client view feature

An up-close screenshot of the client segmentation or clustering resuts

Figure 51 An overall screenshot of the client segmentation icon we would want to achieve

In this scheme we only see the VIP icon inserted just in front of the client most important information , in which what we would want to have the segmentation and the classification we’ve already create its algorithm in R with R connector to SalesForce , namely : Gold , Silver and Bronze Clients .

Unfortunately the results we would want to see wouldn’t be available until the end of August 2017, for these reasons we come to the end of the sixth chapter and the end of this study case, nonetheless we would be more than great full to continue the study in the future due to its importance and impact in the ecommerce fabric.

Conclusion

True data-driven insight calls for domain expertise. For the CRM, this means in-depth knowledge of how the clients interacts and feel , what data to pull from the existing information systems, and an understanding of how to connect and integrate data from multiple sources end-to-end to yield an enriched set of information sources. This is what ultimately enables the creation of a range of services and client-centric applications. Smarter customer relationship management, customer experience management, data brokering and marketing are just some examples of what is possible. A common, horizontal big data analytics platform is necessary to support a variety of analytics applications. Such a platform analyzes incoming data in real time, makes correlations, produces insights and exposes those insights to various applications. This approach both enhances the performance of each application and leverages the big data investments across multiple applications. Storing and processing huge amounts of information is no longer the issue. The challenge now is to know what needs to be done within the big data analytics platform to create specific value. While big data storage and processing techniques are necessary enablers, the goal must be the creation of the right use cases. The big data tools and technologies deployed have to support the process of finding insights that are adequate, accurate and actionable.

Hosted by Jumia , our project was the fruit of integrating all data sources in one interface in which the main objective is to make it not only a data storage interface but also a working environment in order to be able to explore the intelligence of Salesforce and designing a Big Data prototype using Apache Spark as our Big Data engine choice , to connect it after that with SalesForce in both ways – as source of data and as a visualization tool—and also b establishing a basic polarity analysis based on the R language which is also will be connected to Salesforce as aggregation and visualization of the data and the algorithms we’ve created and for a more detailed statistical analysis using its Programming Language.

The work carried out represent a window to a world of future reflections, especially since the project has not achieved yet its final resort and that is the first of its kind.

Finally, this project provided us with a valuable opportunity to learn about Salesforce as one of the most intelligent and powerful current CRMs in the Market, and nonetheless it enabled us to earn more and more about Big Data technologies.

Bibliography & Webography

Abbasi, M. (2017). Learning Apache Spark 2. Packt Publishing.

Catlett, C. and Ghani, R. (2015). Big Data for Social Good. Big Data, 3(1), pp.1-2.

Davenport,H. (2013). Controlling, 25(6), pp.311-311.

Gong, A. (2013). Comment on “Data Science and its Relationship to Big Data and Data-Driven Decision Making”. Big Data, 1(4), pp.194-194.

Gupta, S. (n.d.). Learning real-time processing with Spark Streaming.

Harrigan, P. and Miles, M. (2014). From e-CRM to s-CRM. Critical factors underpinning the social CRM activities of SMEs. Small Enterprise Research, 21(1), pp.99-116.

Hurwitz, J., Nugent, A., Halper, F. and Kaufman, M. (2015). Big data for dummies. Hoboken, N.J.: For Dummies.

Karau, H., Konwinski, A., Wendell, P. and Zaharia, M. (2015.). Learning spark.

Kim, Y., Yang, S., Lee, S. and Park, S. (2014). Design and Implementation of Mobile CRM Utilizing Big Data Analysis Techniques. The Journal of the Institute of Webcasting, Internet and Telecommunication, 14(6), pp.289-294.

Kossecki, P. (n.d.). Building Trust in eCommerce – Quantitative Analysis. SSRN Electronic Journal.

Leonard, P. (2013). Customer data analytics: privacy settings for 'Big Data' business. International Data Privacy Law, 4(1), pp.53-68.

Liu, A. (2016). Apache Spark Machine Learning Blueprints. Packt Publishing.

Liu, C. (2015). A Conceptual Framework of Analytical CRM in Big Data Age. International Journal of Advanced Computer Science and Applications, 6(6).

Ohlhorst, F. (2013). Big data analytics. Hoboken, N.J.: Wiley.

Roehlkepartain, J. (2012). Spark student motivation. Minneapolis, MN: Search Institute.

PANG B. et LEE L.(2008) .Opinion Mining and Sentiment Analysis. Found , Trends Inf.

TUFEKCI Zeynep( 2014) .Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls .

CRM Systems | Expert Market. (2016). Amazon CRM Case Study. [online] Available at: http://crmsystems.expertmarket.co.uk/Amazon-CRM-Case-Study?question_page=5 [Accessed 19 Apr. 2017].

COMMERCE, R. (2015). [Tribune] Le CRM et le Big Data fidélisent vos clients et augmentent vos ventes : – Capitaine Commerce. [online] Capitaine Commerce. Available at: http://www.capitaine-commerce.com/2015/02/24/44092-tribune-le-crm-et-le-big-data-fidelisent-vos-clients-et-augmentent-vos-ventes/ [Accessed 12 Mar. 2017].

Gantz, J. and Reinsel, D. (2012). THE DIGITAL UNIVERSE IN 2020: Big Data, Bigger Digital Shadow s, and Biggest Grow th in the Far East. [online] https://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf. Available at: https://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf [Accessed 13 Mar. 2017].

Davenport, T. and Dyché, J. (2013). Big Data in Big Companies. [online] Available at: http://docs.media.bitpipe.com/io_10x/io_102267/item_725049/Big-Data-in-Big-Companies.pdf [Accessed 28 Feb. 2017]

Turner, D., Schroeck, M. and Shockley, R. (2013). Analytics: The real-world use of big data in financial services. [online] Available at: https://www-935.ibm.com/services/multimedia/Analytics_The_real_world_use_of_big_data_in_Financial_services_Mai_2013.pdf [Accessed 12 Mar. 2017].

Official websites:

Official Apache Spark website : http://spark.apache.org

Official Apache Hadoop website : https://hadoop.apache.org

Apache Maven project official website : https://maven.apache.org

Salesforce official website : https://www.salesforce.com/

Zendesk Official website : https://www.zendesk.com/

Xcally official website : https://www.xcally.com/en/

Data Bricks : https://databricks.gitbooks.io/

Blogs :

BigData Analysis : http://bigdatanalysis.blogspot.com

Business Decision : http://blog.businessdecision.com

DigitalOcean : https://www.digitalocean.com/

Idatassist : http://idatassist.com

Marc Bonzanini : http://marcobonzanini.com

R Datamining : http://www.rdatamining.com

Others:

Edureka  Apache Spark Certification Training : https://www.edureka.co/

Similar Posts