back to notes

DL/ML Methodology

DL/ML Methodology

Identify Problem
Gather Data
Identify Solution
Curate Data
Build Model
Train Model
Test Model
Repeat

Train/Test Split (ML)
Entire Dataset
Training (70%)
Train Model
Testing (30%)
Test Model

Train + Validate/Test Split (DL)
Entire Dataset
Training (70%)
Training (90%)
Validate (10%)
Train/Val Model
Testing (30%)
Test Model

A Note on Data & Data Curation
ML/DL Models work off of numeric data (height, weight, cost, value, …)
Sometimes data is categorical (red, blue, yellow)
Can’t just convert to map (1=red, 2-blue, 3-yellow, …)
1,2 and 3 are still just categories and are not comparable to each other (blue is not < yellow and yellow is not > red)
One-Hot Encode:
IS red - Is blue - Is yellow
0 - 1 - 0
0 - 0 - 1
1 - 0 - 0
Feature data points must relate to each other in some numeric way
Minimize DB Queries (try only to do once)
Work off of static data (flat files)
Train/Test Split

Query -> Raw Data -> Adjust -> Save -> Data Sets

Project Overview
Deep Learning Telephony Discovery
Classification Problem (Fraud/Normal)
Sometimes fraud is initiated in the Interactive Voice Response (IVR) channel
Project goal is to develop a predictive model to identify potential fraud coming from IVR
5 different datasets – 6 different models:
Infomart IVR Session aggregate – Dense Neural Network, Autoencoder
IVR Event Sequence (focus of today’s presentation) – Recurrent Neural Network
Call Center (CC) agent transactions – Recurrent Neural Network
Speech Analytics – Dense Neural Network
CC Audio Dialog (used by fraud specialists today) – Convolutional Neural Network

IVR Sequence Analysis - Process
Balanced Dataset
50/50 Normal/Fraud
Data Curation
Word to Index map
Reduce any less than 100 occurrences
Convert input sequence to event/result sequence
Padding (Left)
Label binarization (Fraud/Normal = 0/1)
Train/Test Split
Model Construction (Keras)
Model Training
TensorBoard Callback
Auto save Callback
History Callback
Model Validation & Testing

Security Engineering Overview
Fraud Prevention & Authentication within DCT
Tasked with detecting and preventing Account Validation traffic against the online (wellsfargo.com) authentication system
Application of DL/ML detection agents to block unwanted traffic before the login requests are authenticated
Disguise detection by presenting requestor with standard Login Failed page

Current Findings
Achieved 94% accuracy on predicting Fraud or Normal
7.9MM records in balanced data set
Produced 181,381 sequences
Max sequence size is 255 operations (event + result)
Training model on CPU takes about 17,000 seconds (~5 hours/epoch)
Training model on 1 GPU takes about 1,200 seconds (~20 minutes/epoch)
Training model on 4 GPUs takes about 170 seconds (~3minutes/epoch)

DL – IVR Sequence
Mode : Deep Learning
Tools : Google TensorFlow (Keras)
Hardware : GPU
Language : Python 3
Model : RNN
Train Accuracy : 94%
Val Accuracy : 94%
Test Accuracy : 94%
Total Records : 181,381
Status : Preliminary Analysis

Confusion Matrix (All Data)
AV 87,981 (Green)
AV 3,004 (Red)
Normal 7,245 (Red)
Normal 83,151 (Green)

TensorBoard

Initial Findings
IVR data in the Infomart format was provided as the primary data set for March, April, and May 2017
The Infomart data was augmented to contain a label (FRAUD/NORMAL)
A balanced data set of 176K IVR sessions was created (123k more than Preliminary data set)
Deep Learning Neural Networks proved to be quite useful in creating predictive models
An 87% accuracy score was achieved using the considerably small data set (by Deep Learning standards)
These results are far better than random and demonstrate the ability to classify the IVR It is our recommendation to move forward with more extensive analysis using Deep Learning

Analysis Details – Data Curation
In order to prepare the data for analysis several operations were performed on the data sets:
Combine fraud labels with Infomart data points
Balance the data set to contain an equal distribution of FRAUD and NORMAL data
Reduce the data features to the following:
LAST_INTERACTION_RESOURCE
pbnk_authstatus
ecda_authtype
ecda_AcctType
ecda_hvc
ecda_nlu
As well as the following identification fields:
CONNID
One Hot encode features (increased feature set from 257 to 377 columns)
Take a column and create a new column for each distinct value and mark with 1 or 0 for each record
Final data stats
Sample data
NORMAL: 2,066,492
FRAUD : 88,145
Balanced data set 176,290 data points (50/50)

Analysis Details – Deep Learning
TensorFlow 1.2
Keras Contribution models
Hyper Parameters:
Learning Rate: 0.001
Epochs: 10
Batch Size: 64
Train/Test 70/30
Train/Validation 90/10
Training:
Loss: Binary Cross Entropy
Optimizer: Adam
Final Training Results:
Test/Train Split Set: 52,887 records
Testing Set: 538,322 records


last updated september 2019