back to notes

IP Classification Machine Learning Exercise

IP Classification Machine Learning Exercise

Overview

Problem: To understand what network environment external Login Application traffic is coming from. Would like to label incoming traffic with environment tag.

Solution: Utilize Machine Learning (ML) techniques to find patterns in historical Login App data that would allow us to categorize and label new incoming requests

Results: Created four categories to represent the network environment of incoming requests. Personal Wireless Network (Desktop), Personal Wireless Network (Mobile), Mobile Carrier Proxy, Standard Proxy

Benefits and Recommendations

Benefits:
To Company: By understanding the environment that a request originates from, we can make more informed decisions in the Login App pre-authentication process
To Team: Knowledge sharing

Recommendations:
For Project: Perform secondary comprehensive investigation of traffic to allow for better categorization
For Company: Invest time and resources into educating and enabling our development community in Machine Learning techniques and tools

Deep dive
Explore process in detail step by step

Our Problem
Business Goal
Expose East Coast ET to basic ML concepts through peer collaboration and knowledge sharing.
Develop a component that would allow WF to more quickly identify types of inbound web-traffic to the login app
This component would attempt to categorize inbound traffic, which will supply additional insight that can be used to make more informed security decisions

Technical Goal
Gain familiarity with commonly used tools and techniques that facilitate analytics and ML...i.e., python-based tools, types of models 
Analyze collected data to uncover useful information from patterns in IP addresses
Attempt to classify types of log-ins into specific groups.

Our Process
Data Collection
Data Curation
Clustering
Classification

Data Collection
Splunk
Collects, captures , and indexes
Easy to interpret charts and graphs
Helps detect fraudulent activity
Over 800k individual login events
The first step in our project is to collect user data, to do that we used Splunk. Splunk is used to capture, index, and correlate user login data in real time. The data collected is placed in to a repository where queries can be run on it, and easy to read graphs, charts, and reports can be generated. These visual aids helped us identify data patterns and collect metrics that we could leverage to identify fraudulent web login activity.  Using Splunk we were able to gather over 800k individual login events that we were then able to further analyze in our data curation process.

Data Curation
Why curate the data?
Algorithms don’t understand words very well.
Give the data mathematical representation.
Need to distinguish different types of IP traffic.
How do we curate data?
Divide & conquer.
Create “buckets” of data.
Focus on the important features.
Before curation: 133 columns – After curation: 17 columns

Sidebar: Unsupervised vs Supervised Learning
Machine learning algorithms can be categorized into two groups based on the way they learn about data to make predictions. Supervised and unsupervised learning. In Supervised learning, all possible outcomes are known, and your data must be labeled with the correct outputs. In unsupervised learning we do not know the outcomes and therefore have unlabeled data. In order to build a classifier you must train a machine learning model with labeled data.

Clustering
K-Means
Assign points to groups based on distance from learned centers
We used  K-means clustering to group similar data based on distance from the learned centers. Looking at this chart, you can see each data point represented as a point on the chart, and the colors represent the cluster class. 

Minimization and Labeling
Started with 7 clusters, ended with 5 clusters.
How and why did we choose these labels?
On our first run we started with 7 clusters but what we found was that some of them had similar values, so we cut down the number of clusters from 7 to 5 to minimize redundancy. As you can see two of them here have the same name, the third and fifth ones called "Desktop Single Instance". They can be treated as the same cluster because their data is identical. They have on average one unique desktop user per IP address. The way we came up with these labels is that we looked at the average count of each feature in a cluster to help describe them. The first cluster is called Mobile Single Instance because they have on average one unique mobile user per IP address. The second cluster has the most users on average per IP address, and we labeled it Mobile Carrier Proxy. Mobile Service providers use proxies based on geographic locations where all customers in that area share one IP address, which you can actually see in the graph. On average we had 132 unique users on each IP address in this cluster, which is overwhelmingly mobile users. The Standard Proxy cluster has on average 6 mobile users per IP. 

Classification
K-Nearest Neighbor
How similar is this point to its neighbors?
Support Vector Machine
Focus on the points that are most difficult to distinguish.
Decision Trees
What does an item’s feature say about it’s classification?
Purpose of multiple classifiers? – Compare results, want to see what’s the most accurate for our data. Higher accuracy = less misclassification = less $$$.

Results – Accuracy & Confusion Matrix

Recommendations
Collect more data
Possibly missed full picture of unique IP traffic
Revisit Curation
Create more features

Takeaways
Recurring learning sessions 
Increase pool of ML experts
Numerous business applications
Potential to better understand our customers


last updated september 2019