back to notes

Exploratory Analysis with Pandas, Matplotlib, and Numpy

In [1]: import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
Exploratory Analysis with Pandas, Matplotlib, and Numpy
This will be a short adventure into the world of these Python libraries and why Jupyter Notebook is really nice for data exploration.
In [2]: # We can read many file formats with Pandas. We'll start with reading a csv file
# with 'read_csv' and passing in the filepath.
df = pd.read_csv('temperature.csv')
In [3]: # By default, Jupyter Notebooks will print the output of the last line run in the cell if it is an expression.
# Let's take a look at what the csv contains with the 'info' method.
df.info()
In [4]: # Now that we know what columns (series in Pandas language) there are, we can get a
# summary of the counts of each gender.
# Using the 'value_counts' method works on series objects, so we need to specify a series first
# which can be done with dot notation, or with brackets.
print(df.Gender.value_counts())
df['Gender'].value_counts()
In [5]: # If we want to see the first n rows, we can use the 'head' method, and pass in the number
# of rows we want. Head will return 5 rows as the default setting.
print(df.head(3))
df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 3 columns):
Gender 130 non-null object
Temperature 130 non-null float64
HeartRate 130 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 3.1+ KB
Male 65
Female 65
Name: Gender, dtype: int64
Out[4]: Male 65
Female 65
Name: Gender, dtype: int64
Gender Temperature HeartRate
0 Male 96.3 70
1 Male 96.7 71
2 Male 96.9 74
Out[5]:
Gender Temperature HeartRate
0 Male 96.3 70
1 Male 96.7 71
2 Male 96.9 74
3 Male 97.0 80
4 Male 97.1 73
In [6]: # We can use the 'describe' method to gather some summary statistics on our dataset.
# This by default will only work on numerical data types, so check to see if your data
# is being assigned a type correctly when you run the 'info' method.
# Already we can tell that there appears to be a fairly low range in body temperatures,
# but a larger range in heart rates.
df.describe()
In [7]: # If you are unfamiliar with f strings, we can use code natively by wrapping it
# with curly braces and Python magically handles the type conversion to string.
print(f'The difference in mean temperature and 98.6 degrees fahrenheit is: {98.6 - np.mean(df.Temperature)}')
Looking at the distribution of Temperature and Heart Rate by histogram, swarm plot included to avoid binning bias
Out[6]:
Temperature HeartRate
count 130.000000 130.000000
mean 98.249231 73.761538
std 0.733183 7.062077
min 96.300000 57.000000
25% 97.800000 69.000000
50% 98.300000 74.000000
75% 98.700000 79.000000
max 100.800000 89.000000
The difference in mean temperature and 98.6 degrees fahrenheit is: 0.35076923076923094
In [8]: #Histogram of human body temperature
# We can create multiple plots in the same space if we define a subplot.
# We pass in the number of rows, number of columns, and then the index of the plot
# the instructions we are about to pass are. That subplot will take all commands until
# we change it otherwise. In this histogram we plot 2 sets of information, and that plot
# receives both. One is the histogram information, and the other is the vertical line for mean.
plt.subplot(1,2,1)
# 'hist' is one of the plot types we can use with pyplot. If you have a newer version of matplotlib
# use 'density' instead of 'normed'. There are a few ways to handle bins. We can pass a range as a list
# which is left inclusive, right exclusive, except for the last bin which is left and right inclusive.
# In this case we use an integer which will create that many bins (101 edges).
plt.hist(df.Temperature, bins=100, normed=True, label='Data', color='blue')
# To show the mean on the plot as well we draw a vertical line using the mean point of the x-axis.
plt.axvline(np.mean(df.Temperature), label='Mean', linestyle='-', color='red', alpha=.5)
# Add in the legend, and place it at a dynamically determined 'best' location.
plt.legend(loc='best')
# Add a label to the x-axis
plt.xlabel('Temperature (°F)')
# Add a label to the y-axis
plt.ylabel('% of occurrences')
#Swarmplot of human body temperature
# Move to plot at index 2 so we can send information to it.
plt.subplot(1,2,2)
# Create a swarmplot by defining the series and the dataset
sns.swarmplot(y='Temperature', data=df)
# Add a label to the x-axis
plt.xlabel('Distribution of Temperature (°F)')
# Add a label to the y-axis
plt.ylabel('Temperature (°F)')
# Add a title to the whole plot space.
plt.suptitle('Measured human body temperature (°F)', fontsize=16)
# Dynamically format the plots to a best-fit. This will make sure if we save the figure as
# an image, all of the contents including titles and labels are included.
plt.tight_layout()
# Move the subplots down so they are not partially covered by the title
plt.subplots_adjust(top=0.9)
# Style the margins if you would like
plt.margins(.02)
# In Jupyter we use this to display all figures.
plt.show()
In [9]: #Histogram of human body temperature
plt.subplot(1,2,1)
plt.hist(df.HeartRate, bins=100, normed=True, label='Data', color='blue')
plt.axvline(np.mean(df.HeartRate), label='Mean', linestyle='-', color='red', alpha=.5)
plt.legend(loc='best')
plt.xlabel('Heart Rate')
plt.ylabel('% of occurrences')
#Swarmplot of human body temperature
plt.subplot(1,2,2)
sns.swarmplot(y='HeartRate', data=df)
plt.xlabel('Distribution of Heart Rate')
plt.ylabel('Heart Rate')
plt.suptitle('Measured human heart rate', fontsize=16)
plt.tight_layout()
plt.subplots_adjust(top=0.9)
plt.margins(.02)
plt.show()
Computing the covariance matrix and pearson coefficient
In [10]: # Compute the covariance matrix: covariance_matrix
covariance_matrix = np.cov(df.HeartRate, df.Temperature)
# Print the covariance matrix
print(covariance_matrix)
# Extract covariance of heart rate and temperature of human body: hr_temp_cov
# This is by default the 0, 1 location in the resulting matrix when there are two
# items being compared. The resulting 2x2 matrix takes the form of:
# (HeartRate=hr, Temperature=t)
# [[cov(hr,hr) cov(hr,t)]
# [cov(hr,t) cov(t,t)]]
hr_temp_cov = covariance_matrix[0,1]
print(hr_temp_cov)
In [11]: # Compute the Pearson correlation coefficient between two lists/arrays.
def pearson_r(x,y):
# Compute the correlation matrix and return entry [0,1]
return np.corrcoef(x,y)[0,1]
[[49.87292785 1.31338104]
[ 1.31338104 0.53755754]]
1.3133810375670802
In [12]: plt.figure(figsize=(20,10))
plt.plot(df.Temperature, df.HeartRate, marker='.', label='H.R./Temp', linestyle='none')
# Add mean lines for temperature and heart rate
plt.axvline(np.mean(df.Temperature), linestyle='-', label='Temp Mean', color='red', alpha=.5)
plt.axhline(np.mean(df.HeartRate), linestyle='-', label='H.R. Mean', color='green', alpha=.5)
# Perform a linear regression using np.polyfit(): a,b
a, b = np.polyfit(df.Temperature, df.HeartRate,1)
# Make theoretical line to plot
x = np.array([np.floor(df.Temperature.min()),np.ceil(df.Temperature.max())])
y = a * x + b
# Add regression line to the plot
plt.plot(x,y,label='Lin. Reg. line', alpha=.5)
plt.xlabel('Temperature (°F)')
plt.ylabel('Heart Rate')
plt.margins(.02)
plt.legend(bbox_to_anchor=(1.05,1), loc=2, borderaxespad=0.)
plt.show()
print(f'The Pearson coefficient is: {pearson_r(df.HeartRate, df.Temperature)}')
print(f'slope = {a}; Heart Rate/Temperature')
print(f'intercept = {b}')
In [13]: # Compute observed correlation: r_obs
r_obs = pearson_r(df.HeartRate, df.Temperature)
# Initialize permutation replicates: perm_replicates
perm_replicates = np.empty(100000)
# Draw replicates
for i in range(100000):
hr_permuted = np.random.permutation(df.HeartRate)
perm_replicates[i] = pearson_r(hr_permuted, df.Temperature)
# Or with list comprehension
# perm_replicates = [pearson_r(np.random.permutation(df.HeartRate), df.Temperature) for x in range(100000)]
# Compute p-value: p
p = np.sum(perm_replicates >= r_obs) / len(perm_replicates)
print(f'p-value = {p}')
The Pearson coefficient is: 0.2536564027207643
slope = 2.4432380386118884; Heart Rate/Temperature
intercept = -166.2847194182037
p-value = 0.00202
The low p-value shown here demonstrates that it's not just chance that heart rate and temperature have a low pearson coefficient.
Looking at the scatter plot shows the collected data of temperature and heart rate has a linear relationship. The Pearson coefficient is a positive 0.254. In
our further analysis of temperatures, heart rate won't be a strong predictor or temperature. However, the positive correlation of 0.25 could warrant further
investigation of how temperature affects heart rates (cases of hypo/hyper-thermia). Regression analysis with heart rate as the dependent variable gives
the equation:
HeartRate = 2.443 x Temperature - 166.285
Is the distribution of body temperatures normal?
We've already looked at the distribution of the human body temperature data as a histogram. To help determine if the data is a normal distribution we'll
also graph the Cumulative Distribution Function (CDF).
In [14]: # Function call to compute the Empirical Cumulative Distribution Function (ECDF)
def ecdf(data):
# Compute the ECDF for a one-dimensional array of values and return a coordinate set of the data: x, y
return np.sort(data), np.arange(1, len(data) + 1) / len(data)
In [15]: # Calculating the cdf from the empirical data set
plt.figure(figsize=(20,10))
x_temp, y_temp = ecdf(df.Temperature)
plt.plot(x_temp, y_temp, marker='.', linestyle='none')
# Calculating a theoretical normal distribution from the mean and stdev of the empirical temperature data
norm_dist = np.random.normal(np.mean(df.Temperature), np.std(df.Temperature), size=10000)
x_norm_temp, y_norm_temp = ecdf(norm_dist)
plt.plot(x_norm_temp, y_norm_temp)
plt.title('Human Body Temperature Actual vs Normal Distribution')
plt.ylabel('ECDF')
plt.xlabel('Temperature (°F)')
plt.margins(.02)
plt.legend(('Theoretical CDF', 'ECDF'))
plt.show()
We have the ECDF and theoretical CDF comparison to assist in identifying if our temperature data set is normal. From this graph we can see that:
The distribution of the data is normal
As an additional measure of certainty, we'll use bootstrapping on the sample to increase the number of observations then see if this larger sample size
does show a normal distribution.
In [16]: # Create a function to increase the number of observations by drawing bootstrap replicates
def draw_bs_reps(data, func, size=1):
# Initialize array of replicates: bs_replicates
bs_replicates = np.empty(size)
#Generate replicates
for i in range(size):
bs_replicates[i] = func(np.random.choice(data, len(data)))
return bs_replicates
In [17]: # Take 100,000 bootstrap replicates of the mean: bs_replicates
bs_replicates = draw_bs_reps(df.Temperature, np.mean, 100000)
# Compute and print the SEM
sem = np.std(df.Temperature) / np.sqrt(len(df.Temperature))
print(f'The standard error of the mean of the original data is: {sem}')
# Compute and print the standard deviation of bootstrap replicates
bs_std = np.std(bs_replicates)
print(f'The standard deviation of the mean of the bootstrapped data is: {bs_std}')
# Make a histogram of the results
plt.hist(bs_replicates, bins=50, normed=True)
plt.xlabel('Mean Human Body Temperature (°F)')
plt.ylabel('PDF')
plt.show()
print(f'There is a 95% chance a data point would fall between {np.percentile(bs_replicates, [2.5, 97.5])}')
print(f'The difference between the 2.5%, 50%, and 97.5% percentiles are: {np.mean(bs_replicates) - np.percenti
le(bs_replicates, [2.5])} ' \
f'{np.percentile(bs_replicates, [97.5]) - np.mean(bs_replicates)}')
print(f'This shows that we have symmetry and normal distribution')
The standard error of the mean of the original data is: 0.06405661469519337
The standard deviation of the mean of the bootstrapped data is: 0.06421745517365847
There is a 95% chance a data point would fall between [98.12384615 98.37538462]
The difference between the 2.5%, 50%, and 97.5% percentiles are: [0.12532052] [0.12621794]
This shows that we have symmetry and normal distribution
This is a probabilistic estimate of the mean human body temperature. The distribution of this data is normal. The SEM for our original data set and
bootstrapped data set are near identical.
When taken in context with the ECDF and CDF comparison, we can safely say that the distribution of human body temperature is normal.
The standard deviation of this distribution, called the standard error of the mean, or SEM, is given by the standard deviation of the data divided by the
square root of the number of data points.
In [18]: def diff_of_means(data_1, data_2):
return np.mean(data_1) - np.mean(data_2)
In [19]: female_temp = df.Temperature[df['Gender'] == 'Female']
male_temp = df.Temperature[df['Gender'] == 'Male']
emp_mean_diff = diff_of_means(female_temp, male_temp)
print(f'Difference of means: {emp_mean_diff}°F')
In [20]: bs_replicates_female = draw_bs_reps(female_temp, np.mean, 10000)
bs_replicates_male = draw_bs_reps(male_temp, np.mean, 10000)
In [21]: sns.swarmplot(x='Gender', y='Temperature', data=df)
plt.xlabel('Gender')
plt.ylabel('Temperature (°F)')
plt.show()
Difference of means: 0.2892307692307554°F
In [22]: plt.hist(bs_replicates_female, bins=50, normed=True, alpha=.5, label='Mean Female Temperature', color='orange'
)
plt.hist(bs_replicates_male, bins=50, normed=True, alpha=.5, label='Mean Male Temperature', color='purple')
plt.xlabel('Mean Temperature (°F) by Gender')
plt.ylabel('PDF')
plt.margins(.02)
plt.legend(loc='best')
plt.show()
In [23]: male_x, male_y = ecdf(male_temp)
female_x, female_y = ecdf(female_temp)
plt.plot(male_x, male_y, marker='.', linestyle='none', color='red')
plt.plot(female_x, female_y, marker='.', linestyle='none', color='blue')
plt.title('Human Body Temperature By Gender')
plt.ylabel('CDF')
plt.xlabel('Temperature (°F)')
plt.margins(.02)
plt.legend(('Male', 'Female'))
plt.show()
In [ ]:


last updated september 2019