Python for Data Science: How to Output Basic Summary Statistics using a Single Pandas Function

In the previous blog post, we wrote a program to read from a text file (marks.txt) that consists of integers separated by commas.

Based on the contents of the text file, we constructed the frequency distribution table of the integers, and calculated their sum and mean using basic Python.

In this tutorial, we’re going to explore an alternative way of doing it using the Pandas library.

Table of Contents

What is Pandas?
Installing Pandas
Creating a Pandas Series
Creating a Pandas DataFrame
Using Pandas for Statistical Analysis

What is Pandas?

Pandas is a powerful data analysis library in Python. It offers two data structures and lots of built-in methods for manipulating numerical data in Python.

The two data structures that Pandas offer are the Pandas Series and Pandas DataFrame.

A Pandas Series is a one-dimensional array of indexed data. It can be created from a basic Python list or dictionary using the Series() method. We’ll learn how to do that later.

A Pandas DataFrame, on the other hand, is a 2-dimensional labeled data structure. You can think of it as an Excel spreadsheet or a SQL table. We can convert a Python list or dictionary into a data frame using the DataFrame() method. We’ll do that later as well.

Installing Pandas

In order to make use of the Pandas library, we need to install it.

If you have followed the instructions in this post, you would have successfully installed Python, pip and numpy.

Both the Pandas Series and DataFrame are based on the NumPy array. Hence, before you install Pandas, please ensure that you have already installed NumPy.

Once NumPy is installed, you simply need to run the command

pip install pandas

in your Command Prompt or Terminal to install Pandas.

Creating a Pandas Series

Once Pandas is successfully installed, you are ready to start using this powerful data analysis library to perform statistical analysis on your data.

Let’s first learn how to create a Pandas series.

As mentioned earlier, a Panda series can be created from a Python list or dictionary.

Creating a Pandas Series from a Python List

Here’s an example of how we can convert a Python list into a Pandas series:

import pandas as pd

#Creating a Python List
myList = [1, 3, 5, 1, 2, 7]

#Converting myList to a series
myList_series = pd.Series(myList)
print(myList_series)

In the code above, we first import pandas on Line 1. It is customary for us to use pd as the alias when importing pandas.

Next, we create a basic Python list called myList (line 4) and use the Series() method on line 7 to convert this list into a Pandas series called myList_series.

Series() is a built-in method that comes with Pandas. Finally, we print myList_series.

If you run the code above, you’ll get

as the output.

You can see that when we print myList_series, we get two columns.

The first column is actually the index of the elements in the series while the second column is the value.

At this point, you may notice that there is not much difference between a Pandas series and a Python list. In this example, the indexes of the series run from 0 to 5, much like the indexes of a Python list.

Indeed, if you run the following statements,

print(myList[5])
print(myList_series[5])

you’ll get the same output. In both cases, you’ll get 7, which is the 6th element in the list/series.

The ‘power’ of a Pandas series lies in the vast number of Pandas methods that we can apply to the series to perform all sorts of statistical analysis on it. These methods are not available for normal Python lists.

Before we learn how to use these statistical methods, let’s look at one more example of how we can create a Pandas series.

Creating a Pandas Series from a Python Dictionary

The code below shows how we can create a Pandas series from a Python dictionary:

fruit_counts = {"Apple": 10, 
                 "Banana": 20, 
                 "Cherries": 30}

fruit_counts_series = pd.Series(fruit_counts)
print(fruit_counts_series)

If you run the code, you’ll get:

as the output.

Creating a Pandas DataFrame

Next, let’s look at how we can create Pandas data frames.

Recall that a Pandas data frame is a 2-dimensional labeled data structure. (In contrast, a Pandas series is a 1-dimensional data structure.) You can convert a Python list of dictionaries or a Python dictionary of lists into a data frame.

Creating a Pandas DataFrame from a Python List

This example shows how you can convert a Python list of dictionaries into a data frame.

emp1 = {'name': 'Alex', 'age': 20, 'salary': 1050}
emp2 = {'name': 'Benny', 'age': 52, 'salary': 1400}
emp3 = {'name': 'Cathy', 'age': 23, 'salary': 1690}

emp_list = [emp1, emp2, emp3]
emp_df = pd.DataFrame(emp_list)
print(emp_df)

In the code above, the emp_list list (line 5) is made up of three dictionaries, emp1, emp2 and emp3. We use the DataFrame() method on line 6 to convert this list into a data frame. If you run the code, you’ll get

as the output.

Each dictionary (emp1, emp2, emp3) corresponds to a row in the data frame.

Creating a Pandas DataFrame from a Python Dictionary

Next, let’s convert a Python dictionary of lists into a data frame.

pet_list = ['Hope', 'Coco', 'Evan']
gender_list = ['F', 'F', 'M']
weight_list = [6.7, 8.1, 10.2]

pet_dict = {
     'Pet Name': pet_list,
     'Gender': gender_list,
     'Weight': weight_list    
}

pet_df = pd.DataFrame(pet_dict)
print(pd.DataFrame(pet_df))

In the code above, the pet_dict dictionary (line 5) is made up of 3 lists, pet_list, gender_list and weight_list. We use the DataFrame() method on line 11 to convert this dictionary into a data frame. If you run the code, you’ll get

as the output. Each list in the dictionary corresponds to a column in the data frame.

Creating a Pandas DataFrame from Pandas Series

We can also combine and convert one or more Pandas series into a data frame. One easy way to do that is to create a dictionary of series first. Here’s an example:

series1 = pd.Series(['Aaron', 'Ben', 'Carol', 'Darren'])
series2 = pd.Series([12, 11, 23, 21])

points_dict = { 'Names': series1, 'Points': series2 }
points_df = pd.DataFrame(points_dict)

print(points_df)

In the code above, the points_dict dictionary (line 4) is made up of 2 series, series1 and series2. If you run the code, you’ll get the following output:

Using Pandas for Statistical Analysis

Now that we are familiar with Pandas series and data frames, let’s look at a simple program for performing basic statistical analysis on our data.

Suppose the file marks2.txt contains the following data:

1, 2, 0, 12, 17
3, 4, 1, 15, 6
18, 9, 20, 1, 4

The code below shows how we can use Pandas to generate the frequency distribution table and calculate various descriptive statistics.

import pandas as pd

marks = []
f = open ('marks2.txt', 'r')

for line in f:
    line = line.rstrip()
    marks = marks + line.split(', ')

marks = list(map(int, marks))

marks_series = pd.Series(marks)

#Descriptive statistics using describe()
stats = marks_series.describe()
print(stats)

#Frequency Distribution Table using value_counts()
freq = marks_series.value_counts()
print(freq)

#Converting freq into a DataFrame with heading
marks_df = pd.DataFrame({'Frequency': freq})
print(marks_df)

#Converting freq into a DataFrame with heading and sorted index
marks_df2 = pd.DataFrame({'Frequency': freq}, index = sorted(freq.index))
print(marks_df2)

Lines 3 to 10 are identical to what we have in the previous post. We simply read the integers from marks2.txt and store them into a list called marks.

Next, on line 12, we use the Series() method to convert marks into a Pandas series called marks_series.

Line 15 is where the magic happens. We use the Pandas method describe() to get the descriptive statistics of marks_series, without having to calculate them ourselves.

This gives us the following output:

Next, we want to get the frequency distribution table of marks_series.

To do that, we use the built-in value_counts() method on line 19 and assign the result to a variable called freq.

The next line print(freq) gives us the following output:

You may notice that this output looks like the output for myList_series from above. Indeed, the value_counts() method returns its result as a Pandas series. If you study this series carefully, you may observe that this Pandas series is different than a normal Python list.

In this output, the indexes (the first column) do not run from 0 to 11. Instead, the indexes represent the values from marks_series.

Essentially, this output is saying that there are three 1s, two 4s, one 15 etc in marks_series. Hence, if you want to know how many 18s there are in marks_series, you can simply access the frequency by using freq[18].

print(freq[18]) will give us 1 as there is only one 18 in the series.

If you want to display a more reader-friendly version of this frequency distribution table, you can convert freq into a data frame and give it a column name. This is done on line 23. You’ll get the following output:

If you want to sort the table by its index, you can do it using the index attribute. This is done on line 27 where we add the parameter index = sorted(freq.index) to the DataFrame() method. Here, we are saying we want to use the sorted index of freq as the index of our data frame. You’ll get the following output:

Here’s a video showing how the program works.

What is Pandas?

Installing Pandas

Creating a Pandas Series

Creating a Pandas DataFrame

Using Pandas for Statistical Analysis

Recent Posts