Getting Started with Pandas

In this notebook we will be:

  1. Reviewing the basics of Pandas

  2. Reviewing Pandas Dataframes

Notebook instructions

Recall that to make notebook cells that have Python code in them do something, hold down the ‘shift’ key and then press the ‘enter’ key (you’ll have to do this to get the YouTube videos to run). To edit a cell (to add answers, for example) you double-click on the cell, add your text, and then enter it by holding down ‘shift’ and pressing ‘enter’

Imports for this Notebook

One of the downsides of notebooks is knowing when things got imported and what modules were important. Trying to get into the habit of including all of the important imports at the top of the notebook is a good way to ensure that you don’t run into issues later on in the notebook. When you restart the notebook, you can run that first cell to get the imports right.

from IPython.display import HTML
from IPython.display import YouTubeVideo
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

1. The basics of Pandas

Pandas is a fundamental tool for doing Data Science work in Python. While we can do this work in straight Python, Pandas makes much of that work easier. We cannot do justice to all of Pandas, as it is a big package, but here we’ll remind ourselves of some of the basics. As you do more data science work, you’ll pick more and more of Pandas as you go along.

1.1 Pandas Series

The basic Pandas data structure is called a Series. It is a sequence, not unlike a numpy array, but with an associated set of labels for each value called the index. If you don’t provide the index labels, Pandas will use the regular 0-based index as the label. Again: if you don’t provide index labels, it will use the numeric index as the index label. That will be important later.

You can make a Series using either a python dict, with keys as the indices, or by separately providing the values and indices. You can also updated the index labels or reset the labels to the default. Note however that the reset_index method does not change the Series but returns, not a Series but a DataFrame, where the original index is preserved as a new column.

Review the following example and make sure you understand everything that is happening. Discuss with your group mates.

# assumes you have imported Pandas as pd in the cell at the top of this page
series_index = pd.Series([1,2,3,4], index=['one', 'two', 'three', 'four'])
print("\nSeries with indicies")
print("Type:", type(series_index))
print(series_index)

series_noindex = pd.Series([5,6,7,8])
print("\nSeries with default indices")
print("Type:", type(series_noindex))
print(series_noindex)

my_dictionary = { 'nine': 9, 'ten':10, 'eleven':11, 'twelve':12 }
series_dictionary = pd.Series(my_dictionary)
print("\nSeries from a dictionary")
print(series_dictionary)
Series with indicies
Type: <class 'pandas.core.series.Series'>
one      1
two      2
three    3
four     4
dtype: int64

Series with default indices
Type: <class 'pandas.core.series.Series'>
0    5
1    6
2    7
3    8
dtype: int64

Series from a dictionary
nine       9
ten       10
eleven    11
twelve    12
dtype: int64

1.2 Manipulating Series

Once you have a Pandas Series object, You can access the values in a number of ways:

  • using the label in [ ], much as you would in a dictionary

  • using data member “dot” (.) access, if the label name would constitute a valid Python variable name (can’t start with a digit for example)

  • using numpy array indexing

Without a label (using default indices) you are restricted to using only the last approach.

Review the following mechanisms for accessing data in a Pandas series based on the format and structure of the Series object.

#using label
print(series_index["three"])

#using data member access
print(series_index.three)

#using array index, 0-based
print(series_index[2])

# no labels
print(series_noindex[2])
# series_noindex.2   # can't, 2 isn't a valid Python variable name
3
3
3
7

Once you have a series object, you can assign/change the values to any of the locations that you can access. Like so:

print("Before:")
print(series_dictionary)
print("---")

series_dictionary["eleven"] = 111
series_dictionary.twelve = 122

print("After:")
print(series_dictionary)
Before:
nine       9
ten       10
eleven    11
twelve    12
dtype: int64
---
After:
nine        9
ten        10
eleven    111
twelve    122
dtype: int64

1.3 Numpy like operations

Finally, you can do many of the things you can do with NumPy arrays, such as indexing NumPy arrays, with a Pandas Series object as well.

Review the following examples to convince yourself how you can use NumPy-style operations to access Series data in Pandas. Discuss with your group mates.

Look at how you can work with ranges of the series elements. The labels are ordered and so the following works:

print(series_index["two":])
two      2
three    3
four     4
dtype: int64

You can also apply Boolean masks to a Series:

print(series_dictionary[series_dictionary <= 10])
nine     9
ten     10
dtype: int64

And you can perform operations which return a new series (but don’t modify the existing one):

print(series_dictionary * 2)
print(series_dictionary.mean() )
nine       18
ten        20
eleven    222
twelve    244
dtype: int64
63.0

There are many operations you can perform on a Pandas Series object (over 200 last we checked!). You’ll pick up more as you continue to become a Pandas expert.


2. The Pandas Dataframe

A Pandas DataFrame is a 2 dimensional data structure. The easiest way to think of a DataFrame is as a group of Series objects where each Series represents a column in the 2D structure. As with Series you can make them a number of ways but the standard way is to use a dictionary where the keys are the column headers and the values are a list of values under that header.

df

It is always important to know the types in each column as that can affect the kinds of operations you can perform on a column. Listing the .dtypes provides such a list. A type of object is likely (though not necessarily) a string.

An index for the rows is provided by default using 0-based array indexing. The use of [] label indexing returns a Series which is a column with that heading name. The index of the entire DataFrame is used for the returned Series.

Run and review the following code: Discuss the results with your group mates.

patient_dict = {"name":["james","jim","joan","jill"],
                 "age":[10, 20, 30, 40],
                 "weight":[150, 140, 130, 120],
                 "complaint": ["sprain", "cut", "headache", "break"]}
patient_df = pd.DataFrame(patient_dict)

print(type(patient_df))
print(patient_df)

print("\n Column types")
print(patient_df.dtypes)

print("\n age column")
age = patient_df["age"]
print(age)
print(type(age))
<class 'pandas.core.frame.DataFrame'>
    name  age  weight complaint
0  james   10     150    sprain
1    jim   20     140       cut
2   joan   30     130  headache
3   jill   40     120     break

 Column types
name         object
age           int64
weight        int64
complaint    object
dtype: object

 age column
0    10
1    20
2    30
3    40
Name: age, dtype: int64
<class 'pandas.core.series.Series'>

2.1 Data Frame indexing

As we noted above, the index for a DataFrame by default is a 0-based array index. However, we can choose to use a different column as an index for the DataFrame. The .set_index() method allows us to set the index.

An irritating issue is whether the Series being worked on changes. Typically it does not but most methods provide a inplace = True setting to apply the changes to the existing element. Otherwise you must assign the result to preserve it.

If you do not do so, the column being used as an index is no longer available using standard indexing. If you provide drop = False in set_index, the column is preserved in the DataFrame as well as in the index. If you .reset_index() returns to 0-based indexing

To access a row, you can use either .loc or .iloc

  • .loc uses the row index label to access the row (or the 0-based index of none is provided). It returns a Series

  • .iloc uses the 0-based index regardless of whether a label exists. It too returns a Series

Indicies and headers are preserved in the Series indexed from a DataFrame

Take a look at the following and make sure you can follow how the dataframe is being manipulated. Discuss with your group mates.

patient_df.set_index("age", inplace=True)
print("\nAge is now the index")
print(patient_df)

# reset to 0-based
patient_df.reset_index(inplace=True)
print("\nBack to 0-based indexing")
print(patient_df)

# keep age as a column
new_df = patient_df.set_index("age", drop=False)
print("\nDon't change the original")
print(patient_df)
print("\nIndex by age, keep the age column")
print(new_df)
Age is now the index
      name  weight complaint
age                         
10   james     150    sprain
20     jim     140       cut
30    joan     130  headache
40    jill     120     break

Back to 0-based indexing
   age   name  weight complaint
0   10  james     150    sprain
1   20    jim     140       cut
2   30   joan     130  headache
3   40   jill     120     break

Don't change the original
   age   name  weight complaint
0   10  james     150    sprain
1   20    jim     140       cut
2   30   joan     130  headache
3   40   jill     120     break

Index by age, keep the age column
     age   name  weight complaint
age                              
10    10  james     150    sprain
20    20    jim     140       cut
30    30   joan     130  headache
40    40   jill     120     break

2.2 Try it yourself

Try doing the following:

  • Make a DataFrame to store student grades. The column headers should be:

    • Name

    • ID

    • Total Percent (out of 100)

    • Final Grade Point (on the 0.0 - 4.0 scale)

Make up some names and values to fill your Dataframe. Include at least 8 students.

Then:

  • Set the index to be the ID

  • Print every student in the dataframe who got a 3.0 or greater

# Put your code here

Follow-up Questions

  1. Is there anything involving using Pandas that you’re curious about or are there any specific issues you’ve run into in the past with Pandas that you couldn’t find a solution for?