Introduction
The purpose of this article is to show some common Excel tasks and how
you would execute similar tasks in pandas. Some of the examples are somewhat trivial
but I think it is important to show the simple as well as the more
complex functions you can find elsewhere. As an added bonus, I’m going to do some fuzzy string
matching to show a little twist to the process and show how pandas can
utilize the full python system of modules to do something simply in
python that would be complex in Excel.
Make sense? Let’s get started.
Adding a Sum to a Row
The first task I’ll cover is summing some columns to add a total column.
We will start by importing our excel data into a pandas dataframe.
account
name
street
city
state
postal-code
Jan
Feb
Mar
0
211829
Kerluke, Koepp and Hilpert
34456 Sean Highway
New Jaycob
Texas
28752
10000
62000
35000
1
320563
Walter-Trantow
1311 Alvis Tunnel
Port Khadijah
NorthCarolina
38365
95000
45000
35000
2
648336
Bashirian, Kunde and Price
62184 Schamberger Underpass Apt. 231
New Lilianland
Iowa
76517
91000
120000
35000
3
109996
D’Amore, Gleichner and Bode
155 Fadel Crescent Apt. 144
Hyattburgh
Maine
46021
45000
120000
10000
4
121213
Bauch-Goldner
7274 Marissa Common
Shanahanchester
California
49681
162000
120000
35000
We want to add a total column to show total sales for Jan, Feb and Mar.
This is straightforward in Excel and in pandas. For Excel, I have added the formula sum(G2:I2) in column J.
Here is what it looks like in Excel:
Next, here is how we do it in pandas:
account
name
street
city
state
postal-code
Jan
Feb
Mar
total
0
211829
Kerluke, Koepp and Hilpert
34456 Sean Highway
New Jaycob
Texas
28752
10000
62000
35000
107000
1
320563
Walter-Trantow
1311 Alvis Tunnel
Port Khadijah
NorthCarolina
38365
95000
45000
35000
175000
2
648336
Bashirian, Kunde and Price
62184 Schamberger Underpass Apt. 231
New Lilianland
Iowa
76517
91000
120000
35000
246000
3
109996
D’Amore, Gleichner and Bode
155 Fadel Crescent Apt. 144
Hyattburgh
Maine
46021
45000
120000
10000
175000
4
121213
Bauch-Goldner
7274 Marissa Common
Shanahanchester
California
49681
162000
120000
35000
317000
Next, let’s get some totals and other values for each month. Here is what
we are trying to do as shown in Excel:
As you can see, we added a SUM(G2:G16) in row 17 in each of the columns to get
totals by month.
Performing column level analysis is easy in pandas. Here are a couple of examples.
Now, we want to add a total by month and grand total. This is where
pandas and Excel diverge a little. It is very simple to add totals in
cells in Excel for each month. Because pandas need to maintain the
integrity of the entire DataFrame, there are a couple more steps.
First, create a sum for the month and total columns.
This is fairly intuitive however, if you want to add totals as a row,
you need to do some minor manipulations.
We need to transpose the data and convert the Series to a DataFrame so
that it is easier to concat onto our existing data. The T function
allows us to switch the data from being row-based to column-based.
Jan
Feb
Mar
total
0
1462000
1507000
717000
3686000
The final thing we need to do before adding the totals back is to add
the missing columns. We use reindex to do this for us. The trick is to
add all of our columns and then allow pandas to fill in the values that
are missing.
account
name
street
city
state
postal-code
Jan
Feb
Mar
total
0
NaN
NaN
NaN
NaN
NaN
NaN
1462000
1507000
717000
3686000
Now that we have a nicely formatted DataFrame, we can add it to our
existing one using append.
account
name
street
city
state
postal-code
Jan
Feb
Mar
total
11
231907
Hahn-Moore
18115 Olivine Throughway
Norbertomouth
NorthDakota
31415
150000
10000
162000
322000
12
242368
Frami, Anderson and Donnelly
182 Bertie Road
East Davian
Iowa
72686
162000
120000
35000
317000
13
268755
Walsh-Haley
2624 Beatty Parkways
Goodwinmouth
RhodeIsland
31919
55000
120000
35000
210000
14
273274
McDermott PLC
8917 Bergstrom Meadow
Kathryneborough
Delaware
27933
150000
120000
70000
340000
15
NaN
NaN
NaN
NaN
NaN
NaN
1462000
1507000
717000
3686000
Additional Data Transforms
For another example, let’s try to add a state abbreviation to the data set.
From an Excel perspective the easiest way is probably to add a new
column, do a vlookup on the state name and fill in the abbreviation.
I did this and here is a snapshot of what the results looks like:
You’ll notice that after performing the vlookup, there are some values that are not
coming through correctly. That’s because we misspelled some of the
states. Handling this in Excel would be really challenging (on big data sets).
Fortunately with pandas we have the full power of the python ecosystem
at our disposal. In thinking about how to solve this type of messy data problem,
I thought about trying to do some fuzzy text matching to determine the correct value.
Fortunately someone else has done a lot of work in this are. The fuzzy wuzzy library has some pretty useful
functions for this type of situation. Make sure to get it and install it first.
The other piece of code we need is a state name to abbreviation mapping. Instead of trying to
type it myself, a little googling found this code.
Get started by importing the appropriate fuzzywuzzy functions and define our state map dictionary.
Here are some example of how the fuzzy text matching function works.
Now that we know how this works, we create our function to take the state column and convert it to a valid
abbreviation. We use the 80 score_cutoff for this data. You can play
with it to see what number works for your data. You’ll notice that we either return a valid
abbreviation or an np.nan so that we have some valid values in the field.
Add the column in the location we want and fill it with NaN values
account
name
street
city
state
postal-code
abbrev
Jan
Feb
Mar
total
0
211829
Kerluke, Koepp and Hilpert
34456 Sean Highway
New Jaycob
Texas
28752
NaN
10000
62000
35000
107000
1
320563
Walter-Trantow
1311 Alvis Tunnel
Port Khadijah
NorthCarolina
38365
NaN
95000
45000
35000
175000
2
648336
Bashirian, Kunde and Price
62184 Schamberger Underpass Apt. 231
New Lilianland
Iowa
76517
NaN
91000
120000
35000
246000
3
109996
D’Amore, Gleichner and Bode
155 Fadel Crescent Apt. 144
Hyattburgh
Maine
46021
NaN
45000
120000
10000
175000
4
121213
Bauch-Goldner
7274 Marissa Common
Shanahanchester
California
49681
NaN
162000
120000
35000
317000
We use apply to add the abbreviations into the approriate column.
account
name
street
city
state
postal-code
abbrev
Jan
Feb
Mar
total
11
231907
Hahn-Moore
18115 Olivine Throughway
Norbertomouth
NorthDakota
31415
ND
150000
10000
162000
322000
12
242368
Frami, Anderson and Donnelly
182 Bertie Road
East Davian
Iowa
72686
IA
162000
120000
35000
317000
13
268755
Walsh-Haley
2624 Beatty Parkways
Goodwinmouth
RhodeIsland
31919
RI
55000
120000
35000
210000
14
273274
McDermott PLC
8917 Bergstrom Meadow
Kathryneborough
Delaware
27933
DE
150000
120000
70000
340000
15
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1462000
1507000
717000
3686000
I think this is pretty cool. We have developed a very simple process to intelligently clean up
this data. Obviously when you only have 15 or so rows, this is not a big deal. However, what if you had
15,000? You would have to do something manual in Excel to clean this up.
Subtotals
For the final section of this article, let’s get some subtotals by state.
In Excel, we would use the subtotal tool to do this for us.
The output would look like this:
Creating a subtotal in pandas, is accomplished using groupby
Jan
Feb
Mar
total
abbrev
AR
150000
120000
35000
305000
CA
162000
120000
35000
317000
DE
150000
120000
70000
340000
IA
253000
240000
70000
563000
ID
70000
120000
35000
225000
ME
45000
120000
10000
175000
MS
62000
120000
70000
252000
NC
95000
45000
35000
175000
ND
150000
10000
162000
322000
PA
70000
95000
35000
200000
RI
200000
215000
70000
485000
TN
45000
120000
55000
220000
TX
10000
62000
35000
107000
Next, we want to format the data as currency by using applymap to all the values in the
data frame.
Jan
Feb
Mar
total
abbrev
AR
$150,000
$120,000
$35,000
$305,000
CA
$162,000
$120,000
$35,000
$317,000
DE
$150,000
$120,000
$70,000
$340,000
IA
$253,000
$240,000
$70,000
$563,000
ID
$70,000
$120,000
$35,000
$225,000
ME
$45,000
$120,000
$10,000
$175,000
MS
$62,000
$120,000
$70,000
$252,000
NC
$95,000
$45,000
$35,000
$175,000
ND
$150,000
$10,000
$162,000
$322,000
PA
$70,000
$95,000
$35,000
$200,000
RI
$200,000
$215,000
$70,000
$485,000
TN
$45,000
$120,000
$55,000
$220,000
TX
$10,000
$62,000
$35,000
$107,000
The formatting looks good, now we can get the totals like we did earlier.
Convert the values to columns and format it.
Jan
Feb
Mar
total
0
$1,462,000
$1,507,000
$717,000
$3,686,000
Finally, add the total value to the DataFrame.
Jan
Feb
Mar
total
AR
$150,000
$120,000
$35,000
$305,000
CA
$162,000
$120,000
$35,000
$317,000
DE
$150,000
$120,000
$70,000
$340,000
IA
$253,000
$240,000
$70,000
$563,000
ID
$70,000
$120,000
$35,000
$225,000
ME
$45,000
$120,000
$10,000
$175,000
MS
$62,000
$120,000
$70,000
$252,000
NC
$95,000
$45,000
$35,000
$175,000
ND
$150,000
$10,000
$162,000
$322,000
PA
$70,000
$95,000
$35,000
$200,000
RI
$200,000
$215,000
$70,000
$485,000
TN
$45,000
$120,000
$55,000
$220,000
TX
$10,000
$62,000
$35,000
$107,000
0
$1,462,000
$1,507,000
$717,000
$3,686,000
You’ll notice that the index is ‘0’ for the total line. We want to change
that using rename.
Jan
Feb
Mar
total
AR
$150,000
$120,000
$35,000
$305,000
CA
$162,000
$120,000
$35,000
$317,000
DE
$150,000
$120,000
$70,000
$340,000
IA
$253,000
$240,000
$70,000
$563,000
ID
$70,000
$120,000
$35,000
$225,000
ME
$45,000
$120,000
$10,000
$175,000
MS
$62,000
$120,000
$70,000
$252,000
NC
$95,000
$45,000
$35,000
$175,000
ND
$150,000
$10,000
$162,000
$322,000
PA
$70,000
$95,000
$35,000
$200,000
RI
$200,000
$215,000
$70,000
$485,000
TN
$45,000
$120,000
$55,000
$220,000
TX
$10,000
$62,000
$35,000
$107,000
Total
$1,462,000
$1,507,000
$717,000
$3,686,000
Conclusion
By now, most people know that pandas can do a lot of complex
manipulations on data - similar to Excel. As I have been learning about
pandas, I still find myself trying to remember how to do things that I
know how to do in Excel but not in pandas. I realize that this
comparison may not be exactly fair - they are different tools. However,
I hope to reach people that know Excel and want to learn what
alternatives are out there for their data processing needs. I hope
these examples will help others feel confident that they can replace a lot
of their crufty Excel data manipulations with pandas.
I found this exercise helpful to cement these ideas in my mind. I hope it works
for you as well. If you have other Excel tasks that you would like to learn how
to do in pandas, let me know via the comments below and I will try to help.