TAGS :Viewed: 17 - Published at: a few seconds ago

[ Pandas Dataframes: How to groupby on a groupby? ]

I'm trying to generalize the question I asked here.

The mlb dataframe looks like

    Player             Position          Salary     Year
0   Mike Witt          Pitcher           1400000    1988
1   George Hendrick    Outfielder        989333     1988
2   Chili Davis        Outfielder        950000     1988
3   Brian Downing      Designated Hitter 900000     1988
4   Bob Boone          Catcher           883000     1988
5   Bob Boone          Catcher           883000     1989
6   Frank Smith        Catcher           993000     1988
7   Frank Smith        Pitcher           1300000    1989

Note that the same player may be listed multiple times for different years. I'm trying to find the player with maximum total salary for each position. Output should be something like:

    Position           Player            Salary    
 0  Pitcher            Mike Witt         1400000
 1  Outfielder         George Hendrick   989333
 2  Brian Downing      Designated Hitter 900000
 3  Catcher            Bob Boone         1766000

I think I need to do something like group by Position, then group by Player, then sum for each player and find the maximum. But I'm having trouble doing this.

Once I do positions = mlb.groupby("Position") I'm having trouble doing the next step. I think a nested groupby by Player is necessary, but I don't know how to proceed.

Answer 1


This is messy but gets the job done.

df = pd.DataFrame({'Player':['Mike Witt','George Hendrick','Chili Davis','Brian Downing','Bob Boone','Bob Boone'],
                'Position':['Pitcher','Outfielder','Outfielder','Designated Hitter','Catcher','Catcher'],
                'Salary':[1400000,989333, 950000,900000,883000,900000],
                'Year':[1988,1988,1988,1988,1988,1988]})

gp = df.groupby(['Player','Position']).sum()['Salary'].to_frame().reset_index()
gp.sort('Salary',ascending=False).drop_duplicates('Position')

OR

gp.groupby('Position').max()

Like @dawg mentioned, this will essentially treat a player that has multiple positions as different players so their salaries per position are what is shown here.

            Player           Position   Salary
0        Bob Boone            Catcher  1783000
4        Mike Witt            Pitcher  1400000
3  George Hendrick         Outfielder   989333
1    Brian Downing  Designated Hitter   900000

Answer 2


Try this

import numpy as np
g = df.groupby(['Position', 'Player']).aggregate({'Salary': sum, 'Player': lambda y: np.unique(y)})
print g.max(level=['Position'])