WeChat official account: “Python reads money”

If there are any questions or suggestions, please official account message.

In the daily data analysis, it is often necessary to analyze the data**Divide into different groups according to one (more) field**For example, in the field of e-commerce, the total sales of the whole country are divided by provinces, and the changes of sales in each province are analyzed. In the social field, the users are subdivided according to their portraits (gender and age), and the usage and preferences of users are studied. In pandas, the above data processing operations are mainly used`groupby`

Finish, this article will introduce`groupby`

The basic principle and the corresponding`agg`

、`transform`

and`apply`

Operation.

For the convenience of the following illustration, 10 sample data generated by simulation are used. The code and data are as follows:

```
company=["A","B","C"]
data=pd.DataFrame({
"company":[company[x] for x in np.random.randint(0,len(company),10)],
"salary":np.random.randint(5,50,10),
"age":np.random.randint(15,50,10)
}
)
```

company | salary | age | |
---|---|---|---|

0 | C | 43 | 35 |

1 | C | 17 | 25 |

2 | C | 8 | 30 |

3 | A | 20 | 22 |

4 | B | 10 | 17 |

5 | B | 21 | 40 |

6 | A | 23 | 33 |

7 | C | 49 | 19 |

8 | B | 8 | 30 |

# 1、 Basic principles of groupby

In panda, the code to implement grouping operation is very simple, only one line of code is needed. Here, the data set above is grouped according to the`company`

Field division:

`In [5]: group = data.groupby("company")`

Enter the above code`ipython`

After that, you’ll get one`DataFrameGroupBy`

object

```
In [6]: group
Out[6]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002B7E2650240>
```

So this is generated`DataFrameGroupBy`

What is it? yes`data`

It’s been done`groupby`

What happened after that?`ipython`

The result returned is its memory address, not conducive to intuitive understanding, in order to see`group`

What’s inside? Here’s what’s inside`group`

convert to`list`

Let’s take a look at the following forms:

```
In [8]: list(group)
Out[8]:
[('A', company salary age
3 A 20 22
6 A 23 33),
('B', company salary age
4 B 10 17
5 B 21 40
8 B 8 30),
('C', company salary age
0 C 43 35
1 C 17 25
2 C 8 30
7 C 49 19)]
```

After converting to the form of a list, you can see that the list consists of three tuples. In each tuple, the first element is a group`company`

In the end, they are divided into two groups`A`

,`B`

,`C`

）The second element is under the corresponding group`DataFrame`

The whole process can be illustrated as follows:

In conclusion,`groupby`

The process is to change the original`DataFrame`

according to`groupby`

In this case`company`

）It is divided into several parts`Grouping dataframe`

There are as many groups as there are`Grouping dataframe`

。**So, in groupbyAfter a series of operations (such asagg、applyAnd so on)Sub dataframeThe operation of.**After understanding this, we can basically find out what is in pandas

`groupby`

The main principle of operation. Let’s talk about it`groupby`

Common operations after.# 2、 AGG aggregation operation

The aggregation operation is`groupby`

After a very common operation, will write`SQL`

My friends should be very familiar with this. Aggregation operations can be used to sum, average, maximum, minimum, etc. the following table lists the common aggregation operations in pandas.

function | purpose |
---|---|

min | minimum value |

max | Maximum |

sum | Sum up |

mean | mean value |

median | median |

std | standard deviation |

var | variance |

count | count |

For the sample data set, if I want to find the average age and average salary of employees in different companies, I can follow the following code:

```
In [12]: data.groupby("company").agg('mean')
Out[12]:
salary age
company
A 21.50 27.50
B 13.00 29.00
C 29.25 27.25
```

If you want to find different values for different columns, for example, to calculate the average age and median salary of employees in different companies, you can use the dictionary to specify the aggregation operation

```
In [17]: data.groupby('company').agg({'salary':'median','age':'mean'})
Out[17]:
salary age
company
A 21.5 27.50
B 10.0 29.00
C 30.0 27.25
```

`agg`

The polymerization process can be illustrated as follows (the second example is an example)

# 3、 Transform

`transform`

What kind of data operation is it? and`agg`

What’s the difference? For better understanding`transform`

and`agg`

The following is a comparison from the actual application scenarios.

It’s on the top`agg`

In, we learned how to calculate the average salary of employees in different companies. If we need to add a new column in the original data set now`avg_salary`

, on behalf of**Average salary of the company in which the employee works (employees in the same company have the same average salary)**How to realize it? If you calculate according to the normal steps, you need to first get the average salary of different companies, and then fill in the corresponding position according to the corresponding relationship between employees and companies`transform`

The implementation code is as follows:

```
In [21]: avg_salary_dict = data.groupby('company')['salary'].mean().to_dict()
In [22]: data['avg_salary'] = data['company'].map(avg_salary_dict)
In [23]: data
Out[23]:
company salary age avg_salary
0 C 43 35 29.25
1 C 17 25 29.25
2 C 8 30 29.25
3 A 20 22 21.50
4 B 10 17 13.00
5 B 21 40 13.00
6 A 23 33 21.50
7 C 49 19 29.25
8 B 8 30 13.00
```

If used`transform`

If so, only one line of code is required:

```
In [24]: data['avg_salary'] = data.groupby('company')['salary'].transform('mean')
In [25]: data
Out[25]:
company salary age avg_salary
0 C 43 35 29.25
1 C 17 25 29.25
2 C 8 30 29.25
3 A 20 22 21.50
4 B 10 17 13.00
5 B 21 40 13.00
6 A 23 33 21.50
7 C 49 19 29.25
8 B 8 30 13.00
```

Let’s take a look at it graphically`groupby`

after`transform`

In order to show more intuitively, we add the`company`

Column, actually according to the above code only`salary`

Column:

The big box in the picture is`transform`

and`agg`

What’s different, right`agg`

As far as accounting is concerned, it can be calculated`A`

，`B`

，`C`

The corresponding mean value of the company is returned directly`transform`

In other words, it will**For each data to get the corresponding results, the same group of samples will have the same value**After calculating the average value within the group, the**In the order of the original index**Return the result. If you don’t understand, you can take this picture and`agg`

Compare that one.

# 4、 Apply

`apply`

It should be an old friend of everyone. It’s better`agg`

and`transform`

It is more flexible, and can pass in any custom function to realize complex data operation. stayThree axes of pandas data processing

）In this paper, we introduce`apply`

How to use it`groupby`

After use`apply`

What’s the difference from what I’ve described before?

There are some differences, but the whole implementation principle is basically the same. The difference between the two is that for`groupby`

After`apply`

After grouping`Sub dataframe`

The basic unit of operation passed into the specified function as an argument is`DataFrame`

And what I’ve described before`apply`

The basic unit of operation is`Series`

. Or is it a case`groupby`

After`apply`

Usage.

Suppose I need to obtain the data of the oldest employees in each company, how can I achieve this? It can be implemented with the following code:

```
In [38]: def get_oldest_staff(x):
...: df = x.sort_values(by = 'age',ascending=True)
...: return df.iloc[-1,:]
...:
In [39]: oldest_staff = data.groupby('company',as_index=False).apply(get_oldest_staff)
In [40]: oldest_staff
Out[40]:
company salary age
0 A 23 33
1 B 21 40
2 C 43 35
```

In this way, we can get the data of the oldest employees in each company. The whole process is illustrated as follows:

As you can see, the`apply`

It is basically consistent with the principle introduced in the previous article, except that the parameters of the input function are controlled by the`Series`

It’s here`Grouping dataframe`

。

Finally, about`apply`

Here’s a little suggestion, though`apply`

More flexibility, but`apply`

Will be more efficient than`agg`

and`transform`

It’s slower. So,`groupby`

It can be used later`agg`

and`transform`

The problem to be solved is to give priority to these two methods, and only when they can’t be solved can they be considered`apply`

Do the operation.

Scan code is concerned about the official account “Python reading money”, dry cargo for the first time, and can also add Python learning exchange group!

This work adoptsCC agreementReprint must indicate the author and the link of this article