3-10 Pandas 常用操作

 

1.构造数据

In [1]:
import pandas as pd
data=pd.DataFrame({'group':['a','a','a','b','b','b','c','c','c'],
                 'data':[4,1,2,2,3,5,3,5,5]})
data
Out[1]:
 
  group data
0 a 4
1 a 1
2 a 2
3 b 2
4 b 3
5 b 5
6 c 3
7 c 5
8 c 5
 

2.排序

In [2]:
data.sort_values(by=['group','data'],ascending=[False,True],inplace=True)#by指定序列,ascending=[False,True]指定升序,BOOL来确定是升序还是降序;inplace=True确认改变原始数据
data
Out[2]:
 
  group data
6 c 3
7 c 5
8 c 5
3 b 2
4 b 3
5 b 5
1 a 1
2 a 2
0 a 4
 

3指定键值进行排序:

In [3]:
data=pd.DataFrame({'k1':['one']*3+['two']*4,'k2':[3,2,1,3,3,3,4]})
data
Out[3]:
 
  k1 k2
0 one 3
1 one 2
2 one 1
3 two 3
4 two 3
5 two 3
6 two 4
In [4]:
data.sort_values(by='k2')
Out[4]:
 
  k1 k2
2 one 1
1 one 2
0 one 3
3 two 3
4 two 3
5 two 3
6 two 4
 

5.对重复的数据删除

In [5]:
data.drop_duplicates()#删除k1+k2里都重复的值
Out[5]:
 
  k1 k2
0 one 3
1 one 2
2 one 1
3 two 3
6 two 4
In [6]:
data.drop_duplicates('k1')#删除k1重复的值
Out[6]:
 
  k1 k2
0 one 3
3 two 3
 

6.对值作出一个新的映射

In [7]:
data1=pd.DataFrame({'food':['A1','A2','B1','B2','C1','C2','C3'],'data':[1,2,3,4,5,6,7]})
data1
Out[7]:
 
  food data
0 A1 1
1 A2 2
2 B1 3
3 B2 4
4 C1 5
5 C2 6
6 C3 7
 

6-1 apply的映射

In [8]:
def food_map(series):
    if series['food']=='A1':
        return 'A'
    elif series['food']=='A2':
        return 'A'
    elif series['food']=='B1':
        return 'B'
    elif series['food']=='B2':
        return 'B'
    elif series['food']=='C1':
        return 'C'
    elif series['food']=='C2':
        return 'C'
    elif series['food']=='C3':
        return 'C'
data1['food_map']=data1.apply(food_map,axis='columns')#apply映射
data1 
Out[8]:
 
  food data food_map
0 A1 1 A
1 A2 2 A
2 B1 3 B
3 B2 4 B
4 C1 5 C
5 C2 6 C
6 C3 7 C
 

6-2 map的映射

In [9]:
food2Upper={
    'A1':'A',
    'A2':'A',
    'B1':'B',
    'B2':'B',
    'C1':'C',
    'C2':'C',
    'C3':'C'}#字典的映射
data1['upper']=data1['food'].map(food2Upper)#map映射操作
data1
Out[9]:
 
  food data food_map upper
0 A1 1 A A
1 A2 2 A A
2 B1 3 B B
3 B2 4 B B
4 C1 5 C C
5 C2 6 C C
6 C3 7 C C
 

7.新添加一列 assign操作

In [10]:
import numpy as np
df=pd.DataFrame({'data1':np.random.random(5),
                 'data2':np.random.random(5)})
df2=df.assign(rantion=df['data1']/df['data2'])
df2
Out[10]:
 
  data1 data2 rantion
0 0.002526 0.336918 0.007498
1 0.530793 0.549558 0.965854
2 0.527832 0.229412 2.300803
3 0.902357 0.826746 1.091456
4 0.984355 0.372997 2.639041
In [11]:
df2.drop('rantion',axis='columns',inplace=True)#删除指定列操作
df2
Out[11]:
 
  data1 data2
0 0.002526 0.336918
1 0.530793 0.549558
2 0.527832 0.229412
3 0.902357 0.826746
4 0.984355 0.372997
 

8.替换值 replace

In [12]:
data=pd.Series([1,2,3,4,5,6,7,8,9])
data
Out[12]:
0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
dtype: int64
In [13]:
data.replace(9,np.nan,inplace=True)
data
Out[13]:
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
6    7.0
7    8.0
8    NaN
dtype: float64
 

9.数据离散化:把数据按范围分组 pd.cut

In [14]:
ages=[15,20,18,25,46,89,66,80]
bins=[10,40,90]
bins_res=pd.cut(ages,bins)#离散化数据:10-40,40-90两组
bins_res
Out[14]:
[(10, 40], (10, 40], (10, 40], (10, 40], (40, 90], (40, 90], (40, 90], (40, 90]]
Categories (2, interval[int64]): [(10, 40] < (40, 90]]
In [15]:
bins_res.labels#没有分类
 
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
 in ()
----> 1bins_res.labels#没有分类

AttributeError: 'Categorical' object has no attribute 'labels'
In [16]:
pd.value_counts(bins_res)#输出分组的范围和每组的个数
Out[16]:
(40, 90]    4
(10, 40]    4
dtype: int64
In [17]:
pd.cut(ages,[10,30,50,90])#把bins直接用[10,30,50,80]代替
Out[17]:
[(10, 30], (10, 30], (10, 30], (10, 30], (30, 50], (50, 90], (50, 90], (50, 90]]
Categories (3, interval[int64]): [(10, 30] < (30, 50] < (50, 90]]
In [18]:
group_names=['Yonth','Mille','Old']
pd.value_counts(pd.cut(ages,[10,30,50,90],labels=group_names))
Out[18]:
Yonth    4
Old      3
Mille    1
dtype: int64
 

10.查看缺失值

In [19]:
df=pd.DataFrame([range(3),[0,np.nan,0],[0,0,np.nan],range(3)])#构建一些缺失值
df
Out[19]:
 
  0 1 2
0 0 1.0 2.0
1 0 NaN 0.0
2 0 0.0 NaN
3 0 1.0 2.0
In [20]:
df.isnull()#查看缺失值位置,False就是缺失值位置
Out[20]:
 
  0 1 2
0 False False False
1 False True False
2 False False True
3 False False False
In [21]:
df.isnull().any()#默认按列查看
Out[21]:
0    False
1     True
2     True
dtype: bool
In [22]:
df.isnull().any(axis=1)#默认按行查看
Out[22]:
0    False
1     True
2     True
3    False
dtype: bool
 

11.填充缺失值

In [23]:
df.fillna(5)#用5填充缺失值
Out[23]:
 
  0 1 2
0 0 1.0 2.0
1 0 5.0 0.0
2 0 0.0 5.0
3 0 1.0 2.0
In [24]:
df[df.isnull().any(axis=1)]#定位有缺失值的行
Out[24]:
 
  0 1 2
1 0 NaN 0.0
2 0 0.0 NaN

你可能感兴趣的