详解利用Pandas求解两个DataFrame的差集,交集,并集

模拟数据

In [1]:

```import pandas as pd
```

In [2]:

```df1 = pd.DataFrame({"col1":[1,2,3,4,5],
"col2":[6,7,8,9,10]
})

df2 = pd.DataFrame({"col1":[1,3,7],
"col2":[6,8,10]
})
```

In [3]:

```df1
```

Out[3]:

col1 col2
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10

In [4]:

```df2
```

Out[4]:

col1 col2
0 1 6
1 3 8
2 7 10

差集

方法1：concat + drop_duplicates

In [5]:

```df3 = pd.concat([df1,df2])
df3
```

Out[5]:

col1 col2
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10
0 1 6
1 3 8
2 7 10

In [6]:

```# 结果1

df3.drop_duplicates(["col1","col2"],keep=False)
```

Out[6]:

col1 col2
1 2 7
3 4 9
4 5 10
2 7 10

方法2：append + drop_duplicates

In [7]:

```df4 = df1.append(df2)
df4
```

Out[7]:

col1 col2
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10
0 1 6
1 3 8
2 7 10

In [8]:

```# 结果2

df4.drop_duplicates(["col1","col2"],keep=False)
```

Out[8]:

col1 col2
1 2 7
3 4 9
4 5 10
2 7 10

交集

方法1：merge

In [9]:

```# 结果

# 等效：df5 = pd.merge(df1, df2, how="inner")
df5 = pd.merge(df1,df2)

df5
```

Out[9]:

col1 col2
0 1 6
1 3 8

方法2：concat + duplicated + loc

In [10]:

```df6 = pd.concat([df1,df2])
df6
```

Out[10]:

col1 col2
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10
0 1 6
1 3 8
2 7 10

In [11]:

```s = df6.duplicated(subset=['col1','col2'], keep='first')
s
```

Out[11]:

0    False
1    False
2    False
3    False
4    False
0     True
1     True
2    False
dtype: bool

In [12]:

```# 结果
df8 = df6.loc[s == True]
df8
```

Out[12]:

col1 col2
0 1 6
1 3 8

方法3：concat + groupby + query

In [13]:

```# df6 = pd.concat([df1,df2])

df6
```

Out[13]:

col1 col2
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10
0 1 6
1 3 8
2 7 10

In [14]:

```df9 = df6.groupby(["col1", "col2"]).size().reset_index()
df9.columns = ["col1", "col2", "count"]

df9
```

Out[14]:

col1 col2 count
0 1 6 2
1 2 7 1
2 3 8 2
3 4 9 1
4 5 10 1
5 7 10 1

In [15]:

```df10 = df9.query("count > 1")[["col1", "col2"]]
df10
```

Out[15]:

col1 col2
0 1 6
2 3 8

并集

方法1：concat + drop_duplicates

In [16]:

```df11 = pd.concat([df1,df2])
df11
```

Out[16]:

col1 col2
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10
0 1 6
1 3 8
2 7 10

In [17]:

```# 结果

# df12 = df11.drop_duplicates(subset=["col1","col2"],keep="last")
df12 = df11.drop_duplicates(subset=["col1","col2"],keep="first")
df12
```

Out[17]:

col1 col2
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10
2 7 10

方法2：append + drop_duplicates

In [18]:

```df13 = df1.append(df2)

# df13.drop_duplicates(subset=["col1","col2"],keep="last")
df13.drop_duplicates(subset=["col1","col2"],keep="first")
```

Out[18]:

col1 col2
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10
2 7 10

方法3：merge

In [19]:

```pd.merge(df1,df2,how="outer")
```

Out[19]:

col1 col2
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10
5 7 10