--- jupytext: text_representation: format_name: myst kernelspec: display_name: Python 3 name: python3 --- ```{code-cell} ipython3 :tags: [remove_input] import numpy as np import pandas as pd import matplotlib.pyplot as plt np.random.seed(0) plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False np.set_printoptions(suppress = True) ``` # 第六章 ## 零、练一练 ```{admonition} 练一练 请写出以下集合A和集合B的笛卡尔积: - $A=\{0\}$;$B=\{0\}$ - $A=\{1, 2, 3\}$;$B=\{4, 5, 6\}$ - $A=\{\{1,2\}, 3\}$;$B=\{4,\{5,6\}\}$ ``` - $\{(0, 0)\}$ - $\{(1,4),(1,5),(1,6),(2,4),(2,5),(2,6),(3,4),(3,5),(3,6)\}$ - $\{(\{1,2\},4),(\{1,2\},\{5,6\}),(3,4),(3,\{5,6\})\}$ ```{admonition} 练一练 假设左表键的所在列元素列表为[1,2,2,3,3,3],右表键与左表相同,请问用4种方式连接得到的结果表分别具有多少行? ``` 4种连接方式都是$1^2+2^2+3^2=14$行 ```{admonition} 练一练 请构造两张表使它们在用merge()合并时能够通过“1:m”的检查但无法通过“m:1”模式的检查。 ``` ```{code-cell} ipython3 df_left = pd.DataFrame({"A":[1,2], "B":["a","b"]}) df_right = pd.DataFrame({"A":[1,1,2], "C":["c","d","e"]}) ``` ```{code-cell} ipython3 df_left.merge(df_right, on="A", validate="1:m") ``` ```{code-cell} ipython3 :tags: [raises-exception] df_left.merge(df_right, on="A", validate="m:1") ``` ```{admonition} 练一练 join()函数没有实现merge()函数中的validate参数,请构造1个join_with_validate函数,其参数包含df1、df2、on、how和validate,完成与merge()类似的功能。(允许在join_with_validate内部调用join()函数) ``` ```{code-cell} ipython3 # is_unique未在教材中提到,读者可以使用集合去重再判断元素数量的方法 def join_with_validate(df1, df2, on, how="left", validate="m:m"): df1_index_not_unique = not df1.index.is_unique df2_index_not_unique = not df2.index.is_unique if validate=="1:1" and (df1_index_not_unique or df2_index_not_unique): raise ValueError( "Join keys are not unique in dataset; " "not a one-to-one merge" ) if validate=="1:m" and df1_index_not_unique: raise ValueError( "Join keys are not unique in left dataset; " "not a one-to-many merge" ) if validate=="m:1" and df2_index_not_unique: raise ValueError( "Join keys are not unique in right dataset; " "not a many-to-one merge" ) return df1.join(df2, on=on, how=how) ``` ```{admonition} 练一练 给定两个具有相同行列索引的DataFrame,设s1和s2分别是传入规则函数的左表列和右表列,请依次根据如下规则更新进行组合: - 当s1中的元素值为0时,使用s2对应位置元素更新,否则保持不变。 - 当s1中的元素值超过s2的均值时,使用s1+s2的对应位置元素更新,否则使用s1-s2的对应位置元素更新。 ``` - 1 ```{code-cell} ipython3 df1 = pd.DataFrame({"A":[0,2],"B":[4,1]}) df2 = pd.DataFrame({"A":[1,0],"B":[3,-1]}) ``` ```{code-cell} ipython3 df1.combine(df2, lambda s1, s2: s1.mask(s1==0, s2)) ``` - 2 ```{code-cell} ipython3 df1.combine(df2, lambda s1, s2: (s1-s2).mask(s1>s2.mean(), s1+s2)) ``` ## 一、合并员工信息表 在data/ch6/employee文件夹下,存放了某公司的员工信息。其中,salary目录下存放了从2018年1月至2020年12月时间段内,每月员工的基本工资数额,award目录下存放了每月员工的奖金数额,员工最终的工资等于基本工资加上奖金。(注:本题中数据均为随机生成,与现实无关) - info_a.csv、info_b.csv和info_c.csv中分别存放了公司员工的不同信息,请提取ID从“ID-000001”至“ID-025000”对应员工的邮箱、性别、年龄和学历,并将它们合并成一张表,索引为员工ID。 - 对所有ID在上题范围内的员工统计从18年1月至20年12月每个季度的工资总数,并将这12个季度的结果作为新的列添加至上一问的结果表中。统计2018年第1季度的结果时,列名即为“2018-Q1”,其他季度对应列的名字以此类推。 ```text 【解答】 ``` - 1 ```{code-cell} ipython res = pd.concat( [ pd.read_csv( "data/ch6/employee/info_%s.csv"%i ).set_index("ID") for i in list("abc") ], axis=1 ) res = res.reindex(["ID-%06d"%(i) for i in range(1, 25001)]) res = res[["邮箱","性别","年龄","学历"]] res.head() ``` - 2 ```{code-cell} ipython data = pd.concat( [ pd.read_csv( "data/ch6/employee/award/%d-%02d.csv"%(y,m), index_col="ID" ) + pd.read_csv( "data/ch6/employee/salary/%d-%02d.csv"%(y,m), index_col="ID" ) for y in range(2018,2021) for m in range(1,13) ], axis=1 ) values = data.values.reshape(25000, -1, 3).sum(-1) res_values = pd.DataFrame( values, index=res.index, columns=["%d-Q%d"%(y,q) for y in range(2018,2021) for q in range(1,5)] ) res = pd.concat([res, res_values], axis=1) res.iloc[:5, :8] # 展示部分 ``` ## 二、实现join函数 请按照如下要求实现join()函数: - 函数的调用方式为join(df1, df2, how, lsuffix, rsuffix) - 传入的df1和df2参数都为单级索引的DataFrame - how参数支持left、right、inner、outer和cross - 给出测试样例,并与pandas中join()的运行结果进行对比 - 在实现过程中允许使用pd.concat() ```{note} 由于合并时可能产生缺失值,导致自定义join函数和pandas内置的join函数在列的dtype上会产生差别,此时使用equals()函数在这种情况下不能进行判定,可以使用pd.testing.assert_frame_equal(my_result, pandas_result, check_dtype=False)来进行对比。assert_frame_equal()通过check_dtype参数可以关闭列类型的检查,它在两个表存在差异时会报错,在两个表相同时不进行任何操作。 ``` ```text 【解答】 ``` ```{code-cell} ipython3 def join(df1, df2, how, lsuffix=None, rsuffix=None): idx1, idx2 = df1.columns, df2.columns idx_intersect = idx1.intersection(idx2) if len(idx_intersect) > 0: if lsuffix==None or rsuffix==None: raise ValueError( "columns overlap but get suffix not specified: " + str(idx_intersect)) df1 = df1.rename(columns={i:i+lsuffix for i in idx_intersect}) df2 = df2.rename(columns={i:i+rsuffix for i in idx_intersect}) idx1, idx2 = df1.index, df2.index idx = idx1.union(idx2).unique().sort_values() columns = pd.Index(df1.columns.tolist() + df2.columns.tolist()) res = pd.DataFrame(columns=columns, index=pd.Index([], name=idx1.name)) for x in idx: _idx1, _idx2 = idx1 == x, idx2 == x in1, in2 = bool(_idx1.sum()), bool(_idx2.sum()) if in1 and in2: _df1, _df2 = df1.loc[[x]], df2.loc[[x]] if how in ["right"]: for i in range(_df2.shape[0]): _res = pd.concat([_df2.iloc[[i]]] * _df1.shape[0]) _res = pd.concat([_df1, _res], axis=1) res = pd.concat([res, _res]) else: for i in range(_df1.shape[0]): _res = pd.concat([_df1.iloc[[i]]] * _df2.shape[0]) _res = pd.concat([_res, _df2], axis=1) res = pd.concat([res, _res]) elif not in1 and how in ["right", "outer"]: _res = df2.loc[[x]].copy() for c in df1.columns: _res[c] = np.nan res = pd.concat([res, _res.reindex(columns, axis=1)]) elif not in2 and how in ["left", "outer"]: _res = df1.loc[[x]].copy() for c in df2.columns: _res[c] = np.nan res = pd.concat([res, _res]) return res ``` ```{code-cell} ipython3 my_res_left = join(df1, df2, how="left", lsuffix="_x", rsuffix="_y") my_res_right = join(df1, df2, how="right", lsuffix="_x", rsuffix="_y") my_res_inner = join(df1, df2, how="inner", lsuffix="_x", rsuffix="_y") my_res_outer = join(df1, df2, how="outer", lsuffix="_x", rsuffix="_y") pd_res_left = df1.join(df2, lsuffix="_x", rsuffix="_y", how="left") pd_res_right = df1.join(df2, lsuffix="_x", rsuffix="_y", how="right") pd_res_inner = df1.join(df2, lsuffix="_x", rsuffix="_y", how="inner") pd_res_outer = df1.join(df2, lsuffix="_x", rsuffix="_y", how="outer") ``` ```{code-cell} ipython3 from pandas.testing import assert_frame_equal assert_frame_equal(my_res_left, pd_res_left, check_dtype=False) assert_frame_equal(my_res_right, pd_res_right, check_dtype=False) assert_frame_equal(my_res_inner, pd_res_inner, check_dtype=False) assert_frame_equal(my_res_outer, pd_res_outer, check_dtype=False) ``` ## 三、条件连接 在本章介绍的关系型连接中,merge()、join()和concat()都是等值连接,即每一个左表键中的label只会与右表键中完全相同的label进行笛卡尔积的匹配。现在,我们希望左表中的键只要与右表中的键满足一定条件就进行匹配,下面给出一种根据大小关系匹配的例子。 假设df1和df2的构造如下: ```{code-cell} ipython3 df1 = pd.DataFrame({"Key":[0,1,1,2], "Col1":[10,20,30,40]}) df1 ``` ```{code-cell} ipython3 df2 = pd.DataFrame({"Key":[1,1,2,3], "Col2":[50,60,70,80]}) df2 ``` 我们希望通过conditional_merge()函数对df1和df2进行左连接,连接规则是左键元素值不得小于右键元素值,即conditional_merge(df1, df2, on="Key", how="left", rule="x>=y")的期望结果如下所示: ```{code-cell} ipython3 :tags: [remove_input] pd.DataFrame( { "Key_x": [0,1,1,1,1,2,2,2], "Col1": [10,20,20,30,30,40,40,40], "Key_y": [np.nan,1,1,1,1,1,1,2], "Col2": [np.nan,50,60,50,60,50,60,70] } ) ``` - 实现上述根据大小关系连接的conditional_merge()函数,其中rule参数可取"x>=y"、"x>y"、"x==y"、"x!=y"、"x<=y"和"x=y") ``` - 2 ```{code-cell} ipython3 df1 = pd.read_csv("data/ch6/left.csv") df2 = pd.read_csv("data/ch6/right.csv") def spherical_merge(df1, df2, d=200, on=["Longitude", "Latitude"]): if on[0] not in df1.columns or on[0] not in df2.columns: raise ValueError("Longitude not in df1's columns or df2's columns.") if on[1] not in df1.columns or on[1] not in df2.columns: raise ValueError("Latitude not in df1's columns or df2's columns.") distance = get_distance(df1, df2) res = pd.concat( [ df1.iloc[[i]].merge(df2.loc[distance[i] <= d], how="cross") for i in range(df1.shape[0]) ] ) return res.reset_index(drop=True) res = spherical_merge(df1, df2, 30) res.head() ```