第三章

零、练一练

练一练

当两个“*”指代的行元素和列元素在DataFrame的行索引和列索引中都为重复值,此时结果是什么类型?

此时结果是DataFrame

df = pd.DataFrame([[1,2],[3,4]], columns=["A"]*2, index=["B"]*2)
df.loc["B", "A"]

练一练

完成以下索引的相关操作:

  • 选出A学校和B学校的大一女生。

  • select_dtypes()是一个实用函数,它能够从表中选出或剔除相应类型的列。若要选出所有数值型的列,只需使用.select_dtypes(include='number');若要剔除所有布尔类型和int64类型的列,只需使用.select_dtypes(exclude=['bool', 'int64'])。请利用布尔列表选择的方法结合DataFrame的dtypes属性在learn_pandas数据集上实现这个功能。

  • 1

df = pd.read_csv('data/learn_pandas.csv')
res = df.loc[(df.School.isin(['A', 'B'])) & (df.Gender == 'Female') & (df.Grade == 'Freshman')]
res.head()
School Grade Name Gender Height Weight Transfer Test_Number Test_Date Time_Record
0 A Freshman Gaopeng Yang Female 158.9 46.0 N 1 2019/10/5 0:04:34
6 A Freshman Qiang Chu Female 162.5 52.0 N 1 2019/12/12 0:03:53
32 B Freshman Gaopeng Shi Female 162.9 48.0 N 1 2019/9/12 0:04:58
45 B Freshman Quan Chu Female 154.7 43.0 N 1 2019/11/28 0:04:47
57 B Freshman Changquan Chu Female 159.6 45.0 N 2 2019/12/9 0:04:18
  • 2

df = pd.DataFrame({"A": [1,2,3], "B":[1.5,2.5,3.5], "C":[True]*3, "D": list("abc")})
df.select_dtypes(exclude=["bool", "int64"])
B D
0 1.5 a
1 2.5 b
2 3.5 c
df.loc[:, (df.dtypes!="bool")&(df.dtypes!="int64")]
B D
0 1.5 a
1 2.5 b
2 3.5 c
df.select_dtypes(include=["number"])
A B
0 1 1.5
1 2 2.5
2 3 3.5
df.loc[:, (df.dtypes=="int32")
         |(df.dtypes=="int64")
         |(df.dtypes=="float32")
         |(df.dtypes=="float64")]
A B
0 1 1.5
1 2 2.5
2 3 3.5

练一练

利用query()选出A学校和B学校的大一女生。

df = pd.read_csv('data/learn_pandas.csv')
res = df.query("(School in ['A', 'B']) and"
               "(Gender == 'Female') and"
               "(Grade == 'Freshman')")
res.head()
School Grade Name Gender Height Weight Transfer Test_Number Test_Date Time_Record
0 A Freshman Gaopeng Yang Female 158.9 46.0 N 1 2019/10/5 0:04:34
6 A Freshman Qiang Chu Female 162.5 52.0 N 1 2019/12/12 0:03:53
32 B Freshman Gaopeng Shi Female 162.9 48.0 N 1 2019/9/12 0:04:58
45 B Freshman Quan Chu Female 154.7 43.0 N 1 2019/11/28 0:04:47
57 B Freshman Changquan Chu Female 159.6 45.0 N 2 2019/12/9 0:04:18

练一练

3.2.1节开头处利用from_product构造了df_multi,请分别利用from_tuples和from_arrays构造一个相同的DataFrame。

只需检查他们构造的索引是否一致

multi_index = pd.MultiIndex.from_product(
    [list('ABCD'),["Female", "Male"]], names=('School', 'Gender'))
multi_index_from_tuples = pd.MultiIndex.from_tuples(
    [
        (i, j)
        for i in list("ABCD")
        for j in ["Female", "Male"]
    ],
    names=('School', 'Gender')
)
multi_index_from_tuples.equals(multi_index)
True
multi_index_from_arrays = pd.MultiIndex.from_arrays(
    [
        list("AABBCCDD"),
        ["Female", "Male"] * 4
    ],
    names=('School', 'Gender')
)
multi_index_from_tuples.equals(multi_index)
True

练一练

尝试在rename_axis()中使用函数完成与例子中一样的功能,即把Upper和Other分别替换为Changed_row和Changed_col。

np.random.seed(0)
L1,L2,L3 = ['A','B'],['a','b'],['alpha','beta']
mul_index1 = pd.MultiIndex.from_product([L1,L2,L3],
                names=('Upper', 'Lower','Extra'))
L4,L5,L6 = ['C','D'],['c','d'],['cat','dog']
mul_index2 = pd.MultiIndex.from_product([L4,L5,L6],
                names=('Big', 'Small', 'Other'))
df_ex = pd.DataFrame(np.random.randint(-9,10,(8,8)),
                        index=mul_index1,
                        columns=mul_index2)
df_ex.head()
Big C D
Small c d c d
Other cat dog cat dog cat dog cat dog
Upper Lower Extra
A a alpha 3 6 -9 -6 -6 -2 0 9
beta -5 -3 3 -8 -3 -2 5 8
b alpha -4 4 -1 0 7 -4 6 6
beta -9 9 -6 8 5 -2 -9 -8
B a alpha 0 -9 1 -6 2 9 -7 -9
df_ex.rename_axis(index=lambda x: "Changed_row" if x=="Upper" else x,
                  columns=lambda x: "Changed_col" if x=="Big" else x,)
Changed_col C D
Small c d c d
Other cat dog cat dog cat dog cat dog
Changed_row Lower Extra
A a alpha 3 6 -9 -6 -6 -2 0 9
beta -5 -3 3 -8 -3 -2 5 8
b alpha -4 4 -1 0 7 -4 6 6
beta -9 9 -6 8 5 -2 -9 -8
B a alpha 0 -9 1 -6 2 9 -7 -9
beta -9 -5 -4 -3 -1 8 6 -5
b alpha 0 1 -8 -8 -2 0 -6 -3
beta 2 5 9 -9 5 -6 3 1

练一练

将上述df_temp的内层行索引合并到与外层行索引(用下划线连接)后删除内层行索引,即结果表压缩为两层列索引,同时再把外层行索引的索引名重命名为“Big_Other”。

df_temp = df_ex.copy()
new_idx = df_temp.index.map(lambda x: (x[0]+'-'+ x[1]+'-'+ x[2]))
df_temp.index = new_idx
new_idx = df_temp.index.map(lambda x:tuple(x.split('-')))
df_temp.index = new_idx
df_temp # 构造方式与书上相同
Big C D
Small c d c d
Other cat dog cat dog cat dog cat dog
A a alpha 3 6 -9 -6 -6 -2 0 9
beta -5 -3 3 -8 -3 -2 5 8
b alpha -4 4 -1 0 7 -4 6 6
beta -9 9 -6 8 5 -2 -9 -8
B a alpha 0 -9 1 -6 2 9 -7 -9
beta -9 -5 -4 -3 -1 8 6 -5
b alpha 0 1 -8 -8 -2 0 -6 -3
beta 2 5 9 -9 5 -6 3 1
df_temp.columns = df_temp.columns.map(lambda x: (x[0]+"_"+x[2], x[1], x[2]))
df_temp.droplevel(-1, axis=1).rename_axis(columns={"Big": "Big_Other"})
Big_Other C_cat C_dog C_cat C_dog D_cat D_dog D_cat D_dog
Small c c d d c c d d
A a alpha 3 6 -9 -6 -6 -2 0 9
beta -5 -3 3 -8 -3 -2 5 8
b alpha -4 4 -1 0 7 -4 6 6
beta -9 9 -6 8 5 -2 -9 -8
B a alpha 0 -9 1 -6 2 9 -7 -9
beta -9 -5 -4 -3 -1 8 6 -5
b alpha 0 1 -8 -8 -2 0 -6 -3
beta 2 5 9 -9 5 -6 3 1

一、实现sample()函数

如果把DataFrame的每一行看作一个样本,或把每一列看作一个特征,再把整个DataFrame看作总体,想要对样本或特征进行随机抽样就可以用sample()函数。有时在拿到大型数据集后,想要对统计特征进行计算来了解数据的大致分布,但是这很费时间。同时,由于许多统计特征在等概率不放回的简单随机抽样条件下,是总体统计特征的无偏估计,比如样本均值和总体均值,那么就可以先从整张表中抽出一部分来做近似估计。

sample()函数中的主要参数为n、axis、frac、replace和weights,前三个分别是指抽样数量、抽样的方向(0为行、1为列)和抽样比例(0.3则为从总体中抽出30%的样本)。replace和weights分别是指是否放回和每个样本的抽样相对概率,replace=True则表示有放回抽样。例如,对下面构造的df_sample以value值的相对大小为抽样概率进行有放回抽样,抽样数量为3。

df_sample = pd.DataFrame({'id': list('abcde'), 'value': [1, 2, 3, 4, 90]})
df_sample
id value
0 a 1
1 b 2
2 c 3
3 d 4
4 e 90
df_sample.sample(3, replace = True, weights = df_sample.value)
id value
4 e 90
4 e 90
4 e 90
  • 请分别利用loc和iloc来实现带有n、axis和frac参数的sample()函数。

  • 请结合numpy中的choice()函数在上一问的基础上实现replace参数和weights参数的功能。

【解答】
def sample(df, n=None, frac=None, replace=None, weights=None, random_state=None, axis=None):
    """
    n:表示抽样个数
    frac:抽样比例
    weights:每个样本抽中的相对概率
    random_state:随机种子
    axis:抽样方向(0表示对行进行抽样,1表示对列进行抽样)
    """
    temp_df = df.copy()
    if n != None and frac != None:
        raise ValueError("输入值n和输入值frac只能存在一个!")
    if n == None:
        n = int(df.shape[0] * frac)
    if isinstance(weights, list):
        weights = np.array(weights)
    if not isinstance(weights, np.ndarray):
        weights = np.ones(df.shape[axis]) / df.shape[axis]
    idx = np.random.choice(range(df.shape[axis]), size=n, replace=replace, p=weights/weights.sum())
    return temp_df.iloc[:, idx] if axis else temp_df.iloc[idx]

二、公司员工数据的索引操作

现有一份公司员工数据集:

df = pd.read_csv('data/ch3/company.csv')
  • 分别只使用query()和loc选出年龄不超过四十岁且工作部门为Dairy或Bakery的男性。

  • 选出员工ID号为奇数所在行的第1、第3和倒数第2列。

  • 按照以下步骤进行索引操作。

    • 把后三列设为索引后交换内外两层

    • 恢复中间层索引

    • 修改外层索引名为Gender

    • 用下划线合并两层行索引

    • 把行索引拆分为原状态

    • 修改索引名为原表名称

    • 恢复默认索引并将列保持为原表的相对位置

  • 1

df = pd.read_csv('data/ch3/company.csv')
dpt = ['Dairy', 'Bakery']
df.query("(age <= 40)&(department == @dpt)&(gender=='M')").head(3)
EmployeeID birthdate_key age city_name department job_title gender
3611 5791 1/14/1975 40 Kelowna Dairy Dairy Person M
3613 5793 1/22/1975 40 Richmond Bakery Baker M
3615 5795 1/30/1975 40 Nanaimo Dairy Dairy Person M
df.loc[
    (df.age <= 40)
    & (df.department.isin(dpt))
    & (df.gender == "M")
].head(3)
EmployeeID birthdate_key age city_name department job_title gender
3611 5791 1/14/1975 40 Kelowna Dairy Dairy Person M
3613 5793 1/22/1975 40 Richmond Bakery Baker M
3615 5795 1/30/1975 40 Nanaimo Dairy Dairy Person M
  • 2

df.iloc[(df.EmployeeID%2==1).values,[0,2,-2]].head()
EmployeeID age job_title
1 1319 58 VP Stores
3 1321 56 VP Human Resources
5 1323 53 Exec Assistant, VP Stores
6 1325 51 Exec Assistant, Legal Counsel
8 1329 48 Store Manager
  • 3

df_op = df.copy()
df_op = df_op.set_index(df_op.columns[-3:].tolist()).swaplevel(0,2,axis=0)
df_op = df_op.reset_index(level=1)
df_op = df_op.rename_axis(index={'gender':'Gender'})
df_op.index = df_op.index.map(lambda x:'_'.join(x))
df_op.index = df_op.index.map(lambda x:tuple(x.split('_')))
df_op = df_op.rename_axis(index=['gender', 'department'])
df_op = df_op.reset_index().reindex(df.columns, axis=1)
df_op.equals(df)
True

三、巧克力评价数据的索引操作

现有一份关于巧克力评价的数据集:

df = pd.read_csv('data/ch3/chocolate.csv')
  • 把列索引名中的换行符替换为空格。

  • 巧克力Rating评分为1至5,每0.25分一档,请选出2.75分及以下且可可含量Cocoa Percent高于中位数的样本。

  • 选出Review Date在2012年之后且Company Location不属于France、Canada、Amsterdam或Belgium的样本。

  • 1

df = pd.read_csv('data/ch3/chocolate.csv')
df.columns = df.columns.map(lambda x: ' '.join(x.split('\n')))
df.head()
Company Review Date Cocoa Percent Company Location Rating
0 A. Morin 2016 63% France 3.75
1 A. Morin 2015 70% France 2.75
2 A. Morin 2015 70% France 3.00
3 A. Morin 2015 70% France 3.50
4 A. Morin 2015 70% France 3.50
  • 2

df['Cocoa Percent'] = df['Cocoa Percent'].apply(lambda x:float(x[:-1])/100)
res = df.query('(Rating<3)&(`Cocoa Percent`>`Cocoa Percent`.median())')
res.head(3)
Company Review Date Cocoa Percent Company Location Rating
33 Akesson's (Pralus) 2010 0.75 Switzerland 2.75
34 Akesson's (Pralus) 2010 0.75 Switzerland 2.75
36 Alain Ducasse 2014 0.75 France 2.75
  • 3

res = df.loc[
    (df["Review Date"]>2012)
    & ~df["Company Location"].isin(
        ["France", "Canada", "Amsterdam", "Belgium"])]
res.head()
Company Review Date Cocoa Percent Company Location Rating
23 Acalli 2015 0.7 U.S.A. 3.75
24 Acalli 2015 0.7 U.S.A. 3.75
40 Alexandre 2017 0.7 Netherlands 3.50
41 Alexandre 2017 0.7 Netherlands 3.50
42 Alexandre 2017 0.7 Netherlands 3.50