第二章¶

零、练一练¶

练一练

请将上面代码中的index=False删除或设定index=True，对比结果有何差异。

第一列插入了原表的索引。事实上，如果原表索引有名字（index.name）时，列名即索引名。此外，如果原来的表是多级索引（第三章介绍），那么新增的列数即为索引的层数。

my_table = pd.DataFrame({"A":[1,2]})
my_table.to_csv("my_csv.csv")
pd.read_csv("my_csv.csv")

	Unnamed: 0	A
0	0	1
1	1	2

index = pd.Series(["a", "b"], name="my_index")
my_table = pd.DataFrame({"A":[1,2]},index=index)
my_table.to_csv("my_csv.csv")
pd.read_csv("my_csv.csv")

	my_index	A
0	a	1
1	b	2

index = pd.Index([("A", "B"), ("C", "D")], name=("index_1", "index_0"))
my_table = pd.DataFrame({"A":[1,2]},index=index)
my_table.to_csv("my_csv.csv")
pd.read_csv("my_csv.csv")

	index_1	index_0	A
0	A	B	1
1	C	D	2

练一练

在上面的df中，如果data字典中'col_0'键对应的不是列表，而是1个索引与df中索引相同的Series，此时会发生什么？如果它的索引和df的索引不一致，又会发生什么？

当索引一致时，序列的值直接对应填入DataFrame。当索引不一致时且Series中索引值唯一时，当前DataFrame行索引如果在Series中出现，则用Series对应元素填充，否则设为缺失值。若索引值不一致且Series索引值有重复时，直接报错。

index = ['row_%d'%i for i in range(3)]
df = pd.DataFrame(
  data={
    'col_0': pd.Series([1,2,3], index=index),
    'col_1':list('abc'),
    'col_2': [1.2, 2.2, 3.2]
  },
  index=index
)
df

	col_0	col_1	col_2
row_0	1	a	1.2
row_1	2	b	2.2
row_2	3	c	3.2

df = pd.DataFrame(
  data={
    'col_0': pd.Series([1,2,3], index=["row_3","row_2","row_1"]),
    'col_1':list('abc'),
    'col_2': [1.2, 2.2, 3.2]
  },
  index=index
)
df

	col_0	col_1	col_2
row_0	NaN	a	1.2
row_1	3.0	b	2.2
row_2	2.0	c	3.2

df = pd.DataFrame(
  data={
    'col_0': pd.Series([1,2,3], index=["row_2","row_1","row_2"]),
    'col_1':list('abc'),
    'col_2': [1.2, 2.2, 3.2]
  },
  index=index
)
df

C:\Users\gyh\AppData\Local\Temp\ipykernel_19428\2736907453.py:1: FutureWarning: reindexing with a non-unique Index is deprecated and will raise in a future version.
  df = pd.DataFrame(

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [7], in <module>
----> 1 df = pd.DataFrame(
 data={
   'col_0': pd.Series([1,2,3], index=["row_2","row_1","row_2"]),
   'col_1':list('abc'),
   'col_2': [1.2, 2.2, 3.2]
 },
 index=index
)
df

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:637, in DataFrame.__init__(self, data, index, columns, dtype, copy)
   mgr = self._init_mgr(
       data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
   )
elif isinstance(data, dict):
   # GH#38939 de facto copy defaults to False only in non-dict cases
--> 637     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
elif isinstance(data, ma.MaskedArray):
   import numpy.ma.mrecords as mrecords

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\internals\construction.py:502, in dict_to_mgr(data, index, columns, dtype, typ, copy)
   arrays = [
       x
       if not hasattr(x, "dtype") or not isinstance(x.dtype, ExtensionDtype)
       else x.copy()
       for x in arrays
   ]
   # TODO: can we get rid of the dt64tz special case above?
--> 502 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\internals\construction.py:125, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
       index = ensure_index(index)
   # don't force copy because getting jammed in an ndarray anyway
--> 125     arrays = _homogenize(arrays, index, dtype)
   # _homogenize ensures
   #  - all(len(x) == len(index) for x in arrays)
   #  - all(x.ndim == 1 for x in arrays)
   (...)

else:
   index = ensure_index(index)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\internals\construction.py:607, in _homogenize(data, index, dtype)
       val = val.astype(dtype, copy=False)
   if val.index is not index:
       # Forces alignment. No need to copy data since we
       # are putting it into an ndarray later
--> 607         val = val.reindex(index, copy=False)
   val = val._values
else:

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\series.py:4669, in Series.reindex(self, *args, **kwargs)
       raise TypeError(
           "'index' passed as both positional and keyword argument"
       )
   kwargs.update({"index": index})
-> 4669 return super().reindex(**kwargs)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\generic.py:4974, in NDFrame.reindex(self, *args, **kwargs)
   return self._reindex_multi(axes, copy, fill_value)
# perform the reindex on the axes
-> 4974 return self._reindex_axes(
   axes, level, limit, tolerance, method, fill_value, copy
).__finalize__(self, method="reindex")

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\generic.py:4994, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
new_index, indexer = ax.reindex(
   labels, level=level, limit=limit, tolerance=tolerance, method=method
)
axis = self._get_axis_number(a)
-> 4994 obj = obj._reindex_with_indexers(
   {axis: [new_index, indexer]},
   fill_value=fill_value,
   copy=copy,
   allow_dups=False,
)
# If we've made a copy once, no need to make another one
copy = False

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\generic.py:5040, in NDFrame._reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups)
   indexer = ensure_platform_int(indexer)
# TODO: speed up on homogeneous DataFrame objects (see _reindex_multi)
-> 5040 new_data = new_data.reindex_indexer(
   index,
   indexer,
   axis=baxis,
   fill_value=fill_value,
   allow_dups=allow_dups,
   copy=copy,
)
# If we've made a copy once, no need to make another one
copy = False

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\internals\managers.py:679, in BaseBlockManager.reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy, consolidate, only_slice, use_na_proxy)
# some axes don't allow reindexing with dups
if not allow_dups:
--> 679     self.axes[axis]._validate_can_reindex(indexer)
if axis >= self.ndim:
   raise IndexError("Requested axis not found in manager")

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\indexes\base.py:4107, in Index._validate_can_reindex(self, indexer)
# trying to reindex on an axis with duplicates
if not self._index_as_unique and len(indexer):
-> 4107     raise ValueError("cannot reindex on an axis with duplicate labels")

ValueError: cannot reindex on an axis with duplicate labels

练一练

df['col_0']和df[['col_0']]二者得到的结果类型有什么区别？

前者是Series，后者是DataFrame

练一练

给定一个DataFrame，请构造其转置且不得使用“.T”。

df = pd.DataFrame({"A": [1,2,3], "B": [4,5,6]}, index=list("abc"))

df_T = pd.DataFrame(df.values.T, index=df.columns, columns=df.index)

df.T.equals(df_T)

True

练一练

身体质量指数BMI的计算方式是体重（单位为kg）除以身高（单位为m）的平方，请找出具有最高BMI指数对应同学的姓名。

df = pd.read_csv('data/learn_pandas.csv')
df.T[(df.Weight / (df.Height/100) ** 2).idxmax()]["Name"]

'Chengpeng Zhou'

实际上在学了第三章后，可以直接用loc来索引：

df.loc[(df.Weight / (df.Height/100) ** 2).idxmax(), "Name"]

'Chengpeng Zhou'

练一练

在clip()中，超过边界的只能截断为边界值，如果要把超出边界的替换为自定义的值，可以如何做？

s = pd.Series(np.arange(5))
s.clip(1, 3)

  1
  1
  2
  3
  3
dtype: int32

small, big = -999, 999
s.where(s<=3, big).where(s>=1, small)

 -999
    1
    2
    3
  999
dtype: int32

练一练

在Numpy中也有一个同名函数np.diff()，它与pandas中的diff功能相同吗？请查阅文档说明。

不同，Numpy中是指n阶差分：

s = pd.Series([1,3,7,5,3])
np.diff(s.values, 3)

array([-8,  6], dtype=int64)

s.diff(3).values

array([nan, nan, nan,  4.,  0.])

练一练

rolling对象的默认窗口方向都是向下滑动的，某些情况下用户需要逆向滑动的窗口，例如对[1,2,3]设定窗口为2的逆向sum操作，结果为[3,5,NaN]，此时应该如何实现？

s = pd.Series([1,2,3])
s[::-1].rolling(2).sum()[::-1]

  3.0
  5.0
  NaN
dtype: float64

一、整理某服装店的商品情况¶

在data/ch2/clothing_store.csv中记录了某服装店商品的信息，每件商品都有一级类别（type_1）、二级类别（type_2）、进价（buy_price）、售价（sale_price）和唯一的商品编号（product_id）。

利润指售价与进价之差，求商品的平均利润。
从原表构造一个同长度的Series，索引是商品编号，value中的每个元素是对应位置的商品信息字符串，字符串格式为“商品一级类别为...，二级类别为...，进价和售价分别为...和...。”。
表中有一个商品的二级类别与一级类别明显无法对应，例如一级类别为上衣，但二级类别是拖鞋，请找出这个商品对应的商品编号。
求各二级类别中利润最高的商品编号。

【解答】

df = pd.read_csv("data/ch2/clothing_store.csv")

1

(df.sale_price - df.buy_price).mean()

24.3481

2

# *符号是序列解包，读者如果不熟悉相关内容可在网上查询
pattern = "商品一级类别为{}，二级类别为{}，进价和售价分别为{:d}和{:d}。"
res = df.apply(
    lambda x: pattern.format(*x.values[:-1]), 1)
res.head()

  商品一级类别为裤子，二级类别为游泳裤，进价和售价分别为145和154。
    商品一级类别为鞋子，二级类别为凉鞋，进价和售价分别为98和101。
   商品一级类别为鞋子，二级类别为拖鞋，进价和售价分别为122和149。
     商品一级类别为鞋子，二级类别为拖鞋，进价和售价分别为55和74。
    商品一级类别为鞋子，二级类别为凉鞋，进价和售价分别为79和112。
dtype: object

# 如果不用*符号，可以一个个手动传入，完全等价
res = df.apply(
    lambda x: pattern.format(
        x['type_1'], x['type_2'], x['buy_price'], x['sale_price']
    ), 1
)
res.head()

  商品一级类别为裤子，二级类别为游泳裤，进价和售价分别为145和154。
    商品一级类别为鞋子，二级类别为凉鞋，进价和售价分别为98和101。
   商品一级类别为鞋子，二级类别为拖鞋，进价和售价分别为122和149。
     商品一级类别为鞋子，二级类别为拖鞋，进价和售价分别为55和74。
    商品一级类别为鞋子，二级类别为凉鞋，进价和售价分别为79和112。
dtype: object

3

通过去重可以发现，最后一个类别显然是错的

df_dup = df.drop_duplicates(["type_1", "type_2"])
df_dup

	type_1	type_2	buy_price	sale_price	product_id
0	裤子	游泳裤	145	154	S007721
1	鞋子	凉鞋	98	101	S007156
2	鞋子	拖鞋	122	149	S002286
5	鞋子	跑鞋	59	88	S004928
7	上衣	冲锋衣	100	144	S003098
8	上衣	T恤	115	157	S006858
9	裤子	长裤	190	202	S001512
12	上衣	羽绒服	84	101	S006706
15	裤子	中裤	141	155	S003019
6023	裤子	拖鞋	155	177	S008754

df_dup.product_id[6023]

'S008754'

4

方法一：

temp = df.copy() # 为了不影响后续代码，先拷贝一份，读者可自行决定是否拷贝
temp["profit"] = df.sale_price - df.buy_price
temp.sort_values(
    ["type_2", "profit"],
    ascending=[True, False]
).drop_duplicates("type_2")[["type_2", "product_id"]]

	type_2	product_id
1405	T恤	S009881
162	中裤	S005119
820	冲锋衣	S009181
664	凉鞋	S001114
858	拖鞋	S002385
492	游泳裤	S009267
515	羽绒服	S003205
1073	跑鞋	S005340
1824	长裤	S005169

方法二：

# 使用groupby方法，建议学完第四章后着重理解一下这种方案
df.set_index("product_id").groupby("type_2")[['sale_price', 'buy_price']].apply(
    lambda x: (x.iloc[:, 0]-x.iloc[:, 1]).idxmax())

type_2
T恤     S009881
中裤     S005119
冲锋衣    S009181
凉鞋     S001114
拖鞋     S002385
游泳裤    S009267
羽绒服    S003205
跑鞋     S005340
长裤     S005169
dtype: object

二、汇总某课程的学生总评分数¶

在data/ch2/student_grade.csv中记录了某课程中每位学生学习情况，包含了学生编号、期中考试分数、期末考试分数、回答问题次数和缺勤次数。请注意，在本题中仅允许使用本章中出现过的函数，不得使用后续章节介绍的功能或函数（例如loc和pd.cut()），但读者可在学习完后续章节后，自行再给出基于其他方案的解答。

求出在缺勤次数最少的学生中回答问题次数最多的学生编号。
按如下规则计算每位学生的总评：（1）总评分数为百分之四十的期中考试成绩加百分之六十的期末考试成绩（2）每回答一次问题，学生的总评分数加1分，但加分的总次数不得超过10次（3）每缺勤一次，学生的总评分数扣5分（4）当学生缺勤次数高于5次时，总评直接按0分计算（5）总评最高分为100分，最低分为0分。
在表中新增一列“等第”，规定当学生总评低于60分时等第为不及格，总评不低于60分且低于80分时为及格，总评不低于80分且低于90分时为良好，总评不低于90分时为优秀，请统计各个等第的学生比例。

【解答】

df = pd.read_csv("data/ch2/student_grade.csv")

1

方法一：

s = df.sort_values(list(df.columns[-2:]), ascending=[False, True]).Student_ID
s[s.index[0]]

'S034'

方法二：

# 时间上而言，方法二效率更高，因为方法一需要排序
temp = df.loc[df.Absence_Times==df.Absence_Times.min()]
temp = temp.loc[temp.Question_Answering_Times==temp.Question_Answering_Times.max(), "Student_ID"]
temp.iloc[0]

'S034'

2

s = df.Mid_Term_Grade * 0.4 + df.Final_Grade * 0.6 
s += df.Question_Answering_Times.clip(0, 10) - 5 * df.Absence_Times
s = s.where(df.Absence_Times <= 5, 0).clip(0, 100)
df["总评"] = s
df.总评.head()

  75.0
  86.2
   0.0
  83.8
  69.0
Name: 总评, dtype: float64

3

方法一：

grade_dict = {0:"不及格", 1:"及格", 2:"良好", 3:"优秀"}
# *1是为了把布尔序列转换为数值序列
df["grade"] = ((df.总评 >= 90)*1 + (df.总评 >= 80)*1 + (df.总评 >= 60)*1).replace(grade_dict)
df.grade.head()

   及格
   良好
  不及格
   良好
   及格
Name: grade, dtype: object

df.grade.value_counts(normalize=True)

及格     0.50
良好     0.32
优秀     0.10
不及格    0.08
Name: grade, dtype: float64

方法二：

# 与方法一grade生成方法不同，使用apply
df["grade"] = df.总评.apply(
    lambda x: "不及格" if x < 60 else
              "及格" if x < 80 else
              "良好" if x < 90 else
              "优秀"
)
df.grade.head()

   及格
   良好
  不及格
   良好
   及格
Name: grade, dtype: object

方法三：

# 见第九章第三节
df["grade"] = pd.cut(
    df.总评,
    bins=[0,60,80,90,np.inf],
    labels=["不及格", "及格", "良好", "优秀"],
    right=False
)
df.grade.head()

   及格
   良好
  不及格
   良好
   及格
Name: grade, dtype: category
Categories (4, object): ['不及格' < '及格' < '良好' < '优秀']

三、实现指数加权窗口¶

（1）作为扩张窗口的ewm窗口

在扩张窗口中，用户可以使用各类函数进行历史的累计指标统计，但这些内置的统计函数往往把窗口中的所有元素赋予了同样的权重。事实上，可以给出不同的权重来赋给窗口中的元素，指数加权窗口就是这样一种特殊的扩张窗口。

其中，最重要的参数是alpha，它决定了默认情况下的窗口权重为\(w_i = (1 - \alpha)^{t-i}, i\in \{0, 1, ..., t\}\)，其中\(w_0\)表示序列第一个元素\(x_0\)的权重，\(w_t\)表示当前元素\(x_t\)的权重。从权重公式可以看出，离开当前值越远则权重越小，若记原序列为\(x\)，更新后的当前元素为\(y_t\)，此时通过加权公式归一化后可知：

\[\begin{split} \begin{aligned} y_t &=\frac{\sum_{i=0}^{t} w_i x_{i}}{\sum_{i=0}^{t} w_i} \\ &=\frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2 x_{t-2} + ... + (1 - \alpha)^{t} x_{0}}{1 + (1 - \alpha) + (1 - \alpha)^2 + ... + (1 - \alpha)^{t}} \end{aligned} \end{split}\]

对于Series而言，可以用ewm对象如下计算指数平滑后的序列：

np.random.seed(0)
s = pd.Series(np.random.randint(-1,2,30).cumsum())
s.head()

 -1
 -1
 -2
 -2
 -2
dtype: int32

s.ewm(alpha=0.2).mean().head()

 -1.000000
 -1.000000
 -1.409836
 -1.609756
 -1.725845
dtype: float64

请用expanding窗口实现。

（2）作为滑动窗口的ewm窗口

从（1）中可以看到，ewm作为一种扩张窗口的特例，只能从序列的第一个元素开始加权。现在希望给定一个限制窗口n，只对包含自身的最近的n个元素作为窗口进行滑动加权平滑。请根据滑窗函数，给出新的\(w_i\)与\(y_t\)的更新公式，并通过rolling窗口实现这一功能。

【解答】

1

def ewm_func(x, alpha=0.2):
    win = (1 - alpha) ** np.arange(x.shape[0])
    win = win[::-1]
    res = (win * x).sum() / win.sum()
    return res

s.expanding().apply(ewm_func).head()

 -1.000000
 -1.000000
 -1.409836
 -1.609756
 -1.725845
dtype: float64

2

权重为\(w_i=(1−\alpha)^{t-i},i\in\{t-n+1,...,t\}\)，且\(y_t\) 更新如下：

\[\begin{split} \begin{aligned} y_t &=\frac{\sum_{i=t-n+1}^{t} w_i x_{i}}{\sum_{i=t-n+1}^{t} w_i} \\ &=\frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2 x_{t-2} + ... + (1 - \alpha)^{n-1} x_{t-n+1}}{1 + (1 - \alpha) + (1 - \alpha)^2 + ... + (1 - \alpha)^{n-1}} \end{aligned} \end{split}\]

事实上，无需对窗口函数进行任何改动，其本身就已经和上述公式完全对应：

# 假设窗口大小为4
s.rolling(window=4).apply(ewm_func).head()

       NaN
       NaN
       NaN
 -1.609756
 -1.826558
dtype: float64