第五章

零、练一练

练一练

除了上述例子之外,请再给出一些长宽表的例子。

  • 某企业员工当年的各月收入情况表,列索引是月份,行索引是员工ID,那么这张表是收入月份的宽表。如果行索引是员工ID,第一列是月份,第二列是该员工在当月的收入,那么这张表是收入月份的长表。

  • 某张表的行索引是月份,列索引是城市名,单元格元素是某城市在某月的发电量,那么这张表是城市的宽表。如果行索引是月份,第一列是城市名,第二列是发电量,那么这张表是城市的长表。

练一练

如果只选取df的部分行使用pivot,那么结果会如何变化?

df = pd.DataFrame({'Class':[1,1,2,2], # 班级
                   'Name':['A','A','B','B'], # 姓名
                   'Subject':['Chinese','Math','Chinese','Math'], # 科目
                   'Grade':[80,75,90,85]}) # 成绩

表中可能会出现缺失值:

df.iloc[[1,2]].pivot(index='Name', columns='Subject', values='Grade')
Subject Chinese Math
Name
A NaN 75.0
B 90.0 NaN

表的形状可能会变化:

df.iloc[[1,3]].pivot(index='Name', columns='Subject', values='Grade')
Subject Math
Name
A 75
B 85
df.iloc[[0,1]].pivot(index='Name', columns='Subject', values='Grade')
Subject Chinese Math
Name
A 80 75
df.iloc[[0]].pivot(index='Name', columns='Subject', values='Grade')
Subject Chinese
Name
A 80

练一练

请修改上表中的元素值以破坏pivot()的唯一性条件,并且在使用pivot()进行强制变形后观察报错信息。

df_unpivot = pd.DataFrame({'Class':[1, 1, 2, 2, 1, 1, 2, 2],
                           'Name':list("AABBAABB"),
                           'Examination': ['Mid', 'Final', 'Mid', 'Final',
                           'Mid', 'Final', 'Mid', 'Final'],
                           'Subject':['Chinese', 'Chinese', 'Chinese', 'Chinese',
                                      'Math', 'Math', 'Math', 'Math'],
                           'Grade':[80, 75, 85, 65, 90, 85, 92, 88],
                           'Rank':[10, 15, 21, 15, 20, 7, 6, 2]})
df_unpivot.loc[0,"Subject"] = "Math"
df_unpivot.pivot(index = ['Class', 'Name'],
                 columns = ['Subject','Examination'],
                 values = ['Grade','Rank'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [8], in <module>
      1 df_unpivot.loc[0,"Subject"] = "Math"
----> 2 df_unpivot.pivot(index = ['Class', 'Name'],
      3                  columns = ['Subject','Examination'],
      4                  values = ['Grade','Rank'])

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:7876, in DataFrame.pivot(self, index, columns, values)
   7871 @Substitution("")
   7872 @Appender(_shared_docs["pivot"])
   7873 def pivot(self, index=None, columns=None, values=None) -> DataFrame:
   7874     from pandas.core.reshape.pivot import pivot
-> 7876     return pivot(self, index=index, columns=columns, values=values)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\pivot.py:520, in pivot(data, index, columns, values)
    518     else:
    519         indexed = data._constructor_sliced(data[values]._values, index=multiindex)
--> 520 return indexed.unstack(columns_listlike)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:8407, in DataFrame.unstack(self, level, fill_value)
   8349 """
   8350 Pivot a level of the (necessarily hierarchical) index labels.
   8351 
   (...)
   8403 dtype: float64
   8404 """
   8405 from pandas.core.reshape.reshape import unstack
-> 8407 result = unstack(self, level, fill_value)
   8409 return result.__finalize__(self, method="unstack")

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:468, in unstack(obj, level, fill_value)
    464 if isinstance(level, (tuple, list)):
    465     if len(level) != 1:
    466         # _unstack_multiple only handles MultiIndexes,
    467         # and isn't needed for a single level
--> 468         return _unstack_multiple(obj, level, fill_value=fill_value)
    469     else:
    470         level = level[0]

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:437, in _unstack_multiple(data, clocs, fill_value)
    434 dummy = data.copy(deep=False)
    435 dummy.index = dummy_index
--> 437 unstacked = dummy.unstack("__placeholder__", fill_value=fill_value)
    438 if isinstance(unstacked, Series):
    439     unstcols = unstacked.index

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:8407, in DataFrame.unstack(self, level, fill_value)
   8349 """
   8350 Pivot a level of the (necessarily hierarchical) index labels.
   8351 
   (...)
   8403 dtype: float64
   8404 """
   8405 from pandas.core.reshape.reshape import unstack
-> 8407 result = unstack(self, level, fill_value)
   8409 return result.__finalize__(self, method="unstack")

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:478, in unstack(obj, level, fill_value)
    476 if isinstance(obj, DataFrame):
    477     if isinstance(obj.index, MultiIndex):
--> 478         return _unstack_frame(obj, level, fill_value=fill_value)
    479     else:
    480         return obj.T.stack(dropna=False)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:505, in _unstack_frame(obj, level, fill_value)
    503     return obj._constructor(mgr)
    504 else:
--> 505     unstacker = _Unstacker(obj.index, level=level, constructor=obj._constructor)
    506     return unstacker.get_result(
    507         obj._values, value_columns=obj.columns, fill_value=fill_value
    508     )

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:140, in _Unstacker.__init__(self, index, level, constructor)
    133 if num_cells > np.iinfo(np.int32).max:
    134     warnings.warn(
    135         f"The following operation may generate {num_cells} cells "
    136         f"in the resulting pandas object.",
    137         PerformanceWarning,
    138     )
--> 140 self._make_selectors()

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:192, in _Unstacker._make_selectors(self)
    189 mask.put(selector, True)
    191 if mask.sum() < len(self.index):
--> 192     raise ValueError("Index contains duplicate entries, cannot reshape")
    194 self.group_index = comp_index
    195 self.mask = mask

ValueError: Index contains duplicate entries, cannot reshape

练一练

在上面的边际汇总例子中,行或列的汇总为新表中行元素或者列元素的平均值,而总体的汇总为新表中4个元素的平均值。这种关系一定成立吗?若不成立,请给出1个例子来说明。

不一定成立,例如:

df = pd.DataFrame({'Name':list("AAAABBBBBBAAB"),
                   'Subject':['Chinese', 'Chinese', 'Math', 'Math',
                              'Chinese', 'Chinese', 'Math', 'Math',
                              'Math', "Chinese", "Chinese", "Chinese", "Math"],
                   'Grade':[90, 80, 70, 90, 75, 85, 80, 100, 0, 0, 30, 40, 20]})
df.pivot_table(
    index='Name', columns='Subject', values='Grade',
    aggfunc='mean', margins='True')
Subject Chinese Math All
Name
A 60.000000 80.0 66.666667
B 53.333333 50.0 51.428571
All 57.142857 60.0 58.461538

练一练

在上例中,若想把考试类型(Examination)的Mid和Final转到行索引,Subject中的Chinese和Math保留在列索引上,应当如何操作?

df = pd.DataFrame({'Class':[1,2],'Name':['A', 'B'],
                   'Mid_Chinese':[80, 85], 'Mid_Math':[70, 80],
                   'Final_Chinese':[90, 75], 'Final_Math':[95, 85]})
df
Class Name Mid_Chinese Mid_Math Final_Chinese Final_Math
0 1 A 80 70 90 95
1 2 B 85 80 75 85

方法一:

temp = df.rename(columns=lambda x: "_".join(x.split("_")[::-1]) if "_" in x else x)
res = pd.wide_to_long(temp, stubnames=['Math', 'Chinese'],
                      i=['Class', 'Name'], j='Examination',
                      sep='_', suffix='.+').rename_axis("Subject", axis=1)
res
Subject Math Chinese
Class Name Examination
1 A Mid 70 80
Final 95 90
2 B Mid 80 85
Final 85 75

方法二:

temp = pd.wide_to_long(df, stubnames=['Mid', 'Final'],
                       i=['Class', 'Name'], j='Subject',
                       sep='_', suffix='.+')
temp.columns.name = "Examiniation"
res = temp.unstack(-1).stack(0)
res
Subject Chinese Math
Class Name Examiniation
1 A Final 90 95
Mid 80 70
2 B Final 75 85
Mid 85 80

练一练

第3章中提到的swaplevel()和reorder_levels()是长宽表变形函数吗?为什么?

不是,这两个函数是统一索引内部的层级交换,stack()和unstack()是行列索引间的层级交换,而5.1节的4个长宽表变形函数是索引与列的交换。

练一练

结合unstack()和reorder_levels(),将行索引的第二层插入至列索引的最外层

df = pd.DataFrame(np.ones((4,2)),
                  index = pd.Index([('A', 'cat', 'big'),
                                    ('A', 'dog', 'small'),
                                    ('B', 'cat', 'big'),
                                    ('B', 'dog', 'small')]),
                  columns=['col_1', 'col_2'])
df.unstack(1).reorder_levels([1,0], axis=1)
cat dog cat dog
col_1 col_1 col_2 col_2
A big 1.0 NaN 1.0 NaN
small NaN 1.0 NaN 1.0
B big 1.0 NaN 1.0 NaN
small NaN 1.0 NaN 1.0

练一练

请自行构造一个unstack()因违背唯一性要求而报错的例子。

pd.DataFrame([[0,0]], columns=pd.Index([(0,0),(0,0)])).stack()
C:\Users\gyh\AppData\Local\Temp\ipykernel_9256\2556769686.py:1: FutureWarning: reindexing with a non-unique Index is deprecated and will raise in a future version.
  pd.DataFrame([[0,0]], columns=pd.Index([(0,0),(0,0)])).stack()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [14], in <module>
----> 1 pd.DataFrame([[0,0]], columns=pd.Index([(0,0),(0,0)])).stack()

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:8217, in DataFrame.stack(self, level, dropna)
   8215     result = stack_multiple(self, level, dropna=dropna)
   8216 else:
-> 8217     result = stack(self, level, dropna=dropna)
   8219 return result.__finalize__(self, method="stack")

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:566, in stack(frame, level, dropna)
    563 level_num = frame.columns._get_level_number(level)
    565 if isinstance(frame.columns, MultiIndex):
--> 566     return _stack_multi_columns(frame, level_num=level_num, dropna=dropna)
    567 elif isinstance(frame.index, MultiIndex):
    568     new_levels = list(frame.index.levels)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:750, in _stack_multi_columns(frame, level_num, dropna)
    748     chunk = this.loc[:, this.columns[loc]]
    749     chunk.columns = level_vals_nan.take(chunk.columns.codes[-1])
--> 750     value_slice = chunk.reindex(columns=level_vals_used).values
    751 else:
    752     if frame._is_homogeneous_type and is_extension_array_dtype(
    753         frame.dtypes.iloc[0]
    754     ):
    755         # TODO(EA2D): won't need special case, can go through .values
    756         #  paths below (might change to ._values)

File ~\miniconda3\envs\final\lib\site-packages\pandas\util\_decorators.py:324, in rewrite_axis_style_signature.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    322 @wraps(func)
    323 def wrapper(*args, **kwargs) -> Callable[..., Any]:
--> 324     return func(*args, **kwargs)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:4806, in DataFrame.reindex(self, *args, **kwargs)
   4804 kwargs.pop("axis", None)
   4805 kwargs.pop("labels", None)
-> 4806 return super().reindex(**kwargs)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\generic.py:4974, in NDFrame.reindex(self, *args, **kwargs)
   4971     return self._reindex_multi(axes, copy, fill_value)
   4973 # perform the reindex on the axes
-> 4974 return self._reindex_axes(
   4975     axes, level, limit, tolerance, method, fill_value, copy
   4976 ).__finalize__(self, method="reindex")

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:4619, in DataFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
   4617 columns = axes["columns"]
   4618 if columns is not None:
-> 4619     frame = frame._reindex_columns(
   4620         columns, method, copy, level, fill_value, limit, tolerance
   4621     )
   4623 index = axes["index"]
   4624 if index is not None:

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:4664, in DataFrame._reindex_columns(self, new_columns, method, copy, level, fill_value, limit, tolerance)
   4651 def _reindex_columns(
   4652     self,
   4653     new_columns,
   (...)
   4659     tolerance=None,
   4660 ):
   4661     new_columns, indexer = self.columns.reindex(
   4662         new_columns, method=method, level=level, limit=limit, tolerance=tolerance
   4663     )
-> 4664     return self._reindex_with_indexers(
   4665         {1: [new_columns, indexer]},
   4666         copy=copy,
   4667         fill_value=fill_value,
   4668         allow_dups=False,
   4669     )

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\generic.py:5040, in NDFrame._reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups)
   5037     indexer = ensure_platform_int(indexer)
   5039 # TODO: speed up on homogeneous DataFrame objects (see _reindex_multi)
-> 5040 new_data = new_data.reindex_indexer(
   5041     index,
   5042     indexer,
   5043     axis=baxis,
   5044     fill_value=fill_value,
   5045     allow_dups=allow_dups,
   5046     copy=copy,
   5047 )
   5048 # If we've made a copy once, no need to make another one
   5049 copy = False

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\internals\managers.py:679, in BaseBlockManager.reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy, consolidate, only_slice, use_na_proxy)
    677 # some axes don't allow reindexing with dups
    678 if not allow_dups:
--> 679     self.axes[axis]._validate_can_reindex(indexer)
    681 if axis >= self.ndim:
    682     raise IndexError("Requested axis not found in manager")

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\indexes\base.py:4107, in Index._validate_can_reindex(self, indexer)
   4105 # trying to reindex on an axis with duplicates
   4106 if not self._index_as_unique and len(indexer):
-> 4107     raise ValueError("cannot reindex on an axis with duplicate labels")

ValueError: cannot reindex on an axis with duplicate labels

练一练

在2.2.2节中提到df.T能够将DataFrame转置,而转置是一种特殊的索引变形,请仅用unstack和stack实现转置功能。

n_row_index = df.index.nlevels
n_col_index = df.columns.nlevels
df.stack(list(range(n_col_index))).unstack(list(range(n_row_index)))
A B
cat dog cat dog
big small big small
col_1 1.0 1.0 1.0 1.0
col_2 1.0 1.0 1.0 1.0

一、某连锁店的库存统计

某连锁店总共有5个大区仓库(First_Area),每个大区仓库内有若干子区仓库(Child_Area),在data/ch5/storage.csv中记录了从2010年至2017年各子区仓库中不同货品(Type)的存量(Storage),其中“Year”、“First_Area”、“Child_Area”和“Type”已为依次排序状态。

df = pd.read_csv("data/ch5/area_storage.csv")
df.head()
Year First_Area Child_Area Type Storage
0 2010 A 4 product_14 1
1 2010 A 4 product_55 2
2 2010 A 7 product_14 1
3 2010 A 7 product_55 2
4 2010 A 8 product_14 31
  • 将数据转为如下格式,其中“First_Area”、“Child_Area”和“Type”三列已为依次排序状态。

First_Area Child_Area Type 2010 2011 2012 2013 2014 2015 2016 2017
0 A 4 product_14 1.0 2.0 1.0 5.0 6.0 3.0 4.0 NaN
1 A 4 product_15 NaN NaN NaN NaN NaN 1.0 NaN NaN
2 A 4 product_27 NaN NaN NaN NaN NaN 1.0 NaN NaN
3 A 4 product_37 NaN NaN NaN NaN NaN 1.0 10.0 6.0
4 A 4 product_42 NaN NaN NaN NaN NaN NaN 1.0 NaN
  • 将上一问中的结果恢复为原表。

  • 按First_Area分别统计每年的库存量总和,并分别将First_Area和Year作为列索引和行索引,要求分别使用pivot_table()函数与groupby+unstack()两种不同的策略实现,并体会变形和分组之间的联系。

【解答】
  • 1

df_ = df.sort_values(['First_Area','Child_Area','Type'],ignore_index=True)
res = df_.pivot(
    index=['First_Area','Child_Area','Type'],
    columns='Year', values='Storage'
).reset_index().rename_axis(columns={'Year':''})
res.head()
First_Area Child_Area Type 2010 2011 2012 2013 2014 2015 2016 2017
0 A 4 product_14 1.0 2.0 1.0 5.0 6.0 3.0 4.0 NaN
1 A 4 product_15 NaN NaN NaN NaN NaN 1.0 NaN NaN
2 A 4 product_27 NaN NaN NaN NaN NaN 1.0 NaN NaN
3 A 4 product_37 NaN NaN NaN NaN NaN 1.0 10.0 6.0
4 A 4 product_42 NaN NaN NaN NaN NaN NaN 1.0 NaN
  • 2

res_melted = res.melt(id_vars = ['First_Area','Child_Area','Type'],
                      value_vars = res.columns[-8:],
                      var_name = 'Year',
                      value_name = 'Storage').dropna(subset=['Storage'])
res_melted = res_melted[df.columns].sort_values(
    ["Year",'First_Area','Child_Area','Type'],
    ignore_index=True).astype({'Year':'int64', 'Storage':'int64'})
res_melted.equals(df)
True
  • 3

df_method1 = df.pivot_table(index='Year', columns='First_Area',values='Storage', aggfunc='sum')
df_method1
First_Area A B C D E
Year
2010 8685 19707 19814 10453 2890
2011 6749 20330 19987 10289 3271
2012 7831 23145 19959 10722 3376
2013 11675 26846 20409 11148 4046
2014 9037 30860 24904 11081 3280
2015 8810 37127 25651 9865 2571
2016 10195 42470 26164 9093 2548
2017 10448 46104 27894 9394 1614
df_method2 = df.groupby(['First_Area', 'Year'])['Storage'].sum()
df_method2 = df_method2.to_frame().unstack(0).droplevel(0,axis=1)
df_method2
First_Area A B C D E
Year
2010 8685 19707 19814 10453 2890
2011 6749 20330 19987 10289 3271
2012 7831 23145 19959 10722 3376
2013 11675 26846 20409 11148 4046
2014 9037 30860 24904 11081 3280
2015 8810 37127 25651 9865 2571
2016 10195 42470 26164 9093 2548
2017 10448 46104 27894 9394 1614

二、整理某地区的化石燃料数据

在data/ch5/fossil_fuel.txt中存放了某地2020年各月化石燃料资源的消耗量和生产量。

df = pd.read_table("data/ch5/fossil_fuel.txt", header=None)

请将上表整理成如下格式,其中“日期”、“统计类别”和“资源名称”三列为已依次排序状态。

资源名称 统计类别 单位 日期 数值
0 01-煤炭 消耗量 万吨 2020-01 14.84
1 02-气化天然气 消耗量 亿立方米 2020-01 0.1201
2 03-油页岩 消耗量 万吨 2020-01 13.82
3 04-石油 消耗量 万吨 2020-01 12.86
4 05-液化天然气 消耗量 万吨 2020-01 10.65
【解答】
df.iloc[0, 2::2] = df.iloc[0, 1::2]
res = pd.DataFrame(
    df.iloc[4:, 1:].values,
    index=pd.Series(df.iloc[4:, 0], name="日期"),
    columns=pd.MultiIndex.from_arrays(df.iloc[:3, 1:].values, names=df.iloc[:3, 0])
).unstack().reset_index().rename(columns={0:"数值"})
res = res.sort_values(["日期", "统计类别", "资源名称"]).reset_index(drop=True)
res.head()
资源名称 统计类别 单位 日期 数值
0 01-煤炭 消耗量 万吨 2020-01 14.84
1 02-气化天然气 消耗量 亿立方米 2020-01 0.1201
2 03-油页岩 消耗量 万吨 2020-01 13.82
3 04-石油 消耗量 万吨 2020-01 12.86
4 05-液化天然气 消耗量 万吨 2020-01 10.65

三、特殊的wide_to_long()方法

从功能上看,melt()方法应当属于wide_to_long()的一种特殊情况,即stubnames只有一类。请使用wide_to_long()生成melt()一节中的df_melted。(提示:对列名增加适当的前缀)

【解答】
df = pd.DataFrame({'Class':[1,2],
                   'Name':['A', 'B'],
                   'Chinese':[80, 90],
                   'Math':[80, 75]})
df = df.rename(columns={'Chinese':'pre_Chinese', 'Math':'pre_Math'})
pd.wide_to_long(df,
                stubnames=['pre'],
                i = ['Class', 'Name'],
                j='Subject',
                sep='_',
                suffix='.+').reset_index().rename(columns={'pre':'Grade'})
Class Name Subject Grade
0 1 A Chinese 80
1 1 A Math 80
2 2 B Chinese 90
3 2 B Math 75