第五章
内容
第五章¶
零、练一练¶
练一练
除了上述例子之外,请再给出一些长宽表的例子。
某企业员工当年的各月收入情况表,列索引是月份,行索引是员工ID,那么这张表是收入月份的宽表。如果行索引是员工ID,第一列是月份,第二列是该员工在当月的收入,那么这张表是收入月份的长表。
某张表的行索引是月份,列索引是城市名,单元格元素是某城市在某月的发电量,那么这张表是城市的宽表。如果行索引是月份,第一列是城市名,第二列是发电量,那么这张表是城市的长表。
练一练
如果只选取df的部分行使用pivot,那么结果会如何变化?
df = pd.DataFrame({'Class':[1,1,2,2], # 班级
'Name':['A','A','B','B'], # 姓名
'Subject':['Chinese','Math','Chinese','Math'], # 科目
'Grade':[80,75,90,85]}) # 成绩
表中可能会出现缺失值:
df.iloc[[1,2]].pivot(index='Name', columns='Subject', values='Grade')
Subject | Chinese | Math |
---|---|---|
Name | ||
A | NaN | 75.0 |
B | 90.0 | NaN |
表的形状可能会变化:
df.iloc[[1,3]].pivot(index='Name', columns='Subject', values='Grade')
Subject | Math |
---|---|
Name | |
A | 75 |
B | 85 |
df.iloc[[0,1]].pivot(index='Name', columns='Subject', values='Grade')
Subject | Chinese | Math |
---|---|---|
Name | ||
A | 80 | 75 |
df.iloc[[0]].pivot(index='Name', columns='Subject', values='Grade')
Subject | Chinese |
---|---|
Name | |
A | 80 |
练一练
请修改上表中的元素值以破坏pivot()的唯一性条件,并且在使用pivot()进行强制变形后观察报错信息。
df_unpivot = pd.DataFrame({'Class':[1, 1, 2, 2, 1, 1, 2, 2],
'Name':list("AABBAABB"),
'Examination': ['Mid', 'Final', 'Mid', 'Final',
'Mid', 'Final', 'Mid', 'Final'],
'Subject':['Chinese', 'Chinese', 'Chinese', 'Chinese',
'Math', 'Math', 'Math', 'Math'],
'Grade':[80, 75, 85, 65, 90, 85, 92, 88],
'Rank':[10, 15, 21, 15, 20, 7, 6, 2]})
df_unpivot.loc[0,"Subject"] = "Math"
df_unpivot.pivot(index = ['Class', 'Name'],
columns = ['Subject','Examination'],
values = ['Grade','Rank'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [8], in <module>
1 df_unpivot.loc[0,"Subject"] = "Math"
----> 2 df_unpivot.pivot(index = ['Class', 'Name'],
3 columns = ['Subject','Examination'],
4 values = ['Grade','Rank'])
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:7876, in DataFrame.pivot(self, index, columns, values)
7871 @Substitution("")
7872 @Appender(_shared_docs["pivot"])
7873 def pivot(self, index=None, columns=None, values=None) -> DataFrame:
7874 from pandas.core.reshape.pivot import pivot
-> 7876 return pivot(self, index=index, columns=columns, values=values)
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\pivot.py:520, in pivot(data, index, columns, values)
518 else:
519 indexed = data._constructor_sliced(data[values]._values, index=multiindex)
--> 520 return indexed.unstack(columns_listlike)
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:8407, in DataFrame.unstack(self, level, fill_value)
8349 """
8350 Pivot a level of the (necessarily hierarchical) index labels.
8351
(...)
8403 dtype: float64
8404 """
8405 from pandas.core.reshape.reshape import unstack
-> 8407 result = unstack(self, level, fill_value)
8409 return result.__finalize__(self, method="unstack")
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:468, in unstack(obj, level, fill_value)
464 if isinstance(level, (tuple, list)):
465 if len(level) != 1:
466 # _unstack_multiple only handles MultiIndexes,
467 # and isn't needed for a single level
--> 468 return _unstack_multiple(obj, level, fill_value=fill_value)
469 else:
470 level = level[0]
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:437, in _unstack_multiple(data, clocs, fill_value)
434 dummy = data.copy(deep=False)
435 dummy.index = dummy_index
--> 437 unstacked = dummy.unstack("__placeholder__", fill_value=fill_value)
438 if isinstance(unstacked, Series):
439 unstcols = unstacked.index
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:8407, in DataFrame.unstack(self, level, fill_value)
8349 """
8350 Pivot a level of the (necessarily hierarchical) index labels.
8351
(...)
8403 dtype: float64
8404 """
8405 from pandas.core.reshape.reshape import unstack
-> 8407 result = unstack(self, level, fill_value)
8409 return result.__finalize__(self, method="unstack")
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:478, in unstack(obj, level, fill_value)
476 if isinstance(obj, DataFrame):
477 if isinstance(obj.index, MultiIndex):
--> 478 return _unstack_frame(obj, level, fill_value=fill_value)
479 else:
480 return obj.T.stack(dropna=False)
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:505, in _unstack_frame(obj, level, fill_value)
503 return obj._constructor(mgr)
504 else:
--> 505 unstacker = _Unstacker(obj.index, level=level, constructor=obj._constructor)
506 return unstacker.get_result(
507 obj._values, value_columns=obj.columns, fill_value=fill_value
508 )
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:140, in _Unstacker.__init__(self, index, level, constructor)
133 if num_cells > np.iinfo(np.int32).max:
134 warnings.warn(
135 f"The following operation may generate {num_cells} cells "
136 f"in the resulting pandas object.",
137 PerformanceWarning,
138 )
--> 140 self._make_selectors()
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:192, in _Unstacker._make_selectors(self)
189 mask.put(selector, True)
191 if mask.sum() < len(self.index):
--> 192 raise ValueError("Index contains duplicate entries, cannot reshape")
194 self.group_index = comp_index
195 self.mask = mask
ValueError: Index contains duplicate entries, cannot reshape
练一练
在上面的边际汇总例子中,行或列的汇总为新表中行元素或者列元素的平均值,而总体的汇总为新表中4个元素的平均值。这种关系一定成立吗?若不成立,请给出1个例子来说明。
不一定成立,例如:
df = pd.DataFrame({'Name':list("AAAABBBBBBAAB"),
'Subject':['Chinese', 'Chinese', 'Math', 'Math',
'Chinese', 'Chinese', 'Math', 'Math',
'Math', "Chinese", "Chinese", "Chinese", "Math"],
'Grade':[90, 80, 70, 90, 75, 85, 80, 100, 0, 0, 30, 40, 20]})
df.pivot_table(
index='Name', columns='Subject', values='Grade',
aggfunc='mean', margins='True')
Subject | Chinese | Math | All |
---|---|---|---|
Name | |||
A | 60.000000 | 80.0 | 66.666667 |
B | 53.333333 | 50.0 | 51.428571 |
All | 57.142857 | 60.0 | 58.461538 |
练一练
在上例中,若想把考试类型(Examination)的Mid和Final转到行索引,Subject中的Chinese和Math保留在列索引上,应当如何操作?
df = pd.DataFrame({'Class':[1,2],'Name':['A', 'B'],
'Mid_Chinese':[80, 85], 'Mid_Math':[70, 80],
'Final_Chinese':[90, 75], 'Final_Math':[95, 85]})
df
Class | Name | Mid_Chinese | Mid_Math | Final_Chinese | Final_Math | |
---|---|---|---|---|---|---|
0 | 1 | A | 80 | 70 | 90 | 95 |
1 | 2 | B | 85 | 80 | 75 | 85 |
方法一:
temp = df.rename(columns=lambda x: "_".join(x.split("_")[::-1]) if "_" in x else x)
res = pd.wide_to_long(temp, stubnames=['Math', 'Chinese'],
i=['Class', 'Name'], j='Examination',
sep='_', suffix='.+').rename_axis("Subject", axis=1)
res
Subject | Math | Chinese | ||
---|---|---|---|---|
Class | Name | Examination | ||
1 | A | Mid | 70 | 80 |
Final | 95 | 90 | ||
2 | B | Mid | 80 | 85 |
Final | 85 | 75 |
方法二:
temp = pd.wide_to_long(df, stubnames=['Mid', 'Final'],
i=['Class', 'Name'], j='Subject',
sep='_', suffix='.+')
temp.columns.name = "Examiniation"
res = temp.unstack(-1).stack(0)
res
Subject | Chinese | Math | ||
---|---|---|---|---|
Class | Name | Examiniation | ||
1 | A | Final | 90 | 95 |
Mid | 80 | 70 | ||
2 | B | Final | 75 | 85 |
Mid | 85 | 80 |
练一练
第3章中提到的swaplevel()和reorder_levels()是长宽表变形函数吗?为什么?
不是,这两个函数是统一索引内部的层级交换,stack()和unstack()是行列索引间的层级交换,而5.1节的4个长宽表变形函数是索引与列的交换。
练一练
结合unstack()和reorder_levels(),将行索引的第二层插入至列索引的最外层
df = pd.DataFrame(np.ones((4,2)),
index = pd.Index([('A', 'cat', 'big'),
('A', 'dog', 'small'),
('B', 'cat', 'big'),
('B', 'dog', 'small')]),
columns=['col_1', 'col_2'])
df.unstack(1).reorder_levels([1,0], axis=1)
cat | dog | cat | dog | ||
---|---|---|---|---|---|
col_1 | col_1 | col_2 | col_2 | ||
A | big | 1.0 | NaN | 1.0 | NaN |
small | NaN | 1.0 | NaN | 1.0 | |
B | big | 1.0 | NaN | 1.0 | NaN |
small | NaN | 1.0 | NaN | 1.0 |
练一练
请自行构造一个unstack()因违背唯一性要求而报错的例子。
pd.DataFrame([[0,0]], columns=pd.Index([(0,0),(0,0)])).stack()
C:\Users\gyh\AppData\Local\Temp\ipykernel_9256\2556769686.py:1: FutureWarning: reindexing with a non-unique Index is deprecated and will raise in a future version.
pd.DataFrame([[0,0]], columns=pd.Index([(0,0),(0,0)])).stack()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [14], in <module>
----> 1 pd.DataFrame([[0,0]], columns=pd.Index([(0,0),(0,0)])).stack()
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:8217, in DataFrame.stack(self, level, dropna)
8215 result = stack_multiple(self, level, dropna=dropna)
8216 else:
-> 8217 result = stack(self, level, dropna=dropna)
8219 return result.__finalize__(self, method="stack")
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:566, in stack(frame, level, dropna)
563 level_num = frame.columns._get_level_number(level)
565 if isinstance(frame.columns, MultiIndex):
--> 566 return _stack_multi_columns(frame, level_num=level_num, dropna=dropna)
567 elif isinstance(frame.index, MultiIndex):
568 new_levels = list(frame.index.levels)
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\reshape\reshape.py:750, in _stack_multi_columns(frame, level_num, dropna)
748 chunk = this.loc[:, this.columns[loc]]
749 chunk.columns = level_vals_nan.take(chunk.columns.codes[-1])
--> 750 value_slice = chunk.reindex(columns=level_vals_used).values
751 else:
752 if frame._is_homogeneous_type and is_extension_array_dtype(
753 frame.dtypes.iloc[0]
754 ):
755 # TODO(EA2D): won't need special case, can go through .values
756 # paths below (might change to ._values)
File ~\miniconda3\envs\final\lib\site-packages\pandas\util\_decorators.py:324, in rewrite_axis_style_signature.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
322 @wraps(func)
323 def wrapper(*args, **kwargs) -> Callable[..., Any]:
--> 324 return func(*args, **kwargs)
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:4806, in DataFrame.reindex(self, *args, **kwargs)
4804 kwargs.pop("axis", None)
4805 kwargs.pop("labels", None)
-> 4806 return super().reindex(**kwargs)
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\generic.py:4974, in NDFrame.reindex(self, *args, **kwargs)
4971 return self._reindex_multi(axes, copy, fill_value)
4973 # perform the reindex on the axes
-> 4974 return self._reindex_axes(
4975 axes, level, limit, tolerance, method, fill_value, copy
4976 ).__finalize__(self, method="reindex")
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:4619, in DataFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
4617 columns = axes["columns"]
4618 if columns is not None:
-> 4619 frame = frame._reindex_columns(
4620 columns, method, copy, level, fill_value, limit, tolerance
4621 )
4623 index = axes["index"]
4624 if index is not None:
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\frame.py:4664, in DataFrame._reindex_columns(self, new_columns, method, copy, level, fill_value, limit, tolerance)
4651 def _reindex_columns(
4652 self,
4653 new_columns,
(...)
4659 tolerance=None,
4660 ):
4661 new_columns, indexer = self.columns.reindex(
4662 new_columns, method=method, level=level, limit=limit, tolerance=tolerance
4663 )
-> 4664 return self._reindex_with_indexers(
4665 {1: [new_columns, indexer]},
4666 copy=copy,
4667 fill_value=fill_value,
4668 allow_dups=False,
4669 )
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\generic.py:5040, in NDFrame._reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups)
5037 indexer = ensure_platform_int(indexer)
5039 # TODO: speed up on homogeneous DataFrame objects (see _reindex_multi)
-> 5040 new_data = new_data.reindex_indexer(
5041 index,
5042 indexer,
5043 axis=baxis,
5044 fill_value=fill_value,
5045 allow_dups=allow_dups,
5046 copy=copy,
5047 )
5048 # If we've made a copy once, no need to make another one
5049 copy = False
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\internals\managers.py:679, in BaseBlockManager.reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy, consolidate, only_slice, use_na_proxy)
677 # some axes don't allow reindexing with dups
678 if not allow_dups:
--> 679 self.axes[axis]._validate_can_reindex(indexer)
681 if axis >= self.ndim:
682 raise IndexError("Requested axis not found in manager")
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\indexes\base.py:4107, in Index._validate_can_reindex(self, indexer)
4105 # trying to reindex on an axis with duplicates
4106 if not self._index_as_unique and len(indexer):
-> 4107 raise ValueError("cannot reindex on an axis with duplicate labels")
ValueError: cannot reindex on an axis with duplicate labels
练一练
在2.2.2节中提到df.T能够将DataFrame转置,而转置是一种特殊的索引变形,请仅用unstack和stack实现转置功能。
n_row_index = df.index.nlevels
n_col_index = df.columns.nlevels
df.stack(list(range(n_col_index))).unstack(list(range(n_row_index)))
A | B | |||
---|---|---|---|---|
cat | dog | cat | dog | |
big | small | big | small | |
col_1 | 1.0 | 1.0 | 1.0 | 1.0 |
col_2 | 1.0 | 1.0 | 1.0 | 1.0 |
一、某连锁店的库存统计¶
某连锁店总共有5个大区仓库(First_Area),每个大区仓库内有若干子区仓库(Child_Area),在data/ch5/storage.csv中记录了从2010年至2017年各子区仓库中不同货品(Type)的存量(Storage),其中“Year”、“First_Area”、“Child_Area”和“Type”已为依次排序状态。
df = pd.read_csv("data/ch5/area_storage.csv")
df.head()
Year | First_Area | Child_Area | Type | Storage | |
---|---|---|---|---|---|
0 | 2010 | A | 4 | product_14 | 1 |
1 | 2010 | A | 4 | product_55 | 2 |
2 | 2010 | A | 7 | product_14 | 1 |
3 | 2010 | A | 7 | product_55 | 2 |
4 | 2010 | A | 8 | product_14 | 31 |
将数据转为如下格式,其中“First_Area”、“Child_Area”和“Type”三列已为依次排序状态。
First_Area | Child_Area | Type | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | A | 4 | product_14 | 1.0 | 2.0 | 1.0 | 5.0 | 6.0 | 3.0 | 4.0 | NaN |
1 | A | 4 | product_15 | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN |
2 | A | 4 | product_27 | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN |
3 | A | 4 | product_37 | NaN | NaN | NaN | NaN | NaN | 1.0 | 10.0 | 6.0 |
4 | A | 4 | product_42 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN |
将上一问中的结果恢复为原表。
按First_Area分别统计每年的库存量总和,并分别将First_Area和Year作为列索引和行索引,要求分别使用pivot_table()函数与groupby+unstack()两种不同的策略实现,并体会变形和分组之间的联系。
【解答】
1
df_ = df.sort_values(['First_Area','Child_Area','Type'],ignore_index=True)
res = df_.pivot(
index=['First_Area','Child_Area','Type'],
columns='Year', values='Storage'
).reset_index().rename_axis(columns={'Year':''})
res.head()
First_Area | Child_Area | Type | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | A | 4 | product_14 | 1.0 | 2.0 | 1.0 | 5.0 | 6.0 | 3.0 | 4.0 | NaN |
1 | A | 4 | product_15 | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN |
2 | A | 4 | product_27 | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN |
3 | A | 4 | product_37 | NaN | NaN | NaN | NaN | NaN | 1.0 | 10.0 | 6.0 |
4 | A | 4 | product_42 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN |
2
res_melted = res.melt(id_vars = ['First_Area','Child_Area','Type'],
value_vars = res.columns[-8:],
var_name = 'Year',
value_name = 'Storage').dropna(subset=['Storage'])
res_melted = res_melted[df.columns].sort_values(
["Year",'First_Area','Child_Area','Type'],
ignore_index=True).astype({'Year':'int64', 'Storage':'int64'})
res_melted.equals(df)
True
3
df_method1 = df.pivot_table(index='Year', columns='First_Area',values='Storage', aggfunc='sum')
df_method1
First_Area | A | B | C | D | E |
---|---|---|---|---|---|
Year | |||||
2010 | 8685 | 19707 | 19814 | 10453 | 2890 |
2011 | 6749 | 20330 | 19987 | 10289 | 3271 |
2012 | 7831 | 23145 | 19959 | 10722 | 3376 |
2013 | 11675 | 26846 | 20409 | 11148 | 4046 |
2014 | 9037 | 30860 | 24904 | 11081 | 3280 |
2015 | 8810 | 37127 | 25651 | 9865 | 2571 |
2016 | 10195 | 42470 | 26164 | 9093 | 2548 |
2017 | 10448 | 46104 | 27894 | 9394 | 1614 |
df_method2 = df.groupby(['First_Area', 'Year'])['Storage'].sum()
df_method2 = df_method2.to_frame().unstack(0).droplevel(0,axis=1)
df_method2
First_Area | A | B | C | D | E |
---|---|---|---|---|---|
Year | |||||
2010 | 8685 | 19707 | 19814 | 10453 | 2890 |
2011 | 6749 | 20330 | 19987 | 10289 | 3271 |
2012 | 7831 | 23145 | 19959 | 10722 | 3376 |
2013 | 11675 | 26846 | 20409 | 11148 | 4046 |
2014 | 9037 | 30860 | 24904 | 11081 | 3280 |
2015 | 8810 | 37127 | 25651 | 9865 | 2571 |
2016 | 10195 | 42470 | 26164 | 9093 | 2548 |
2017 | 10448 | 46104 | 27894 | 9394 | 1614 |
二、整理某地区的化石燃料数据¶
在data/ch5/fossil_fuel.txt中存放了某地2020年各月化石燃料资源的消耗量和生产量。
df = pd.read_table("data/ch5/fossil_fuel.txt", header=None)
请将上表整理成如下格式,其中“日期”、“统计类别”和“资源名称”三列为已依次排序状态。
资源名称 | 统计类别 | 单位 | 日期 | 数值 | |
---|---|---|---|---|---|
0 | 01-煤炭 | 消耗量 | 万吨 | 2020-01 | 14.84 |
1 | 02-气化天然气 | 消耗量 | 亿立方米 | 2020-01 | 0.1201 |
2 | 03-油页岩 | 消耗量 | 万吨 | 2020-01 | 13.82 |
3 | 04-石油 | 消耗量 | 万吨 | 2020-01 | 12.86 |
4 | 05-液化天然气 | 消耗量 | 万吨 | 2020-01 | 10.65 |
【解答】
df.iloc[0, 2::2] = df.iloc[0, 1::2]
res = pd.DataFrame(
df.iloc[4:, 1:].values,
index=pd.Series(df.iloc[4:, 0], name="日期"),
columns=pd.MultiIndex.from_arrays(df.iloc[:3, 1:].values, names=df.iloc[:3, 0])
).unstack().reset_index().rename(columns={0:"数值"})
res = res.sort_values(["日期", "统计类别", "资源名称"]).reset_index(drop=True)
res.head()
资源名称 | 统计类别 | 单位 | 日期 | 数值 | |
---|---|---|---|---|---|
0 | 01-煤炭 | 消耗量 | 万吨 | 2020-01 | 14.84 |
1 | 02-气化天然气 | 消耗量 | 亿立方米 | 2020-01 | 0.1201 |
2 | 03-油页岩 | 消耗量 | 万吨 | 2020-01 | 13.82 |
3 | 04-石油 | 消耗量 | 万吨 | 2020-01 | 12.86 |
4 | 05-液化天然气 | 消耗量 | 万吨 | 2020-01 | 10.65 |
三、特殊的wide_to_long()方法¶
从功能上看,melt()方法应当属于wide_to_long()的一种特殊情况,即stubnames只有一类。请使用wide_to_long()生成melt()一节中的df_melted。(提示:对列名增加适当的前缀)
【解答】
df = pd.DataFrame({'Class':[1,2],
'Name':['A', 'B'],
'Chinese':[80, 90],
'Math':[80, 75]})
df = df.rename(columns={'Chinese':'pre_Chinese', 'Math':'pre_Math'})
pd.wide_to_long(df,
stubnames=['pre'],
i = ['Class', 'Name'],
j='Subject',
sep='_',
suffix='.+').reset_index().rename(columns={'pre':'Grade'})
Class | Name | Subject | Grade | |
---|---|---|---|---|
0 | 1 | A | Chinese | 80 |
1 | 1 | A | Math | 80 |
2 | 2 | B | Chinese | 90 |
3 | 2 | B | Math | 75 |