第九章
内容
第九章¶
零、练一练¶
练一练
请构造一个无序类别元素构成的序列,通过cat对象完成增删改查的操作。
s = pd.Series(list("ABCD")).astype("category")
s
0 A
1 B
2 C
3 D
dtype: category
Categories (4, object): ['A', 'B', 'C', 'D']
s.cat.categories
Index(['A', 'B', 'C', 'D'], dtype='object')
s = s.cat.add_categories('E')
s
0 A
1 B
2 C
3 D
dtype: category
Categories (5, object): ['A', 'B', 'C', 'D', 'E']
s = s.cat.remove_categories('E')
s
0 A
1 B
2 C
3 D
dtype: category
Categories (4, object): ['A', 'B', 'C', 'D']
s.cat.rename_categories({'D':'E'})
0 A
1 B
2 C
3 E
dtype: category
Categories (4, object): ['A', 'B', 'C', 'E']
练一练
请构造两组因不同原因(索引不一致和Series类别不一致)而导致无法比较的有序类别Series。
s1 = pd.Series(list("ABCD")).astype("category")
s2 = pd.Series(list("ABCD"), index=[1,2,3,5]).astype("category")
s1 == s2
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [7], in <module>
1 s1 = pd.Series(list("ABCD")).astype("category")
2 s2 = pd.Series(list("ABCD"), index=[1,2,3,5]).astype("category")
----> 3 s1 == s2
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\ops\common.py:70, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
66 return NotImplemented
68 other = item_from_zerodim(other)
---> 70 return method(self, other)
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\arraylike.py:40, in OpsMixin.__eq__(self, other)
38 @unpack_zerodim_and_defer("__eq__")
39 def __eq__(self, other):
---> 40 return self._cmp_method(other, operator.eq)
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\series.py:5614, in Series._cmp_method(self, other, op)
5611 res_name = ops.get_op_result_name(self, other)
5613 if isinstance(other, Series) and not self._indexed_same(other):
-> 5614 raise ValueError("Can only compare identically-labeled Series objects")
5616 lvalues = self._values
5617 rvalues = extract_array(other, extract_numpy=True, extract_range=True)
ValueError: Can only compare identically-labeled Series objects
s1 = pd.Series(list("ABCD")).astype("category")
s2 = pd.Series(list("ABCD")).astype("category")
s2 = s2.cat.add_categories('E')
s1 == s2
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [8], in <module>
2 s2 = pd.Series(list("ABCD")).astype("category")
3 s2 = s2.cat.add_categories('E')
----> 4 s1 == s2
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\ops\common.py:70, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
66 return NotImplemented
68 other = item_from_zerodim(other)
---> 70 return method(self, other)
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\arraylike.py:40, in OpsMixin.__eq__(self, other)
38 @unpack_zerodim_and_defer("__eq__")
39 def __eq__(self, other):
---> 40 return self._cmp_method(other, operator.eq)
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\series.py:5620, in Series._cmp_method(self, other, op)
5617 rvalues = extract_array(other, extract_numpy=True, extract_range=True)
5619 with np.errstate(all="ignore"):
-> 5620 res_values = ops.comparison_op(lvalues, rvalues, op)
5622 return self._construct_result(res_values, name=res_name)
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\ops\array_ops.py:269, in comparison_op(left, right, op)
260 raise ValueError(
261 "Lengths must match to compare", lvalues.shape, rvalues.shape
262 )
264 if should_extension_dispatch(lvalues, rvalues) or (
265 (isinstance(rvalues, (Timedelta, BaseOffset, Timestamp)) or right is NaT)
266 and not is_object_dtype(lvalues.dtype)
267 ):
268 # Call the method on lvalues
--> 269 res_values = op(lvalues, rvalues)
271 elif is_scalar(rvalues) and isna(rvalues): # TODO: but not pd.NA?
272 # numpy does not like comparisons vs None
273 if op is operator.ne:
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\ops\common.py:70, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
66 return NotImplemented
68 other = item_from_zerodim(other)
---> 70 return method(self, other)
File ~\miniconda3\envs\final\lib\site-packages\pandas\core\arrays\categorical.py:150, in _cat_compare_op.<locals>.func(self, other)
148 msg = "Categoricals can only be compared if 'categories' are the same."
149 if not self._categories_match_up_to_permutation(other):
--> 150 raise TypeError(msg)
152 if not self.ordered and not self.categories.equals(other.categories):
153 # both unordered and different order
154 other_codes = recode_for_categories(
155 other.codes, other.categories, self.categories, copy=False
156 )
TypeError: Categoricals can only be compared if 'categories' are the same.
一、统计未出现的类别¶
crosstab()函数是一种特殊的变形函数。在默认参数下,它能够对两个列中元素组合出现的频数进行统计汇总:
df = pd.DataFrame({'A':['a','b','c','a'],
'B':['cat','cat','dog','cat']})
pd.crosstab(df.A, df.B)
B | cat | dog |
---|---|---|
A | ||
a | 2 | 0 |
b | 1 | 0 |
c | 0 | 1 |
但事实上,有些列存储的是分类变量,列中并不一定包含所有的类别,此时如果想要对这些未出现的类别在crosstab()结果中也进行汇总,则可以指定dropna参数为False:
df.B = df.B.astype('category').cat.add_categories('sheep')
pd.crosstab(df.A, df.B, dropna=False)
B | cat | dog | sheep |
---|---|---|---|
A | |||
a | 2 | 0 | 0 |
b | 1 | 0 | 0 |
c | 0 | 1 | 0 |
请实现1个带有dropna参数的my_crosstab()函数来完成上面的功能。
【解答】
def my_crosstab(s1, s2, dropna=True):
if s1.dtype.name == 'category' and not dropna:
idx1 = s1.cat.categories
else:
idx1 = s1.unique()
if s2.dtype.name == 'category' and not dropna:
idx2 = s2.cat.categories
else:
idx2 = s2.unique()
res = pd.DataFrame(
np.zeros((idx1.shape[0], idx2.shape[0])),
index=idx1,
columns=idx2)
for i, j in zip(s1, s2):
res.at[i, j] += 1
res = res.rename_axis(index=s1.name, columns=s2.name).astype('int')
return res
df = pd.DataFrame({'A':['a','b','c','a'],
'B':['cat','cat','dog','cat']})
df.B = df.B.astype('category').cat.add_categories('sheep')
my_crosstab(df.A, df.B)
my_crosstab(df.A, df.B, dropna=False)
B | cat | dog | sheep |
---|---|---|---|
A | |||
a | 2 | 0 | 0 |
b | 1 | 0 | 0 |
c | 0 | 1 | 0 |
二、钻石数据的类别构造¶
现有一份关于钻石的数据集,其中carat、cut、clarity和price分别表示克拉重量、切割质量、纯净度和价格:
df = pd.read_csv('data/ch9/diamonds.csv')
df.head(3)
carat | cut | clarity | price | |
---|---|---|---|---|
0 | 0.23 | Ideal | SI2 | 326 |
1 | 0.21 | Premium | SI1 | 326 |
2 | 0.23 | Good | VS1 | 327 |
分别对df.cut在object类型和category类型下使用nunique()函数,并比较它们的性能。
钻石的切割质量可以分为五个等级,由次到好分别是Fair、Good、Very Good、Premium、Ideal,纯净度有八个等级,由次到好分别是I1、SI2、SI1、VS2、VS1、VVS2、VVS1、IF,请对切割质量按照由好到次的顺序排序,相同切割质量的钻石,按照纯净度进行由次到好的排序。
分别采用两种不同的方法,把cut和clarity这两列按照由好到次的顺序,映射到从0到n-1的整数,其中n表示类别的个数。
对每克拉的价格分别按照分位数(q=[0.2, 0.4, 0.6, 0.8])与[1000, 3500, 5500, 18000]割点进行分箱得到五个类别Very Low、Low、Mid、High、Very High,并把按这两种分箱方法得到的category序列依次添加到原表中。
在第4问中按整数分箱得到的序列中,是否出现了所有的类别?如果存在没有出现的类别请把该类别删除。
对第4问中按分位数分箱得到的序列,求序列元素所在区间的左端点值和长度。
【解答】
1
df = pd.read_csv('data/ch9/diamonds.csv')
s_obj, s_cat = df.cut, df.cut.astype('category')
%timeit -n 30 s_obj.nunique()
2.57 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)
%timeit -n 30 s_cat.nunique()
403 µs ± 22.5 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)
2
df.cut = df.cut.astype('category').cat.reorder_categories([
'Fair', 'Good', 'Very Good', 'Premium', 'Ideal'],ordered=True)
df.clarity = df.clarity.astype('category').cat.reorder_categories([
'I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'],ordered=True)
res = df.sort_values(['cut', 'clarity'], ascending=[False, True])
res.head(3)
carat | cut | clarity | price | |
---|---|---|---|---|
315 | 0.96 | Ideal | I1 | 2801 |
535 | 0.96 | Ideal | I1 | 2826 |
551 | 0.97 | Ideal | I1 | 2830 |
res.tail(3)
carat | cut | clarity | price | |
---|---|---|---|---|
47407 | 0.52 | Fair | IF | 1849 |
49683 | 0.52 | Fair | IF | 2144 |
50126 | 0.47 | Fair | IF | 2211 |
3
df.cut = df.cut.cat.reorder_categories(df.cut.cat.categories[::-1])
df.clarity = df.clarity.cat.reorder_categories(df.clarity.cat.categories[::-1])
# 方法一:利用cat.codes
df.cut = df.cut.cat.codes
clarity_cat = df.clarity.cat.categories
# 方法二:使用replace映射
df.clarity = df.clarity.replace(dict(zip(clarity_cat, np.arange(len(clarity_cat)))))
df.head(3)
carat | cut | clarity | price | |
---|---|---|---|---|
0 | 0.23 | 0 | 6 | 326 |
1 | 0.21 | 1 | 5 | 326 |
2 | 0.23 | 3 | 3 | 327 |
4
q = [0, 0.2, 0.4, 0.6, 0.8, 1]
point = [-np.infty, 1000, 3500, 5500, 18000, np.infty]
avg = df.price / df.carat
df['avg_cut'] = pd.cut(avg, bins=point, labels=['Very Low', 'Low', 'Mid', 'High', 'Very High'])
df['avg_qcut'] = pd.qcut(avg, q=q, labels=['Very Low', 'Low', 'Mid', 'High', 'Very High'])
df.head()
carat | cut | clarity | price | avg_cut | avg_qcut | |
---|---|---|---|---|---|---|
0 | 0.23 | 0 | 6 | 326 | Low | Very Low |
1 | 0.21 | 1 | 5 | 326 | Low | Very Low |
2 | 0.23 | 3 | 3 | 327 | Low | Very Low |
3 | 0.29 | 1 | 4 | 334 | Low | Very Low |
4 | 0.31 | 3 | 6 | 335 | Low | Very Low |
5
df.avg_cut.unique()
['Low', 'Mid', 'High']
Categories (5, object): ['Very Low' < 'Low' < 'Mid' < 'High' < 'Very High']
df.avg_cut.cat.categories
Index(['Very Low', 'Low', 'Mid', 'High', 'Very High'], dtype='object')
df.avg_cut = df.avg_cut.cat.remove_categories(['Very Low', 'Very High'])
df.avg_cut.head(3)
0 Low
1 Low
2 Low
Name: avg_cut, dtype: category
Categories (3, object): ['Low' < 'Mid' < 'High']
6
interval_avg = pd.IntervalIndex(pd.qcut(avg, q=q))
interval_avg.right.to_series().reset_index(drop=True).head(3)
0 2295.0
1 2295.0
2 2295.0
dtype: float64
interval_avg.left.to_series().reset_index(drop=True).head(3)
0 1051.162
1 1051.162
2 1051.162
dtype: float64
interval_avg.length.to_series().reset_index(drop=True).head(3)
0 1243.838
1 1243.838
2 1243.838
dtype: float64
三、有序类别下的逻辑斯蒂回归¶
逻辑斯蒂回归是经典的分类模型,它将无序的类别作为目标值来进行模型的参数训练。对于有序类别的目标值,虽然我们仍然可以将其作为无序类别来输入模型,但这样做显然损失了类别之间的顺序关系,而有序类别下的逻辑斯蒂回归(Ordinal Logistic Regression)就能够对此类问题进行处理。
设样本数据为\(X=(x_1,...,x_n),x_i\in R^d(1\leq i \leq n)\),\(d\)是数据特征的维度,标签值为\(y=(y_1,...,y_n), y_i\in \{C_1,...,C_k\}(1\leq i\leq n)\),\(C_i\)是有序类别,\(k\)是有序类别的数量,记\(P(y\leq C_0\vert x_i)=0(1\leq i\leq n)\)。设参数\(w=(w_1,...,w_d),\theta=(\theta_0,\theta_1,...,\theta_k)\),其中\(\theta_0=-\infty,\theta_k=\infty\),则OLR模型为:
由此可得已知\(x_i\)的情况下,\(y_i\)取\(C_j\)的概率为:
从而对数似然方程为:
mord包能够对上述OLR模型的参数\(w\)和参数\(\theta\)进行求解,可以使用conda install -c conda-forge mord命令下载。
首先,我们读取1个目标值为4个有序类别的数据集:
df = pd.read_csv("data/ch9/olr.csv")
df.head()
X1 | X2 | X3 | X4 | y | |
---|---|---|---|---|---|
0 | -0.937269 | -0.991073 | -0.535304 | 2.004688 | Fair |
1 | -2.491539 | -0.707034 | -1.324849 | 0.782425 | Bad |
2 | -0.149041 | -0.205646 | 0.101816 | 0.530163 | Good |
3 | 0.275750 | 0.320831 | 0.699250 | 1.716456 | Marvellous |
4 | 0.527534 | -0.774606 | 0.614202 | -1.236699 | Fair |
已知y中元素的大小关系为“Bad”\(\leq\)“Fair”\(\leq\)“Good”\(\leq\)“Marvellous”,此时先将y转换至有序类别编号:
df.y = df.y.astype("category").cat.reorder_categories(['Bad', 'Fair',
'Good', 'Marvellous'], ordered=True).cat.codes
df.y.head()
0 1
1 0
2 2
3 3
4 1
Name: y, dtype: int8
从mord中导入LogisticAT进行参数训练:
from mord import LogisticAT
clf = LogisticAT()
X = df.iloc[:,:4]
clf.fit(X, df.y)
LogisticAT()
此时,我们就能通过clf.coef_和clf.theta_分别访问\(w\)和\(\theta\):
clf.coef_ # w1, w2, w3, w4
array([1.73109827, 2.82370372, 3.09811977, 2.26793622])
clf.theta_ # 每一个类别的theta_j
array([-4.11366804, -0.15825334, 3.49922219])
从而就能通过predict_proba()方法对每一个样本的特征\(x_i\)计算\(y_i\)属于各个类别的概率,取\(\arg\max\)后即得到输出的类别。
res = clf.predict_proba(X).argmax(1)[:5] # 取前五个
res
array([1, 0, 2, 3, 1], dtype=int64)
现有1个新的样本点\(x_{new}=[-0.5, 0.3, 0.4, 0.1]\)需要进行预测,请问它的predict_proba()概率预测结果是如何得到的?请写出计算过程。
x_new = np.array([[-0.5, 0.3, 0.4, 0.1]])
clf.predict_proba(x_new)
array([[0.00382917, 0.16333546, 0.71894644, 0.11388892]])
数据集data/ch9/car.csv中,含有6个与汽车有关的变量:“buying”、“maint”、“doors”、“persons”、“lug_boot”和“safety”分别指购买价格、保养价格、车门数量、车载人数、车身大小和安全等级,需要对汽车满意度指标“accpet”进行建模。汽车满意度是有序类别变量,由低到高可分为“unacc”、“acc”、“good”和“vgood”4个类别。请利用有序类别的逻辑斯蒂回归,利用6个有关变量来构建关于汽车满意度的分类模型。
df = pd.read_csv("data/ch9/car.csv")
df.head()
buying | maint | doors | persons | lug_boot | safety | accept | |
---|---|---|---|---|---|---|---|
0 | vhigh | vhigh | 2 | 2 | small | low | unacc |
1 | vhigh | vhigh | 2 | 2 | small | med | unacc |
2 | vhigh | vhigh | 2 | 2 | small | high | unacc |
3 | vhigh | vhigh | 2 | 2 | med | low | unacc |
4 | vhigh | vhigh | 2 | 2 | med | med | unacc |
【解答】
1
x = x_new[0]
theta_ = np.r_[-np.inf, clf.theta_, np.inf]
p = 1 / (1 + np.exp(-theta_+x.dot(clf.coef_)))
p = np.diff(p)
p
array([0.00382917, 0.16333546, 0.71894644, 0.11388892])
2
df = pd.read_csv("data/ch9/car.csv")
X = pd.get_dummies(df.iloc[:, :-1])
y = df.accept.astype("category").cat.reorder_categories(
["unacc", "acc", "good", "vgood"], ordered=True).cat.codes
clf = LogisticAT()
clf.fit(X, y)
LogisticAT()