第九章

零、练一练

练一练

请构造一个无序类别元素构成的序列,通过cat对象完成增删改查的操作。

s = pd.Series(list("ABCD")).astype("category")
s
0    A
1    B
2    C
3    D
dtype: category
Categories (4, object): ['A', 'B', 'C', 'D']
s.cat.categories
Index(['A', 'B', 'C', 'D'], dtype='object')
s = s.cat.add_categories('E')
s
0    A
1    B
2    C
3    D
dtype: category
Categories (5, object): ['A', 'B', 'C', 'D', 'E']
s = s.cat.remove_categories('E')
s
0    A
1    B
2    C
3    D
dtype: category
Categories (4, object): ['A', 'B', 'C', 'D']
s.cat.rename_categories({'D':'E'})
0    A
1    B
2    C
3    E
dtype: category
Categories (4, object): ['A', 'B', 'C', 'E']

练一练

请构造两组因不同原因(索引不一致和Series类别不一致)而导致无法比较的有序类别Series。

s1 = pd.Series(list("ABCD")).astype("category")
s2 = pd.Series(list("ABCD"), index=[1,2,3,5]).astype("category")
s1 == s2
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [7], in <module>
      1 s1 = pd.Series(list("ABCD")).astype("category")
      2 s2 = pd.Series(list("ABCD"), index=[1,2,3,5]).astype("category")
----> 3 s1 == s2

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\ops\common.py:70, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     66             return NotImplemented
     68 other = item_from_zerodim(other)
---> 70 return method(self, other)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\arraylike.py:40, in OpsMixin.__eq__(self, other)
     38 @unpack_zerodim_and_defer("__eq__")
     39 def __eq__(self, other):
---> 40     return self._cmp_method(other, operator.eq)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\series.py:5614, in Series._cmp_method(self, other, op)
   5611 res_name = ops.get_op_result_name(self, other)
   5613 if isinstance(other, Series) and not self._indexed_same(other):
-> 5614     raise ValueError("Can only compare identically-labeled Series objects")
   5616 lvalues = self._values
   5617 rvalues = extract_array(other, extract_numpy=True, extract_range=True)

ValueError: Can only compare identically-labeled Series objects
s1 = pd.Series(list("ABCD")).astype("category")
s2 = pd.Series(list("ABCD")).astype("category")
s2 = s2.cat.add_categories('E')
s1 == s2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [8], in <module>
      2 s2 = pd.Series(list("ABCD")).astype("category")
      3 s2 = s2.cat.add_categories('E')
----> 4 s1 == s2

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\ops\common.py:70, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     66             return NotImplemented
     68 other = item_from_zerodim(other)
---> 70 return method(self, other)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\arraylike.py:40, in OpsMixin.__eq__(self, other)
     38 @unpack_zerodim_and_defer("__eq__")
     39 def __eq__(self, other):
---> 40     return self._cmp_method(other, operator.eq)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\series.py:5620, in Series._cmp_method(self, other, op)
   5617 rvalues = extract_array(other, extract_numpy=True, extract_range=True)
   5619 with np.errstate(all="ignore"):
-> 5620     res_values = ops.comparison_op(lvalues, rvalues, op)
   5622 return self._construct_result(res_values, name=res_name)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\ops\array_ops.py:269, in comparison_op(left, right, op)
    260         raise ValueError(
    261             "Lengths must match to compare", lvalues.shape, rvalues.shape
    262         )
    264 if should_extension_dispatch(lvalues, rvalues) or (
    265     (isinstance(rvalues, (Timedelta, BaseOffset, Timestamp)) or right is NaT)
    266     and not is_object_dtype(lvalues.dtype)
    267 ):
    268     # Call the method on lvalues
--> 269     res_values = op(lvalues, rvalues)
    271 elif is_scalar(rvalues) and isna(rvalues):  # TODO: but not pd.NA?
    272     # numpy does not like comparisons vs None
    273     if op is operator.ne:

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\ops\common.py:70, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     66             return NotImplemented
     68 other = item_from_zerodim(other)
---> 70 return method(self, other)

File ~\miniconda3\envs\final\lib\site-packages\pandas\core\arrays\categorical.py:150, in _cat_compare_op.<locals>.func(self, other)
    148 msg = "Categoricals can only be compared if 'categories' are the same."
    149 if not self._categories_match_up_to_permutation(other):
--> 150     raise TypeError(msg)
    152 if not self.ordered and not self.categories.equals(other.categories):
    153     # both unordered and different order
    154     other_codes = recode_for_categories(
    155         other.codes, other.categories, self.categories, copy=False
    156     )

TypeError: Categoricals can only be compared if 'categories' are the same.

一、统计未出现的类别

crosstab()函数是一种特殊的变形函数。在默认参数下,它能够对两个列中元素组合出现的频数进行统计汇总:

df = pd.DataFrame({'A':['a','b','c','a'],
                   'B':['cat','cat','dog','cat']})
pd.crosstab(df.A, df.B)
B cat dog
A
a 2 0
b 1 0
c 0 1

但事实上,有些列存储的是分类变量,列中并不一定包含所有的类别,此时如果想要对这些未出现的类别在crosstab()结果中也进行汇总,则可以指定dropna参数为False:

df.B = df.B.astype('category').cat.add_categories('sheep')
pd.crosstab(df.A, df.B, dropna=False)
B cat dog sheep
A
a 2 0 0
b 1 0 0
c 0 1 0

请实现1个带有dropna参数的my_crosstab()函数来完成上面的功能。

【解答】
def my_crosstab(s1, s2, dropna=True):
    if s1.dtype.name == 'category' and not dropna:
        idx1 = s1.cat.categories
    else:
        idx1 = s1.unique()
    if s2.dtype.name == 'category' and not dropna:
        idx2 = s2.cat.categories
    else:
        idx2 = s2.unique()
    res = pd.DataFrame(
        np.zeros((idx1.shape[0], idx2.shape[0])),
        index=idx1,
        columns=idx2)
    for i, j in zip(s1, s2):
        res.at[i, j] += 1
    res = res.rename_axis(index=s1.name, columns=s2.name).astype('int')
    return res
df = pd.DataFrame({'A':['a','b','c','a'],
                   'B':['cat','cat','dog','cat']})
df.B = df.B.astype('category').cat.add_categories('sheep')
my_crosstab(df.A, df.B)
my_crosstab(df.A, df.B, dropna=False)
B cat dog sheep
A
a 2 0 0
b 1 0 0
c 0 1 0

二、钻石数据的类别构造

现有一份关于钻石的数据集,其中carat、cut、clarity和price分别表示克拉重量、切割质量、纯净度和价格:

df = pd.read_csv('data/ch9/diamonds.csv')
df.head(3)
carat cut clarity price
0 0.23 Ideal SI2 326
1 0.21 Premium SI1 326
2 0.23 Good VS1 327
  • 分别对df.cut在object类型和category类型下使用nunique()函数,并比较它们的性能。

  • 钻石的切割质量可以分为五个等级,由次到好分别是Fair、Good、Very Good、Premium、Ideal,纯净度有八个等级,由次到好分别是I1、SI2、SI1、VS2、VS1、VVS2、VVS1、IF,请对切割质量按照由好到次的顺序排序,相同切割质量的钻石,按照纯净度进行由次到好的排序。

  • 分别采用两种不同的方法,把cut和clarity这两列按照由好到次的顺序,映射到从0到n-1的整数,其中n表示类别的个数。

  • 对每克拉的价格分别按照分位数(q=[0.2, 0.4, 0.6, 0.8])与[1000, 3500, 5500, 18000]割点进行分箱得到五个类别Very Low、Low、Mid、High、Very High,并把按这两种分箱方法得到的category序列依次添加到原表中。

  • 在第4问中按整数分箱得到的序列中,是否出现了所有的类别?如果存在没有出现的类别请把该类别删除。

  • 对第4问中按分位数分箱得到的序列,求序列元素所在区间的左端点值和长度。

【解答】
  • 1

df = pd.read_csv('data/ch9/diamonds.csv')
s_obj, s_cat = df.cut, df.cut.astype('category')
%timeit -n 30 s_obj.nunique()
2.57 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)
%timeit -n 30 s_cat.nunique()
403 µs ± 22.5 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)
  • 2

df.cut = df.cut.astype('category').cat.reorder_categories([
        'Fair', 'Good', 'Very Good', 'Premium', 'Ideal'],ordered=True)
df.clarity = df.clarity.astype('category').cat.reorder_categories([
        'I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'],ordered=True)
res = df.sort_values(['cut', 'clarity'], ascending=[False, True])
res.head(3)
carat cut clarity price
315 0.96 Ideal I1 2801
535 0.96 Ideal I1 2826
551 0.97 Ideal I1 2830
res.tail(3)
carat cut clarity price
47407 0.52 Fair IF 1849
49683 0.52 Fair IF 2144
50126 0.47 Fair IF 2211
  • 3

df.cut = df.cut.cat.reorder_categories(df.cut.cat.categories[::-1])
df.clarity = df.clarity.cat.reorder_categories(df.clarity.cat.categories[::-1])
# 方法一:利用cat.codes
df.cut = df.cut.cat.codes
clarity_cat = df.clarity.cat.categories
# 方法二:使用replace映射
df.clarity = df.clarity.replace(dict(zip(clarity_cat, np.arange(len(clarity_cat)))))
df.head(3)
carat cut clarity price
0 0.23 0 6 326
1 0.21 1 5 326
2 0.23 3 3 327
  • 4

q = [0, 0.2, 0.4, 0.6, 0.8, 1]
point = [-np.infty, 1000, 3500, 5500, 18000, np.infty]
avg = df.price / df.carat
df['avg_cut'] = pd.cut(avg, bins=point, labels=['Very Low', 'Low', 'Mid', 'High', 'Very High'])
df['avg_qcut'] = pd.qcut(avg, q=q, labels=['Very Low', 'Low', 'Mid', 'High', 'Very High'])
df.head()
carat cut clarity price avg_cut avg_qcut
0 0.23 0 6 326 Low Very Low
1 0.21 1 5 326 Low Very Low
2 0.23 3 3 327 Low Very Low
3 0.29 1 4 334 Low Very Low
4 0.31 3 6 335 Low Very Low
  • 5

df.avg_cut.unique()
['Low', 'Mid', 'High']
Categories (5, object): ['Very Low' < 'Low' < 'Mid' < 'High' < 'Very High']
df.avg_cut.cat.categories
Index(['Very Low', 'Low', 'Mid', 'High', 'Very High'], dtype='object')
df.avg_cut = df.avg_cut.cat.remove_categories(['Very Low', 'Very High'])
df.avg_cut.head(3)
0    Low
1    Low
2    Low
Name: avg_cut, dtype: category
Categories (3, object): ['Low' < 'Mid' < 'High']
  • 6

interval_avg = pd.IntervalIndex(pd.qcut(avg, q=q))
interval_avg.right.to_series().reset_index(drop=True).head(3)
0    2295.0
1    2295.0
2    2295.0
dtype: float64
interval_avg.left.to_series().reset_index(drop=True).head(3)
0    1051.162
1    1051.162
2    1051.162
dtype: float64
interval_avg.length.to_series().reset_index(drop=True).head(3)
0    1243.838
1    1243.838
2    1243.838
dtype: float64

三、有序类别下的逻辑斯蒂回归

逻辑斯蒂回归是经典的分类模型,它将无序的类别作为目标值来进行模型的参数训练。对于有序类别的目标值,虽然我们仍然可以将其作为无序类别来输入模型,但这样做显然损失了类别之间的顺序关系,而有序类别下的逻辑斯蒂回归(Ordinal Logistic Regression)就能够对此类问题进行处理。

设样本数据为\(X=(x_1,...,x_n),x_i\in R^d(1\leq i \leq n)\)\(d\)是数据特征的维度,标签值为\(y=(y_1,...,y_n), y_i\in \{C_1,...,C_k\}(1\leq i\leq n)\)\(C_i\)是有序类别,\(k\)是有序类别的数量,记\(P(y\leq C_0\vert x_i)=0(1\leq i\leq n)\)。设参数\(w=(w_1,...,w_d),\theta=(\theta_0,\theta_1,...,\theta_k)\),其中\(\theta_0=-\infty,\theta_k=\infty\),则OLR模型为:

\[P(y_i\leq C_j\vert x_i)=\frac{1}{1+\exp^{-(\theta_j-w^Tx_i)}}, 1\leq i \leq n, 1\leq j \leq k\]

由此可得已知\(x_i\)的情况下,\(y_i\)\(C_j\)的概率为:

\[P(y_i=C_j\vert x_i) = P(y_i\leq C_j\vert x_i)-P(y_i\leq C_{j-1}\vert x_i), 1\leq j\leq k\]

从而对数似然方程为:

\[\log L(w,\theta\vert X,y)=\sum_{i=1}^n\sum_{j=1}^k \mathbb{I}_{\{y_i=C_j\}}\log [\frac{1}{1+\exp^{-(\theta_j-w^Tx_i)}}-\frac{1}{1+\exp^{-(\theta_{j-1}-w^Tx_i)}}]\]

mord包能够对上述OLR模型的参数\(w\)和参数\(\theta\)进行求解,可以使用conda install -c conda-forge mord命令下载。

首先,我们读取1个目标值为4个有序类别的数据集:

df = pd.read_csv("data/ch9/olr.csv")
df.head()
X1 X2 X3 X4 y
0 -0.937269 -0.991073 -0.535304 2.004688 Fair
1 -2.491539 -0.707034 -1.324849 0.782425 Bad
2 -0.149041 -0.205646 0.101816 0.530163 Good
3 0.275750 0.320831 0.699250 1.716456 Marvellous
4 0.527534 -0.774606 0.614202 -1.236699 Fair

已知y中元素的大小关系为“Bad”\(\leq\)“Fair”\(\leq\)“Good”\(\leq\)“Marvellous”,此时先将y转换至有序类别编号:

df.y = df.y.astype("category").cat.reorder_categories(['Bad', 'Fair',
    'Good', 'Marvellous'], ordered=True).cat.codes
df.y.head()
0    1
1    0
2    2
3    3
4    1
Name: y, dtype: int8

从mord中导入LogisticAT进行参数训练:

from mord import LogisticAT
clf = LogisticAT()
X = df.iloc[:,:4]
clf.fit(X, df.y)
LogisticAT()

此时,我们就能通过clf.coef_和clf.theta_分别访问\(w\)\(\theta\)

clf.coef_ # w1, w2, w3, w4
array([1.73109827, 2.82370372, 3.09811977, 2.26793622])
clf.theta_ # 每一个类别的theta_j
array([-4.11366804, -0.15825334,  3.49922219])

从而就能通过predict_proba()方法对每一个样本的特征\(x_i\)计算\(y_i\)属于各个类别的概率,取\(\arg\max\)后即得到输出的类别。

res = clf.predict_proba(X).argmax(1)[:5] # 取前五个
res
array([1, 0, 2, 3, 1], dtype=int64)
  • 现有1个新的样本点\(x_{new}=[-0.5, 0.3, 0.4, 0.1]\)需要进行预测,请问它的predict_proba()概率预测结果是如何得到的?请写出计算过程。

x_new = np.array([[-0.5, 0.3, 0.4, 0.1]])
clf.predict_proba(x_new)
array([[0.00382917, 0.16333546, 0.71894644, 0.11388892]])
  • 数据集data/ch9/car.csv中,含有6个与汽车有关的变量:“buying”、“maint”、“doors”、“persons”、“lug_boot”和“safety”分别指购买价格、保养价格、车门数量、车载人数、车身大小和安全等级,需要对汽车满意度指标“accpet”进行建模。汽车满意度是有序类别变量,由低到高可分为“unacc”、“acc”、“good”和“vgood”4个类别。请利用有序类别的逻辑斯蒂回归,利用6个有关变量来构建关于汽车满意度的分类模型。

df = pd.read_csv("data/ch9/car.csv")
df.head()
buying maint doors persons lug_boot safety accept
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
2 vhigh vhigh 2 2 small high unacc
3 vhigh vhigh 2 2 med low unacc
4 vhigh vhigh 2 2 med med unacc
【解答】
  • 1

x = x_new[0]
theta_ = np.r_[-np.inf, clf.theta_, np.inf]
p = 1 / (1 + np.exp(-theta_+x.dot(clf.coef_)))
p = np.diff(p)
p
array([0.00382917, 0.16333546, 0.71894644, 0.11388892])
  • 2

df = pd.read_csv("data/ch9/car.csv")
X = pd.get_dummies(df.iloc[:, :-1])
y = df.accept.astype("category").cat.reorder_categories(
    ["unacc", "acc", "good", "vgood"], ordered=True).cat.codes
clf = LogisticAT()
clf.fit(X, y)
LogisticAT()