用源代码此处报错

来源：2-6 数据处理与模型图构建(2)

慕数据4013138

2019-11-26

老师，前两天看第七章运行源代码您说可能是解码错误，所以我又重新看第二章的代码与教程，当时对于编解码错误，自己虽然尝试的解决办法是用的这个，但是原理没懂，还想请教下老师，以及后面第七章的编解码问题该如何解决？
第二章的问题为:

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0x8b in position 6: ordinal not in range(128)

实际的在colab上的代码报错为：
可编辑连接为：https://colab.research.google.com/drive/1_82Sy4y6JiOF3SFLcDg2ByMO4I8ZAd86

上次提问的text-rnn7的同样报错的编解码问题连接：
https://colab.research.google.com/drive/1_82Sy4y6JiOF3SFLcDg2ByMO4I8ZAd86
如果上面连接不行（可以用这个https://drive.google.com/file/d/1dZOvfOHn40gN06Pu1HnaN37xdtZC3pVR/view?usp=sharing）
图片描述
老师，我不是要耽误您的时间，确实是自己试了好些方法了，还是不会，恳求老师指点一下

写回答

2回答

正十七

2019-12-15

我细想了一下，你这个应该还是编码格式不对应的问题，python3默认字符串是unicode，你读入的两个文件的类别可能不同，你可以试一下把这两个都变成utf-8格式的，

用str.encode("utf-8").

还有个有用的工具可以用，在python3中，对于非unicode的字符串，可以使用

import chardet
chardet.detect(字符串)

就能得到是什么类型的字符串的输出。

Python 3.7.5rc1 (default, Oct  2 2019, 04:19:31) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a = "我是谁"
>>> a.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>> a.encode("utf-8")
b'\xe6\x88\x91\xe6\x98\xaf\xe8\xb0\x81'
>>> a
'我是谁'
>>> a
'我是谁'
>>> type(a)
<class 'str'>
>>> import chardet
>>> chardet.detect(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 34, in detect
    '{0}'.format(type(byte_str)))
TypeError: Expected object of type bytes or bytearray, got: <class 'str'>
>>> a1 = a.encode("utf-8")
>>> chardet.detect(a1)
{'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}