Skip to main content

[转] Python的转码

字符串内码的转换,是开发中经常遇到的问题。
在Java中,我们可以先对某个String调用getByte(),由结果生成新String的办法来转码,也可以用NIO包里面的Charset来实现
在Python中,可以对String调用decode和encode方法来实现转码。
比如,若要将某个String对象s从gbk内码转换为UTF-8,可以如下操作
s.decode('gbk').encode('utf-8')

可是,在实际开发中,我发现,这种办法经常会出现异常:
UnicodeDecodeError: 'gbk' codec can't decode bytes in position 30664-30665: illegal multibyte sequence

这 是因为遇到了非法字符——尤其是在某些用C/C++编写的程序中,全角空格往往有多种不同的实现方式,比如\xa3\xa0,或者\xa4\x57,这些 字符,看起来都是全角空格,但它们并不是“合法”的全角空格(真正的全角空格是\xa1\xa1),因此在转码的过程中出现了异常。

这样的问题很让人头疼,因为只要字符串中出现了一个非法字符,整个字符串——有时候,就是整篇文章——就都无法转码。

幸运的是,tiny找到了完美的解决办法(我因此被批评看文档不仔细,汗啊……)
s.decode('gbk', 'ignore').encode('utf-8')

因为decode的函数原型是decode([encoding], [errors='strict']),可以用第二个参数控制错误处理的策略,默认的参数就是strict,代表遇到非法字符时抛出异常;
如果设置为ignore,则会忽略非法字符;
如果设置为replace,则会用?取代非法字符;
如果设置为xmlcharrefreplace,则使用XML的字符引用。

Comments

Popular posts from this blog

Determine Perspective Lines With Off-page Vanishing Point

In perspective drawing, a vanishing point represents a group of parallel lines, in other words, a direction. For any point on the paper, if we want a line towards the same direction (in the 3d space), we simply draw a line through it and the vanishing point. But sometimes the vanishing point is too far away, such that it is outside the paper/canvas. In this example, we have a point P and two perspective lines L1 and L2. The vanishing point VP is naturally the intersection of L1 and L2. The task is to draw a line through P and VP, without having VP on the paper. I am aware of a few traditional solutions: 1. Use extra pieces of paper such that we can extend L1 and L2 until we see VP. 2. Draw everything in a smaller scale, such that we can see both P and VP on the paper. Draw the line and scale everything back. 3. Draw a perspective grid using the Brewer Method. #1 and #2 might be quite practical. #3 may not guarantee a solution, unless we can measure distances/p...

Installing Linux on Surface Pro 1g

Windows 10 will soon reach its end of life, and my 1-gen Surface Pro is not supported by Windows 11. I (finally) decided to install Linux to it. Fortunately, it's a not-so-easy nice adventure: The device has only one USB port, so I have to bring back my 10+-year old USB hub. My live USB drive cannot boot directly, I have to disable Secure Boot first, by holding Volume Up during boot. I think years ago I learned that booting on USB might not work through a USB hub, but fortunatelly it worked well with my setup. This is done by holding Volume Down during boot. Wifi adapter was detected in the live Linux environment, but not functional. And I don't have a USB-Ethernet adapter. Luckily, nowadays we have USB-tethering from Android phones, which works out-of-the-box. Originally I planned to following this guide to set up root on ZFS, however, the system froze when building the ZFS kernel module. Then I decided to just use EXT4, yet I still learned a lot from the guide about disk par...

Fix Google Security Code

Google Security Code (http://g.co/sc) is one type of 2-step verification. This is particularly useful when security keys and passkeys are not available. I have been using it in my LXC containers, until today I found out that it stopped working. It just kept saying "The code is invalid". It is easy to rule out some factors: The code works on other browsers on my laptop. The code works on other devices that are directly connected to the router. So it appears that Google also checks IP addresses besides the security code. Recently I have IPv6 enabled, so most devices that are directly connected to the router have both IPv4 and IPv6 addresses. But  I only enabled IPv4 for my LXC containers. So I guess when a code is generated by device A and used by device B, Google should be able to check that device A and device B are closely located. But in my case, IPv6 address appears on device A but not on device B, which may look suspicious. To fix the problem, I just needed to disable IPv...