Skip to main content

让pdftohtml支持中文

没装fbgs时,查到pdftohtml和pdftotext可以转pdf为html和txt
不过对中文识别有些问题,我有几个用acrobat转doc成的pdf,用pdftohtml就能识别中文,而自己用latex写的就不行。

有个参数-c很搞笑,pdftohtml不认中文时,就把它们转成图片放到网页中。
另外有个参数是-enc,感觉不大会用,总说couldn't find unicodeMap.. “猜中”个UTF-8可以用,但是仍然不能正确转换。

原来我的pdftohtml是poppler-utils里的,后来把它卸了,装了xpdf一系列东西,以及一个单独的pdftohtml.
之后有一些变化,比如pdftohtml和pdftotext不需要参数就能正常转换的那几个pdf(说是用acrobat转的),现在必须用参数-enc才可以,而且totext用-enc GBK可以,而tohtml用-enc GBK有乱码,用-enc EUC-CN才行。
另外/etc/xpdf有一些unicodeMap文件,定义了一些unicodeMap,然而用在我自己的pdf上还是不行。

最后我又改回poppler-utils了,还是觉得它提供的pdftohtml好使。而且-enc命令也可以用,这应该是某个xpdf的包支持的。(注意,我没有完全卸载xpdf的包,只是把和poppler-utils冲突的卸了)

今天发现了cmap宏包,说是可以让生成的pdf中的文字支持复制粘贴。不过我试过,没有用。
然后发现了ccmap宏包,这个是针对中文的,是cct宏包的一部分。源里没有这个东西,于是我到http://lsec.cc.ac.cn/cgi-bin/viewcvs.cgi/cct/ccmap/把里面的东西全下载下来(其中Makefile和t1.tex可以不要,另外Attic里面的文件也要下来),放到/usr/share/texmf/tex/latex/ccmap(具体要看tex的安装目录),把包解开,最后运行texhash
现在在.tex文件里加入\usepackage{ccmap}(我记得cmap要求是在\documentclass后紧接着就调入cmap,不知道ccmap是否也需要这样),再用pdflatex编译,哇!出现中文了,只是默认是utf-8输出的,用-enc GBK选项就好了!

最后,根据一个同学的说法,本质不在于cmap还是ccmap,只要有那些.cmap文件就行了,我试了过了,的确是这样,现在用\usepackage{cmap}也可以了。

Popular posts from this blog

[转] UTF-8 and Unicode FAQ for Unix/Linux

这几天,这个东西把我搞得很头疼 而且这篇文章好像太大了,blogger自己的发布系统不能发 只好用mail了 //原文 http://www.cl.cam.ac.uk/~mgk25/unicode.html UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn This text is a very comprehensive one-stop information resource on how you can use Unicode/UTF-8 on POSIX systems (Linux, Unix). You will find here both introductory information for every user, as well as detailed references for the experienced developer. Unicode has started to replace ASCII, ISO 8859 and EUC at all levels. It enables users to handle not only practically any script and language used on this planet, it also supports a comprehensive set of mathematical and technical symbols to simplify scientific information exchange. With the UTF-8 encoding, Unicode can be used in a convenient and backwards compatible way in environments that were designed entirely around ASCII, like Unix. UTF-8 is the way in which Unicode is used under Unix, Linux, and similar systems. It is now time to make sure that you are well familiar ...

Determine Perspective Lines With Off-page Vanishing Point

In perspective drawing, a vanishing point represents a group of parallel lines, in other words, a direction. For any point on the paper, if we want a line towards the same direction (in the 3d space), we simply draw a line through it and the vanishing point. But sometimes the vanishing point is too far away, such that it is outside the paper/canvas. In this example, we have a point P and two perspective lines L1 and L2. The vanishing point VP is naturally the intersection of L1 and L2. The task is to draw a line through P and VP, without having VP on the paper. I am aware of a few traditional solutions: 1. Use extra pieces of paper such that we can extend L1 and L2 until we see VP. 2. Draw everything in a smaller scale, such that we can see both P and VP on the paper. Draw the line and scale everything back. 3. Draw a perspective grid using the Brewer Method. #1 and #2 might be quite practical. #3 may not guarantee a solution, unless we can measure distances/p...

Moving Items Along Bezier Curves with CSS Animation (Part 2: Time Warp)

This is a follow-up of my earlier article.  I realized that there is another way of achieving the same effect. This article has lots of nice examples and explanations, the basic idea is to make very simple @keyframe rules, usually just a linear movement, then use timing function to distort the time, such that the motion path becomes the desired curve. I'd like to call it the "time warp" hack. Demo See the Pen Interactive cubic Bezier curve + CSS animation by Lu Wang ( @coolwanglu ) on CodePen . How does it work? Recall that a cubic Bezier curve is defined by this formula : \[B(t) = (1-t)^3P_0+3(1-t)^2tP_1+3(1-t)t^2P_2+t^3P_3,\ 0 \le t \le 1.\] In the 2D case, \(B(t)\) has two coordinates, \(x(t)\) and \(y(t)\). Define \(x_i\) to the be x coordinate of \(P_i\), then we have: \[x(t) = (1-t)^3x_0+3(1-t)^2tx_1+3(1-t)t^2x_2+t^3x_3,\ 0 \le t \le 1.\] So, for our animated element, we want to make sure that the x coordiante (i.e. the "left" CSS property) is \(...