はじめに

PDFからHTMLへの変換、HTMLからPDFへの変換を実践する。

環境

1
2
3
4
Windows 10 Professional
WSL2 - (Ubuntu22.04 LTS)
pdftohtml version 22.02.0
wkhtmltopdf 0.12.6

pdftohtmlのインストール

PDFからHTMLに変換するソフトウェア

1
$ sudo apt-get install poppler-utils

wkhtmltopdfのインストール

HTMLからPDFに変換するソフトウェア

1
$ sudo apt-get install wkhtmltopdf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
 adwaita-icon-theme at-spi2-core avahi-daemon dconf-gsettings-backend dconf-service fontconfig geoclue-2.0
 glib-networking glib-networking-common glib-networking-services gsettings-desktop-schemas gstreamer1.0-plugins-base
 gtk-update-icon-cache hicolor-icon-theme humanity-icon-theme iio-sensor-proxy libatk-bridge2.0-0 libatk1.0-0
 libatk1.0-data libatspi2.0-0 libavahi-client3 libavahi-common-data libavahi-common3 libavahi-core7 libavahi-glib1
 libcairo-gobject2 libcdparanoia0 libcolord2 libcups2 libdaemon0 libdatrie1 libdconf1 libdouble-conversion3
 libdrm-amdgpu1 libdrm-intel1 libdrm-nouveau2 libdrm-radeon1 libegl-mesa0 libegl1 libepoxy0 libevdev2 libfontenc1
 libgbm1 libgdk-pixbuf-2.0-0 libgdk-pixbuf2.0-bin libgdk-pixbuf2.0-common libgl1 libgl1-amber-dri libgl1-mesa-dri
 libglapi-mesa libglvnd0 libglx-mesa0 libglx0 libgraphite2-3 libgstreamer-plugins-base1.0-0 libgtk-3-0 libgtk-3-bin
 libgtk-3-common libgudev-1.0-0 libharfbuzz0b libhyphen0 libice6 libinput-bin libinput10 libjson-glib-1.0-0
 libjson-glib-1.0-common libllvm15 libmbim-glib4 libmbim-proxy libmd4c0 libmm-glib0 libmtdev1 libnl-route-3-200
 libnotify4 libnss-mdns libogg0 libopus0 liborc-0.4-0 libpango-1.0-0 libpangocairo-1.0-0 libpangoft2-1.0-0
 libpciaccess0 libpcre2-16-0 libpcsclite1 libproxy1v5 libqmi-glib5 libqmi-proxy libqt5core5a libqt5dbus5 libqt5gui5
 libqt5network5 libqt5positioning5 libqt5printsupport5 libqt5qml5 libqt5qmlmodels5 libqt5quick5 libqt5sensors5
 libqt5svg5 libqt5webchannel5 libqt5webkit5 libqt5widgets5 librsvg2-2 librsvg2-common libsensors-config libsensors5
 libsm6 libsoup2.4-1 libsoup2.4-common libtcl8.6 libthai-data libthai0 libtheora0 libvisual-0.4-0 libvorbis0a
 libvorbisenc2 libwacom-bin libwacom-common libwacom9 libwayland-client0 libwayland-cursor0 libwayland-egl1
 libwayland-server0 libwoff1 libx11-xcb1 libxaw7 libxcb-dri2-0 libxcb-dri3-0 libxcb-glx0 libxcb-icccm4 libxcb-image0
 libxcb-keysyms1 libxcb-present0 libxcb-randr0 libxcb-render-util0 libxcb-shape0 libxcb-sync1 libxcb-util1
 libxcb-xfixes0 libxcb-xinerama0 libxcb-xinput0 libxcb-xkb1 libxcomposite1 libxcursor1 libxdamage1 libxfixes3
 libxfont2 libxi6 libxinerama1 libxkbcommon-x11-0 libxkbcommon0 libxkbfile1 libxmu6 libxpm4 libxrandr2 libxshmfence1
 libxslt1.1 libxt6 libxtst6 libxxf86vm1 modemmanager qt5-gtk-platformtheme qttranslations5-l10n session-migration tcl
 tcl8.6 ubuntu-mono usb-modeswitch usb-modeswitch-data wpasupplicant x11-common x11-xkb-utils xfonts-base
 xfonts-encodings xfonts-utils xnest xserver-common
Suggested packages:
 avahi-autoipd gvfs colord cups-common libvisual-0.4-plugins gnome-shell | notification-daemon avahi-autoipd
 | zeroconf opus-tools pcscd qt5-image-formats-plugins qtwayland5 qt5-qmltooling-plugins librsvg2-bin lm-sensors
 tcl-tclreadline comgt wvdial wpagui libengine-pkcs11-openssl
The following NEW packages will be installed:
 adwaita-icon-theme at-spi2-core avahi-daemon dconf-gsettings-backend dconf-service fontconfig geoclue-2.0
 glib-networking glib-networking-common glib-networking-services gsettings-desktop-schemas gstreamer1.0-plugins-base
 gtk-update-icon-cache hicolor-icon-theme humanity-icon-theme iio-sensor-proxy libatk-bridge2.0-0 libatk1.0-0
 libatk1.0-data libatspi2.0-0 libavahi-client3 libavahi-common-data libavahi-common3 libavahi-core7 libavahi-glib1
 libcairo-gobject2 libcdparanoia0 libcolord2 libcups2 libdaemon0 libdatrie1 libdconf1 libdouble-conversion3
 libdrm-amdgpu1 libdrm-intel1 libdrm-nouveau2 libdrm-radeon1 libegl-mesa0 libegl1 libepoxy0 libevdev2 libfontenc1
 libgbm1 libgdk-pixbuf-2.0-0 libgdk-pixbuf2.0-bin libgdk-pixbuf2.0-common libgl1 libgl1-amber-dri libgl1-mesa-dri
 libglapi-mesa libglvnd0 libglx-mesa0 libglx0 libgraphite2-3 libgstreamer-plugins-base1.0-0 libgtk-3-0 libgtk-3-bin
 libgtk-3-common libgudev-1.0-0 libharfbuzz0b libhyphen0 libice6 libinput-bin libinput10 libjson-glib-1.0-0
 libjson-glib-1.0-common libllvm15 libmbim-glib4 libmbim-proxy libmd4c0 libmm-glib0 libmtdev1 libnl-route-3-200
 libnotify4 libnss-mdns libogg0 libopus0 liborc-0.4-0 libpango-1.0-0 libpangocairo-1.0-0 libpangoft2-1.0-0
 libpciaccess0 libpcre2-16-0 libpcsclite1 libproxy1v5 libqmi-glib5 libqmi-proxy libqt5core5a libqt5dbus5 libqt5gui5
 libqt5network5 libqt5positioning5 libqt5printsupport5 libqt5qml5 libqt5qmlmodels5 libqt5quick5 libqt5sensors5
 libqt5svg5 libqt5webchannel5 libqt5webkit5 libqt5widgets5 librsvg2-2 librsvg2-common libsensors-config libsensors5
 libsm6 libsoup2.4-1 libsoup2.4-common libtcl8.6 libthai-data libthai0 libtheora0 libvisual-0.4-0 libvorbis0a
 libvorbisenc2 libwacom-bin libwacom-common libwacom9 libwayland-client0 libwayland-cursor0 libwayland-egl1
 libwayland-server0 libwoff1 libx11-xcb1 libxaw7 libxcb-dri2-0 libxcb-dri3-0 libxcb-glx0 libxcb-icccm4 libxcb-image0
 libxcb-keysyms1 libxcb-present0 libxcb-randr0 libxcb-render-util0 libxcb-shape0 libxcb-sync1 libxcb-util1
 libxcb-xfixes0 libxcb-xinerama0 libxcb-xinput0 libxcb-xkb1 libxcomposite1 libxcursor1 libxdamage1 libxfixes3
 libxfont2 libxi6 libxinerama1 libxkbcommon-x11-0 libxkbcommon0 libxkbfile1 libxmu6 libxpm4 libxrandr2 libxshmfence1
 libxslt1.1 libxt6 libxtst6 libxxf86vm1 modemmanager qt5-gtk-platformtheme qttranslations5-l10n session-migration tcl
 tcl8.6 ubuntu-mono usb-modeswitch usb-modeswitch-data wkhtmltopdf wpasupplicant x11-common x11-xkb-utils xfonts-base
 xfonts-encodings xfonts-utils xnest xserver-common
0 upgraded, 177 newly installed, 0 to remove and 6 not upgraded.
Need to get 98.5 MB of archives.
After this operation, 388 MB of additional disk space will be used.

※依存パッケージ多めなので結構重い。

HTMLからPDFへの変換

wkhtmltopdf を使用し、下記コマンドで変換する。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
$ wkhtmltopdf --help
Name:
 wkhtmltopdf 0.12.6

Synopsis:
 wkhtmltopdf [GLOBAL OPTION]... [OBJECT]... <output file>

Document objects:
 wkhtmltopdf is able to put several objects into the output file, an object is
 either a single webpage, a cover webpage or a table of contents. The objects
 are put into the output document in the order they are specified on the
 command line, options can be specified on a per object basis or in the global
 options area. Options from the Global Options section can only be placed in
 the global options area.

 A page objects puts the content of a single webpage into the output document.

 (page)? <input url/file name> [PAGE OPTION]...
 Options for the page object can be placed in the global options and the page
 options areas. The applicable options can be found in the Page Options and
 Headers And Footer Options sections.

 A cover objects puts the content of a single webpage into the output document,
 the page does not appear in the table of contents, and does not have headers
 and footers.

 cover <input url/file name> [PAGE OPTION]...
 All options that can be specified for a page object can also be specified for
 a cover.

 A table of contents object inserts a table of contents into the output
 document.

 toc [TOC OPTION]...
 All options that can be specified for a page object can also be specified for
 a toc, further more the options from the TOC Options section can also be
 applied. The table of contents is generated via XSLT which means that it can
 be styled to look however you want it to look. To get an idea of how to do
 this you can dump the default xslt document by supplying the
 --dump-default-toc-xsl, and the outline it works on by supplying
 --dump-outline, see the Outline Options section.

Description:
 Converts one or more HTML pages into a PDF document, *not* using wkhtmltopdf
 patched qt.

Global Options:
 --collate Collate when printing multiple copies
 (default)
 --no-collate Do not collate when printing multiple
 copies
 --copies <number> Number of copies to print into the pdf
 file (default 1)
 -H, --extended-help Display more extensive help, detailing
 less common command switches
 -g, --grayscale PDF will be generated in grayscale
 -h, --help Display help
 --license Output license information and exit
 --log-level <level> Set log level to: none, error, warn or
 info (default info)
 -l, --lowquality Generates lower quality pdf/ps. Useful to
 shrink the result document space
 -O, --orientation <orientation> Set orientation to Landscape or Portrait
 (default Portrait)
 -s, --page-size <Size> Set paper size to: A4, Letter, etc.
 (default A4)
 -q, --quiet Be less verbose, maintained for backwards
 compatibility; Same as using --log-level
 none
 --read-args-from-stdin Read command line arguments from stdin
 --title <text> The title of the generated pdf file (The
 title of the first document is used if not
 specified)
 -V, --version Output version information and exit

Reduced Functionality:
 This version of wkhtmltopdf has been compiled against a version of QT without
 the wkhtmltopdf patches. Therefore some features are missing, if you need
 these features please use the static version.

 Currently the list of features only supported with patch QT includes:

 * Printing more than one HTML document into a PDF file.
 * Running without an X11 server.
 * Adding a document outline to the PDF file.
 * Adding headers and footers to the PDF file.
 * Generating a table of contents.
 * Adding links in the generated PDF file.
 * Printing using the screen media-type.
 * Disabling the smart shrink feature of WebKit.

Contact:
 If you experience bugs or want to request new features please visit
 <https://wkhtmltopdf.org/support.html>

WebページをPDFに変換可能

1
wkhtmltopdf https://blog.k-bushi.com/post/column/pass-fp3/ pass-fp3.pdf

※ローカルに立てたサーバでも問題ないはず

↓ PDFになっていることを確認 pass-fp3

PDFからHTMLへの変換

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
$ pdftohtml --help
pdftohtml version 22.02.0
Copyright 2005-2022 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2011 Glyph & Cog, LLC

Usage: pdftohtml [options] <PDF-file> [<html-file> <xml-file>]
 -f <int> : first page to convert
 -l <int> : last page to convert
 -q : don't print any messages or errors
 -h : print usage information
 -? : print usage information
 -help : print usage information
 --help : print usage information
 -p : exchange .pdf links by .html
 -c : generate complex document
 -s : generate single document that includes all pages
 -dataurls : use data URLs instead of external images in HTML
 -i : ignore images
 -noframes : generate no frames
 -stdout : use standard output
 -zoom <fp> : zoom the pdf document (default 1.5)
 -xml : output for XML post-processing
 -noroundcoord : do not round coordinates (with XML output only)
 -hidden : output hidden text
 -nomerge : do not merge paragraphs
 -enc <string> : output text encoding name
 -fmt <string> : image file format for Splash output (png or jpg)
 -v : print copyright and version info
 -opw <string> : owner password (for encrypted files)
 -upw <string> : user password (for encrypted files)
 -nodrm : override document DRM settings
 -wbt <fp> : word break threshold (default 10 percent)
 -fontfullname : outputs font full name

とりあえずそのまま使ってみる。

1
pdftohtml pass-fp3.pdf pass-fp3.html

なんかいっぱいできた↓

1
2
3
$ ls
pass-fp3-1_1.jpg pass-fp3-3_1.jpg pass-fp3-4_1.jpg pass-fp3-5_1.jpg pass-fp3.pdf pass-fp3s.html
pass-fp3-2_1.jpg pass-fp3-3_2.jpg pass-fp3-4_2.jpg pass-fp3.html pass-fp3_ind.html

できたもの↓ pass-fp3-html

左のフレームはいらないなあと思ったら、 s がついているやつにはついていないっぽい。
フレーム付き、フレームなしがデフォルトで出力されるのかな。

おわりに

PDFからHTMLへの変換、HTMLからPDFへの変換を実践してみた。
特に、wkhtmltopdf は実務でも使っておりかなり便利、レイアウト調整には苦労するが帳票出力やら書類出力の機能には重宝する。
PDFでほしいというお客様は多いですしね。
PDFからHTMLのpdftohtmlは初めて使ったけど、PDFの内容をそのままHTMLにしてどうにかしたい!ってときに使えそう。
そのまま使わないにしても、ある程度はできた形で出てくるので、ソースちょっといじるだけでどうにかなるって場面も多いかな。
PDF系のツールは多くて面白いなあ