はじめに
PDFからHTMLへの変換、HTMLからPDFへの変換を実践する。
環境
Windows 10 Professional
WSL2 - (Ubuntu22.04 LTS)
pdftohtml version 22.02.0
wkhtmltopdf 0.12.6
pdftohtmlのインストール
PDFからHTMLに変換するソフトウェア
$ sudo apt-get install poppler-utils
- ※poppler-utilsをインストールすると下記コマンドが使用できるようになる。 https://www.mankier.com/package/poppler-utils
- ※インストール済みのため、今回は省略
wkhtmltopdfのインストール
HTMLからPDFに変換するソフトウェア
$ sudo apt-get install wkhtmltopdf
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
adwaita-icon-theme at-spi2-core avahi-daemon dconf-gsettings-backend dconf-service fontconfig geoclue-2.0
glib-networking glib-networking-common glib-networking-services gsettings-desktop-schemas gstreamer1.0-plugins-base
gtk-update-icon-cache hicolor-icon-theme humanity-icon-theme iio-sensor-proxy libatk-bridge2.0-0 libatk1.0-0
libatk1.0-data libatspi2.0-0 libavahi-client3 libavahi-common-data libavahi-common3 libavahi-core7 libavahi-glib1
libcairo-gobject2 libcdparanoia0 libcolord2 libcups2 libdaemon0 libdatrie1 libdconf1 libdouble-conversion3
libdrm-amdgpu1 libdrm-intel1 libdrm-nouveau2 libdrm-radeon1 libegl-mesa0 libegl1 libepoxy0 libevdev2 libfontenc1
libgbm1 libgdk-pixbuf-2.0-0 libgdk-pixbuf2.0-bin libgdk-pixbuf2.0-common libgl1 libgl1-amber-dri libgl1-mesa-dri
libglapi-mesa libglvnd0 libglx-mesa0 libglx0 libgraphite2-3 libgstreamer-plugins-base1.0-0 libgtk-3-0 libgtk-3-bin
libgtk-3-common libgudev-1.0-0 libharfbuzz0b libhyphen0 libice6 libinput-bin libinput10 libjson-glib-1.0-0
libjson-glib-1.0-common libllvm15 libmbim-glib4 libmbim-proxy libmd4c0 libmm-glib0 libmtdev1 libnl-route-3-200
libnotify4 libnss-mdns libogg0 libopus0 liborc-0.4-0 libpango-1.0-0 libpangocairo-1.0-0 libpangoft2-1.0-0
libpciaccess0 libpcre2-16-0 libpcsclite1 libproxy1v5 libqmi-glib5 libqmi-proxy libqt5core5a libqt5dbus5 libqt5gui5
libqt5network5 libqt5positioning5 libqt5printsupport5 libqt5qml5 libqt5qmlmodels5 libqt5quick5 libqt5sensors5
libqt5svg5 libqt5webchannel5 libqt5webkit5 libqt5widgets5 librsvg2-2 librsvg2-common libsensors-config libsensors5
libsm6 libsoup2.4-1 libsoup2.4-common libtcl8.6 libthai-data libthai0 libtheora0 libvisual-0.4-0 libvorbis0a
libvorbisenc2 libwacom-bin libwacom-common libwacom9 libwayland-client0 libwayland-cursor0 libwayland-egl1
libwayland-server0 libwoff1 libx11-xcb1 libxaw7 libxcb-dri2-0 libxcb-dri3-0 libxcb-glx0 libxcb-icccm4 libxcb-image0
libxcb-keysyms1 libxcb-present0 libxcb-randr0 libxcb-render-util0 libxcb-shape0 libxcb-sync1 libxcb-util1
libxcb-xfixes0 libxcb-xinerama0 libxcb-xinput0 libxcb-xkb1 libxcomposite1 libxcursor1 libxdamage1 libxfixes3
libxfont2 libxi6 libxinerama1 libxkbcommon-x11-0 libxkbcommon0 libxkbfile1 libxmu6 libxpm4 libxrandr2 libxshmfence1
libxslt1.1 libxt6 libxtst6 libxxf86vm1 modemmanager qt5-gtk-platformtheme qttranslations5-l10n session-migration tcl
tcl8.6 ubuntu-mono usb-modeswitch usb-modeswitch-data wpasupplicant x11-common x11-xkb-utils xfonts-base
xfonts-encodings xfonts-utils xnest xserver-common
Suggested packages:
avahi-autoipd gvfs colord cups-common libvisual-0.4-plugins gnome-shell | notification-daemon avahi-autoipd
| zeroconf opus-tools pcscd qt5-image-formats-plugins qtwayland5 qt5-qmltooling-plugins librsvg2-bin lm-sensors
tcl-tclreadline comgt wvdial wpagui libengine-pkcs11-openssl
The following NEW packages will be installed:
adwaita-icon-theme at-spi2-core avahi-daemon dconf-gsettings-backend dconf-service fontconfig geoclue-2.0
glib-networking glib-networking-common glib-networking-services gsettings-desktop-schemas gstreamer1.0-plugins-base
gtk-update-icon-cache hicolor-icon-theme humanity-icon-theme iio-sensor-proxy libatk-bridge2.0-0 libatk1.0-0
libatk1.0-data libatspi2.0-0 libavahi-client3 libavahi-common-data libavahi-common3 libavahi-core7 libavahi-glib1
libcairo-gobject2 libcdparanoia0 libcolord2 libcups2 libdaemon0 libdatrie1 libdconf1 libdouble-conversion3
libdrm-amdgpu1 libdrm-intel1 libdrm-nouveau2 libdrm-radeon1 libegl-mesa0 libegl1 libepoxy0 libevdev2 libfontenc1
libgbm1 libgdk-pixbuf-2.0-0 libgdk-pixbuf2.0-bin libgdk-pixbuf2.0-common libgl1 libgl1-amber-dri libgl1-mesa-dri
libglapi-mesa libglvnd0 libglx-mesa0 libglx0 libgraphite2-3 libgstreamer-plugins-base1.0-0 libgtk-3-0 libgtk-3-bin
libgtk-3-common libgudev-1.0-0 libharfbuzz0b libhyphen0 libice6 libinput-bin libinput10 libjson-glib-1.0-0
libjson-glib-1.0-common libllvm15 libmbim-glib4 libmbim-proxy libmd4c0 libmm-glib0 libmtdev1 libnl-route-3-200
libnotify4 libnss-mdns libogg0 libopus0 liborc-0.4-0 libpango-1.0-0 libpangocairo-1.0-0 libpangoft2-1.0-0
libpciaccess0 libpcre2-16-0 libpcsclite1 libproxy1v5 libqmi-glib5 libqmi-proxy libqt5core5a libqt5dbus5 libqt5gui5
libqt5network5 libqt5positioning5 libqt5printsupport5 libqt5qml5 libqt5qmlmodels5 libqt5quick5 libqt5sensors5
libqt5svg5 libqt5webchannel5 libqt5webkit5 libqt5widgets5 librsvg2-2 librsvg2-common libsensors-config libsensors5
libsm6 libsoup2.4-1 libsoup2.4-common libtcl8.6 libthai-data libthai0 libtheora0 libvisual-0.4-0 libvorbis0a
libvorbisenc2 libwacom-bin libwacom-common libwacom9 libwayland-client0 libwayland-cursor0 libwayland-egl1
libwayland-server0 libwoff1 libx11-xcb1 libxaw7 libxcb-dri2-0 libxcb-dri3-0 libxcb-glx0 libxcb-icccm4 libxcb-image0
libxcb-keysyms1 libxcb-present0 libxcb-randr0 libxcb-render-util0 libxcb-shape0 libxcb-sync1 libxcb-util1
libxcb-xfixes0 libxcb-xinerama0 libxcb-xinput0 libxcb-xkb1 libxcomposite1 libxcursor1 libxdamage1 libxfixes3
libxfont2 libxi6 libxinerama1 libxkbcommon-x11-0 libxkbcommon0 libxkbfile1 libxmu6 libxpm4 libxrandr2 libxshmfence1
libxslt1.1 libxt6 libxtst6 libxxf86vm1 modemmanager qt5-gtk-platformtheme qttranslations5-l10n session-migration tcl
tcl8.6 ubuntu-mono usb-modeswitch usb-modeswitch-data wkhtmltopdf wpasupplicant x11-common x11-xkb-utils xfonts-base
xfonts-encodings xfonts-utils xnest xserver-common
0 upgraded, 177 newly installed, 0 to remove and 6 not upgraded.
Need to get 98.5 MB of archives.
After this operation, 388 MB of additional disk space will be used.
※依存パッケージ多めなので結構重い。
HTMLからPDFへの変換
wkhtmltopdf
を使用し、下記コマンドで変換する。
$ wkhtmltopdf --help
Name:
wkhtmltopdf 0.12.6
Synopsis:
wkhtmltopdf [GLOBAL OPTION]... [OBJECT]... <output file>
Document objects:
wkhtmltopdf is able to put several objects into the output file, an object is
either a single webpage, a cover webpage or a table of contents. The objects
are put into the output document in the order they are specified on the
command line, options can be specified on a per object basis or in the global
options area. Options from the Global Options section can only be placed in
the global options area.
A page objects puts the content of a single webpage into the output document.
(page)? <input url/file name> [PAGE OPTION]...
Options for the page object can be placed in the global options and the page
options areas. The applicable options can be found in the Page Options and
Headers And Footer Options sections.
A cover objects puts the content of a single webpage into the output document,
the page does not appear in the table of contents, and does not have headers
and footers.
cover <input url/file name> [PAGE OPTION]...
All options that can be specified for a page object can also be specified for
a cover.
A table of contents object inserts a table of contents into the output
document.
toc [TOC OPTION]...
All options that can be specified for a page object can also be specified for
a toc, further more the options from the TOC Options section can also be
applied. The table of contents is generated via XSLT which means that it can
be styled to look however you want it to look. To get an idea of how to do
this you can dump the default xslt document by supplying the
--dump-default-toc-xsl, and the outline it works on by supplying
--dump-outline, see the Outline Options section.
Description:
Converts one or more HTML pages into a PDF document, *not* using wkhtmltopdf
patched qt.
Global Options:
--collate Collate when printing multiple copies
(default)
--no-collate Do not collate when printing multiple
copies
--copies <number> Number of copies to print into the pdf
file (default 1)
-H, --extended-help Display more extensive help, detailing
less common command switches
-g, --grayscale PDF will be generated in grayscale
-h, --help Display help
--license Output license information and exit
--log-level <level> Set log level to: none, error, warn or
info (default info)
-l, --lowquality Generates lower quality pdf/ps. Useful to
shrink the result document space
-O, --orientation <orientation> Set orientation to Landscape or Portrait
(default Portrait)
-s, --page-size <Size> Set paper size to: A4, Letter, etc.
(default A4)
-q, --quiet Be less verbose, maintained for backwards
compatibility; Same as using --log-level
none
--read-args-from-stdin Read command line arguments from stdin
--title <text> The title of the generated pdf file (The
title of the first document is used if not
specified)
-V, --version Output version information and exit
Reduced Functionality:
This version of wkhtmltopdf has been compiled against a version of QT without
the wkhtmltopdf patches. Therefore some features are missing, if you need
these features please use the static version.
Currently the list of features only supported with patch QT includes:
* Printing more than one HTML document into a PDF file.
* Running without an X11 server.
* Adding a document outline to the PDF file.
* Adding headers and footers to the PDF file.
* Generating a table of contents.
* Adding links in the generated PDF file.
* Printing using the screen media-type.
* Disabling the smart shrink feature of WebKit.
Contact:
If you experience bugs or want to request new features please visit
<https://wkhtmltopdf.org/support.html>
WebページをPDFに変換可能
wkhtmltopdf https://blog.k-bushi.com/post/column/pass-fp3/ pass-fp3.pdf
※ローカルに立てたサーバでも問題ないはず
PDFからHTMLへの変換
$ pdftohtml --help
pdftohtml version 22.02.0
Copyright 2005-2022 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdftohtml [options] <PDF-file> [<html-file> <xml-file>]
-f <int> : first page to convert
-l <int> : last page to convert
-q : don't print any messages or errors
-h : print usage information
-? : print usage information
-help : print usage information
--help : print usage information
-p : exchange .pdf links by .html
-c : generate complex document
-s : generate single document that includes all pages
-dataurls : use data URLs instead of external images in HTML
-i : ignore images
-noframes : generate no frames
-stdout : use standard output
-zoom <fp> : zoom the pdf document (default 1.5)
-xml : output for XML post-processing
-noroundcoord : do not round coordinates (with XML output only)
-hidden : output hidden text
-nomerge : do not merge paragraphs
-enc <string> : output text encoding name
-fmt <string> : image file format for Splash output (png or jpg)
-v : print copyright and version info
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-nodrm : override document DRM settings
-wbt <fp> : word break threshold (default 10 percent)
-fontfullname : outputs font full name
とりあえずそのまま使ってみる。
pdftohtml pass-fp3.pdf pass-fp3.html
なんかいっぱいできた↓
$ ls
pass-fp3-1_1.jpg pass-fp3-3_1.jpg pass-fp3-4_1.jpg pass-fp3-5_1.jpg pass-fp3.pdf pass-fp3s.html
pass-fp3-2_1.jpg pass-fp3-3_2.jpg pass-fp3-4_2.jpg pass-fp3.html pass-fp3_ind.html
左のフレームはいらないなあと思ったら、 s
がついているやつにはついていないっぽい。
フレーム付き、フレームなしがデフォルトで出力されるのかな。
おわりに
PDFからHTMLへの変換、HTMLからPDFへの変換を実践してみた。
特に、wkhtmltopdf
は実務でも使っておりかなり便利、レイアウト調整には苦労するが帳票出力やら書類出力の機能には重宝する。
PDFでほしいというお客様は多いですしね。
PDFからHTMLのpdftohtml
は初めて使ったけど、PDFの内容をそのままHTMLにしてどうにかしたい!ってときに使えそう。
そのまま使わないにしても、ある程度はできた形で出てくるので、ソースちょっといじるだけでどうにかなるって場面も多いかな。
PDF系のツールは多くて面白いなあ