HTMLからPDF、PDFからHTMLに変換する

はじめに

PDFからHTMLへの変換、HTMLからPDFへの変換を実践する。

環境

Windows 10 Professional
WSL2 - (Ubuntu22.04 LTS)
pdftohtml version 22.02.0
wkhtmltopdf 0.12.6

pdftohtmlのインストール

PDFからHTMLに変換するソフトウェア

$ sudo apt-get install poppler-utils

wkhtmltopdfのインストール

HTMLからPDFに変換するソフトウェア

$ sudo apt-get install wkhtmltopdf
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  adwaita-icon-theme at-spi2-core avahi-daemon dconf-gsettings-backend dconf-service fontconfig geoclue-2.0
  glib-networking glib-networking-common glib-networking-services gsettings-desktop-schemas gstreamer1.0-plugins-base
  gtk-update-icon-cache hicolor-icon-theme humanity-icon-theme iio-sensor-proxy libatk-bridge2.0-0 libatk1.0-0
  libatk1.0-data libatspi2.0-0 libavahi-client3 libavahi-common-data libavahi-common3 libavahi-core7 libavahi-glib1
  libcairo-gobject2 libcdparanoia0 libcolord2 libcups2 libdaemon0 libdatrie1 libdconf1 libdouble-conversion3
  libdrm-amdgpu1 libdrm-intel1 libdrm-nouveau2 libdrm-radeon1 libegl-mesa0 libegl1 libepoxy0 libevdev2 libfontenc1
  libgbm1 libgdk-pixbuf-2.0-0 libgdk-pixbuf2.0-bin libgdk-pixbuf2.0-common libgl1 libgl1-amber-dri libgl1-mesa-dri
  libglapi-mesa libglvnd0 libglx-mesa0 libglx0 libgraphite2-3 libgstreamer-plugins-base1.0-0 libgtk-3-0 libgtk-3-bin
  libgtk-3-common libgudev-1.0-0 libharfbuzz0b libhyphen0 libice6 libinput-bin libinput10 libjson-glib-1.0-0
  libjson-glib-1.0-common libllvm15 libmbim-glib4 libmbim-proxy libmd4c0 libmm-glib0 libmtdev1 libnl-route-3-200
  libnotify4 libnss-mdns libogg0 libopus0 liborc-0.4-0 libpango-1.0-0 libpangocairo-1.0-0 libpangoft2-1.0-0
  libpciaccess0 libpcre2-16-0 libpcsclite1 libproxy1v5 libqmi-glib5 libqmi-proxy libqt5core5a libqt5dbus5 libqt5gui5
  libqt5network5 libqt5positioning5 libqt5printsupport5 libqt5qml5 libqt5qmlmodels5 libqt5quick5 libqt5sensors5
  libqt5svg5 libqt5webchannel5 libqt5webkit5 libqt5widgets5 librsvg2-2 librsvg2-common libsensors-config libsensors5
  libsm6 libsoup2.4-1 libsoup2.4-common libtcl8.6 libthai-data libthai0 libtheora0 libvisual-0.4-0 libvorbis0a
  libvorbisenc2 libwacom-bin libwacom-common libwacom9 libwayland-client0 libwayland-cursor0 libwayland-egl1
  libwayland-server0 libwoff1 libx11-xcb1 libxaw7 libxcb-dri2-0 libxcb-dri3-0 libxcb-glx0 libxcb-icccm4 libxcb-image0
  libxcb-keysyms1 libxcb-present0 libxcb-randr0 libxcb-render-util0 libxcb-shape0 libxcb-sync1 libxcb-util1
  libxcb-xfixes0 libxcb-xinerama0 libxcb-xinput0 libxcb-xkb1 libxcomposite1 libxcursor1 libxdamage1 libxfixes3
  libxfont2 libxi6 libxinerama1 libxkbcommon-x11-0 libxkbcommon0 libxkbfile1 libxmu6 libxpm4 libxrandr2 libxshmfence1
  libxslt1.1 libxt6 libxtst6 libxxf86vm1 modemmanager qt5-gtk-platformtheme qttranslations5-l10n session-migration tcl
  tcl8.6 ubuntu-mono usb-modeswitch usb-modeswitch-data wpasupplicant x11-common x11-xkb-utils xfonts-base
  xfonts-encodings xfonts-utils xnest xserver-common
Suggested packages:
  avahi-autoipd gvfs colord cups-common libvisual-0.4-plugins gnome-shell | notification-daemon avahi-autoipd
  | zeroconf opus-tools pcscd qt5-image-formats-plugins qtwayland5 qt5-qmltooling-plugins librsvg2-bin lm-sensors
  tcl-tclreadline comgt wvdial wpagui libengine-pkcs11-openssl
The following NEW packages will be installed:
  adwaita-icon-theme at-spi2-core avahi-daemon dconf-gsettings-backend dconf-service fontconfig geoclue-2.0
  glib-networking glib-networking-common glib-networking-services gsettings-desktop-schemas gstreamer1.0-plugins-base
  gtk-update-icon-cache hicolor-icon-theme humanity-icon-theme iio-sensor-proxy libatk-bridge2.0-0 libatk1.0-0
  libatk1.0-data libatspi2.0-0 libavahi-client3 libavahi-common-data libavahi-common3 libavahi-core7 libavahi-glib1
  libcairo-gobject2 libcdparanoia0 libcolord2 libcups2 libdaemon0 libdatrie1 libdconf1 libdouble-conversion3
  libdrm-amdgpu1 libdrm-intel1 libdrm-nouveau2 libdrm-radeon1 libegl-mesa0 libegl1 libepoxy0 libevdev2 libfontenc1
  libgbm1 libgdk-pixbuf-2.0-0 libgdk-pixbuf2.0-bin libgdk-pixbuf2.0-common libgl1 libgl1-amber-dri libgl1-mesa-dri
  libglapi-mesa libglvnd0 libglx-mesa0 libglx0 libgraphite2-3 libgstreamer-plugins-base1.0-0 libgtk-3-0 libgtk-3-bin
  libgtk-3-common libgudev-1.0-0 libharfbuzz0b libhyphen0 libice6 libinput-bin libinput10 libjson-glib-1.0-0
  libjson-glib-1.0-common libllvm15 libmbim-glib4 libmbim-proxy libmd4c0 libmm-glib0 libmtdev1 libnl-route-3-200
  libnotify4 libnss-mdns libogg0 libopus0 liborc-0.4-0 libpango-1.0-0 libpangocairo-1.0-0 libpangoft2-1.0-0
  libpciaccess0 libpcre2-16-0 libpcsclite1 libproxy1v5 libqmi-glib5 libqmi-proxy libqt5core5a libqt5dbus5 libqt5gui5
  libqt5network5 libqt5positioning5 libqt5printsupport5 libqt5qml5 libqt5qmlmodels5 libqt5quick5 libqt5sensors5
  libqt5svg5 libqt5webchannel5 libqt5webkit5 libqt5widgets5 librsvg2-2 librsvg2-common libsensors-config libsensors5
  libsm6 libsoup2.4-1 libsoup2.4-common libtcl8.6 libthai-data libthai0 libtheora0 libvisual-0.4-0 libvorbis0a
  libvorbisenc2 libwacom-bin libwacom-common libwacom9 libwayland-client0 libwayland-cursor0 libwayland-egl1
  libwayland-server0 libwoff1 libx11-xcb1 libxaw7 libxcb-dri2-0 libxcb-dri3-0 libxcb-glx0 libxcb-icccm4 libxcb-image0
  libxcb-keysyms1 libxcb-present0 libxcb-randr0 libxcb-render-util0 libxcb-shape0 libxcb-sync1 libxcb-util1
  libxcb-xfixes0 libxcb-xinerama0 libxcb-xinput0 libxcb-xkb1 libxcomposite1 libxcursor1 libxdamage1 libxfixes3
  libxfont2 libxi6 libxinerama1 libxkbcommon-x11-0 libxkbcommon0 libxkbfile1 libxmu6 libxpm4 libxrandr2 libxshmfence1
  libxslt1.1 libxt6 libxtst6 libxxf86vm1 modemmanager qt5-gtk-platformtheme qttranslations5-l10n session-migration tcl
  tcl8.6 ubuntu-mono usb-modeswitch usb-modeswitch-data wkhtmltopdf wpasupplicant x11-common x11-xkb-utils xfonts-base
  xfonts-encodings xfonts-utils xnest xserver-common
0 upgraded, 177 newly installed, 0 to remove and 6 not upgraded.
Need to get 98.5 MB of archives.
After this operation, 388 MB of additional disk space will be used.

※依存パッケージ多めなので結構重い。

HTMLからPDFへの変換

wkhtmltopdf を使用し、下記コマンドで変換する。

$ wkhtmltopdf --help
Name:
  wkhtmltopdf 0.12.6

Synopsis:
  wkhtmltopdf [GLOBAL OPTION]... [OBJECT]... <output file>

Document objects:
  wkhtmltopdf is able to put several objects into the output file, an object is
  either a single webpage, a cover webpage or a table of contents.  The objects
  are put into the output document in the order they are specified on the
  command line, options can be specified on a per object basis or in the global
  options area. Options from the Global Options section can only be placed in
  the global options area.

  A page objects puts the content of a single webpage into the output document.

  (page)? <input url/file name> [PAGE OPTION]...
  Options for the page object can be placed in the global options and the page
  options areas. The applicable options can be found in the Page Options and
  Headers And Footer Options sections.

  A cover objects puts the content of a single webpage into the output document,
  the page does not appear in the table of contents, and does not have headers
  and footers.

  cover <input url/file name> [PAGE OPTION]...
  All options that can be specified for a page object can also be specified for
  a cover.

  A table of contents object inserts a table of contents into the output
  document.

  toc [TOC OPTION]...
  All options that can be specified for a page object can also be specified for
  a toc, further more the options from the TOC Options section can also be
  applied. The table of contents is generated via XSLT which means that it can
  be styled to look however you want it to look. To get an idea of how to do
  this you can dump the default xslt document by supplying the
  --dump-default-toc-xsl, and the outline it works on by supplying
  --dump-outline, see the Outline Options section.

Description:
  Converts one or more HTML pages into a PDF document, *not* using wkhtmltopdf
  patched qt.

Global Options:
      --collate                       Collate when printing multiple copies
                                      (default)
      --no-collate                    Do not collate when printing multiple
                                      copies
      --copies <number>               Number of copies to print into the pdf
                                      file (default 1)
  -H, --extended-help                 Display more extensive help, detailing
                                      less common command switches
  -g, --grayscale                     PDF will be generated in grayscale
  -h, --help                          Display help
      --license                       Output license information and exit
      --log-level <level>             Set log level to: none, error, warn or
                                      info (default info)
  -l, --lowquality                    Generates lower quality pdf/ps. Useful to
                                      shrink the result document space
  -O, --orientation <orientation>     Set orientation to Landscape or Portrait
                                      (default Portrait)
  -s, --page-size <Size>              Set paper size to: A4, Letter, etc.
                                      (default A4)
  -q, --quiet                         Be less verbose, maintained for backwards
                                      compatibility; Same as using --log-level
                                      none
      --read-args-from-stdin          Read command line arguments from stdin
      --title <text>                  The title of the generated pdf file (The
                                      title of the first document is used if not
                                      specified)
  -V, --version                       Output version information and exit

Reduced Functionality:
  This version of wkhtmltopdf has been compiled against a version of QT without
  the wkhtmltopdf patches. Therefore some features are missing, if you need
  these features please use the static version.

  Currently the list of features only supported with patch QT includes:

 * Printing more than one HTML document into a PDF file.
 * Running without an X11 server.
 * Adding a document outline to the PDF file.
 * Adding headers and footers to the PDF file.
 * Generating a table of contents.
 * Adding links in the generated PDF file.
 * Printing using the screen media-type.
 * Disabling the smart shrink feature of WebKit.

Contact:
  If you experience bugs or want to request new features please visit
  <https://wkhtmltopdf.org/support.html>

WebページをPDFに変換可能

wkhtmltopdf https://blog.k-bushi.com/post/column/pass-fp3/ pass-fp3.pdf

※ローカルに立てたサーバでも問題ないはず

↓ PDFになっていることを確認 pass-fp3

PDFからHTMLへの変換

$ pdftohtml --help
pdftohtml version 22.02.0
Copyright 2005-2022 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2011 Glyph & Cog, LLC

Usage: pdftohtml [options] <PDF-file> [<html-file> <xml-file>]
  -f <int>              : first page to convert
  -l <int>              : last page to convert
  -q                    : don't print any messages or errors
  -h                    : print usage information
  -?                    : print usage information
  -help                 : print usage information
  --help                : print usage information
  -p                    : exchange .pdf links by .html
  -c                    : generate complex document
  -s                    : generate single document that includes all pages
  -dataurls             : use data URLs instead of external images in HTML
  -i                    : ignore images
  -noframes             : generate no frames
  -stdout               : use standard output
  -zoom <fp>            : zoom the pdf document (default 1.5)
  -xml                  : output for XML post-processing
  -noroundcoord         : do not round coordinates (with XML output only)
  -hidden               : output hidden text
  -nomerge              : do not merge paragraphs
  -enc <string>         : output text encoding name
  -fmt <string>         : image file format for Splash output (png or jpg)
  -v                    : print copyright and version info
  -opw <string>         : owner password (for encrypted files)
  -upw <string>         : user password (for encrypted files)
  -nodrm                : override document DRM settings
  -wbt <fp>             : word break threshold (default 10 percent)
  -fontfullname         : outputs font full name

とりあえずそのまま使ってみる。

pdftohtml pass-fp3.pdf pass-fp3.html

なんかいっぱいできた↓

$ ls
pass-fp3-1_1.jpg  pass-fp3-3_1.jpg  pass-fp3-4_1.jpg  pass-fp3-5_1.jpg  pass-fp3.pdf       pass-fp3s.html
pass-fp3-2_1.jpg  pass-fp3-3_2.jpg  pass-fp3-4_2.jpg  pass-fp3.html     pass-fp3_ind.html

できたもの↓ pass-fp3-html

左のフレームはいらないなあと思ったら、 s がついているやつにはついていないっぽい。
フレーム付き、フレームなしがデフォルトで出力されるのかな。

おわりに

PDFからHTMLへの変換、HTMLからPDFへの変換を実践してみた。
特に、wkhtmltopdf は実務でも使っておりかなり便利、レイアウト調整には苦労するが帳票出力やら書類出力の機能には重宝する。
PDFでほしいというお客様は多いですしね。
PDFからHTMLのpdftohtmlは初めて使ったけど、PDFの内容をそのままHTMLにしてどうにかしたい!ってときに使えそう。
そのまま使わないにしても、ある程度はできた形で出てくるので、ソースちょっといじるだけでどうにかなるって場面も多いかな。
PDF系のツールは多くて面白いなあ

Hugo で構築されています。
テーマ StackJimmy によって設計されています。