Sonntag, 20. September 2009

Resizing images in a pdf document

Today I'm going to resize an image within a pdf document from the command line (batch mode). There exists a plethora of tools available for free, but i will stick with the following ones:
  • pdfimages
  • convert (from the venerable imagemagick tool
  • kit)
  • pdflatex
  • a pdf viewer of your choice
Because you can't really edit a pdf document (aside from some expensive proprietary tools i.e. Adobe Acrobat Writer) we have to extract the desired data, process it and output it into a new pdf document.

So for image resizing the workflow will be:
Extract image -> Resize image -> Output to new pdf document

Sounds easy - doesn't it? But there's one caveat to it:
With resizing I referred to the physical size of the image as is output on the printing device

The output device (printer, screen) has a pixel density associated with it called dots per inch or dpi (read the definition if you're not familiar with it).
If the dpis of the display device and the printer don't match, images who haven't got a fixed dpi associated with it will have different physical sizes on these devices - to sum it up understanding dpi is essential.

To avoid this problem we have to determine the dpi value of the image in the pdf document and after resizing - output it with the same dpi value.

Said that, actually there are two methods of resizing an image (physical size):
  1. already mentioned above: original dpi  equals new dpi,  (pixel-)resizing the image
  2. changing the new dpi but retaining the image (pixel-)size: original dpi doesn't equal new dpi
*With original/new dpi I refer to the dpi of the orignal/new pdf document


The latter is admittedly the worse approach for output devices have a certain intrinsic dpi-value which yields best results, with other dpi-values the output device has to scale the data to its intrinsic dpi-value.
Of course you have no control over the algorithms applied in this process - in stark contrast to the first method.

Unfortunately I haven't found an easy way to determine the dpi-value of an image within a pdf-document.
In Acrobat Reader Professional 6+ there's  allegedly a tool called "preflight" and another called "pitstop" who can extract this information - both very expensive.

My approach is to guess the dpi-value, generate the pdf document, and compare the image size in a pdf viewer with the original pdf document. (assuming equal horizontal and vertical dpi-values).
This works quite well for most pdf documents, but there also exist documents where the horizontal differs form the vertical dpi-value. In this case you first adjust the horizontal dpi-value so that the widths of both images match and then you adjust the vertical dpi-value so that the heights match (or conversely).

Task

Say we've got a pdf-file with two images in it called "a.pdf" and we want a new pdf with the first image in it
with its  size of  75% the original one.

Steps

To extract  the image we call pdfimage

$ pdfimages a.pdf   image
This will create two files: "image-000.ppm" and "image-001.ppm" 
 "image" is just a prefix for the filenames each extracted image is saved in

To determine the dpi-value of the image we have to create a pdf document with the image in it and with a guessed dpi-value and adjust it until the images have the same size in the pdf viewer (as explained in detail above).

To associate a dpi-value to the image we have to convert it into a jpeg and write it into it's header - since neither  the \includegraphics command supports a dpi-value argument nor does pdflatex support the "ppm" image type:

$ convert image-000.ppm -quality 100% -density 160x160 image1.jpg
Here we converted the image into a jpeg with a dpi-value (density) of 160 and nearly lossless compression - quality 100%.

The latex document to create a pdf document is very simple, it just includes the bare scaffold to include an image:

\documentclass{scrartcl}
\usepackage[utf8]{inputenc}
\usepackage{graphicx}

\begin{document}
\begin{figure}
\includegraphics{image1.jpg}
\end{figure}
\end{document}

 Save this file under "b.tex". To create the pdf document "b.pdf" simply write:

$ pdflatex b.tex

After you have successfuly determined the dpi-value(s) of the image you can now go on to the final step of resizing it:

$ convert image-000.ppm -quality 100% -density 160x160 -scale 75% image1.jpg
It's actually the same as above with a new argument "-scale 75%" - which rescales the image to 75% its original (pixel-)size.

Again run

$ pdflatex b.tex
and you should have now a pdf document with a 75% sized image of its original image.

Samstag, 4. Juli 2009

Shell script pretty printer

If your favorite syntax highlighter doesn't support html output and/or your language, chances are VIm does.

VIm is a highly customizable text editor in functionality arguably only comparable to emacs (but never tried it!)

If your script gets highlighted correctly in VIm you can easly export it to an html page for use in your blog, with the script Colorize.

By default this script is Mac OS/X and only highlights shell scripts (bash) but it's easily adaptable to every supported language.

It uses the 2html.vim plugin, which features:
  • XHTML/HTML 4 support
  • CSS styles for highlighting
  • Line numbering
  • Folding

The script makes use of the "-s-ex" option for vim (for more info open vim and type ":help -s-ex").
If you call 2html.vim interactively and the colors of the output html file don't fit, try setting
":set t_Co=256" before exporting. This tells vim that the terminal supports 256 colors.

Here's an example for an interactive vim-session, with options set for css-styles, xhtml, line numbering, unfolding, tabstop expanding and an unintrusive colorscheme "peachpuff" (also used on this site). For generating the html code simply write ":run! syntax/2html.vim" and then save the new window.

vim -e  \
-S <(
   echo '
      set nocompatible
      set t_Co=256
      let use_xhtml=1
      let html_number_lines=1
      let html_use_css=1
      let html_ignore_folding=1
      set expandtabs
      set tabstop=4
      set background=light
      syntax on
      colorscheme peachpuff
      vi
   '
) <your_script_file>

Note: As of Vim 7.0 enabling line numbering causes an error for Latex files, which results in incomplete formating.

Freitag, 3. Juli 2009

Deep file encoding converter

So let's say we copied directory "Miguel Bosé" containing some mp3-files from a fat32 partition to linux one (utf-8).

In the terminal it will display like:
Miguel Bos�
Miguel Bos�/CD 1
Miguel Bos�/CD 1/14 - Si... Piensa En M�.mp3
Miguel Bos�/CD 1/13 - Never Gonna Fall In Love Again.mp3
Miguel Bos�/CD 1/08 - Te Amar�.mp3
Miguel Bos�/CD 1/09 - M�rchate Ya.mp3
...
Miguel Bos�/CD 2
Miguel Bos�/CD 2/19 - Te Dir�.mp3
Miguel Bos�/CD 2/23 - Voy A Ganar.mp3
Miguel Bos�/CD 2/16 - Sevilla.mp3
Miguel Bos�/CD 2/25 - Se�or Padre.mp3
....
The problem is - as usual - the enconding. Under windows it was iso8859-1 encoded, but under linux it's utf-8 - we have to convert the filenames/directorynames.

Our first attempt probably would be a simple script like the following one:
find ./Migu* | #We use the * wildcard, for we can't enter the character easily
while read f; do
fc=$(echo -n "$f" | iconv -f iso8859-1); #Convert filename
mv "$f" "$fc";
done;

This script will fail on the first nested entry with a "file not found" error. Why?
-Because first it does: mv "Miguel Bos�" "Miguel Bosé"
Second: mv "Miguel Bos�/CD 1" "Miguel Bosé/CD 1", which must fail because we just previously renamed the parent directory, so it actually should be: mv "Miguel Bosé/CD 1" "Miguel Bosé/CD 1" (which is by the way a no-op, and should be filtered - but I come to that in a minute...)

Obviously the problem in the foregoing example was that the conversion of the parent directory wasn't conveyed to its children (inner files/directories). That's because we piped the output of the find command (which executed at the beginning) to the loop.
So one remedy would be to reexecute the find every time we rename a directory - obviously we would have to keep track of our current position in the directory-hierarchy otherwise we would have conceived an endless loop - Not very desireable. This sounds complicated to you - Then you're right, there's actually a much simpler solution.

Some handy tools:
dirname ... returns the directory of the given filepath
i.e.:
$ dirname "Miguel Bosé/CD 1"
Miguel Bosé

basename ... returns the basename of the given filepath
i.e.:
$ dirname "Miguel Bosé/CD 1"
CD 1

hexdump ... returns a formated hexadecimal representation of the file contents.
A badly documented feature is the -e option, which lets you define the format of the output: hexdump -e ' [iterations]/[byte_count] "[format string]" '

Lets say we want to convert a string into its hexcode using the format \x??
i.e. "Hello" to "\x68\x65\x6c\x6c\x6f" (one byte hex represantion)
The hexdump command would be:
hexdump -v -e '1/1 "\\\x"' -e '1/1 "%01x"'


So all we have to do is rename the directories in a topological order (parents before children) - Thankfully find does exactly that.

So our next attempt would be something like this:

data=$(find Miguel*) #Fetch filepath list to be processed

while [ "$data" != "" ]; do
f=$(echo -n "$data" | head -n 1 | tr -d '\n'); #extrac first filepath from the list
dir=$(dirname "$f" | tr -d '\n'); #extract dirname
dirc=$(echo -n "$dir" | iconv -f "iso8859-1"); #dirc .. converted dirname
to=$(echo -n "$f" | iconv -f "iso8859-1"); #converted filepath name
from=$(echo -n "$f" | sed -e "s|$dir|$dirc|"); #partly converted filepath name
mv "$from" "$to";
data=$(echo -n "$data" | sed -e '1 d'); #Remove the currently processed filepath from the list
done

Unfortunately it will also crash on the second renaming as before, but for a different reason ;)
The reason is this limited/buggy sed command:
s/Miguel Bos�/Miguel Bosé/ somehow can't/doesn't allow to match the unrecognized "�" (0xe9 ... é in iso8859-1) in "Miguel Bos�".

So the next attempt was to convert "Miguel Bos�" into its hex representation to bypass the above checking - Fiddlesticks!
Sed is even so "intelligent" to try to interpret the characters regex-meanings.
So if you have a dot "." in hex "\x2e" and you make a "sed -e 's/\x2e/A/'" it will replace all characters with "A" !!!.

Fortunately perl puts things right: perl -pe "s|||", is able to make a binary match with a hex representation as a pattern:
$ echo "This is . hack" | perl -pe "s|\x2e|a|"
This is a hack

The revised script is:

data=$(find Miguel*) #Fetch filepath list to be processed
enc="iso8859-1"

while [ "$data" != "" ]; do
f=$(echo -n "$data" | head -n 1 | tr -d '\n'); #extrac first filepath from the list
#extract dirname and convert to hex representation
dir=$(dirname "$f" | tr -d '\n' | hexdump -v -e '1/1 "\\\x"' -e '1/1 "%01x"');
#extract dirname and convert to utf-8
dirc=$(dirname "$f" | tr -d '\n' | iconv -f "$enc");
#converted filepath name
to=$(echo -n "$f" | iconv -f "$enc");
#from contains replaced dirname-part
from=$(echo -n "$f" | perl -pe "s|$dir|$dirc|");
if [ "$from" != "$to" ]; then #Only try to convert if filepath has changed
#Convert
mv "$from" "$to";
fi
data=$(echo -n "$data" | sed -e '1 d'); #Remove the currently processed filepath from the list
done

Mittwoch, 29. April 2009

iconv

iconv is a Linux utility which converts data from one encoding to another. Under Linux textual data is stored in Unicode-encoding (UTF-8) - because it supports all charactersets. So for textual data to be displayed correctly in most programs they have - if not already - to be converted from the source encoding (i.e.: cp1251, cp866 ... most popular cyrillic enconding) to Unicode - here iconv comes in handy.

Invoke options:
-l ... Lists all available encodings (also pseudonyms, like latin1, cyrillic,...)
-f ... the encoding to convert from
-t ... the encoding to convert to (default: utf8)


Getting started



Often the the id3-tags of mp3-files are encoded in some strange encoding.
Let's take a russian mp3-file "01-Posledny_Zakat.mp3", whose id3-tag I know is encoded in cp1251 (as mentioned above some cyrllic encoding).

"But what if I don't know what encoding it is?" you might ask
-Be patient, in the next chapter we will tackle that.

id3 -Rl 01-Posledny_Zakat.mp3
Filename: 01-Posledny_Zakat.mp3
Title: ��������� �����
Artist: ����
Album: ����������
Year: 2006
Genre: Heavy Metal (137)

Converting it to unicode:

id3 -R -l 01-Posledny_Zakat.mp3 | grep -i "title\|artist\|album" | iconv -f cp1251 -t utf-8 | cut -d ' ' -f 2- | (read title && read artist && read album && echo "id3 -t '$title' -a '$artist' -A '$album'")

(If you have cyrllic fonts installed you should now be able to se some cool characters)

id3 -Rl 01-Posledny_Zakat.mp3
Filename: 01-Posledny_Zakat.mp3
Title: Последний закат
Artist: Ария
Album: Армагеддон
Year: 2006
Genre: Heavy Metal (137)


Command for batch converting all mp3-files within a directory:

id3 -R -l *.mp3 | grep -i "filename\|title\|artist\|album" | iconv -f cp1251 -t utf-8 | cut -d ' ' -f 2- | (while read filename; do read title && read artist && read album && id3 -t "$title" -a "$artist" -A "$album" "$filename"; done)



Determine unknown encoding



A) Requires you to be able to recognize correctly encoded data (i.e. you should know to read the target language)


If you don't know in which encoding your data is, you can try a brute-force method, testing every (reasonable) encoding.
Let's say we want to try all encodings starting with CP\d (where \d is a digit). These encompass by the way all Cyrillic encodings - so we've assumed our data is encoded en some cyrillic encoding.


iconv -l | grep -i "^cp[0-9]" | sed -e 's§//§§' | while read i; do in=$(find ./ -maxdepth 1 -type d -ctime -1); str=$(echo "$in" | iconv -f "$i" -t utf-8 2>/dev/null); if [ "$?" -ne "0" ];then continue; fi; echo "$str"; echo "Encoding: '$i'"; sleep 1; done

B) Requires you to know exactly what character should be displayed, instead of a wrong encoded


You've got a mp3-file displayed as:

Medina Azahara - Caravana Espa�ola - 8 - Caravana Espa�ola.mp3

You know that it should be:

Medina Azahara - Caravana Española - 8 - Caravana Española.mp3

Let's dump the binary-data: ls -l | hexdump -C *mp3
-C .. Tells hexdump to output hexdata and ascii-encoded data simultaneously

Excerpt of output looks like this:
...
000002d0 6f 6d 29 20 2d 20 43 61 72 61 76 61 6e 61 20 45 |om) - Caravana E|
000002e0 73 70 61 a4 6f 6c 61 20 2d 20 38 20 2d 20 43 61 |spa.ola - 8 - Ca|
...

So we see that 0xa4 (hexadecimal) should be encoded as ñ.
(The assumption is of course that it is an 8-bit encoding.)

Now we can use google with search string "0xa4 ñ".
With a little luck you get a site with the corresponding encoding used.
(In this case it was cp850 "DOS latin1")

Comparison to Method A:
Advantage: Method A can take quite a lot of time if you don't know on which encodings to
restrict your search.
Disadvantage: No success guarantee.

Library Quirks (on debian)

Installing libraries on exotic paths

Because debian is very conservative regarding its package-management, new packages are likely not to be found in the repository.
Chances are also low that you find precompiled deb packages on the web.

So you have to compile the packages on your own.

Following are some issues regarding the configure process for compilation from sources.

1) Required libraries aren't found even though you've got them installed using (aptitude, apt-get, ...)

Solution: The configure scripts looks after the Header-files for the libraries (usually located in /usr/include, /usr/local/include). If you install a library with aptitude, apt-get, .. you only install the shared-objects files *.so (usually located in /usr/lib, /usr/local/lib), because only these files are needed for execution by other programms (these shared-objects are the same as *.dll files under windows and contain the actual executable-code).
Header-files on the opposite are only needed for compilation of programms, which include those libraries, because they contain the structure-declarations, data-types, function-prototypes of the library.
In order to get them for you already installed package, you have to install the packages ending in -dev (which stands for developer files).
Example:
Say you've got package 'libglib2.0-0' installed and configure yields the error 'glib2.0 ... not found'.
All you have to do is install 'libglib2.0-dev': 'sudo apt-get install libglib2.0-dev'

2) Required libraries aren't found even though you've installed them and their header-files

The problem is probably that you've installed the libraries into a non-standard location (specifically not /usr or /usr/local). This can be achived be appending to configure the argumen --prefix=.

Say you want to install you're library into /usr/local/exotic, all you've gotta do is a ./confgire --prefix=/usr/local/exotic (and of course make && make install).
This is useful if you don't want to mess up your system with expirimental versions of libraries, because the dynamic linker (ld) chooses the highest subversion of your library available on your system (Actually the APIs and the behaviour of libraries mustn't change -- reality is a bit different).

Back to the problem: Looking at the lines above the line 'Checking for ... not found', if there is 'Checking for pkg-config ...' then you've just found the problem.

Here an excerpt from 'man pkg-config':

" The pkg-config program is used to retrieve information about installed
libraries in the system. It is typically used to compile and link
against one or more libraries. Here is a typical usage scenario in a
Makefile:

program: program.c
cc program.c pkg-config --cflags --libs gnomeui

pkg-config retrieves information about packages from special metadata
files. These files are named after the package, with the extension .pc.
By default, pkg-config looks in the directory prefix/lib/pkgconfig for
these files; it will also look in the colon-separated (on Windows,
semicolon-separated) list of directories specified by the PKG_CON
FIG_PATH environment variable."

Eureka! So the message ' ... not found' actually just means that pkg-config didn't find a corresponding *.pc file for the library. That's because pkg-config per default searchs for its *.pc files in /usr/lib/pkgconfig, /usr/local/lib/pkgconfig dirs. But our library has it's *.pc files in /usr/local/exotic/lib/pkgconfig dir, so as stated in the man-page
all we have to tell pkg-config where to look for the *.pc files using the PKG_CONFIG_PATH variable.

So all you have to do is (in bash):

$ export PKG_CONFIG_PATH=/usr/local/exotic/lib/pkgconfig

checking for GLIB - version >= 2.17.6...
*** 'pkg-config --modversion glib-2.0' returned 2.18.0, but GLIB (2.12.4)
*** was found! If pkg-config was correct, then it is best
*** to remove the old version of GLib. You may also be able to fix the error
*** by modifying your LD_LIBRARY_PATH enviroment variable, or by editing
*** /etc/ld.so.conf. Make sure you have run ldconfig if that is
*** required on your system.
*** If pkg-config was wrong, set the environment variable PKG_CONFIG_PATH
*** to point to the correct configuration files
no
configure: error:
*** GLIB 2.17.6 or better is required. The latest version of
*** GLIB is always available from ftp://ftp.gtk.org/pub/gtk/.

This means the header-files were correctly found, but the library itself wasn't (here it was taken the old one in std-location /usr/lib). The remedy is to set explicitly where to search for the Libraries through the LD_RUN_PATH variable:

export LD_RUN_PATH=/usr/local/exotic/lib