Sonntag, 20. September 2009

Resizing images in a pdf document

Today I'm going to resize an image within a pdf document from the command line (batch mode). There exists a plethora of tools available for free, but i will stick with the following ones:
  • pdfimages
  • convert (from the venerable imagemagick tool
  • kit)
  • pdflatex
  • a pdf viewer of your choice
Because you can't really edit a pdf document (aside from some expensive proprietary tools i.e. Adobe Acrobat Writer) we have to extract the desired data, process it and output it into a new pdf document.

So for image resizing the workflow will be:
Extract image -> Resize image -> Output to new pdf document

Sounds easy - doesn't it? But there's one caveat to it:
With resizing I referred to the physical size of the image as is output on the printing device

The output device (printer, screen) has a pixel density associated with it called dots per inch or dpi (read the definition if you're not familiar with it).
If the dpis of the display device and the printer don't match, images who haven't got a fixed dpi associated with it will have different physical sizes on these devices - to sum it up understanding dpi is essential.

To avoid this problem we have to determine the dpi value of the image in the pdf document and after resizing - output it with the same dpi value.

Said that, actually there are two methods of resizing an image (physical size):
  1. already mentioned above: original dpi  equals new dpi,  (pixel-)resizing the image
  2. changing the new dpi but retaining the image (pixel-)size: original dpi doesn't equal new dpi
*With original/new dpi I refer to the dpi of the orignal/new pdf document


The latter is admittedly the worse approach for output devices have a certain intrinsic dpi-value which yields best results, with other dpi-values the output device has to scale the data to its intrinsic dpi-value.
Of course you have no control over the algorithms applied in this process - in stark contrast to the first method.

Unfortunately I haven't found an easy way to determine the dpi-value of an image within a pdf-document.
In Acrobat Reader Professional 6+ there's  allegedly a tool called "preflight" and another called "pitstop" who can extract this information - both very expensive.

My approach is to guess the dpi-value, generate the pdf document, and compare the image size in a pdf viewer with the original pdf document. (assuming equal horizontal and vertical dpi-values).
This works quite well for most pdf documents, but there also exist documents where the horizontal differs form the vertical dpi-value. In this case you first adjust the horizontal dpi-value so that the widths of both images match and then you adjust the vertical dpi-value so that the heights match (or conversely).

Task

Say we've got a pdf-file with two images in it called "a.pdf" and we want a new pdf with the first image in it
with its  size of  75% the original one.

Steps

To extract  the image we call pdfimage

$ pdfimages a.pdf   image
This will create two files: "image-000.ppm" and "image-001.ppm" 
 "image" is just a prefix for the filenames each extracted image is saved in

To determine the dpi-value of the image we have to create a pdf document with the image in it and with a guessed dpi-value and adjust it until the images have the same size in the pdf viewer (as explained in detail above).

To associate a dpi-value to the image we have to convert it into a jpeg and write it into it's header - since neither  the \includegraphics command supports a dpi-value argument nor does pdflatex support the "ppm" image type:

$ convert -quality 100% -density 160 image-000.ppm image1.jpg
Here we converted the image into a jpeg with a dpi-value (density) of 160 and nearly lossless compression - quality 100%.

The latex document to create a pdf document is very simple, it just includes the bare scaffold to include an image:

\documentclass{scrartcl}
\usepackage[utf8]{inputenc}
\usepackage{graphicx}

\begin{document}
\begin{figure}
\includegraphics{image1.jpg}
\end{figure}
\end{document}

 Save this file under "b.tex". To create the pdf document "b.pdf" simply write:

$ pdflatex b.tex

After you have successfuly determined the dpi-value(s) of the image you can now go on to the final step of resizing it:

$ convert -quality 100% -density  -size 75% image-000.ppm image1.jpg
It's actually the same as above with a new argument "-size 75%" - which resizes the image to 75% its original (pixel-)size.

Again run

$ pdflatex b.tex
and you should have now a pdf document with a 75% sized image of its original image.

Samstag, 4. Juli 2009

Shell script pretty printer

If your favorite syntax highlighter doesn't support html output and/or your language, chances are VIm does.

VIm is a highly customizable text editor in functionality arguably only comparable to emacs (but never tried it!)

If your script gets highlighted correctly in VIm you can easly export it to an html page for use in your blog, with the script Colorize.

By default this script is Mac OS/X and only highlights shell scripts (bash) but it's easily adaptable to every supported language.

It uses the 2html.vim plugin, which features:
  • XHTML/HTML 4 support
  • CSS styles for highlighting
  • Line numbering
  • Folding

The script makes use of the "-s-ex" option for vim (for more info open vim and type ":help -s-ex").
If you call 2html.vim interactively and the colors of the output html file don't fit, try setting
":set t_Co=256" before exporting. This tells vim that the terminal supports 256 colors.

Here's an example for an interactive vim-session, with options set for css-styles, xhtml, line numbering, unfolding, tabstop expanding and an unintrusive colorscheme "peachpuff" (also used on this site). For generating the html code simply write ":run! syntax/2html.vim" and then save the new window.

vim -e  \
-S <(
   echo '
      set nocompatible
      set t_Co=256
      let use_xhtml=1
      let html_number_lines=1
      let html_use_css=1
      let html_ignore_folding=1
      set expandtabs
      set tabstop=4
      set background=light
      syntax on
      colorscheme peachpuff
      vi
   '
) <your_script_file>

Note: As of Vim 7.0 enabling line numbering causes an error for Latex files, which results in incomplete formating.

Freitag, 3. Juli 2009

Deep file encoding converter

So let's say we copied directory "Miguel Bosé" containing some mp3-files from a fat32 partition to linux one (utf-8).

In the terminal it will display like:
Miguel Bos�
Miguel Bos�/CD 1
Miguel Bos�/CD 1/14 - Si... Piensa En M�.mp3
Miguel Bos�/CD 1/13 - Never Gonna Fall In Love Again.mp3
Miguel Bos�/CD 1/08 - Te Amar�.mp3
Miguel Bos�/CD 1/09 - M�rchate Ya.mp3
...
Miguel Bos�/CD 2
Miguel Bos�/CD 2/19 - Te Dir�.mp3
Miguel Bos�/CD 2/23 - Voy A Ganar.mp3
Miguel Bos�/CD 2/16 - Sevilla.mp3
Miguel Bos�/CD 2/25 - Se�or Padre.mp3
....
The problem is - as usual - the enconding. Under windows it was iso8859-1 encoded, but under linux it's utf-8 - we have to convert the filenames/directorynames.

Our first attempt probably would be a simple script like the following one:
find ./Migu* | #We use the * wildcard, for we can't enter the character easily
while read f; do
fc=$(echo -n "$f" | iconv -f iso8859-1); #Convert filename
mv "$f" "$fc";
done;

This script will fail on the first nested entry with a "file not found" error. Why?
-Because first it does: mv "Miguel Bos�" "Miguel Bosé"
Second: mv "Miguel Bos�/CD 1" "Miguel Bosé/CD 1", which must fail because we just previously renamed the parent directory, so it actually should be: mv "Miguel Bosé/CD 1" "Miguel Bosé/CD 1" (which is by the way a no-op, and should be filtered - but I come to that in a minute...)

Obviously the problem in the foregoing example was that the conversion of the parent directory wasn't conveyed to its children (inner files/directories). That's because we piped the output of the find command (which executed at the beginning) to the loop.
So one remedy would be to reexecute the find every time we rename a directory - obviously we would have to keep track of our current position in the directory-hierarchy otherwise we would have conceived an endless loop - Not very desireable. This sounds complicated to you - Then you're right, there's actually a much simpler solution.

Some handy tools:
dirname ... returns the directory of the given filepath
i.e.:
$ dirname "Miguel Bosé/CD 1"
Miguel Bosé

basename ... returns the basename of the given filepath
i.e.:
$ dirname "Miguel Bosé/CD 1"
CD 1

hexdump ... returns a formated hexadecimal representation of the file contents.
A badly documented feature is the -e option, which lets you define the format of the output: hexdump -e ' [iterations]/[byte_count] "[format string]" '

Lets say we want to convert a string into its hexcode using the format \x??
i.e. "Hello" to "\x68\x65\x6c\x6c\x6f" (one byte hex represantion)
The hexdump command would be:
hexdump -v -e '1/1 "\\\x"' -e '1/1 "%01x"'


So all we have to do is rename the directories in a topological order (parents before children) - Thankfully find does exactly that.

So our next attempt would be something like this:

data=$(find Miguel*) #Fetch filepath list to be processed

while [ "$data" != "" ]; do
f=$(echo -n "$data" | head -n 1 | tr -d '\n'); #extrac first filepath from the list
dir=$(dirname "$f" | tr -d '\n'); #extract dirname
dirc=$(echo -n "$dir" | iconv -f "iso8859-1"); #dirc .. converted dirname
to=$(echo -n "$f" | iconv -f "iso8859-1"); #converted filepath name
from=$(echo -n "$f" | sed -e "s|$dir|$dirc|"); #partly converted filepath name
mv "$from" "$to";
data=$(echo -n "$data" | sed -e '1 d'); #Remove the currently processed filepath from the list
done

Unfortunately it will also crash on the second renaming as before, but for a different reason ;)
The reason is this limited/buggy sed command:
s/Miguel Bos�/Miguel Bosé/ somehow can't/doesn't allow to match the unrecognized "�" (0xe9 ... é in iso8859-1) in "Miguel Bos�".

So the next attempt was to convert "Miguel Bos�" into its hex representation to bypass the above checking - Fiddlesticks!
Sed is even so "intelligent" to try to interpret the characters regex-meanings.
So if you have a dot "." in hex "\x2e" and you make a "sed -e 's/\x2e/A/'" it will replace all characters with "A" !!!.

Fortunately perl puts things right: perl -pe "s|||", is able to make a binary match with a hex representation as a pattern:
$ echo "This is . hack" | perl -pe "s|\x2e|a|"
This is a hack

The revised script is:

data=$(find Miguel*) #Fetch filepath list to be processed
enc="iso8859-1"

while [ "$data" != "" ]; do
f=$(echo -n "$data" | head -n 1 | tr -d '\n'); #extrac first filepath from the list
#extract dirname and convert to hex representation
dir=$(dirname "$f" | tr -d '\n' | hexdump -v -e '1/1 "\\\x"' -e '1/1 "%01x"');
#extract dirname and convert to utf-8
dirc=$(dirname "$f" | tr -d '\n' | iconv -f "$enc");
#converted filepath name
to=$(echo -n "$f" | iconv -f "$enc");
#from contains replaced dirname-part
from=$(echo -n "$f" | perl -pe "s|$dir|$dirc|");
if [ "$from" != "$to" ]; then #Only try to convert if filepath has changed
#Convert
mv "$from" "$to";
fi
data=$(echo -n "$data" | sed -e '1 d'); #Remove the currently processed filepath from the list
done

Mittwoch, 29. April 2009

iconv

iconv is a Linux utility which converts data from one encoding to another. Under Linux textual data is stored in Unicode-encoding (UTF-8) - because it supports all charactersets. So for textual data to be displayed correctly in most programs they have - if not already - to be converted from the source encoding (i.e.: cp1251, cp866 ... most popular cyrillic enconding) to Unicode - here iconv comes in handy.

Invoke options:
-l ... Lists all available encodings (also pseudonyms, like latin1, cyrillic,...)
-f ... the encoding to convert from
-t ... the encoding to convert to (default: utf8)


Getting started



Often the the id3-tags of mp3-files are encoded in some strange encoding.
Let's take a russian mp3-file "01-Posledny_Zakat.mp3", whose id3-tag I know is encoded in cp1251 (as mentioned above some cyrllic encoding).

"But what if I don't know what encoding it is?" you might ask
-Be patient, in the next chapter we will tackle that.

id3 -Rl 01-Posledny_Zakat.mp3
Filename: 01-Posledny_Zakat.mp3
Title: ��������� �����
Artist: ����
Album: ����������
Year: 2006
Genre: Heavy Metal (137)

Converting it to unicode:

id3 -R -l 01-Posledny_Zakat.mp3 | grep -i "title\|artist\|album" | iconv -f cp1251 -t utf-8 | cut -d ' ' -f 2- | (read title && read artist && read album && echo "id3 -t '$title' -a '$artist' -A '$album'")

(If you have cyrllic fonts installed you should now be able to se some cool characters)

id3 -Rl 01-Posledny_Zakat.mp3
Filename: 01-Posledny_Zakat.mp3
Title: Последний закат
Artist: Ария
Album: Армагеддон
Year: 2006
Genre: Heavy Metal (137)


Command for batch converting all mp3-files within a directory:

id3 -R -l *.mp3 | grep -i "filename\|title\|artist\|album" | iconv -f cp1251 -t utf-8 | cut -d ' ' -f 2- | (while read filename; do read title && read artist && read album && id3 -t "$title" -a "$artist" -A "$album" "$filename"; done)



Determine unknown encoding



A) Requires you to be able to recognize correctly encoded data (i.e. you should know to read the target language)


If you don't know in which encoding your data is, you can try a brute-force method, testing every (reasonable) encoding.
Let's say we want to try all encodings starting with CP\d (where \d is a digit). These encompass by the way all Cyrillic encodings - so we've assumed our data is encoded en some cyrillic encoding.


iconv -l | grep -i "^cp[0-9]" | sed -e 's§//§§' | while read i; do in=$(find ./ -maxdepth 1 -type d -ctime -1); str=$(echo "$in" | iconv -f "$i" -t utf-8 2>/dev/null); if [ "$?" -ne "0" ];then continue; fi; echo "$str"; echo "Encoding: '$i'"; sleep 1; done

B) Requires you to know exactly what character should be displayed, instead of a wrong encoded


You've got a mp3-file displayed as:

Medina Azahara - Caravana Espa�ola - 8 - Caravana Espa�ola.mp3

You know that it should be:

Medina Azahara - Caravana Española - 8 - Caravana Española.mp3

Let's dump the binary-data: ls -l | hexdump -C *mp3
-C .. Tells hexdump to output hexdata and ascii-encoded data simultaneously

Excerpt of output looks like this:
...
000002d0 6f 6d 29 20 2d 20 43 61 72 61 76 61 6e 61 20 45 |om) - Caravana E|
000002e0 73 70 61 a4 6f 6c 61 20 2d 20 38 20 2d 20 43 61 |spa.ola - 8 - Ca|
...

So we see that 0xa4 (hexadecimal) should be encoded as ñ.
(The assumption is of course that it is an 8-bit encoding.)

Now we can use google with search string "0xa4 ñ".
With a little luck you get a site with the corresponding encoding used.
(In this case it was cp850 "DOS latin1")

Comparison to Method A:
Advantage: Method A can take quite a lot of time if you don't know on which encodings to
restrict your search.
Disadvantage: No success guarantee.

Library Quirks (on debian)

Installing libraries on exotic paths

Because debian is very conservative regarding its package-management, new packages are likely not to be found in the repository.
Chances are also low that you find precompiled deb packages on the web.

So you have to compile the packages on your own.

Following are some issues regarding the configure process for compilation from sources.

1) Required libraries aren't found even though you've got them installed using (aptitude, apt-get, ...)

Solution: The configure scripts looks after the Header-files for the libraries (usually located in /usr/include, /usr/local/include). If you install a library with aptitude, apt-get, .. you only install the shared-objects files *.so (usually located in /usr/lib, /usr/local/lib), because only these files are needed for execution by other programms (these shared-objects are the same as *.dll files under windows and contain the actual executable-code).
Header-files on the opposite are only needed for compilation of programms, which include those libraries, because they contain the structure-declarations, data-types, function-prototypes of the library.
In order to get them for you already installed package, you have to install the packages ending in -dev (which stands for developer files).
Example:
Say you've got package 'libglib2.0-0' installed and configure yields the error 'glib2.0 ... not found'.
All you have to do is install 'libglib2.0-dev': 'sudo apt-get install libglib2.0-dev'

2) Required libraries aren't found even though you've installed them and their header-files

The problem is probably that you've installed the libraries into a non-standard location (specifically not /usr or /usr/local). This can be achived be appending to configure the argumen --prefix=.

Say you want to install you're library into /usr/local/exotic, all you've gotta do is a ./confgire --prefix=/usr/local/exotic (and of course make && make install).
This is useful if you don't want to mess up your system with expirimental versions of libraries, because the dynamic linker (ld) chooses the highest subversion of your library available on your system (Actually the APIs and the behaviour of libraries mustn't change -- reality is a bit different).

Back to the problem: Looking at the lines above the line 'Checking for ... not found', if there is 'Checking for pkg-config ...' then you've just found the problem.

Here an excerpt from 'man pkg-config':

" The pkg-config program is used to retrieve information about installed
libraries in the system. It is typically used to compile and link
against one or more libraries. Here is a typical usage scenario in a
Makefile:

program: program.c
cc program.c pkg-config --cflags --libs gnomeui

pkg-config retrieves information about packages from special metadata
files. These files are named after the package, with the extension .pc.
By default, pkg-config looks in the directory prefix/lib/pkgconfig for
these files; it will also look in the colon-separated (on Windows,
semicolon-separated) list of directories specified by the PKG_CON
FIG_PATH environment variable."

Eureka! So the message ' ... not found' actually just means that pkg-config didn't find a corresponding *.pc file for the library. That's because pkg-config per default searchs for its *.pc files in /usr/lib/pkgconfig, /usr/local/lib/pkgconfig dirs. But our library has it's *.pc files in /usr/local/exotic/lib/pkgconfig dir, so as stated in the man-page
all we have to tell pkg-config where to look for the *.pc files using the PKG_CONFIG_PATH variable.

So all you have to do is (in bash):

$ export PKG_CONFIG_PATH=/usr/local/exotic/lib/pkgconfig

checking for GLIB - version >= 2.17.6...
*** 'pkg-config --modversion glib-2.0' returned 2.18.0, but GLIB (2.12.4)
*** was found! If pkg-config was correct, then it is best
*** to remove the old version of GLib. You may also be able to fix the error
*** by modifying your LD_LIBRARY_PATH enviroment variable, or by editing
*** /etc/ld.so.conf. Make sure you have run ldconfig if that is
*** required on your system.
*** If pkg-config was wrong, set the environment variable PKG_CONFIG_PATH
*** to point to the correct configuration files
no
configure: error:
*** GLIB 2.17.6 or better is required. The latest version of
*** GLIB is always available from ftp://ftp.gtk.org/pub/gtk/.

This means the header-files were correctly found, but the library itself wasn't (here it was taken the old one in std-location /usr/lib). The remedy is to set explicitly where to search for the Libraries through the LD_RUN_PATH variable:

export LD_RUN_PATH=/usr/local/exotic/lib

Dienstag, 1. Juli 2008

Website extraction tools

I'm evaluating some website extraction tools with commandline support in order to use them with scripts:

DEiXTo from the Computer Science Department of the Aristotle University of Thessaloniki is a GPL based, yet very powerful, web data extractor.

It consists of 2 parts, first the GUI-based Windows only (quite a drawback) query generator, which produces an XML-file - called a Wrapper project file - *.wpf, which describes what should be matched.
The GUI has a built-in Webbrowser for selecting the visible elements of interest. Furthermore it supports Regex, neighborhood and a lot more...
It's still Beta and has some teething troubles. In some cases I suddenly have 2 "virtual roots" one of them which I can't remove.

Second the commandline based data extractor, which gets fed by the WPF-File generated with the GUI. The extractor is under GPL, written in Perl, available for Windows and Linux and runs without installation. Outputs supported are: tab delimited, XML, RSS, CSV, Excel.
Under the Hood it's mostly based on the XML::LibXML, WWW::Mechanize and Tree::Fast Perl modules.

Sonntag, 29. Juni 2008

Batch renaming of files using regular expressions

1. Batch renaming of files

Problem: You've got a lot of mp3-files containing a nasty string like "(www.nastysite.com)" and want to remove it from the name:

Code:

find ./ -iname "*.mp3" |
(
while read i; do
m=$(
echo "$i" | sed -e 's/(www.*\.com) //'
);

mv "$i" "$m";
done
)

or copy-paste version:

find ./ -iname "*.mp3" | (while read i; do m=$(echo "$i" | sed -e 's/(www.*\.com) //' ); mv "$i" "$m"; done)


Breakdown

1. find ./ -iname "*.mp3" ... Feeds the fullpath of the mp3s in the current directory to stdin (needed as starting point)

2. Pipe the "Mp3-list" to a little bash-script, which loops through every line of input

First the name is read into variable 'i'.

Second we pipe the name (through echo '$i') to sed, which does the removing of the nasty string through a regex-pattern
's/(www.*\.com) //' ... substitutes first occurrence of '(www.nastysite.com) ' through void.

Then we store the result in variable 'm'.

Last we rename the file by 'mv "$i" "$m"'




2. Batch renaming of id3-tags of mp3-files

Problem: You want your mp3s to be indexed by a programm utilizing id3-tags. Sadly many mp3s have messed up id3-tags or no tags at all.
But we want at least the title to be displayed correctly, so we need to set it.
For that we derive it from the filename:
Say our mp3-files are named like this: 'Triana - El Patio - 3 - Abre la puerta niña.mp3".
It consists of 4 parts:
1. Artist: Triana
2. Album: El Patio
3. Track#: 3
4. Title: Abre la puerta niña
Now the title should be '3 - Abre la puerta niña'

This little script does the trick. It's just a slight variation of the previous one, namely replacing the renaming part (mv "$i" "$m")
through the retagging part (id3 -t "$m" "$i"):
For that we utilize the cmd-tool id3 which mangles the id3-tag info of an mp3-file. The option '-t' sets the title, '-a' the Artist, '-A' the Album,...

Code:


01: find ./ -iname "*.mp3" |
02: (
03: while read i; do

04: m=$(
05: echo "$i" | sed -e 's/.*\([0-9] - .*\)\.mp3/\1/'

06: );
07: #Output status
08: echo "$i: '$m'";
09: #Retag the title

10: id3 -t "$m" "$i";
11:
12: done
13: )




or copy-paste version:

find ./ -iname "*.mp3" | (while read i; do m=$(echo "$i" | sed -e 's/.*\([0-9] - .*\)\.mp3/\1/'); echo "$i: '$m'"; id3 -t "$m" "$i"; done)


Breakdown

The regex is a bit more tricky, because it uses grouping:

s/.*\([0-9] - .*\)\.mp3/\1/

.* is a greedy operator, that means it consumes as most as it can, as long as the whole expression still matches.

\(...\) doesn't actually match anything, it's just a marker (group), so that we can reference the content (all within the brackets) by \1 later.

[0-9] ... this is a character-class which matches only numbers from 0 through 9
\. matches a single dot (Because . by default matches any single character)