Samstag, 4. Juli 2009

Shell script pretty printer

If your favorite syntax highlighter doesn't support html output and/or your language, chances are VIm does.

VIm is a highly customizable text editor in functionality arguably only comparable to emacs (but never tried it!)

If your script gets highlighted correctly in VIm you can easly export it to an html page for use in your blog, with the script Colorize.

By default this script is Mac OS/X and only highlights shell scripts (bash) but it's easily adaptable to every supported language.

It uses the 2html.vim plugin, which features:
  • XHTML/HTML 4 support
  • CSS styles for highlighting
  • Line numbering
  • Folding

The script makes use of the "-s-ex" option for vim (for more info open vim and type ":help -s-ex").
If you call 2html.vim interactively and the colors of the output html file don't fit, try setting
":set t_Co=256" before exporting. This tells vim that the terminal supports 256 colors.

Here's an example for an interactive vim-session, with options set for css-styles, xhtml, line numbering, unfolding, tabstop expanding and an unintrusive colorscheme "peachpuff" (also used on this site). For generating the html code simply write ":run! syntax/2html.vim" and then save the new window.

vim -e  \
-S <(
   echo '
      set nocompatible
      set t_Co=256
      let use_xhtml=1
      let html_number_lines=1
      let html_use_css=1
      let html_ignore_folding=1
      set expandtabs
      set tabstop=4
      set background=light
      syntax on
      colorscheme peachpuff
      vi
   '
) <your_script_file>

Note: As of Vim 7.0 enabling line numbering causes an error for Latex files, which results in incomplete formating.

Freitag, 3. Juli 2009

Deep file encoding converter

So let's say we copied directory "Miguel Bosé" containing some mp3-files from a fat32 partition to linux one (utf-8).

In the terminal it will display like:
Miguel Bos�
Miguel Bos�/CD 1
Miguel Bos�/CD 1/14 - Si... Piensa En M�.mp3
Miguel Bos�/CD 1/13 - Never Gonna Fall In Love Again.mp3
Miguel Bos�/CD 1/08 - Te Amar�.mp3
Miguel Bos�/CD 1/09 - M�rchate Ya.mp3
...
Miguel Bos�/CD 2
Miguel Bos�/CD 2/19 - Te Dir�.mp3
Miguel Bos�/CD 2/23 - Voy A Ganar.mp3
Miguel Bos�/CD 2/16 - Sevilla.mp3
Miguel Bos�/CD 2/25 - Se�or Padre.mp3
....
The problem is - as usual - the enconding. Under windows it was iso8859-1 encoded, but under linux it's utf-8 - we have to convert the filenames/directorynames.

Our first attempt probably would be a simple script like the following one:
find ./Migu* | #We use the * wildcard, for we can't enter the character easily
while read f; do
fc=$(echo -n "$f" | iconv -f iso8859-1); #Convert filename
mv "$f" "$fc";
done;

This script will fail on the first nested entry with a "file not found" error. Why?
-Because first it does: mv "Miguel Bos�" "Miguel Bosé"
Second: mv "Miguel Bos�/CD 1" "Miguel Bosé/CD 1", which must fail because we just previously renamed the parent directory, so it actually should be: mv "Miguel Bosé/CD 1" "Miguel Bosé/CD 1" (which is by the way a no-op, and should be filtered - but I come to that in a minute...)

Obviously the problem in the foregoing example was that the conversion of the parent directory wasn't conveyed to its children (inner files/directories). That's because we piped the output of the find command (which executed at the beginning) to the loop.
So one remedy would be to reexecute the find every time we rename a directory - obviously we would have to keep track of our current position in the directory-hierarchy otherwise we would have conceived an endless loop - Not very desireable. This sounds complicated to you - Then you're right, there's actually a much simpler solution.

Some handy tools:
dirname ... returns the directory of the given filepath
i.e.:
$ dirname "Miguel Bosé/CD 1"
Miguel Bosé

basename ... returns the basename of the given filepath
i.e.:
$ dirname "Miguel Bosé/CD 1"
CD 1

hexdump ... returns a formated hexadecimal representation of the file contents.
A badly documented feature is the -e option, which lets you define the format of the output: hexdump -e ' [iterations]/[byte_count] "[format string]" '

Lets say we want to convert a string into its hexcode using the format \x??
i.e. "Hello" to "\x68\x65\x6c\x6c\x6f" (one byte hex represantion)
The hexdump command would be:
hexdump -v -e '1/1 "\\\x"' -e '1/1 "%01x"'


So all we have to do is rename the directories in a topological order (parents before children) - Thankfully find does exactly that.

So our next attempt would be something like this:

data=$(find Miguel*) #Fetch filepath list to be processed

while [ "$data" != "" ]; do
f=$(echo -n "$data" | head -n 1 | tr -d '\n'); #extrac first filepath from the list
dir=$(dirname "$f" | tr -d '\n'); #extract dirname
dirc=$(echo -n "$dir" | iconv -f "iso8859-1"); #dirc .. converted dirname
to=$(echo -n "$f" | iconv -f "iso8859-1"); #converted filepath name
from=$(echo -n "$f" | sed -e "s|$dir|$dirc|"); #partly converted filepath name
mv "$from" "$to";
data=$(echo -n "$data" | sed -e '1 d'); #Remove the currently processed filepath from the list
done

Unfortunately it will also crash on the second renaming as before, but for a different reason ;)
The reason is this limited/buggy sed command:
s/Miguel Bos�/Miguel Bosé/ somehow can't/doesn't allow to match the unrecognized "�" (0xe9 ... é in iso8859-1) in "Miguel Bos�".

So the next attempt was to convert "Miguel Bos�" into its hex representation to bypass the above checking - Fiddlesticks!
Sed is even so "intelligent" to try to interpret the characters regex-meanings.
So if you have a dot "." in hex "\x2e" and you make a "sed -e 's/\x2e/A/'" it will replace all characters with "A" !!!.

Fortunately perl puts things right: perl -pe "s|||", is able to make a binary match with a hex representation as a pattern:
$ echo "This is . hack" | perl -pe "s|\x2e|a|"
This is a hack

The revised script is:

data=$(find Miguel*) #Fetch filepath list to be processed
enc="iso8859-1"

while [ "$data" != "" ]; do
f=$(echo -n "$data" | head -n 1 | tr -d '\n'); #extrac first filepath from the list
#extract dirname and convert to hex representation
dir=$(dirname "$f" | tr -d '\n' | hexdump -v -e '1/1 "\\\x"' -e '1/1 "%01x"');
#extract dirname and convert to utf-8
dirc=$(dirname "$f" | tr -d '\n' | iconv -f "$enc");
#converted filepath name
to=$(echo -n "$f" | iconv -f "$enc");
#from contains replaced dirname-part
from=$(echo -n "$f" | perl -pe "s|$dir|$dirc|");
if [ "$from" != "$to" ]; then #Only try to convert if filepath has changed
#Convert
mv "$from" "$to";
fi
data=$(echo -n "$data" | sed -e '1 d'); #Remove the currently processed filepath from the list
done