Freitag, 3. Juli 2009

Deep file encoding converter

So let's say we copied directory "Miguel Bosé" containing some mp3-files from a fat32 partition to linux one (utf-8).

In the terminal it will display like:
Miguel Bos�
Miguel Bos�/CD 1
Miguel Bos�/CD 1/14 - Si... Piensa En M�.mp3
Miguel Bos�/CD 1/13 - Never Gonna Fall In Love Again.mp3
Miguel Bos�/CD 1/08 - Te Amar�.mp3
Miguel Bos�/CD 1/09 - M�rchate Ya.mp3
...
Miguel Bos�/CD 2
Miguel Bos�/CD 2/19 - Te Dir�.mp3
Miguel Bos�/CD 2/23 - Voy A Ganar.mp3
Miguel Bos�/CD 2/16 - Sevilla.mp3
Miguel Bos�/CD 2/25 - Se�or Padre.mp3
....
The problem is - as usual - the enconding. Under windows it was iso8859-1 encoded, but under linux it's utf-8 - we have to convert the filenames/directorynames.

Our first attempt probably would be a simple script like the following one:
find ./Migu* | #We use the * wildcard, for we can't enter the character easily
while read f; do
fc=$(echo -n "$f" | iconv -f iso8859-1); #Convert filename
mv "$f" "$fc";
done;

This script will fail on the first nested entry with a "file not found" error. Why?
-Because first it does: mv "Miguel Bos�" "Miguel Bosé"
Second: mv "Miguel Bos�/CD 1" "Miguel Bosé/CD 1", which must fail because we just previously renamed the parent directory, so it actually should be: mv "Miguel Bosé/CD 1" "Miguel Bosé/CD 1" (which is by the way a no-op, and should be filtered - but I come to that in a minute...)

Obviously the problem in the foregoing example was that the conversion of the parent directory wasn't conveyed to its children (inner files/directories). That's because we piped the output of the find command (which executed at the beginning) to the loop.
So one remedy would be to reexecute the find every time we rename a directory - obviously we would have to keep track of our current position in the directory-hierarchy otherwise we would have conceived an endless loop - Not very desireable. This sounds complicated to you - Then you're right, there's actually a much simpler solution.

Some handy tools:
dirname ... returns the directory of the given filepath
i.e.:
$ dirname "Miguel Bosé/CD 1"
Miguel Bosé

basename ... returns the basename of the given filepath
i.e.:
$ dirname "Miguel Bosé/CD 1"
CD 1

hexdump ... returns a formated hexadecimal representation of the file contents.
A badly documented feature is the -e option, which lets you define the format of the output: hexdump -e ' [iterations]/[byte_count] "[format string]" '

Lets say we want to convert a string into its hexcode using the format \x??
i.e. "Hello" to "\x68\x65\x6c\x6c\x6f" (one byte hex represantion)
The hexdump command would be:
hexdump -v -e '1/1 "\\\x"' -e '1/1 "%01x"'


So all we have to do is rename the directories in a topological order (parents before children) - Thankfully find does exactly that.

So our next attempt would be something like this:

data=$(find Miguel*) #Fetch filepath list to be processed

while [ "$data" != "" ]; do
f=$(echo -n "$data" | head -n 1 | tr -d '\n'); #extrac first filepath from the list
dir=$(dirname "$f" | tr -d '\n'); #extract dirname
dirc=$(echo -n "$dir" | iconv -f "iso8859-1"); #dirc .. converted dirname
to=$(echo -n "$f" | iconv -f "iso8859-1"); #converted filepath name
from=$(echo -n "$f" | sed -e "s|$dir|$dirc|"); #partly converted filepath name
mv "$from" "$to";
data=$(echo -n "$data" | sed -e '1 d'); #Remove the currently processed filepath from the list
done

Unfortunately it will also crash on the second renaming as before, but for a different reason ;)
The reason is this limited/buggy sed command:
s/Miguel Bos�/Miguel Bosé/ somehow can't/doesn't allow to match the unrecognized "�" (0xe9 ... é in iso8859-1) in "Miguel Bos�".

So the next attempt was to convert "Miguel Bos�" into its hex representation to bypass the above checking - Fiddlesticks!
Sed is even so "intelligent" to try to interpret the characters regex-meanings.
So if you have a dot "." in hex "\x2e" and you make a "sed -e 's/\x2e/A/'" it will replace all characters with "A" !!!.

Fortunately perl puts things right: perl -pe "s|||", is able to make a binary match with a hex representation as a pattern:
$ echo "This is . hack" | perl -pe "s|\x2e|a|"
This is a hack

The revised script is:

data=$(find Miguel*) #Fetch filepath list to be processed
enc="iso8859-1"

while [ "$data" != "" ]; do
f=$(echo -n "$data" | head -n 1 | tr -d '\n'); #extrac first filepath from the list
#extract dirname and convert to hex representation
dir=$(dirname "$f" | tr -d '\n' | hexdump -v -e '1/1 "\\\x"' -e '1/1 "%01x"');
#extract dirname and convert to utf-8
dirc=$(dirname "$f" | tr -d '\n' | iconv -f "$enc");
#converted filepath name
to=$(echo -n "$f" | iconv -f "$enc");
#from contains replaced dirname-part
from=$(echo -n "$f" | perl -pe "s|$dir|$dirc|");
if [ "$from" != "$to" ]; then #Only try to convert if filepath has changed
#Convert
mv "$from" "$to";
fi
data=$(echo -n "$data" | sed -e '1 d'); #Remove the currently processed filepath from the list
done