Technical stuff mostly Linux related: iconv

iconv is a Linux utility which converts data from one encoding to another. Under Linux textual data is stored in Unicode-encoding (UTF-8) - because it supports all charactersets. So for textual data to be displayed correctly in most programs they have - if not already - to be converted from the source encoding (i.e.: cp1251, cp866 ... most popular cyrillic enconding) to Unicode - here iconv comes in handy.

Invoke options:
-l ... Lists all available encodings (also pseudonyms, like latin1, cyrillic,...)
-f ... the encoding to convert from
-t ... the encoding to convert to (default: utf8)

Getting started

Often the the id3-tags of mp3-files are encoded in some strange encoding.
Let's take a russian mp3-file "01-Posledny_Zakat.mp3", whose id3-tag I know is encoded in cp1251 (as mentioned above some cyrllic encoding).

"But what if I don't know what encoding it is?" you might ask
-Be patient, in the next chapter we will tackle that.

id3 -Rl 01-Posledny_Zakat.mp3
Filename: 01-Posledny_Zakat.mp3
Title: ��
Artist: ��
Album: ��
Year: 2006
Genre: Heavy Metal (137)

Converting it to unicode:

id3 -R -l 01-Posledny_Zakat.mp3 | grep -i "title\|artist\|album" | iconv -f cp1251 -t utf-8 | cut -d ' ' -f 2- | (read title && read artist && read album && echo "id3 -t '$title' -a '$artist' -A '$album'")

(If you have cyrllic fonts installed you should now be able to se some cool characters)

id3 -Rl 01-Posledny_Zakat.mp3
Filename: 01-Posledny_Zakat.mp3
Title: Последний закат
Artist: Ария
Album: Армагеддон
Year: 2006
Genre: Heavy Metal (137)

Command for batch converting all mp3-files within a directory:

id3 -R -l *.mp3 | grep -i "filename\|title\|artist\|album" | iconv -f cp1251 -t utf-8 | cut -d ' ' -f 2- | (while read filename; do read title && read artist && read album && id3 -t "$title" -a "$artist" -A "$album" "$filename"; done)

Determine unknown encoding

A) Requires you to be able to recognize correctly encoded data (i.e. you should know to read the target language)

If you don't know in which encoding your data is, you can try a brute-force method, testing every (reasonable) encoding.
Let's say we want to try all encodings starting with CP\d (where \d is a digit). These encompass by the way all Cyrillic encodings - so we've assumed our data is encoded en some cyrillic encoding.

iconv -l | grep -i "^cp[0-9]" | sed -e 's§//§§' | while read i; do in=$(find ./ -maxdepth 1 -type d -ctime -1); str=$(echo "$in" | iconv -f "$i" -t utf-8 2>/dev/null); if [ "$?" -ne "0" ];then continue; fi; echo "$str"; echo "Encoding: '$i'"; sleep 1; done

B) Requires you to know exactly what character should be displayed, instead of a wrong encoded

You've got a mp3-file displayed as:

Medina Azahara - Caravana Espa�ola - 8 - Caravana Espa�ola.mp3

You know that it should be:

Medina Azahara - Caravana Española - 8 - Caravana Española.mp3

Let's dump the binary-data: ls -l | hexdump -C *mp3
-C .. Tells hexdump to output hexdata and ascii-encoded data simultaneously

Excerpt of output looks like this:
...
000002d0 6f 6d 29 20 2d 20 43 61 72 61 76 61 6e 61 20 45 |om) - Caravana E|
000002e0 73 70 61 a4 6f 6c 61 20 2d 20 38 20 2d 20 43 61 |spa.ola - 8 - Ca|
...

So we see that 0xa4 (hexadecimal) should be encoded as ñ.
(The assumption is of course that it is an 8-bit encoding.)

Now we can use google with search string "0xa4 ñ".
With a little luck you get a site with the corresponding encoding used.
(In this case it was cp850 "DOS latin1")

Comparison to Method A:
Advantage: Method A can take quite a lot of time if you don't know on which encodings to
restrict your search.
Disadvantage: No success guarantee.

Technical stuff mostly Linux related

Mittwoch, 29. April 2009

iconv

Getting started

Determine unknown encoding

A) Requires you to be able to recognize correctly encoded data (i.e. you should know to read the target language)

B) Requires you to know exactly what character should be displayed, instead of a wrong encoded

1 Kommentar:

Visitor counter