Mittwoch, 29. April 2009

iconv

iconv is a Linux utility which converts data from one encoding to another. Under Linux textual data is stored in Unicode-encoding (UTF-8) - because it supports all charactersets. So for textual data to be displayed correctly in most programs they have - if not already - to be converted from the source encoding (i.e.: cp1251, cp866 ... most popular cyrillic enconding) to Unicode - here iconv comes in handy.

Invoke options:
-l ... Lists all available encodings (also pseudonyms, like latin1, cyrillic,...)
-f ... the encoding to convert from
-t ... the encoding to convert to (default: utf8)


Getting started



Often the the id3-tags of mp3-files are encoded in some strange encoding.
Let's take a russian mp3-file "01-Posledny_Zakat.mp3", whose id3-tag I know is encoded in cp1251 (as mentioned above some cyrllic encoding).

"But what if I don't know what encoding it is?" you might ask
-Be patient, in the next chapter we will tackle that.

id3 -Rl 01-Posledny_Zakat.mp3
Filename: 01-Posledny_Zakat.mp3
Title: ��������� �����
Artist: ����
Album: ����������
Year: 2006
Genre: Heavy Metal (137)

Converting it to unicode:

id3 -R -l 01-Posledny_Zakat.mp3 | grep -i "title\|artist\|album" | iconv -f cp1251 -t utf-8 | cut -d ' ' -f 2- | (read title && read artist && read album && echo "id3 -t '$title' -a '$artist' -A '$album'")

(If you have cyrllic fonts installed you should now be able to se some cool characters)

id3 -Rl 01-Posledny_Zakat.mp3
Filename: 01-Posledny_Zakat.mp3
Title: Последний закат
Artist: Ария
Album: Армагеддон
Year: 2006
Genre: Heavy Metal (137)


Command for batch converting all mp3-files within a directory:

id3 -R -l *.mp3 | grep -i "filename\|title\|artist\|album" | iconv -f cp1251 -t utf-8 | cut -d ' ' -f 2- | (while read filename; do read title && read artist && read album && id3 -t "$title" -a "$artist" -A "$album" "$filename"; done)



Determine unknown encoding



A) Requires you to be able to recognize correctly encoded data (i.e. you should know to read the target language)


If you don't know in which encoding your data is, you can try a brute-force method, testing every (reasonable) encoding.
Let's say we want to try all encodings starting with CP\d (where \d is a digit). These encompass by the way all Cyrillic encodings - so we've assumed our data is encoded en some cyrillic encoding.


iconv -l | grep -i "^cp[0-9]" | sed -e 's§//§§' | while read i; do in=$(find ./ -maxdepth 1 -type d -ctime -1); str=$(echo "$in" | iconv -f "$i" -t utf-8 2>/dev/null); if [ "$?" -ne "0" ];then continue; fi; echo "$str"; echo "Encoding: '$i'"; sleep 1; done

B) Requires you to know exactly what character should be displayed, instead of a wrong encoded


You've got a mp3-file displayed as:

Medina Azahara - Caravana Espa�ola - 8 - Caravana Espa�ola.mp3

You know that it should be:

Medina Azahara - Caravana Española - 8 - Caravana Española.mp3

Let's dump the binary-data: ls -l | hexdump -C *mp3
-C .. Tells hexdump to output hexdata and ascii-encoded data simultaneously

Excerpt of output looks like this:
...
000002d0 6f 6d 29 20 2d 20 43 61 72 61 76 61 6e 61 20 45 |om) - Caravana E|
000002e0 73 70 61 a4 6f 6c 61 20 2d 20 38 20 2d 20 43 61 |spa.ola - 8 - Ca|
...

So we see that 0xa4 (hexadecimal) should be encoded as ñ.
(The assumption is of course that it is an 8-bit encoding.)

Now we can use google with search string "0xa4 ñ".
With a little luck you get a site with the corresponding encoding used.
(In this case it was cp850 "DOS latin1")

Comparison to Method A:
Advantage: Method A can take quite a lot of time if you don't know on which encodings to
restrict your search.
Disadvantage: No success guarantee.

Library Quirks (on debian)

Installing libraries on exotic paths

Because debian is very conservative regarding its package-management, new packages are likely not to be found in the repository.
Chances are also low that you find precompiled deb packages on the web.

So you have to compile the packages on your own.

Following are some issues regarding the configure process for compilation from sources.

1) Required libraries aren't found even though you've got them installed using (aptitude, apt-get, ...)

Solution: The configure scripts looks after the Header-files for the libraries (usually located in /usr/include, /usr/local/include). If you install a library with aptitude, apt-get, .. you only install the shared-objects files *.so (usually located in /usr/lib, /usr/local/lib), because only these files are needed for execution by other programms (these shared-objects are the same as *.dll files under windows and contain the actual executable-code).
Header-files on the opposite are only needed for compilation of programms, which include those libraries, because they contain the structure-declarations, data-types, function-prototypes of the library.
In order to get them for you already installed package, you have to install the packages ending in -dev (which stands for developer files).
Example:
Say you've got package 'libglib2.0-0' installed and configure yields the error 'glib2.0 ... not found'.
All you have to do is install 'libglib2.0-dev': 'sudo apt-get install libglib2.0-dev'

2) Required libraries aren't found even though you've installed them and their header-files

The problem is probably that you've installed the libraries into a non-standard location (specifically not /usr or /usr/local). This can be achived be appending to configure the argumen --prefix=.

Say you want to install you're library into /usr/local/exotic, all you've gotta do is a ./confgire --prefix=/usr/local/exotic (and of course make && make install).
This is useful if you don't want to mess up your system with expirimental versions of libraries, because the dynamic linker (ld) chooses the highest subversion of your library available on your system (Actually the APIs and the behaviour of libraries mustn't change -- reality is a bit different).

Back to the problem: Looking at the lines above the line 'Checking for ... not found', if there is 'Checking for pkg-config ...' then you've just found the problem.

Here an excerpt from 'man pkg-config':

" The pkg-config program is used to retrieve information about installed
libraries in the system. It is typically used to compile and link
against one or more libraries. Here is a typical usage scenario in a
Makefile:

program: program.c
cc program.c pkg-config --cflags --libs gnomeui

pkg-config retrieves information about packages from special metadata
files. These files are named after the package, with the extension .pc.
By default, pkg-config looks in the directory prefix/lib/pkgconfig for
these files; it will also look in the colon-separated (on Windows,
semicolon-separated) list of directories specified by the PKG_CON
FIG_PATH environment variable."

Eureka! So the message ' ... not found' actually just means that pkg-config didn't find a corresponding *.pc file for the library. That's because pkg-config per default searchs for its *.pc files in /usr/lib/pkgconfig, /usr/local/lib/pkgconfig dirs. But our library has it's *.pc files in /usr/local/exotic/lib/pkgconfig dir, so as stated in the man-page
all we have to tell pkg-config where to look for the *.pc files using the PKG_CONFIG_PATH variable.

So all you have to do is (in bash):

$ export PKG_CONFIG_PATH=/usr/local/exotic/lib/pkgconfig

checking for GLIB - version >= 2.17.6...
*** 'pkg-config --modversion glib-2.0' returned 2.18.0, but GLIB (2.12.4)
*** was found! If pkg-config was correct, then it is best
*** to remove the old version of GLib. You may also be able to fix the error
*** by modifying your LD_LIBRARY_PATH enviroment variable, or by editing
*** /etc/ld.so.conf. Make sure you have run ldconfig if that is
*** required on your system.
*** If pkg-config was wrong, set the environment variable PKG_CONFIG_PATH
*** to point to the correct configuration files
no
configure: error:
*** GLIB 2.17.6 or better is required. The latest version of
*** GLIB is always available from ftp://ftp.gtk.org/pub/gtk/.

This means the header-files were correctly found, but the library itself wasn't (here it was taken the old one in std-location /usr/lib). The remedy is to set explicitly where to search for the Libraries through the LD_RUN_PATH variable:

export LD_RUN_PATH=/usr/local/exotic/lib