Dienstag, 1. Juli 2008

Website extraction tools

I'm evaluating some website extraction tools with commandline support in order to use them with scripts:

DEiXTo from the Computer Science Department of the Aristotle University of Thessaloniki is a GPL based, yet very powerful, web data extractor.

It consists of 2 parts, first the GUI-based Windows only (quite a drawback) query generator, which produces an XML-file - called a Wrapper project file - *.wpf, which describes what should be matched.
The GUI has a built-in Webbrowser for selecting the visible elements of interest. Furthermore it supports Regex, neighborhood and a lot more...
It's still Beta and has some teething troubles. In some cases I suddenly have 2 "virtual roots" one of them which I can't remove.

Second the commandline based data extractor, which gets fed by the WPF-File generated with the GUI. The extractor is under GPL, written in Perl, available for Windows and Linux and runs without installation. Outputs supported are: tab delimited, XML, RSS, CSV, Excel.
Under the Hood it's mostly based on the XML::LibXML, WWW::Mechanize and Tree::Fast Perl modules.

Sonntag, 29. Juni 2008

Batch renaming of files using regular expressions

Diese Zusammenfassung ist nicht verfügbar. Klicke hier, um den Post aufzurufen.

What does the blogname mean?

'palehui' is a nahuatl verb meaning (help, assist), that describes quite well what that blog is aimed at - helping me to archive minor things related to linux, my studies (physics, maths) and languages.
...and most of all it's catchy and easy to remember.

And no - I don't speak Nahuatl - but I'm interested in languages and some day came across that site http://mexica.ohui.net where I took the word from.