Technical stuff mostly Linux related: Juli 2008

I'm evaluating some website extraction tools with commandline support in order to use them with scripts:

DEiXTo from the Computer Science Department of the Aristotle University of Thessaloniki is a GPL based, yet very powerful, web data extractor.

It consists of 2 parts, first the GUI-based Windows only (quite a drawback) query generator, which produces an XML-file - called a Wrapper project file - *.wpf, which describes what should be matched.
The GUI has a built-in Webbrowser for selecting the visible elements of interest. Furthermore it supports Regex, neighborhood and a lot more...
It's still Beta and has some teething troubles. In some cases I suddenly have 2 "virtual roots" one of them which I can't remove.

Second the commandline based data extractor, which gets fed by the WPF-File generated with the GUI. The extractor is under GPL, written in Perl, available for Windows and Linux and runs without installation. Outputs supported are: tab delimited, XML, RSS, CSV, Excel.
Under the Hood it's mostly based on the XML::LibXML, WWW::Mechanize and Tree::Fast Perl modules.

Technical stuff mostly Linux related

Dienstag, 1. Juli 2008

Website extraction tools

Visitor counter