I'm evaluating some website extraction tools with commandline support in order to use them with scripts:
DEiXTo from the Computer Science Department of the Aristotle University of Thessaloniki is a GPL based, yet very powerful, web data extractor.
It consists of 2 parts, first the GUI-based Windows only (quite a drawback) query generator, which produces an XML-file - called a Wrapper project file - *.wpf, which describes what should be matched.
The GUI has a built-in Webbrowser for selecting the visible elements of interest. Furthermore it supports Regex, neighborhood and a lot more...
It's still Beta and has some teething troubles. In some cases I suddenly have 2 "virtual roots" one of them which I can't remove.
Second the commandline based data extractor, which gets fed by the WPF-File generated with the GUI. The extractor is under GPL, written in Perl, available for Windows and Linux and runs without installation. Outputs supported are: tab delimited, XML, RSS, CSV, Excel.
Under the Hood it's mostly based on the XML::LibXML, WWW::Mechanize and Tree::Fast Perl modules.
Dienstag, 1. Juli 2008
Sonntag, 29. Juni 2008
Batch renaming of files using regular expressions
Diese Zusammenfassung ist nicht verfügbar.
Klicke hier, um den Post aufzurufen.
Labels:
batch renaming,
regular expressions,
sed
What does the blogname mean?
'palehui' is a nahuatl verb meaning (help, assist), that describes quite well what that blog is aimed at - helping me to archive minor things related to linux, my studies (physics, maths) and languages.
...and most of all it's catchy and easy to remember.
And no - I don't speak Nahuatl - but I'm interested in languages and some day came across that site http://mexica.ohui.net where I took the word from.
...and most of all it's catchy and easy to remember.
And no - I don't speak Nahuatl - but I'm interested in languages and some day came across that site http://mexica.ohui.net where I took the word from.
Labels:
Introduction,
language,
nahuatl,
palehui
Abonnieren
Posts (Atom)