Normandie WebLog: Glossary generation

novembre 15, 2002

Glossary generation

Bonjour Pascale, Delphine et Karl,

G'day. :) I've generated a glossary based on the "nwtitle" and "nwurl" tags which are found in all XML files except 'data.xml'. The data format is:

nwtitle @@ nwurl

This file is found at http://www.normandieweb.org/web/glossary.txt.

The following is the Perl source, which parses each XML file using XML::LibXML and outputs a glossary item:

#!/usr/bin/perl -w
#
# genGlossaryItem.pl
# This script is exclusively for generating a glossary for NormandieWeb.
# It should be called from the shell with the name of an XML file
# as the only argument.
# It parses this file and prints out the contents 
# of <nwtitle> and <nwurl>,
# delimited by ' @@ '.
# Eg:
# Rouen par Aurore Daubenfeld @@ \
     http://www.normandieweb.org/76/rouen/rouen/rouenaurore.html
#
# Author: Stephanie Troeth
# Date: 14th November 2002
#

use strict;
use XML::LibXML;
local $XML::LibXML::skipDTD = 1; # skip DTD

# Bail out if we don't have the single argument we are expecting to receive
if ($#ARGV != 0)
{
        print "Usage: genGlossaryItem.pl <xmlfilename>\n";
  exit;
}

# call our main
&main();

#~~~~~~~~~~~~~~~
# main function
#~~~~~~~~~~~~~~~
sub main
{
  my $xmlfile = shift(@ARGV);               # get commandline argument
  my $parser = XML::LibXML->new();          # create new instance of parser

  my $tree = $parser->parse_file($xmlfile); # parse
  my $root = $tree->getDocumentElement;     # get the root of our doc tree

  # use XPath to get the title and url
  my $title =  $root->findvalue('/nwcity/nwtexte/nwtitle');
  my $url =  $root->findvalue('/nwcity/nwtexte/nwurl');

  print "$title @@ $url\n";
}
# end main

The wrapper shell script looks for all relevant XML files and passes each file through genGlossaryItem.pl:

#!/bin/sh
# genGlossary.sh
#
# This script is exclusively for generating a glossary for NormandieWeb.
# It recursively finds XML files (all except anything with 'data' in
# the file/path name) and passes each file to genGlossaryItem.pl.
# genGlossaryItem.pl parses each file 
# and prints <nwtitle> and <nwurl>
# in the format of:
#
# <nwtitle> @@ <nwurl>
#
# The output of genGlossaryItem.pl is redirected (appended) to a file
# called 'glossary.txt'
#
# Authors: Stephanie Troeth, Karl Dubost
# Date: 14th November 2002
#

touch glossary
XMLDIR=`find  /Users/karl/Sites/NW/nwxml/ -name '*.xml' | grep -v data`

echo Start: `date`
for i in $XMLDIR;
do
./genGlossaryItem.pl $i >> glossary.txt
done
echo End: `date`

Later on, this file will be referenced by the CMS whenever xHTML pages are generated.

Posted by steph at novembre 15, 2002 05:02 PM

Comments

Hello Steph !

Sacré boulot... Je ne comprends rien à l'anglais ni rien à Perl mais je vois bien que c'est du ... super boulot.

Je compte sur Karl pour te traduire ;-) Et espère que nous pourrons faire facilement la jonction entre ton travail et le mien.

bye !

Posted by: Pascale on novembre 16, 2002 03:43 PM