Publicidad:
La Coctelera

un rato de sol

i will work harder, ja ja ja, no ahora en serio.

13 Enero 2008

Stream parsing XML in Ruby with a push-down automaton

This is an example of a small ruby library i've been working today to ease the stream parsing of XML documents.

The aim of the library is easy, automate the process of stream parsing by simply defining a push-down automaton describing the parsing process.
Take a look at the next XML fragment:

<catalog>
       <category name="testCategoryA">
        <game name="test1" id="435">
          <device id="33">
          <description>
            Texto descriptivo.
          </description>
       </device>
       <game name="test2" id="436">
          <devices>
             <device id="34">
             <device id="35">
          </device>
       </device>
      </devices>
     </game>
   </category>
 </catalog>
 

Suppose you are interested in obtaining the games by category, and the devices for each game. To write a SAX-like parser handling code to achieve this task you must take into account the order the parser will be giving you the elements: category -> game -> device, and the context of the element, for instance, the game a certain device element belongs to.

The sequence of elements can be described by the means of a finite state machine, with each node tag signaling the transition from a state to other.

The need to remember the context of the elements being parsed, can be solved with some kind memory mechanism, for example, a stack.

So everything you need is a finite state machine + stack: enters the push-down automaton.

Next, there's an example of a bunch of ruby code that uses the library to write a parser using the push-down automaton formalism.

To use the parser you just describe a set of transition between states like this: from the "initial" state, if a "<category>" start tag is found, go to the state "category". This rules are coded like an array of arrays: [[:init,:category,"categor"],...].

To manage the stack of the automaton you define methods parsed_xxx(xxx) where xxx is the element name: parsed_category(category). The library will call this functions to return the element from the top of the stack when it is popped out of the stack. Alternatively you can also define methods found_xxx(xxx) that will be called when the element is pushed in the stack.
Internally, the library uses REXML stream parser, not SAX compliant, as the backend parsing engine.

And that's all! here is an example of use from a unit test:

 require File.dirname(__FILE__) + '/../test_helper'
 require 'automata_parser/graph'
 require 'automata_parser/parser'
 
 include AutomataParser
 
 class StreamParserTest < ActiveSupport::TestCase
 
   include AutomataParser::Constants
 
   XML = <<__END
 <catalog>
       <category name="testCategoryA">
        <game name="test1" id="435">
          <device id="33">
          <description>
            Texto descriptivo.
          </description>
       </device>
       <game name="test2" id="436">
          <devices>
             <device id="34">
             <device id="35">
          </device>
       </device>
      </devices>
     </game>
   </category>
 </catalog>
 __END
 
   def setup
     @mapping = [[:init,:categ,"category"],
                 [:categ,:game,"game"],
                 [:game,:device,"device"],
                 [:device,:game,"/device"],
                 [:game,:description,"description"],
                 [:description,:game,"/description"],
                 [:game,:categ,"/game"],
                 [:game,:init,"/category"]]
     @parser = AutomataParser::Parser.new(@mapping,:init,self)
     @xml = StringIO.new(XML)
     @categories = Array.new
     @games = Array.new
     @devices = Array.new
     @descriptions = Array.new
   end
 
   def test_parse
     @parser.parse @xml
     
     assert_equal(1,@categories.size)
     assert_equal(2,@games.size)
     assert_equal(3,@devices.size)
     assert_equal(1,@descriptions.size)
     assert_equal("testCategoryA",@categories[0].values["name"])
     assert_equal(12,@categories[0].values["id"])
 
   end
   
   def found_category(category)
     @categories.push category
     category.values["id"]=12
   end
 
   def parsed_game(game)
     @games.push game
     if(game.values["id"]=="435")
       assert_equal("test1",game.values["name"])
     end
     if(game.values["id"]=="436")
       assert_equal("test2",game.values["name"])
     end
     
     assert_equal("category",game.parent.name)
     assert_equal(12,game.parent.values["id"])
   end
 
   def parsed_description(description)
     @descriptions.push description
     assert_not_nil(description.text)
     assert_equal("game",description.parent.name)
     assert_equal("test1",description.parent.attribute["name"])
   end
   
   def parsed_device(device)
     @devices.push device
     assert_equal(true,(device.values["id"]=="33" || devices.values["id"]=="34" || devices.values["id"]=="35"))
     assert_equal("game",device.parent.name)
   end  
 end
 

If you wan to give it a try, you can try the attached download:

pushdown_parser.zip

Tags: ruby

servido por Antonio sin comentarios compártelo

sin comentarios · Escribe aquí tu comentario

Escribe tu comentario


Sobre mí

Avatar de Antonio

un rato de sol

Barcelona/Salamanca, España
ver perfil »
contacto »
Trabajador del metal y del acero, en la gloriosa XING AG, escribo software con el que poder ganarme el jornal. En mi tiempo libre sigo tecleando código de bonitos colores a medio camino entre lo sublime y lo terrible. Últimamente me gustan mucho los gatos.

Fotos

Antonio Garrote Hernández todavía no ha subido ninguna foto.

¡Anímale a hacerlo!

Buscar

suscríbete

Selecciona el agregador que utilices para suscribirte a este blog (también puedes obtener la URL de los feeds):

¿Qué es esto?

Crea tu blog gratis en La Coctelera