Stream parsing XML in Ruby with a push-down automaton
This is an example of a small ruby library i've been working today to ease the stream parsing of XML documents.
The aim of the library is easy, automate the process of stream parsing by simply defining a push-down automaton describing the parsing process.
Take a look at the next XML fragment:
<catalog> <category name="testCategoryA"> <game name="test1" id="435"> <device id="33"> <description> Texto descriptivo. </description> </device> <game name="test2" id="436"> <devices> <device id="34"> <device id="35"> </device> </device> </devices> </game> </category> </catalog>
Suppose you are interested in obtaining the games by category, and the devices for each game. To write a SAX-like parser handling code to achieve this task you must take into account the order the parser will be giving you the elements: category -> game -> device, and the context of the element, for instance, the game a certain device element belongs to.
The sequence of elements can be described by the means of a finite state machine, with each node tag signaling the transition from a state to other.
The need to remember the context of the elements being parsed, can be solved with some kind memory mechanism, for example, a stack.
So everything you need is a finite state machine + stack: enters the push-down automaton.
Next, there's an example of a bunch of ruby code that uses the library to write a parser using the push-down automaton formalism.
To use the parser you just describe a set of transition between states like this: from the "initial" state, if a "
To manage the stack of the automaton you define methods parsed_xxx(xxx) where xxx is the element name: parsed_category(category). The library will call this functions to return the element from the top of the stack when it is popped out of the stack. Alternatively you can also define methods found_xxx(xxx) that will be called when the element is pushed in the stack.
Internally, the library uses REXML stream parser, not SAX compliant, as the backend parsing engine.
And that's all! here is an example of use from a unit test:
require File.dirname(__FILE__) + '/../test_helper' require 'automata_parser/graph' require 'automata_parser/parser' include AutomataParser class StreamParserTest < ActiveSupport::TestCase include AutomataParser::Constants XML = <<__END <catalog> <category name="testCategoryA"> <game name="test1" id="435"> <device id="33"> <description> Texto descriptivo. </description> </device> <game name="test2" id="436"> <devices> <device id="34"> <device id="35"> </device> </device> </devices> </game> </category> </catalog>__END def setup @mapping = [[:init,:categ,"category"], [:categ,:game,"game"], [:game,:device,"device"], [:device,:game,"/device"], [:game,:description,"description"], [:description,:game,"/description"], [:game,:categ,"/game"], [:game,:init,"/category"]] @parser = AutomataParser::Parser.new(@mapping,:init,self) @xml = StringIO.new(XML) @categories = Array.new @games = Array.new @devices = Array.new @descriptions = Array.new end def test_parse @parser.parse @xml assert_equal(1,@categories.size) assert_equal(2,@games.size) assert_equal(3,@devices.size) assert_equal(1,@descriptions.size) assert_equal("testCategoryA",@categories[0].values["name"]) assert_equal(12,@categories[0].values["id"]) end def found_category(category) @categories.push category category.values["id"]=12 end def parsed_game(game) @games.push game if(game.values["id"]=="435") assert_equal("test1",game.values["name"]) end if(game.values["id"]=="436") assert_equal("test2",game.values["name"]) end assert_equal("category",game.parent.name) assert_equal(12,game.parent.values["id"]) end def parsed_description(description) @descriptions.push description assert_not_nil(description.text) assert_equal("game",description.parent.name) assert_equal("test1",description.parent.attribute["name"]) end def parsed_device(device) @devices.push device assert_equal(true,(device.values["id"]=="33" || devices.values["id"]=="34" || devices.values["id"]=="35")) assert_equal("game",device.parent.name) end end
If you wan to give it a try, you can try the attached download:
