Arun Manglick - Technical View: Xml parsers

Thursday, April 30, 2009

Xml Parsers

Three traditional techniques for processing XML files are:

More recent and emerging techniques for processing XML files are:

Pull Parsing - System.Xml.XmlReader in .NET, STAX in JAVA ((Streaming API for XML))
Non-Extractive Parsing (i.e. in-situ parsing)
Data binding

SAX Parser –

SAX isa lexical, event-driven interface in which a document is read serially and its contents are reported as "callbacks" to various methods on a handler object of the user's design.
SAX is fast and efficient to implement, but difficult to use for extracting information at random from the XML, since it tends to burden the application author with keeping track of what part of the document is being processed.
It is better suited to situations in which certain types of information are always handled the same way, no matter where they occur in the document.

Pull Parser -

A pull parser creates an iterator that sequentially visits the various elements, attributes, and data in an XML document.
Code which uses this 'iterator' can test the current item (to tell, for example, whether it is a start or end element, or text), and inspect its attributes (local name, namespace, values of XML attributes, value of text, etc.), and can also move the iterator to the 'next' item.
The code can thus extract information from the document as it traverses it.

Recap of XML Parser Types

Generally, there are two types of XML parsers.
First are the Push- And Pull-Parsers that simply read a XML document and return the data and structure of the document (e.g., SAX and StAX).
Both are event-driven parsers because they return events that the developer has to handle.

Push parsers implementations like SAX (Simple API for XML) return the data of the whole document in one stream and cannot be stopped (you could throw an exception in Java).
Pull-parsers, on the other hand, only return data when they are asked to read the next node in a document. StAX in JAVA & System.Xml.XmlReader in .NET is a pull-parser

The second type of XML parsers are Object Model Parsers (e.g., DOM and Apache AXIOM), which not only read the data but also construct an in-memory representation of the document, which can be altered. Since DOM parsers mostly use SAX parsers to read in the documents, it is clear that the object model of a document is always built completely. This is a performance limitation if only data at the beginning of a document needs to be read and altered. New approaches like Apache's AXIOM make use of StAX pull-parser implementations to overcome this limitation. AXIOM only builds the tree representation of a document until the last node that was requested. Therefore, it does not need to read the complete document.

Thanks & Regards,

Arun Manglick || Senior Tech Lead

Labels