Print This Post Print This Post

Parsing mixed content XML using Python

Python (is definitely) != snake_oil

I wanted to extract the formatted text from an OpenOffice.org Writer document, retaining the positions of words within the white space of the text.
Most folks are aware that these documents are xml ‘under the hood’ – .odt == .zip == collection of xml files etc.
Moreover, I wanted to use Python, with a view to re-using extracted and/or munged text with ReportLab.

It turns out that handling ‘mixed’ xml content is not that well documented, but I did manage to find one blogger whose short example pointed me in the right direction. The power behind the Sax parser code left me a little breathless, as I ended up with a lot less slurp than I expected.

I present my example code below in it’s entirety, for the benefit of others looking to solve a similar problem. The output from my example is presented as a screen dump below the code.



#
# Parsing mixed content XML of the type from an OpenOffice.org Writer xml document.
# An example by Joe Brown
# 14th june 2012 @ 9:38am
#
# Based on blog: http://www.knowthytools.com/2010/03/sax-parsing-with-python.html, by Anthony Cramp,
# and modified (by me) to accept strings instead of URI - Thanks Anthony.
#
# [Using Python 2.7 on Windows XP/SP3]
#
# Example handler to parse 'mixed' content (formatting tags) in xml
# The subject of mixed content handling in Python is not covered
# very well (for beginners with the language) and so I thought
# publishing this might help others.
#
# My example is based on parsing lines of text in an OpenOffice.org Writer Doc which show the
# lyrics of a song with chord names above them (as per my 'Songs For Broken Voices' - http://www.joebrown.org.uk/wp/?p=1217#BadToMe )
# White space has been added to position the chords above certain words, and as
# a result, 'mixed content' is present in the xml, making it a pain to read using
# the likes of ElementTree
#
# The strings below demonstrate  both 'normal' and 'formatted' lines of text.
# Un-comment the commented code to see a full report of parsing
#

import xml.sax

srctxt1 = """<text:p text:style-name="P6">A normal line of text, which has no formatting tags in it.</text:p>"""
# next: a string containing the names of 4 guitar chords, separated by arbitrary whitespace and space-insertion instruction tags.
# some of which can be arbitrarily empty!
srctxt2 = """<text:p text:style-name="P6"><text:s text:c="5"/>D <text:s text:c="10"/>F#min <text:s text:c="7"/>Bmin <text:s text:c="12"/>A7<text:s /></text:p>"""
# now the lyric line
srctxt3 = """<text:p text:style-name="P6">The birds in the sky would be <text:s text:c="3"/>sad and lonely, </text:p>"""

class ABContentHandler(xml.sax.ContentHandler):
  """ Interface for reading mixed content within xml text tags of the type used
      in OpenOffice.org Write Documents. The text-formatting tags are replaced with
      whitespace, thus retaining the original spacings between the words."""
  # string accumulator
  line = ""

  def __init__(self):
    xml.sax.ContentHandler.__init__(self)

  def startElement(self, name, attrs):
     # print("startElement '" + name + "'")
     if name == "text:s":
        if attrs.getLength() != 0: # yes, some of these elements appear without any attribute/value pair, in other words just noise - OpenOffice.org Writer is sloppy!
            # print("\tattribute type='" + attrs.getType("text:c") + "'")
            # print("\tattribute value='" + attrs.getValue("text:c") + "'")
            # NOTE: example only caters for 'text:c' attribute
            self.line += ' ' * int(attrs.getValue("text:c")) # accumulate white space for cursor value

  def endElement(self, name):
    # print("endElement '" + name + "'")
     pass

  def characters(self, content):
    # print("characters '" + content + "'")
    self.line += content # accumulate literal text

  def printLine(self):
     print self.line

  def clear(self): # clear the accumulator for re-use
    self.line = ""

def main():
  ch = ABContentHandler()
  xml.sax.parseString(srctxt1, ch)
  ch.printLine()
  print "-------------------------------------"
  ch.clear() # clear the string accumulator
  xml.sax.parseString(srctxt2, ch) # parse the chord line
  ch.printLine()
  ch.clear()
  xml.sax.parseString(srctxt3, ch) # now parse the lyric
  ch.printLine()

if __name__ == "__main__":
  main()


Executing the above produced the following (in the Idle GUI):

Output from sax_ex1.py

Output from sax_ex1.py

Downloads
The python code for the above is here: http://joebrown.org.uk/images/sax_ex1.py
The xml content (from the original ODT document for ‘Bad To Me‘ [http://www.joebrown.org.uk/wp/?p=1217#BadToMe ] ) is here: http://joebrown.org.uk/images/content.xml

Copy the code below to your web site.
x 
  • Share/Bookmark
PDF Creator    Send article as PDF   

You must be logged in to post a comment.