Python (is definitely) != snake_oil
I wanted to extract the formatted text from an OpenOffice.org Writer document, retaining the positions of words within the white space of the text.
Most folks are aware that these documents are xml ‘under the hood’ – .odt == .zip == collection of xml files etc.
Moreover, I wanted to use Python, with a view to re-using extracted and/or munged text with ReportLab.
It turns out that handling ‘mixed’ xml content is not that well documented, but I did manage to find one blogger whose short example pointed me in the right direction. The power behind the Sax parser code left me a little breathless, as I ended up with a lot less slurp than I expected.
I present my example code below in it’s entirety, for the benefit of others looking to solve a similar problem. The output from my example is presented as a screen dump below the code.
# # Parsing mixed content XML of the type from an OpenOffice.org Writer xml document. # An example by Joe Brown # 14th june 2012 @ 9:38am # # Based on blog: http://www.knowthytools.com/2010/03/sax-parsing-with-python.html, by Anthony Cramp, # and modified (by me) to accept strings instead of URI - Thanks Anthony. # # [Using Python 2.7 on Windows XP/SP3] # # Example handler to parse 'mixed' content (formatting tags) in xml # The subject of mixed content handling in Python is not covered # very well (for beginners with the language) and so I thought # publishing this might help others. # # My example is based on parsing lines of text in an OpenOffice.org Writer Doc which show the # lyrics of a song with chord names above them (as per my 'Songs For Broken Voices' - http://www.joebrown.org.uk/wp/?p=1217#BadToMe ) # White space has been added to position the chords above certain words, and as # a result, 'mixed content' is present in the xml, making it a pain to read using # the likes of ElementTree # # The strings below demonstrate both 'normal' and 'formatted' lines of text. # Un-comment the commented code to see a full report of parsing # import xml.sax srctxt1 = """<text:p text:style-name="P6">A normal line of text, which has no formatting tags in it.</text:p>""" # next: a string containing the names of 4 guitar chords, separated by arbitrary whitespace and space-insertion instruction tags. # some of which can be arbitrarily empty! srctxt2 = """<text:p text:style-name="P6"><text:s text:c="5"/>D <text:s text:c="10"/>F#min <text:s text:c="7"/>Bmin <text:s text:c="12"/>A7<text:s /></text:p>""" # now the lyric line srctxt3 = """<text:p text:style-name="P6">The birds in the sky would be <text:s text:c="3"/>sad and lonely, </text:p>""" class ABContentHandler(xml.sax.ContentHandler): """ Interface for reading mixed content within xml text tags of the type used in OpenOffice.org Write Documents. The text-formatting tags are replaced with whitespace, thus retaining the original spacings between the words.""" # string accumulator line = "" def __init__(self): xml.sax.ContentHandler.__init__(self) def startElement(self, name, attrs): # print("startElement '" + name + "'") if name == "text:s": if attrs.getLength() != 0: # yes, some of these elements appear without any attribute/value pair, in other words just noise - OpenOffice.org Writer is sloppy! # print("\tattribute type='" + attrs.getType("text:c") + "'") # print("\tattribute value='" + attrs.getValue("text:c") + "'") # NOTE: example only caters for 'text:c' attribute self.line += ' ' * int(attrs.getValue("text:c")) # accumulate white space for cursor value def endElement(self, name): # print("endElement '" + name + "'") pass def characters(self, content): # print("characters '" + content + "'") self.line += content # accumulate literal text def printLine(self): print self.line def clear(self): # clear the accumulator for re-use self.line = "" def main(): ch = ABContentHandler() xml.sax.parseString(srctxt1, ch) ch.printLine() print "-------------------------------------" ch.clear() # clear the string accumulator xml.sax.parseString(srctxt2, ch) # parse the chord line ch.printLine() ch.clear() xml.sax.parseString(srctxt3, ch) # now parse the lyric ch.printLine() if __name__ == "__main__": main()
Executing the above produced the following (in the Idle GUI):
The python code for the above is here: http://joebrown.org.uk/images/sax_ex1.py
The xml content (from the original ODT document for ‘Bad To Me‘ [http://www.joebrown.org.uk/wp/?p=1217#BadToMe ] ) is here: http://joebrown.org.uk/images/content.xml