Quantcast
Viewing all articles
Browse latest Browse all 108

lxml etree iterparse or xpath, xml namespaces, and python generators

I've been going round in circles trying to parse XML in python for MSL. Everything I've done in the past has been a nasty hack. Today, I tried to figure out how to cleanly deal with the namespaces and was foiled. It seems crazy that lxml doesn't have a something that returns to you the namespace dictionary. But maybe that isn't even possible. So, in the past, I've had things like this:
utm_coords = root.xpath('//*/gml:coordinates', namespaces={'gml':"http://www.opengis.net/gml"})[0].text
Ick. I really do not like the whole namespace URL thing. Why does my code care where the gml spec is? Today, I finally found a way that you can do xpath searches that ignore the namespace. e.g. If want to find all the <Node> entries in a document:
et.xpath("//*[local-name()='Node']")
Out[47]: [<Element {RPK}Node at 0x103c9ca50>]
This works fine for small documents. But really I should be using iterparse to deal with one chunk at a time. Some documents have one <Node>, but others have 10's of thousands of nodes and several hundred thousand child nodes. xpath just gets slow. It turns out that using lxml etree iterparse plus a python generator makes for compact code that doesn't make my head hurt. I'm sure it's not the easiest to understand if you are not the author, but check it out:
# This is a python generator... note the "yield"
def RksmlNodes(rksml_source):
    for event, element in etree.iterparse(rksml_source, tag='{RPK}Node'):
        knots = dict([(child.attrib['Name'], float(child.text)) for child in element])
        yield knots
Then if you want to get all the nodes in a file, you can do something like this:
nodes = [node for node in rksml.RksmlNodes('data/00048/rksml_playback.rksml')]
len(nodes)
nodes[0]
And that gives you something like this:
Out[8]: 20129
Out[9]: {'RSM_AZ': 2.1, 'RSM_EL': 0.321}
And I didn't have to hard code the fields that are the child of each <Node>. This is all find until somebody changes the namespace alias in a RKSML file to something other than RPK. In the past, I would read each file and rewrite it without the namespace.

Viewing all articles
Browse latest Browse all 108

Trending Articles