XML Walker section¶
An XML walker source section yields a hierarchy of items by iterating over an `lxml.etree`_ tree of XML elements that match an `XPath`_. This can be used to build content structure based on the sitemap or navigation of a HTML web site.
Options starting with
element-
may contain expressions whose value will be inserted into
the element items. The expressions have access to the
following:
element
|
the current walked element |
item
|
the current walked element item to be yielded |
source_item
|
the original item containing the walked tree |
tree
|
the original walked tree |
transmogrifier
|
the transmogrifier |
name
|
the name of the inserter section |
options
|
the inserter options |
modules
|
sys.modules |
Start with an HTML file containing a heirarchical navbar.
>>> import os
>>> html_file = os.path.join(
... os.path.dirname(__file__), 'xmlwalker.html')
>>> infologger = """
... [transmogrifier]
... pipeline =
... source
... parse
... walk
... defaultpage
... clean
... logger
...
... [source]
... blueprint = collective.transmogrifier.sections.tests.rangesource
... size = 1
...
... [parse]
... blueprint = collective.transmogrifier.sections.inserter
... key = string:_trees
... value = python:modules['lxml.html'].parse('{}').xpath(\
... "//*[contains(@class, 'navbar')]//ul[contains(@class, 'nav')]")
...
... [walk]
... blueprint = collective.transmogrifier.sections.xmlwalker
... element-keys =
... _path
... title
... element-_path = python:element.attrib.get(\
... 'href', element.attrib.get('src', ''))
... element-title = python:element.text_content().strip()\
... or element.attrib.get('alt', '')
...
... [defaultpage]
... blueprint = collective.transmogrifier.sections.inserter
... key = string:_defaultpage
... condition = python:item.get('_parent', dict()).pop('_parent', True)\
... and item.get('_defaultpage')
... value = exists:item/_defaultpage
...
... [clean]
... blueprint = collective.transmogrifier.sections.manipulator
... delete =
... _trees
... _element
... id
...
... [logger]
... blueprint = collective.transmogrifier.sections.logger
... name = logger
... level = INFO
... """.format(html_file)
>>> registerConfig(u'collective.transmogrifier.sections.tests.xmlwalker',
... infologger)
>>> transmogrifier(u'collective.transmogrifier.sections.tests.xmlwalker')
>>> print handler
logger INFO
{}
logger INFO
{'_parent': {}, '_path': '#', '_type': 'Folder', 'title': 'Foo Tab'}
logger INFO
{'_is_defaultpage': True,
'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Foo Tab'},
'_path': '#',
'title': 'Foo Tab'}
logger INFO
{'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Foo Tab'},
'_path': '../foo-tab/index.html',
'title': 'Foo Tab Default Page'}
logger INFO
{'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Foo Tab'},
'_path': '../foo-tab/bar-image.png',
'title': 'Bar Image'}
logger INFO
{'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Foo Tab'},
'_path': '../foo-tab/qux-page.html',
'title': 'Qux Page'}
logger INFO
{'_parent': {}, '_path': '#', '_type': 'Folder', 'title': 'Company'}
logger INFO
{'_is_defaultpage': True,
'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Company'},
'_path': '#',
'title': 'Company'}
logger INFO
{'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Company'},
'_path': '../company/news.html',
'_type': 'Folder',
'title': 'News'}
logger INFO
{'_is_defaultpage': True,
'_parent': {'_path': '../company/news.html',
'_type': 'Folder',
'title': 'News'},
'_path': '../company/news.html',
'title': 'News'}
logger INFO
{'_parent': {'_path': '../company/news.html',
'_type': 'Folder',
'title': 'News'},
'_path': '../company/news.html',
'title': 'News'}
logger INFO
{'_parent': {'_path': '../company/news.html',
'_type': 'Folder',
'title': 'News'},
'_path': '../company/press_releases.html',
'title': 'Press Releases'}
logger INFO
{'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Company'},
'_path': '../company/events.html',
'title': 'Events'}
logger INFO
{'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Company'},
'_path': '../contact_us/contact.html',
'title': 'Contact Us'}
logger INFO
{'_parent': {'_path': '#', '_type': 'Folder', 'title': 'Company'},
'_path': '../company/index.html',
'_type': 'Folder',
'title': 'About Company'}
logger INFO
{'_is_defaultpage': True,
'_parent': {'_path': '../company/index.html',
'_type': 'Folder',
'title': 'About Company'},
'_path': '../company/index.html',
'title': 'About Company'}
logger INFO
{'_parent': {'_path': '../company/index.html',
'_type': 'Folder',
'title': 'About Company'},
'_path': '../company/management.html',
'title': 'Management'}
logger INFO
{'_parent': {'_path': '../company/index.html',
'_type': 'Folder',
'title': 'About Company'},
'_path': '../company/investors.html',
'title': 'Investors'}
logger INFO
{'_parent': {'_path': '../company/index.html',
'_type': 'Folder',
'title': 'About Company'},
'_path': '../company/careers.html',
'title': 'Careers'}
logger INFO
{'_parent': {'_path': '../company/index.html',
'_type': 'Folder',
'title': 'About Company'},
'_path': '../company/company.html',
'title': 'About Us'}