xml2vsxmltodict
The xml2 package is a binding to libxml2, making it easy to work with HTML and XML from R. The API is somewhat inspired by jQuery.
xml2 can be used to parse HTML documents using XPath selectors and is a successor to R's XML package with a few improvements:
- xml2 takes care of memory management for you. It will automatically free the memory used by an XML document as soon as the last reference to it goes away.
- xml2 has a very simple class hierarchy so don't need to think about exactly what type of object you have, xml2 will just do the right thing.
- More convenient handling of namespaces in Xpath expressions - see xml_ns() and xml_ns_strip() to get started.
xmltodict is a Python library that allows you to work with XML data as if it were JSON. It allows you to parse XML documents and convert them to dictionaries, which can then be easily manipulated using standard dictionary operations.
You can also use the library to convert a dictionary back into an XML document. xmltodict is built on top of the popular lxml library and provides a simple, intuitive API for working with XML data.
Note that despite using lxml conversion speeds can be quite slow for large XML documents and in web scraping this should be used to parse specific snippets instead of whole HTML documents.
xmltodict pairs well with JSON parsing tools like jmespath or jsonpath. Alternatively, it can be used in reverse mode to parse JSON documents using HTML parsing tools like CSS selectors and XPath.
It can be installed via pip by running pip install xmltodict command.
Example Use
library("xml2")
x <- read_xml("<foo> <bar> text <baz/> </bar> </foo>")
x
xml_name(x)
xml_children(x)
xml_text(x)
xml_find_all(x, ".//baz")
h <- read_html("<html><p>Hi <b>!")
h
xml_name(h)
import xmltodict
xml_string = """
<book>
    <title>The Great Gatsby</title>
    <author>F. Scott Fitzgerald</author>
    <publisher>Charles Scribner's Sons</publisher>
    <publication_date>1925</publication_date>
</book>
"""
book_dict = xmltodict.parse(xml_string)
print(book_dict)
{'book': {'title': 'The Great Gatsby',
'author': 'F. Scott Fitzgerald',
'publisher': "Charles Scribner's Sons",
'publication_date': '1925'}}
# and to reverse:
book_xml = xmltodict.unparse(book_dict)
print(book_xml)
# the xml can be loaded and parsed using parsel or beautifulsoup:
from parsel import Selector
sel = Selector(book_xml)
print(sel.css('publication_date::text').get())
'1925'