1. What is XML?
XML stands for eXtensible Markup Language. XML is frequently utilized for organizing, storing, and transmitting data between various systems. It follows specific rules to encode data into a particular document format. For example, we have a file items.xml
with the content below:
<data>
<items>
<item name="item1" price="5">book</item>
<item name="item2" price="15">chair</item>
<item name="item3" price="20">window</item>
</items>
</data>
The items.xml file consists of nested tags. Each item tag has the attributes name and price. We will use the items.xml file to demonstrate XML file reading in this article.
Parsing an XML file refers to the process of reading and analyzing its contents. In Python, we can parse an XML file using libraries:
- BeautifulSoup
- ElementTree
- minidom
2. Read XML file with BeautifulSoup
The BeautifulSoup library supports the HTML parser (lxml) to read an XML file in Python. To use the lxml parser, we need to install this library with the following command:
# install beautifulsoup
pip install beautifulsoup4
#install lmxl parser
pip install lxml
You can refer to the article Installing Python and programming environment with Visual Studio Code for instructions on installing Python libraries in Visual Studio Code.
To read an XML file with lxml, we perform 2 steps: 1) Find the tags in the XML, 2) Extract data from the tags.
from bs4 import BeautifulSoup
# reading data in items.xml
with open('items.xml', 'r') as f:
data = f.read()
# passing data inside the beautifulsoup parser
bs_data = BeautifulSoup(data, "xml")
# finding all instances of tag item
bs_item = bs_data.find_all('item')
print(bs_item)
# using find() to get a tag with specified attribute
bs_name = bs_data.find('item', {'name':'item1'})
print(bs_name)
# extracting the text stored in a tag
text = bs_name.get_text()
print(text)
# extracting the data stored in a specific attribute of a tag
value = bs_name.get('price')
print(value)
Result
[<item name="item1" price="5">book</item>, <item name="item2" price="15">chair</item>, <item name="item3" price="20">window</item>]
<item name="item1" price="5">book</item>
book
5
The functions of BeautifulSoup are often used to read XML files such as:
find_all()
finds all specified tags.find()
finds the first tag that matches the requirement.get_text()
retrieves the text of the tag.get()
retrieves the value of an attribute of a tag.
You can learn more about how to use BeautifulSoup at Beautiful Soup Documentation.
3. Read XML file with ElementTree
The ElementTree module provides a lot of tools to manipulate XML files. The ElementTree module is pre-built in Python so we don’t need to install additional libraries to use ElementTree.
Representing an XML file as a tree is simpler due to its hierarchical data format. The ElementTree module provides methods to represent the entire XML document as a single tree. This is very suitable for working with XML files.
The ElementTree module provides the ElementTree.parse()
function to start parsing the XML file. Then, the getroot()
function helps get the XML file’s root tag. The root tag will have child tags indexed starting from 0. The child tags will have an attrib attribute to help access the attributes of a tag.
# importing element tree
import xml.etree.ElementTree as ET
# Pass the path of the xml document
tree = ET.parse('items.xml')
# get the root tag
root = tree.getroot()
# print the root tag along with its memory location
print(root)
# print the text contained within first subtag of the 0th tag from the root
print(root[0][0].text)
# print the attributes of the first subtag of the 0th tag from the root
print(root[0][0].attrib)
Result
<Element 'data' at 0x0000023CC11D8D60>
book
{'name': 'item1', 'price': '5'}
You can learn more about how to use ElementTree at The ElementTree XML API.
4. Read XML file with minidom
The minidom module is integrated into Python. We only need import xml.dom.minidom
to use the minidom module. This module supports the parse()
function to read an XML file in Python. With minidom, each tag will be viewed as an object. We can access the attributes and text of a tag by accessing the object’s attributes.
from xml.dom import minidom
# parse file items.xml
file = minidom.parse('items.xml')
# use getElementsByTagName() to get tags
items = file.getElementsByTagName('item')
# one specific item attribute
print('Value of attribute name of item #2:')
print(items[1].attributes['name'].value)
# all attributes of item tags
print('\nAll values of attribute name:')
for elem in items:
print(elem.attributes['name'].value)
# one specific item's data
print('\nData of item:')
print(items[1].firstChild.data)
print(items[1].childNodes[0].data)
# all items data
print('\nAll item data:')
for elem in items:
print(elem.firstChild.data)
Result
Value of attribute name of item #2:
item2
All values of attribute name:
item1
item2
item3
Data of item:
chair
chair
All item data:
book
chair
window
You can learn more about how to use minidom at Minimal DOM implementation.