Bash Script – Create List of URLs from WordPress Sitemap

Sometimes you may find yourself wishing you had a list of all of the post, pages and category URLs of your WordPress or WooCommerce site. For example if you want to batch invalidate posts on Facebook for sharing, it can be very handy to have a text file of all your URLs for easy copy and pasting. There is an online version of this already URL extractor from an XML sitemap but I wanted something easier so I had a raw text file.

Create List of URLs from WordPress Sitemap

The xml2 package on Debian and Ubuntu allows you to easily scrape the WordPress sitemap using bash

sudo apt update
sudo apt install wget xml2 -y

Now the basic tools are installed you can go on to the scripts.

Basic Sitemap Scraper

This will do your post-sitemap.xml and page-sitemap.xml

#!/usr/bin/env bash
# Purpose: WordPress URL sitemap scraper
# Author: Mike 
# Source: WP Bullet https://guides.wp-bullet.com

# Site to extract sitemaps from
SITE=https://wp-bullet.com

# Extract post-sitemap.xml
wget -q $SITE/post-sitemap.xml -O postsitemap.xml

#Parse the xml file and put it into posts.txt
xml2 < postsitemap.xml | grep /url/loc= | sed 's/.*=//' > posts.txt

#Download page sitemap
wget -q $SITE/page-sitemap.xml -O pagesitemap.xml

#Parse the xml file and put it into pages.txt
xml2 < pagesitemap.xml | grep /url/loc= | sed 's/.*=//' > pages.txt

You will be left with a posts.txt and pages.txt file.

Multiple post-sitemaps

If you have multiple post sitemap pages in the format post-sitemap1.xml then this script will help extract all of the URLs from each sub-sitemap.

You do have to provide the number in the 1..6 area

#!/usr/bin/env bash
# WordPress URL sitemap scraper for multiple post-sitemaps
# Author Mike from WP Bullet https://guides.wp-bullet.com

# Site to extract sitemaps from
SITE=https://wp-bullet.com

# for multiple post-sitemap files here 6
for i in {1..6}
do
    wget -q $SITE/post-sitemap$i.xml -O postsitemap.xml
    #Parse the xml file and put it into posts-$i.txt
    xml2 < postsitemap.xml | grep /url/loc= | sed 's/.*=//' > posts-$i.txt
done

If you want to combine (concatenate) the multiple posts-1.txt, posts-2.txt files into allposts.txt this will do the trick

cat posts-*.txt > allposts.txt

You can also use sed instead to combine the files into one

sed -n w"allposts.txt" posts-{1..6}.txt

Convert All Sitemaps

This will turn all sitemaps into text files no matter how many subsitemaps you have.

#!/usr/bin/env bash
# Purpose: WordPress URL sitemap scraper
# Author: Mike 
# Source: WP Bullet https://guides.wp-bullet.com

# Site to extract sitemaps from
SITEMAPBASE=https://guides.wp-bullet.com

# name of the main sitemap
SITEMAPXML=sitemap_index.xml

# Grab sitemap.xml
wget -q $SITEMAPBASE/$SITEMAPXML -O /tmp/sitemap.xml

# turn sitemap into array of sub-sitemaps
SITEMAPARRAY=($(xml2 < /tmp/sitemap.xml | grep /sitemapindex/sitemap/loc= | sed 's#.*=##'))

# loop through array, grab sub-sitemap and turn it into a text file
for SITEMAPELEMENT in ${SITEMAPARRAY[@]}; do
    echo $SITEMAPELEMENT
    wget -q $SITEMAPELEMENT -O /tmp/tempsitemap.xml
    SITEMAPTXTNAME=$(echo $SITEMAPELEMENT | sed "s#$SITEMAPBASE/##" | sed "s#.xml##")
    echo $SITEMAPTXTNAME
    xml2 < /tmp/tempsitemap.xml | grep /url/loc= | sed 's/.*=//' > $SITEMAPTXTNAME.txt
done

If you want to combine (concatenate) the multiple posts-1.txt, posts-2.txt files into allposts.txt this will do the trick

cat post-sitemap*.txt > allposts.txt

Now you have a list of posts, pages, categories and whatever else you may have a sitemap for to paste or work with as you please.

Sources

xml2
Parse XML with Bash
grep and sed Equivalent of XML processing
Append contents of Multiple Files into One File