Sometimes you may find yourself wishing you had a list of all of the post, pages and category URLs of your WordPress or WooCommerce site. For example if you want to batch invalidate posts on Facebook for sharing, it can be very handy to have a text file of all your URLs for easy copy and pasting. There is an online version of this already URL extractor from an XML sitemap but I wanted something easier so I had a raw text file.
Create List of URLs from WordPress Sitemap
The xml2 package on Debian and Ubuntu allows you to easily scrape the WordPress sitemap using bash
sudo apt update
sudo apt install wget xml2 -y
Now the basic tools are installed you can go on to the scripts.
Basic Sitemap Scraper
This will do your post-sitemap.xml and page-sitemap.xml
#!/usr/bin/env bash
# Purpose: WordPress URL sitemap scraper
# Author: Mike
# Source: WP Bullet https://guides.wp-bullet.com
# Site to extract sitemaps from
SITE=https://wp-bullet.com
# Extract post-sitemap.xml
wget -q $SITE/post-sitemap.xml -O postsitemap.xml
#Parse the xml file and put it into posts.txt
xml2 < postsitemap.xml | grep /url/loc= | sed 's/.*=//' > posts.txt
#Download page sitemap
wget -q $SITE/page-sitemap.xml -O pagesitemap.xml
#Parse the xml file and put it into pages.txt
xml2 < pagesitemap.xml | grep /url/loc= | sed 's/.*=//' > pages.txt
You will be left with a posts.txt and pages.txt file.
Multiple post-sitemaps
If you have multiple post sitemap pages in the format post-sitemap1.xml then this script will help extract all of the URLs from each sub-sitemap.
You do have to provide the number in the 1..6
area
#!/usr/bin/env bash
# WordPress URL sitemap scraper for multiple post-sitemaps
# Author Mike from WP Bullet https://guides.wp-bullet.com
# Site to extract sitemaps from
SITE=https://wp-bullet.com
# for multiple post-sitemap files here 6
for i in {1..6}
do
wget -q $SITE/post-sitemap$i.xml -O postsitemap.xml
#Parse the xml file and put it into posts-$i.txt
xml2 < postsitemap.xml | grep /url/loc= | sed 's/.*=//' > posts-$i.txt
done
If you want to combine (concatenate) the multiple posts-1.txt, posts-2.txt files into allposts.txt this will do the trick
cat posts-*.txt > allposts.txt
You can also use sed instead to combine the files into one
sed -n w"allposts.txt" posts-{1..6}.txt
Convert All Sitemaps
This will turn all sitemaps into text files no matter how many subsitemaps you have.
#!/usr/bin/env bash
# Purpose: WordPress URL sitemap scraper
# Author: Mike
# Source: WP Bullet https://guides.wp-bullet.com
# Site to extract sitemaps from
SITEMAPBASE=https://guides.wp-bullet.com
# name of the main sitemap
SITEMAPXML=sitemap_index.xml
# Grab sitemap.xml
wget -q $SITEMAPBASE/$SITEMAPXML -O /tmp/sitemap.xml
# turn sitemap into array of sub-sitemaps
SITEMAPARRAY=($(xml2 < /tmp/sitemap.xml | grep /sitemapindex/sitemap/loc= | sed 's#.*=##'))
# loop through array, grab sub-sitemap and turn it into a text file
for SITEMAPELEMENT in ${SITEMAPARRAY[@]}; do
echo $SITEMAPELEMENT
wget -q $SITEMAPELEMENT -O /tmp/tempsitemap.xml
SITEMAPTXTNAME=$(echo $SITEMAPELEMENT | sed "s#$SITEMAPBASE/##" | sed "s#.xml##")
echo $SITEMAPTXTNAME
xml2 < /tmp/tempsitemap.xml | grep /url/loc= | sed 's/.*=//' > $SITEMAPTXTNAME.txt
done
If you want to combine (concatenate) the multiple posts-1.txt, posts-2.txt files into allposts.txt this will do the trick
cat post-sitemap*.txt > allposts.txt
Now you have a list of posts, pages, categories and whatever else you may have a sitemap for to paste or work with as you please.
Sources
xml2
Parse XML with Bash
grep and sed Equivalent of XML processing
Append contents of Multiple Files into One File