by Leon Stafford
The Crawling phase of WP2Static requests each detected URL and saves it in a static site friendly format.
URL Detection | Crawling <--- here | Post Processing | Deployment
When you visit a page in your WordPress site, the contents are assembled from PHP scripts, database entries and any manipulations themes and plugins apply. WP2Static saves a copy of the page as your users see it - after all the assembly has been done.
For each detected URL in our Crawl Queue, WP2Static:
Static site friendly HTML pages
In your WordPress site, visiting a path like
https://example.com/posts/my-sample-post/ relies on your webserver and WordPress to route that URL to show the appropriate post content. There is no directory /posts/ within your WordPress installation.
With a static website, we have no PHP or database to magically route such URLs. Instead, we need to create directories to support the full path being requested, ie
/posts/my-sample-post/. Most static site hosting providers work by routing requests to directories to a default document within that directory, usually
index.html. If this wasn’t the case, URLs in static sites would need to look like:
https://example.com/posts/my-sample-post/index.html - not ideal.
When WP2Static crawls a URL like
https://example.com/posts/my-sample-post/, it will thus create the following directories and file within your Generated Static Site directory:
If everytime we change a page in our WordPress site, WP2Static needed to detect, crawl, post-process and deploy every single URL again, it would be very slow (ask anyone who used versions 6 and earlier!).
When we’ve successfully crawled a URL, we add it to WP2Static’s CrawlCache. On subsequent crawls, we skip any URLs already in our CrawlCache.
Caching is hard!
There are only two hard things in Computer Science: cache invalidation, naming things, and off-by-one errors
I need to give website caching it’s own page in our docs, as there are so many layers of cache between your user and your website. For now, I’ll just explain the decisions made around our Crawl Cache.
I opted to aggressively cache all successfully crawled URLs and incrementally fine-tune the cache-invalidation mechanisms. For now, when you edit/delete a post or page, that invalidates the CrawlCache for that URL, allowing it to be freshly crawled in the next Job.
Because of this, you may find that changing a menu, switching themes or other WordPress content updates don’t properly reflect in your next static site export. To alleviate this, you can manually clear the CrawlCache between jobs in WP2Static’s UI or if using WP-CLI, add the
wp wp2static crawl_cache delete command before the
wp wp2static crawl command.
Before the official release of version 7, I’m likely to either:
HTTP basic authentication credentials
In the core WP2Static plugin, the only Crawling option you can set is if you are using HTTP basic authentication on our development site (highly recommended!), you can enter the username and password on the main WP2Static > Options UI screen (or using WP-CLI).
HTTP basic auth is not required to use WP2Static, so you can leave those 2 fields empty if not using HTTP basic auth. When saved, as with other WP2Static sensitive information, we encrypt the basic auth password when saving to the database.
HTTP basic authentication adds an extra login/password to your development WordPress site. Because only you and your team should be able to access this site, it makes sense to lock it down with an extra layer of security, keeping it away from malicious users and bots. Only your deployed static site needs to be publicly accessible. This is one of the key ways in which WP2Static can offer a much more secure setup for WordPress sites.
wp wp2static crawl_cache list
wp wp2static static_site list
wp wp2static crawl
<?php $crawler = new \WP2Static\Crawler(); $crawler->crawlSite( \WP2Static\StaticSite::getPath() );
You generally don’t need to manually clear the Crawl Queue, as WP2Static will clear it before each new detection Job. It’s also a part of the whole detect, crawl, post-process, deployment Job that takes up very little time, even on very large sites (takes only seconds).
wp wp2static crawl_cache delete
wp wp2static static_site delete
The CrawlCache should look identical to your CrawlQueue (Detected URLs) after a full crawl has completed, ie a sampling:
/ /category/uncategorized/ /category/uncategorized/page/1/ /favicon.ico /hello-world/ /page/1/ /pages/page/1/ /robots.txt /sample-page/ /sitemap.xml /wp-content/plugins/wp-search-with-algolia/css/algolia-autocomplete.css /wp-content/plugins/wp-search-with-algolia/css/algolia-instantsearch.css /wp-content/plugins/wp-search-with-algolia/css/algolia-logo.svg /wp-content/plugins/wp-search-with-algolia/includes/admin/css/algolia-admin.css /wp-content/plugins/wp-search-with-algolia/includes/admin/fonts/algolia.eot /wp-content/plugins/wp-search-with-algolia/includes/admin/fonts/algolia.svg /wp-content/plugins/wp-search-with-algolia/includes/admin/fonts/algolia.ttf /wp-content/plugins/wp-search-with-algolia/includes/admin/fonts/algolia.woff ...
This should look similar to your CrawlCache, noting the
index.html preppended to URLs with HTML content:
/category/uncategorized/index.html /category/uncategorized/page/1/index.html /hello-world/index.html /index.html /page/1/index.html /sample-page/index.html /wp-content/plugins/wp-search-with-algolia/css/algolia-autocomplete.css /wp-content/plugins/wp-search-with-algolia/css/algolia-instantsearch.css /wp-content/plugins/wp-search-with-algolia/includes/admin/css/algolia-admin.css /wp-content/plugins/wp-search-with-algolia/includes/admin/fonts/algolia.woff
In a typical workflow, crawled files are saved to the
/wp-content/uploads/wp2static-crawled-site/ directory. This directory then undergoes the Post-processing phase of a WP2Static Job before deployment.
I have some CSS or XML served from
/feed/paths, will that work?
/feed/ URL is common in WordPress, I’m working on a way to support maintaining those URLs. With static hosting providers offering serverless functions or redirects, this should be possible with a bit of work.
For the case of CSS, images or other files served without file extensions in WordPress, it’s less common and more of an exception, so I’d advise to avoid the theme/plugin that’s generating such URLs. I can look into supporting such cases, but it’s going to be further down the list of priorities. You may be able to workaround such a case with a plugin like Autoptimize, which could make troublesome static asset URLs more friendly for static site export.
Back to top