Daily Archives: May 1, 2023

laptop

Scraping a WordPress site for text to use with char-rnn

As with anything else in life, it is possible to change nothing but yourself.
The first step toward making change is simply to change yourself.

In this quick example the gist of scraping a WordPress website using Linux and Lynx will be shown. Wget is great at scraping from the web but, I have found out that it does not always work well with WordPress sites.

Grab some text

Step one of the process is to grab some text to work with. In the example I tried I grabbed all of the text of the posts of this blog by scraping it with wget. I also used the text of the US Constitution to see what the tools would do with it as well. Generally the more text, the better the machine learning code will be at generating something interesting.

Scraping the posts

Using the command line web browser lynx in a script I was able to download the text of the posts on this site. Initially I thought to use wget. But, I remembered that wget will do a good job downloading static sites and sometimes will not do so well with ones like this one that is created in WordPress.

There is probably a way to loop this code in bash, and increment a counter for the pages. But, being that this is a one time thing, I opted for a quick approach instead of thinking too hard on making a loop.

#!/bin/bash
lynx -dump -nolist http://erick.heart-centered-living.org/ > my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/2/ >> my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/3/ >> my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/4/ >> my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/5/ >> my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/6/ >> my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/7/ >> my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/8/ >> my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/9/ >> my-posts.txt

This code will output a file that contains all of the text from the posts on this site. Up to Fall of 2018 when I ran it.