Tag Archives: wget

laptop

Scraping a WordPress site for text to use with char-rnn

As with anything else in life, it is possible to change nothing but yourself.
The first step toward making change is simply to change yourself.

In this quick example the gist of scraping a WordPress website using Linux and Lynx will be shown. Wget is great at scraping from the web but, I have found out that it does not always work well with WordPress sites.

Grab some text

Step one of the process is to grab some text to work with. In the example I tried I grabbed all of the text of the posts of this blog by scraping it with wget. I also used the text of the US Constitution to see what the tools would do with it as well. Generally the more text, the better the machine learning code will be at generating something interesting.

Scraping the posts

Using the command line web browser lynx in a script I was able to download the text of the posts on this site. Initially I thought to use wget. But, I remembered that wget will do a good job downloading static sites and sometimes will not do so well with ones like this one that is created in WordPress.

There is probably a way to loop this code in bash, and increment a counter for the pages. But, being that this is a one time thing, I opted for a quick approach instead of thinking too hard on making a loop.

#!/bin/bash
lynx -dump -nolist http://erick.heart-centered-living.org/ > my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/2/ >> my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/3/ >> my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/4/ >> my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/5/ >> my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/6/ >> my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/7/ >> my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/8/ >> my-posts.txt
lynx -dump -nolist http://erick.heart-centered-living.org/page/9/ >> my-posts.txt

This code will output a file that contains all of the text from the posts on this site. Up to Fall of 2018 when I ran it.

Blue Screen of Death Again!

Wget an ISO or other large file in the background

Let us forget the past. And remember that the past is a gift of the present, not a substitute for the future.

I was trying to download the Debian testing DVD ISO and it looked like it would take a while, many hours and I wanted to power off the machine.  This was back a while ago with slower internet but, this topic is still relevant. Normally I use the torrent for the distro file, but on the testing branch of Debian, none were available.

The solution

I have a Raspberry Pi, it’s running 24/7, let it do the work overnight and I can just power down my machine and not worry about the download.
Instead of downloading the file itself, I grabbed the link to the download location.
Then executed

wget -c https://gensho.ftp.acc.umu.se/cdimage/buster_di_alpha2/amd64/iso-dvd/debian-buster-DI-alpha2-amd64-DVD-1.iso
Output...
 --2018-02-07 18:15:27-- https://gensho.ftp.acc.umu.se/cdimage/buster_di_alpha2/amd64/iso-dvd/debian-buster-DI-alpha2-amd64-DVD-1.iso
 Resolving gensho.ftp.acc.umu.se (gensho.ftp.acc.umu.se)... 194.71.11.176, 2001:6b0:19::176
 Connecting to gensho.ftp.acc.umu.se (gensho.ftp.acc.umu.se)|194.71.11.176|:443... connected.
 HTTP request sent, awaiting response... 200 OK
 Length: 3864182784 (3.6G) [application/x-iso9660-image]
 Saving to: `debian-buster-DI-alpha2-amd64-DVD-1.iso'

Suceess!

Now all I have to do is put the task in the background via Ctrl-Z and then bg and I detach from SSH’ing into the R-Pi and it will just download in the background to the harddrive tethered to it’s USB port. When you enter bg it will still print it’s progress to the screen, but the terminal can be closed out fine.
There is a -b option for wget that will launch it, into the background from the start as well.

ps aux|grep wget

…will confirm that it is running still…

 erick 12438 7.0 2.2 13120 10996 ? S 18:15 2:46 wget -c https://gensho.ftp.acc.umu.se/cdimage/buster_di_alpha2/amd64/iso-dvd/debian-buster-DI-alpha2-amd64-DVD-1.iso

Watch

While in the directory that it is downloading a watch command can be used to see the progress of the download…

watch ls -l debian-buster-DI-alpha2-amd64-DVD-1.iso

 

Output…

Every 2.0s: ls -l debian-buster-DI-alpha2-amd64-DVD... Wed Feb 7 18:56:25 2018

-rw-r--r-- 1 erick erick 280608768 Feb 7 18:56 debian-buster-DI-alpha2-amd64-DV
 D-1.iso

This will show a progressive increase in file size, in case you want to monitor it.

 

Create a hidden WordPress page using bash on the command line

Recently I was searching around looking for a way to create a hidden page on a WordPress site. It is a hosted site, not on wordpress.com. It is on a Linux server to which I have shell access.

Initially I tried using a plugin that I found that hides pages and posts. Plugins, you got to love or hate them. Love then when they work great right out of the box, hate them when they take a long time to troubleshoot.

Rather than waste too much time with the plugin, I went straight to the command line.

Screenshot_2018-04-03_18-28-35-shows-making-hidden-page

It turns out that if you publish a page and then log into the hosting server, make a directory somewhere under your public_html, change directory into it and execute…

 wget -x -nH your-page-url-to-hide-here

 

Screenshot_2018-04-03_18-38-33-draft-or-private-wp-setting
Set to Draft or Private

…then go back it and make the page a draft or under review, so it “disappears” from the menu structure. It will still work as a “cached” HTML page that has been downloaded to the folder that you have created. It will work, pictures and what not that you have loaded in it will be fully functional.

Example of a hidden page

http://erick.heart-centered-living.org/hidden/i-am-a-hidden-page/

Once the original page is put into draft/under review or private mode, it is gone…

http://erick.heart-centered-living.org/i-am-a-hidden-page/

Caveat

I have noticed that caching can get in the way. If your server caches pages, wget may not see the page updated when you make changes. A quick remedy is to set the page to draft/pending review or private, delete the hidden page. I usually use rm -rf from the directory above it and then force it to download the “404” page. Then  you can publish the page re-run wget and it will force it to get the fresh version. Keep note of the size of the file as a hint that it is getting the right one.

Upcoming: Do this with a CGI Script

In an upcoming post, I will cover how to make a CGI script that will allow you to create a hidden page easily without having to use SSH to login to the server.

 

wget options used in this example, from the man page

-x
–force-directories
The opposite of -nd—create a hierarchy of directories, even if
one would not have been created otherwise.  E.g. wget -x
http://fly.srk.fer.hr/robots.txt will save the downloaded file to
fly.srk.fer.hr/robots.txt.

-nH
–no-host-directories
Disable generation of host-prefixed directories.  By default,
invoking Wget with -r http://fly.srk.fer.hr/ will create a
structure of directories beginning with fly.srk.fer.hr/.  This
option disables such behavior.

Wget Resources

https://www.lifewire.com/uses-of-command-wget-2201085

https://www.labnol.org/software/wget-command-examples/28750/

The Ultimate Wget Download Guide With 15 Awesome Examples

http://www.linuxjournal.com/content/downloading-entire-web-site-wget

U.S.D.A. Forest Service Webcam Image - Cloud Peak, WY

Active Desktop Wallpaper using wget

It is nice to have a desktop wallpaper that is not static, I like to see some outdoor scene that has a good view and a dynamic sky. Wyoming certainly has some ever changing skies and nice terrain so I have a wallpaper background set to show the Cloud Peaks Wilderness in Wyoming that updates every hour.

It is possible to load a JPG file periodically from a source using the Linux built-in wget command. In the example below, I am loading a scene from Cloud Peak Wyoming that is captured by a US Forest Service Webcam. It is loading right into my home folder, it could be put in any place that you prefer.

There is a nice bunch of pictures taken by the Forest Service from all over the country and they provide some nice high resolution scenery. See the links at the bottom of this post.

Code for script file

#! /bin/bash
 rm /home/erick/cpwa1_large.jpg
 wget http://www.fsvisimages.com/images/photos-large/cpwa1_large.jpg

The code first removes the old copy of the image and then it uses the wget command to fetch a new copy.

.wgetrc

It is not necessary to modify .wgetrc to use wget, but I put this here as an FYI. There is a configuration file for wget. It is located at /usr/local/etc/wgetrc. More info on wget locations. You can make a copy of it and put it in your home directory. Once in the home directory any modifications to it will work for your user profile. I have mine modified to do a few non-standard things, one is to use timestamping which will make wget only download when the file it is trying to download is newer than the local copy.

# Set this to on to use timestamping by default:
timestamping = on

Secondly, I also added a line at the end of the file that puts an option for wget for limiting the rate of downloading. Otherwise wget will run as fast as possible and will use the entire bandwidth. This option can be used on a case by case basis by putting in the line when wget is called as well. Doing this makes it so wget doesn’t slow down your connection to the Internet a lot and doesn’t hit the server hard with high speed downloads, important if you are downloading multiple large files.

limit-rate=20k

It is also possible to add a bit of a delay between connections when downloading. This avoids hammering the server that you are downloading from when downloading multiple files. This makes it easier on the server load and makes your download activity less likely to be obnoxious to the folks running the server that you are downloading from. Obnoxious down-loaders and site scrapers are more likely to get banned I would imagine if someone notices a spike in server load and pins it down to the IP address.

# It can be useful to make Wget wait between connections.  Set this to
# the number of seconds you want Wget to wait.
wait = 1

Some sites go as far as prohibiting downloads unless the user agent has a string inside of it. I didn’t do this yet as I have not had a problem with this issue. But it is possible to set the user-agent via --user-agent=“Acceptable String Here”

More on user-agent modification

CRON entry

01 08-22 * * * /home/erick/cpwa1/wget-cpwa1.sh

Using crontab -e, a line can be loaded into your CRON file to run the script periodically. The one above runs every hour 1 minute after the hour between 8AM and 10PM. There is no sense in loading nighttime pictures so that is why the times are bounded to load pictures during daylight hours ( right now) for Mountain Daylight Time. The picture I load is update around 59 minutes after the hour so loading 1 minute after the hour provides a bit of a guard band of time.

USDA Forest Service Webcams

USDA Forest Service Real Time Image Description Page
USDA Forest Service Real Time Image Gallery