So let’s say you want to download a link from a website, or maybe two links, heck maybe you want the whole site and everything in it. In that case wget is your friend. I’m going to assume if you found this page you already can figure out how to download/install wget.
So let’s start with the basics.
To download a single file:
wget http://example.com/supercool.zip
Multiple files? Just add a space between them
wget https://example.com/supercool.zip https://example.com/superbad.zip
If you have a whole list of files you need to download throw them in a txt file.
wget -i urls.txt
Use -O if you want to rename the file you’re downloading (it’ll save the file as homework.zip):
wget -O homework.zip https://example.com/xxx.zip
If you’re getting rate-limited or have a bad connection, you can have it keep retrying (50 times in this case):
wget -tries=50 https://example.com/cool.zip
Did you spill your chocolate milk on the keyboard mid-download and it aborted? No problem, resume downloads with -c
wget -c https://example.com/10000000gig.zip
Change the user agent and pretend to be a web browser
✅ Here’s a list of updated user agents to use ✅
wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:129.0) Gecko/20100101 Firefox/129.0" https://example.com/cia.zip
Ok, now let’s say you have a directory full of links you want
wget -r -nH --cut-dirs=2 -nr -R "index.html*" http://website.com/dir1/dir2/data
- The -r tells it to download recursively
- The -nH tells it not to include the domain name.
- We don’t need those first two directories so –cut-dirs=2 ignores them
- The -nr makes sure it doesn’t go back a directory if there’s a backlink somewhere on the page
- The -R “index.html*” rejects downloading any index.html files because this is a link directory. That means all you will download are the links, and they won’t be in two levels of empty directories with blank html files and long URLS.
Now let’s download the entire website
wget --mirror -page-requisites -k -P TheFilesGo/Here/ https://www.funtimesbiz.com
- The -mirror tells it to download everything structure wise identical to the site’s original content
- The -page-requisites tells it to also grab any secondary files like css, js and images
- The -k tells it to convert any links to the local directory structure so they’re not broken in your local copy
- The -P TheFilesGo/Here tells it to download everything into that directory instead of the local one.
Now, let’s say you’re grabbing a lot of files and you don’t want to get rate-limited/banned, you can add -w 3
and it will wait three seconds between transfers (adjust as needed). If you also added after that --random-wait
it would randomize the time between each transfer between 0.5 and 1.5 seconds of the value chosen with -w.
A quick run down of other useful Wget options:
-q
is for quiet mode, only major notifications will be shown.-A gif,jpg
would tell it to only download gif and jpg files.-A "resume"
would only download files that start with the word resume.-R "cupcakes"
would NOT download anything with the word cupcakes.-o logfile.log
will put all the info it would have posted to the screen in a logfile.-b
will download files in the background, useful when you have other stuff to do in the terminal.--no-check-certificate
will skip any certificate check in case the source has an invalid or missing SSL cert.--load-cookies cookies.txt
will load the cookies from the text file and send them to the site, great for bypassing having to re-login each time.--save-cookies file.txt
will save any non-expired cookies to file before exiting the task.
Now you should be ready to scrape that site you’ve been after. A fun site to practice on is The-Eye.eu as it contains terabytes of directory links/media.
Wget is your friend. DNS is your enemy.