Download offline version of dynamic pages with Wget

24 November, 2006 - 10:43
Categories:

Remainder mainly to myself: short list of useful options of wget for recursive downloading of dynamic (PHP, ASP, ...) webpages (because wget's man page is too long):

  • --no-clobber: do not redownload pages that already exist locally.
  • --html-extension: append extension .html to webpages of which the URL does not end on .html or .htm but with things like .php or .php?q=boo&bar=4.
  • --recursive: turn on recursive downloading.
  • --level=3: set the recursion depth.
  • --convert-links: make the links in downloaded documents point to local files if possible.
  • --page-requisites: download embedded images and stylesheets for each downloaded html document.
  • --relative: only follow relative links, not absolute links (even if in the same domain).
  • --no-parent: do not ascend to parent directory of the given URL while recursively retrieving.
22 December, 2010 - 18:36

Thanks Stefaan for your blog // Mr Mizzen is awesome.

Benjamin Yakubu (not verified)

I used the script provided by Mr Mizzen and all I can say is this - it gives a neat face to wget - makes it all easy to do offline browsing ; even better than webhttrack.

Great job Mr Mizzen!

3 August, 2010 - 14:52

wget work on http://www.decasasyautos.com

MarkGerard (not verified)

Thanks, we use wget to test http://www.decasasyautos.com and it works.

Mark
http://www.decasasyautos.com

25 March, 2010 - 21:58

Wget recursive

Anonymous (not verified)

I am using Wget recursive to download content from websites. However I want all the files to be saved with the absolute urls as the file name.
For example http://www.whatever.com/whatever1/whatever2

Can someone help me with this?

Thanks,
M

13 February, 2010 - 13:38

have You ever wondered how to download photos from a page like..

gggrzesiek (not verified)

:) have You ever tried to download photos from pages like http://dermatlas.med.jhmi.edu/derm/ using wget
if jest please hand me a tip:)
take care

2 March, 2009 - 23:04

Easy when you find it.....

Mr. Mizzen (not verified)

Many thanks for the solution to php links.

I have added you suggestions to my site-ripper script and it works very well.
As a thank you, here is the script with zenity dialog and the desktop file. Very handy to click this, check off what you want, enter the url and let it go....

#!/bin/sh

# export FCBASE=`pwd`
STDOUT=`mktemp`
#
# Place site-ripper.desktop file in /usr/share/applications
# Place site-ripper in /usr/bin
#
################ Begin intro #################################################

zenity --title "Welcome to: Mr. Mizzen's Site Ripper Script" \
--width=700 \
--height=370 \
--list \
--checklist \
--column " " \
--column " Item " \
--column " Description " \
--checklist \
--multiple \
TRUE recurse " -r recursively get files from page(s)" \
TRUE noclobber " -nc Use the No Clobber option" \
TRUE noparent " -np Do not save partent directory structure" \
TRUE robots " -e robots=off Ignore the robots instructions" \
FALSE span " -H Span hosts "\
FALSE conver " --convert-links: make the links in downloaded documents point to local files if possible." \
TRUE html " --html-extension: append extension .html to webpages like .php or .php?q=boo&bar=4." \
TRUE page " --page-requisites: download embedded images and stylesheets for each downloaded html document." \
TRUE relative " --relative: only follow relative links, not absolute links (even if in the same domain)." \
> $STDOUT
####################### Test for exit #################################
# True = 1, False =0
if [ $? -eq 0 ] ; then
cancelsetup=0 # false
cancelyesno="no do not cancel, continue"
else
cancelsetup=1 # true
cancelyesno="yes cancel"
echo "You selected Cancel"
exit 0
fi
echo "<------------- Here we go! --------------> "
starts=`date +%s`

#####################################################################

levels=$(zenity --entry --text "Levels to drill down? (Defaul of 1 will get a page, 0 is endless) " --entry-text "1")
site=$(zenity --entry --text "Site URL " --entry-text "")

# Setup all the variables now..............

############################### Update data section #################
if grep recurse $STDOUT > /dev/null ; then
recurse="-r"
else
recurse=""
fi

if grep noclobber $STDOUT > /dev/null ; then
noclobber="-nc"
else
noclobber=""
fi

if grep noparent $STDOUT > /dev/null ; then
noparent="-np"
else
noparent=""
fi

if grep robots $STDOUT > /dev/null ; then
robots="-e robots=off"
else
robots=""
fi

if grep span $STDOUT > /dev/null ; then
span="-H"
else
span=""
fi

if grep conver $STDOUT > /dev/null ; then
conver="--convert-links"
else
conver=""
fi

if grep html $STDOUT > /dev/null ; then
html="--html-extension"
else
html=""
fi

if grep page $STDOUT > /dev/null ; then
page="--page-requisites"
else
page=""
fi

if grep relative $STDOUT > /dev/null ; then
relative="--relative"
else
relative=""
fi

#####################################################################
echo " This is the command line"
echo wget $recurse $noclobber $noparent $conver $html $page $relative --tries=2 $span -l $levels $robots '--user-agent="Microsoft Internet Explorer"' $site
wget $recurse $noclobber $noparent $conver $html $page $relative --tries=2 $span -l $levels $robots '--user-agent="Microsoft Internet Explorer"' $site

exit 0

The desktop file data - Save as site-ripper.desktop

[Desktop Entry]
Encoding=UTF-8
Name=Site-Ripper
Comment=Rip Web Sites
Exec=site-ripper
Icon=gnome-color-browser
OnlyShowIn=GNOME;XFCE;
Terminal=true
Type=Application
StartupNotify=true
Categories=GNOME;GTK;Utility;Site-Ripper

Post new comment

The content of this field is kept private and will not be shown publicly.
  • No HTML tags allowed
  • Lines and paragraphs break automatically.

More information about formatting options