Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
WEBCRAWL(1)		    General Commands Manual		   WEBCRAWL(1)

NAME
	 WebCrawl - download web sites,	following links

SYNOPSIS
       webcrawl	[ options ] host[:port]/filename directory

DESCRIPTION
       WebCrawl	 is  a	program	designed to download an	entire website without
       user interaction	(although an interactive mode is available).

       WebCrawl	will download the page web-address  into  a  directory	called
       destination-dir	under the compiled in server root directory (which can
       be changed with the -o option, see below).  web-address should not con-
       tain a leading http://

       It works	simply by starting with	a single web page, and	following  all
       links  from that	page to	attempt	to recreate the	directory structure on
       the remote server.

       As well as downloading the pages, it also rewrites them to use a	 local
       URL  where  URLs	 that would otherwise not work on the local system are
       used in the page	(eg URLs that begin with http:// or the	begin  with  a
       /).

       It  stores  the	downloaded files in a directory	structure that mirrors
       the original site's, under a directory  called  server.domain.com:port.
       This  way,  multiple  sites  can	 all be	loaded into the	same directory
       structure, and if they link to each other, they	can  be	 rewritten  to
       link to the local, rather than remote, versions.

       Comprehensive URL selection facilities allow you	to describe what docu-
       ments  you  want	to download, so	that you don't end up downloading much
       more than you need.

       WebCrawl	is written in ANSI C, and should work  on  any	POSIX  system.
       With  minor modifications, it should be possible	to make	it work	on any
       operating system	that supports TCP/IP sockets. It has been tested  only
       on Linux.

OPTIONS
       URL selection

       -a      This  causes  the program to ask	the user whether to download a
	       page that it hasn't been	otherwise instructed to	 (by  default,
	       this means off-site pages)

       -f string
	       This  causes  the  program  to always follow links to URLs that
	       contain the string. You can use this, for example, to prevent a
	       crawl from going	up beyond a single directory  on  a  site  (in
	       conjunction  with  the  -x option below); say you wanted	to get
	       http://www.web-sites.co.uk/jules	but not	any other site located
	       on the same server. You could use the command line:

	       webcrawl	-x -f /jules www.web-sites.co.uk/jules/	mirror

	       Another use would be if a site contained	 links	to  (eg)  pic-
	       tures,  videos or sound clips on	a remote server, you could use
	       the following command line to get them:

	       webcrawl	-f .jpg	-f .gif	-f .mpg	-f .wav	-f  .au	 www.site.com/
	       mirror

	       Note that webcrawl always downloads inline images.

       -d string
	       The  opposite  of -f, this option tells webcrawl	never to get a
	       URL containing the string.  -d takes priority  over  all	 other
	       URL  selection options (except that it won't stop it from down-
	       loading inline images, which are	always downloaded).

       -u filename
	       Causes webcrawl to log unfollowed links to the file filename.

       -x      Causes webcrawl not to automatically follow links to  pages  on
	       the  same  server. This is useful in conjuction with the	-f op-
	       tion to specify a subsection of an entire site to download.

       -X      Causes webcrawl not to  automatically  download	inline	images
	       (which  it  would  otherwise do even when other options did not
	       indicate	that the image should be loaded).  This	is  useful  in
	       conjunction  with  the  -f option to specify a subsection of an
	       entire site to download,	when even the  images  concerned  need
	       careful selection.

       Page re-writing:

       -n      Turns off page rewriting	completely.

       -rx     Select  which  URLs  to rewrite.	Only URLs that begin with / or
	       http: are considered for	rewriting, all others are always  left
	       unchanged.   This  options  selects  which  of  these  URLs are
	       rewritten to point to local files, depending on the value of x.

	       a   all absolute	URLs are rewritten

	       l   Only	URLs that point	to pages on the	same site are  rewrit-
		   ten.

	       f (default)
		   URLs	 for which the file that the rewritten URL would point
		   to exists are rewritten. Note that rewriting	 occurs	 after
		   all	links  in  a page have been followed (if required), so
		   this	represents probably the	most sensible option,  and  is
		   therefore the default.

       -k      Keep original filenames - disables changing of filenames	to re-
	       move  metacharacters  that may confuse a	web server, and	to en-
	       sure that the extension on the end of the filename is a correct
	       .html or	.htm whenever the page has a text/html	content	 type.
	       (See  Configuration  Files  below  for  a discusssion of	how to
	       achieve this with other file types).

       -q      Disable process ID insertion  into  query  filenames.   Without
	       this flag, and whenever -k is not in use, webcrawl rewrites the
	       filenames  of  queries  (defined	as any fetch from a web	server
	       that includes a '?' character in	the filename) to  include  the
	       process	ID  of	the webcrawl fetching the query	in hexadecimal
	       after the (escaped) '?' in the filename;	this may be  desirable
	       if  performing  the  same query multiple	times to get different
	       results.	 This flag disables this behaviour.

       Recursion limiting:

       -l[x] number
	       This option is used to limit the	depth to which	webcrawl  will
	       search  the  tree  (forest) of interlinked pages. There are two
	       parameters that may be set; with	x as l,	the initial  limit  is
	       set, with x as r, the limit used	after jumping to a remote site
	       is set. If x is missed out, both	limits are set.

       -v      Increases  the program's	verbosity. Without this	option,	no re-
	       ports on	status are made	unless errors occur, etc.  Used	 once,
	       webcrawl	 will  report which URLs it is trying to download, and
	       also which links	it has decided not to follow.  -v may be  used
	       more  than once,	but this is probably only useful for debugging
	       purposes.

       -o dir  Change the server root directory. This is  the  directory  that
	       the  path  specified at the end of the command line is relative
	       to.

       -p dir  Change the URL rewriting	prefix.	This is	prepended to rewritten
	       URLs, and should	be a (relative)	URL that points	to the current
	       server root directory. An example of the	use of the -o  and  -p
	       options is given	below:

	       webcrawl	     -o	    /home/jules/public_html	-p     /~jules
	       www.site.com/page.html mirrors

       HTTP-related options

       -A string
	       Causes webcrawl to send the specified string as the HTTP	'User-
	       Agent' value, rather than the  compiled	in  default  (normally
	       `Mozilla/4.05  [en] (X11; I; Linux 2.0.27 i586; Nav)', although
	       this can	be changed in the file web.h at	compile	time).

       -t n    Specifies a timeout, in seconds.	Default	behaviour is  to  give
	       up  after  this	length of time from the	initial	connection at-
	       tempt.

       -T      Changes the timeout behaviour.  With this flag, the timeout oc-
	       curs only if no data is received	from the server	for the	speci-
	       fied length of time.

CONFIGURATION FILES
       webcrawl	uses configuration files at present to specify rules  for  the
       rewriting  of  filenames.  It searches for files	in /etc/webcrawl.conf,
       /usr/local/etc/webcrawl.conf, and  $HOME/.webcrawl  and	processes  all
       files  it finds in that order.  Parameters set in one file may be over-
       riden by	subsequent files.  Note	that it	is perfectly possible  to  use
       webcrawl	 without  a  configuration file	- it is	only for advanced fea-
       tures that are too complex to configure on the command line that	it  is
       required.

       The  overall  syntax  of	 the  webcrawl file is a set of	sections, each
       headed by a line	of the form [section-name].

       At present, only	the [rename] section is	defined. This may contain  the
       following commands:

       meta string
	       Sets  metacharacter  list.  Any character in the	list specified
	       will be quoted in filenames produced (unless filename rewriting
	       is disabled with	the  -k	 option).   Quoting  is	 performed  by
	       prepending the quoting character	(default @) to the hexadecimal
	       ASCII   value  of  the  character  being	 quoted.  The  default
	       metacharacter list is: ?&*%=#

       quote char
	       Sets the	quoting	character, as described	above. The default is:
	       @

       type content/type preferred [extra extra	...]
	       Sets the	list of	acceptable extensions for  the	specifed  MIME
	       content	type.  The first item in the list is the preferred ex-
	       tension;	if renaming is not disabled (with the -k  option)  and
	       the  extension  of a file of this type is not on	the list, then
	       the first extension on the list will be appended	to its name.

	       An implicit line	is defined internally, which reads:

	       type text/html html htm

	       This could be overriden;	if say you preferred the 'htm'	exten-
	       sion over 'html', you could use:

	       type text/html htm html

	       in  a  configuration  file  to cause .htm extensions to be used
	       whenever	a new extension	was added.

AUTHOR
       WebCrawl	was written by Julian R. Hall <jules@acris.co.uk> with sugges-
       tions and prompting by Andy Smith.

       Bugs should be submitted	to Julian Hall at the  address	above.	Please
       include	information about what architecture, version, etc, you are us-
       ing.

webcrawl			   webcrawl			   WEBCRAWL(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=webcrawl&sektion=1&manpath=FreeBSD+Ports+15.0>

home | help