FreeBSD Manual Pages

home | help
CRAWL(1)		    General Commands Manual		      CRAWL(1)

NAME
       crawl --	a small	and efficient HTTP crawler

SYNOPSIS
       crawl  [-u  urlincl] [-e	urlexcl] [-i imgincl] [-I imgexcl] [-d imgdir]
	     [-m depth]	[-c state] [-t timeout]	[-A agent] [-R]	[-E  external]
	     [url ...]

DESCRIPTION
       The  crawl  utility  starts  a  depth-first traversal of	the web	at the
       specified URLs.	It stores all JPEG images that	match  the  configured
       constraints.

       The options are as follows:

       -v level	    The	verbosity level	of crawl in regards to printing	infor-
		    mation about URL processing.  The default is 1.

       -u urlincl   A  regex(3)	 expression  that  all URLs that should	be in-
		    cluded in the traversal have to match.

       -e urlexcl   A regex(3) expression that determines which	URLs  will  be
		    excluded from the traversal.

       -i imgincl   A regex(3) expression that all image URLs have to match in
		    order to be	stored on disk.

       -I imgexcl   A regex(3) expression that determines the images that will
		    not	be stored.

       -d imagedir  Specifies  the  directory  under  which the	images will be
		    stored.

       -m depth	    Specifies the maximum depth	of the traversal.  A  0	 means
		    that  only	the URLs specified on the command line will be
		    retrieved. A -1 stands for unlimited traversal and	should
		    be used with caution.

       -c state	    Continues a	traversal that was interrupted previosly.  The
		    remaining URLs with	be read	from the file state.

       -t timeout   Specifies  the  time in seconds that needs to pass between
		    successive access of a single host.	 The  parameter	 is  a
		    float.  The	default	is five	seconds.

       -A agent	    Specifies  the  agent  string that will be included	in all
		    HTTP requests.

       -R	    Specifies that the crawler should  ignore  the  robots.txt
		    file.

       -E external  Specifies an external filter program that can refine which
		    URLs are to	be included in the traversal.  The filter pro-
		    gram  reads	the URLs on stdin and outputs a	single charac-
		    ter	on stdout.  An output of `y' indicates	that  the  URL
		    may	 be  included,	`n'  means  that the URL should	be ex-
		    cluded.

       The source code for existing web	crawlers tend to be very  complicated.
       crawl is	a very simple design with pretty simple	source code.

       A configuration file can	be used	instead	of the command line arguments.
       The  configuration  file	contains the MIME-type that is being used.  To
       download	other objects besides images the MIME-type  needs  to  be  ad-
       justed accordingly.  For	more information, see crawl.conf.

EXAMPLES
       crawl -m	0 http://www.w3.org/

       Searches	 for  images  in  the index page of the	web consortium without
       following any other links.

ACKNOWLEDGEMENTS
       This product includes software developed	by Ericsson Radio Systems.

       This product includes software developed	by the University of  Califor-
       nia, Berkeley and its contributors.

AUTHORS
       The crawl utility has been developed by Niels Provos.

FreeBSD	ports 15.quarterly	 May 29, 2001			      CRAWL(1)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=crawl&sektion=1&manpath=FreeBSD+Ports+15.0.quarterly>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages