Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
rwscan(1)			SiLK Tool Suite			     rwscan(1)

NAME
       rwscan -	Detect scanning	activity in a SiLK dataset

SYNOPSIS
	 rwscan	[--scan-model=MODEL] [--output-path=PATH]
	       [--trw-internal-set=SETFILE]
	       [--trw-theta0=PROB] [--trw-theta1=PROB]
	       [--no-titles] [--no-columns] [--column-separator=CHAR]
	       [--no-final-delimiter] [{--delimited | --delimited=CHAR}]
	       [--integer-ips] [--model-fields]	[--scandb]
	       [--threads=THREADS] [--queue-depth=DEPTH]
	       [--verbose-progress=CIDR] [--verbose-flows]
	       [ {--verbose-results | --verbose-results=NUM} ]
	       [--site-config-file=FILENAME]
	       [FILES...]

	 rwscan	--help

	 rwscan	--version

DESCRIPTION
       rwscan reads sorted SiLK	Flow records, performs scan detection analysis
       on those	records, and outputs textual columnar output for the scanning
       IP addresses.  rwscan writes its	out to the --output-path or to the
       standard	output when --output-path is not specified.

       The types of scan detection analysis that rwscan	supports are Threshold
       Random Walk (TRW) and Bayesian Logistic Regression (BLR).  Details
       about these techniques are described in the "METHOD OF OPERATION"
       section below.

       rwscan is designed to write its data into a database.  This database
       can be queried using the	rwscanquery(1) tool.  See the "EXAMPLES"
       section for the recommended database schema.

       The input to rwscan should be pre-sorted	using rwsort(1)	by the source
       IP, protocol, and destination IP	(i.e., --fields=sip,proto,dip).

       rwscan reads SiLK Flow records from the files named on the command line
       or from the standard input when no file names are specified.  To	read
       the standard input in addition to the named files, use "-" or "stdin"
       as a file name.	If an input file name ends in ".gz", the file is
       uncompressed as it is read.

OPTIONS
       Option names may	be abbreviated if the abbreviation is unique or	is an
       exact match for an option.  A parameter to an option may	be specified
       as --arg=param or --arg param, though the first form is required	for
       options that take optional parameters.

       --scan-model=MODEL
	   Select a specific scan detection model.  If not specified, the
	   default value for MODEL is 0.  See the "METHOD OF OPERATION"
	   section for more details.

	    0  Use the Threshold Random	Walk (TRW) and Bayesian	Logistic
	       Regression (BLR)	scan detection models in series.

	    1  Use only	the TRW	scan detection model.

	    2  Use only	the BLR	scan detection model.

       --output-path=PATH
	   Write the textual output to PATH, where PATH	is a filename, a named
	   pipe, the keyword "stderr" to write the output to the standard
	   error, or the keyword "stdout" or "-" to write the output to	the
	   standard output (and	bypass the paging program).  If	PATH names an
	   existing file, rwscan exits with an error unless the	SILK_CLOBBER
	   environment variable	is set,	in which case PATH is overwritten.  If
	   this	switch is not given, the output	is either sent to the pager or
	   written to the standard output.

       --trw-internal-set=SETFILE
	   Specify an IPset file containing all	valid internal IP addresses.
	   This	parameter is required when using the TRW scan detection	model,
	   since the TRW model requires	the list of targeted IPs (i.e.,	the
	   IPs to detect the scanning activity to).  This switch is ignored
	   when	the TRW	model is not used.  For	information on creating	IPset
	   files, see the rwset(1) and rwsetbuild(1) manual pages.  Prior to
	   SiLK	3.4, this switch was named --trw-sip-set.

       --trw-sip-set=SETFILE
	   This	is a deprecated	alias for --trw-internal-set.

       --trw-theta0=PROB
	   Set the theta_0 parameter for the TRW scan model to PROB, which
	   must	be a floating point number between 0 and 1.  theta_0 is
	   defined as the probability that a connection	succeeds given the
	   hypothesis that the remote source is	benign (not a scanner).	 The
	   default value for this option is 0.8.  This option should only be
	   used	by experts familiar with the TRW algorithm.

       --trw-theta1=PROB
	   Set the theta_1 parameter for the TRW scan model to PROB, which
	   must	be a floating point number between 0 and 1.  theta_1 is
	   defined as the probability that a connection	succeeds given the
	   hypothesis that the remote source is	malicious (a scanner).	The
	   default value for this option is 0.2.  This option should only be
	   used	by experts familiar with the TRW algorithm.

       --no-titles
	   Turn	off column titles.  By default,	titles are printed.

       --no-columns
	   Disable fixed-width columnar	output.

       --column-separator=C
	   Use specified character between columns.  When this switch is not
	   specified, the default of '|' is used.

       --no-final-delimiter
	   Do not print	the column separator after the final column.  Normally
	   a delimiter is printed.

       --delimited
       --delimited=C
	   Run as if --no-columns --no-final-delimiter --column-sep=C had been
	   specified.  That is,	disable	fixed-width column output; if
	   character C is provided, it is used as the delimiter	between
	   columns instead of the default '|'.

       --integer-ips
	   Print IP addresses as decimal integers instead of in	their
	   canonical representation.

       --model-fields
	   Show	scan model detail fields.  This	switch controls	whether
	   additional informational fields about the scan detection models are
	   printed.

       --scandb
	   Produce output suitable for loading into a database.	 Sample
	   database schema are given below under "EXAMPLES".  This option is
	   equivalent to --no-titles --no-columns --no-final-delimiter
	   --model-fields --integer-ips.

       --threads=THREADS
	   Specify the number of worker	threads	to create for scan detection
	   processing.	By default, one	thread will be used.  Changing this
	   number to match the number of available CPUs	will often yield a
	   large performance improvement.

       --queue-depth=DEPTH
	   Specify the depth of	the work queue.	 The default is	to make	the
	   work	queue the same size as the number of worker threads, but this
	   can be changed.  Normally, the default is fine.

       --verbose-progress=CIDR
	   Report progress as rwscan processes input data.  The	CIDR argument
	   should be an	integer	that corresponds to the	netblock size of each
	   line	of progress.  For example, --verbose-progress=8	would print a
	   progress message for	each /8	network	processed.

       --verbose-flows
	   Cause rwscan	to print very verbose information for each flow.  This
	   switch is primarily useful for debugging.

       --verbose-results
       --verbose-results=NUM
	   Print detailed information on each IP processed by rwscan.  If a
	   NUM argument	is provided, only print	verbose	results	for sources
	   that	sent at	least NUM flows. This information includes scan	model
	   calculations, overall scan scores, etc.  This option	will generate
	   a lot of output, and	is primarily useful for	debugging.

       --site-config-file=FILENAME
	   Read	the SiLK site configuration from the named file	FILENAME.
	   When	this switch is not provided, rwscan searches for the site
	   configuration file in the locations specified in the	"FILES"
	   section.

       --help
	   Print the available options and exit.

       --version
	   Print the version number and	information about how SiLK was
	   configured, then exit the application.

METHOD OF OPERATION
       rwscan's	default	behavior is to consult two scan	detection models to
       determine whether a source is a scanner.	 The primary model used	is the
       Threshold Random	Walk (TRW) model.  The TRW algorithm takes advantage
       of the tendency of scanners to attempt to contact a large number	of IPs
       that do not exist on the	target network.

       By keeping track	of the number of "hits"	(successful connections) and
       "misses"	(attempts to connect to	IP addresses that are not active on
       the target network), scanners can be detected quickly and with a	high
       degree of accuracy.  Sequential hypothesis testing is used to analyze
       the probability that a source is	a scanner as each flow record is
       processed.  Once	the scan probability exceeds a configured maximum, the
       source is flagged as a scanner, and no further analysis of traffic from
       that host is necessary.

       The TRW model is	not 100% accurate, however, and	only finds scans in
       TCP flow	data. In the case where	the TRW	model is inconclusive, a
       secondary model called BLR is invoked.  BLR stands for "Bayesian
       Logistic	Regression."  Unlike TRW, the BLR approach must	analyze	all
       traffic from a given source IP to determine whether that	IP is a
       scanner.

       Because of this,	BLR operates much slower than TRW. However, the	BLR
       model has been shown to detect scans that are not detected by the TRW
       model, particularly scans in UDP	and ICMP data, and vertical TCP	scans
       which focus on finding services on a single host.  It does this by
       calculating metrics from	the flow data from each	source,	and using
       those metrics to	arrive at an overall likelihood	that the flow data
       represents scanning activity.

       The metrics BLR uses for	detecting scans	in TCP flow data are:

          the ratio of	flows with no ACK bit set to all flows

          the ratio of	flows with fewer than three packets to all flows

          the average number of source	ports per destination IP address

          the ratio of	the number of flows that have an average of 60
	   bytes/packet	or greater to all flows

          the ratio of	the number of unique destination IP addresses to the
	   total number	of flows

          the ratio of	the number of flows where the flag combination
	   indicates backscatter to all	flows

       The metrics BLR uses for	detecting scans	in UDP flow data are:

          the ratio of	flows with fewer than three packets to all flows

          the maximum run length of IP	addresses per /24 subnet

          the maximum number of unique	low-numbered (less than	1024)
	   destination ports contacted on any one host

          the maximum number of consecutive low-numbered destination ports
	   contacted on	any one	host

          the average number of unique	source ports per destination IP
	   address

          the ratio of	flows with 60 or more bytes/packet to all flows

          the ratio of	unique source ports (both low and high)	to the number
	   of flows

       The metrics BLR uses for	detecting scans	in ICMP	flow data are:

          the maximum number of consecutive /24 subnets that were contacted

          the maximum run length of IP	addresses per /24 subnet

          the maximum number of IP addresses contacted	in any one /24 subnet

          the total number of IP addresses contacted

          the ratio of	ICMP echo requests to all ICMP flows

       Because the TRW model has a lower false positive	rate than the BLR
       model, any source identified as a scanner by TRW	will be	identified as
       a scanner by the	hybrid model without consulting	BLR.  BLR is only
       invoked in the following	cases:

          The traffic being analyzed is UDP or	ICMP traffic, which rwscan's
	   implementation of TRW cannot	process.

          The TRW model has identified	the source as benign.  This occurs
	   when	the scan probability drops below a configured minimum during
	   sequential hypothesis testing.

          The TRW model has identified	the source as unknown (where the scan
	   probability never exceeded the minimum or maximum thresholds	during
	   sequential hypothesis testing).

       In situations where the use of one model	is preferred, the other	model
       can be disabled using the --scan-model switch.  This may	have an	impact
       on the performance and/or accuracy of the system.

LIMITATIONS
       rwscan detects scans in IPv4 flows only.

EXAMPLES
       In the following	examples, the dollar sign ("$")	represents the shell
       prompt.	The text after the dollar sign represents the command line.
       Lines have been wrapped for improved readability, and the back slash
       ("\") is	used to	indicate a wrapped line.

   Basic Usage
       Assuming	a properly sorted SiLK Flow file as input, the basic usage for
       Bayesian	Logistic Regression (BLR) scan detection requires only the
       input file, data.rw, and	output file, scans.txt,	arguments.

	$ rwscan --scan-model=2	--output-path=scans.txt	data.rw

       Basic usage of Threshold	Random Walk (TRW) scan detection requires the
       IP addresses of the targeted network (i.e., the internal	IP space),
       specified in the	internal.set IPset file.

	$ rwscan --trw-internal-set=internal.set --output-path=scans.txt data.rw

   Typical Usage
       More commonly, an analyst uses rwfilter(1) to query the data repository
       for flow	records	within a time window.  First, the analyst has rwset(1)
       put the source addresses	of outgoing flow records into an IPset,
       resulting in the	IPset containing the IPs of active hosts on the
       internal	network.  Next,	the incoming traffic is	piped to rwsort(1) and
       then to rwscan.

	$ rwfilter --start=2004/12/29:00 --type=out,outweb --all-dest=stdout \
	  | rwset --sip=internal.set

	$ rwfilter --start=2004/12/29:00 --type=in,inweb --all-dest=stdout \
	  | rwsort --fields=sip,proto,dip				   \
	  | rwscan --trw-internal-set=internal.set --scan-model=0	   \
	       --output-path=scans.txt

   Storing Scans in a PostgreSQL Database
       Instead of having the analyst run rwscan	directly, often	the output
       from rwscan is put into a database where	it can be queried by
       rwscanquery(1).	The output produced by the --scandb switch is suitable
       for loading into	a database of scans.  The process for using the
       PostgreSQL database is described	in this	section.

       Schemas for Oracle, MySQL, and SQLite are provided below, but the
       details to create users with the	proper rolls are not included.

       Here is the schema for PostgreSQL:

	CREATE DATABASE	scans

	CREATE SCHEMA scans

	CREATE SEQUENCE	scans_id_seq

	CREATE TABLE scans (
	  id	      BIGINT	  NOT NULL    DEFAULT nextval('scans_id_seq'),
	  sip	      BIGINT	  NOT NULL,
	  proto	      SMALLINT	  NOT NULL,
	  stime	      TIMESTAMP	without	time zone NOT NULL,
	  etime	      TIMESTAMP	without	time zone NOT NULL,
	  flows	      BIGINT	  NOT NULL,
	  packets     BIGINT	  NOT NULL,
	  bytes	      BIGINT	  NOT NULL,
	  scan_model  INTEGER	  NOT NULL,
	  scan_prob   FLOAT	  NOT NULL,
	  PRIMARY KEY (id)
	)

	CREATE INDEX scans_stime_idx ON	scans (stime)
	CREATE INDEX scans_etime_idx ON	scans (etime)
	;

       A database user should be created for the purposes of populating	the
       scan database, e.g.:

	CREATE USER rwscan WITH	PASSWORD 'secret';

	GRANT ALL PRIVILEGES ON	DATABASE scans TO rwscan;

       Additionally, a user with read-only access should be created for	use by
       the rwscanquery tool:

	CREATE USER rwscanquery	WITH PASSWORD 'secret';

	GRANT SELECT ON	DATABASE scans TO rwscanquery;

       To import rwscan's --scandb output into a PostgreSQL database, use a
       command similar to the following:

	$ cat /tmp/scans.import.txt	       \
	  | psql -c			       \
	    "COPY scans			       \
		(sip, proto, stime, etime,     \
		flows, packets,	bytes,	       \
		scan_model, scan_prob)	       \
	    FROM stdin DELIMITER as '|'" scans

   Sample Schema for Oracle
	CREATE TABLE scans (
	  id	      integer unsigned	  not null unique,
	  sip	      integer unsigned	  not null,
	  proto	      tinyint unsigned	  not null,
	  stime	      datetime		  not null,
	  etime	      datetime		  not null,
	  flows	      integer unsigned	  not null,
	  packets     integer unsigned	  not null,
	  bytes	      integer unsigned	  not null,
	  scan_model  integer unsigned	  not null,
	  scan_prob   float unsigned	  not null,
	  primary key (id)
	);

   Sample Schema for MySQL
	CREATE TABLE scans (
	  id	      integer unsigned	  not null auto_increment,
	  sip	      integer unsigned	  not null,
	  proto	      tinyint unsigned	  not null,
	  stime	      datetime		  not null,
	  etime	      datetime		  not null,
	  flows	      integer unsigned	  not null,
	  packets     integer unsigned	  not null,
	  bytes	      integer unsigned	  not null,
	  scan_model  integer unsigned	  not null,
	  scan_prob   float unsigned	  not null,
	  primary key (id),
	  INDEX	(stime),
	  INDEX	(etime)
	) TYPE=InnoDB;

   Sample Schema and Import Command for	SQLite
	CREATE TABLE scans (
	  id	      INTEGER PRIMARY KEY AUTOINCREMENT,
	  sip	      INTEGER		  NOT NULL,
	  proto	      SMALLINT		  NOT NULL,
	  stime	      TIMESTAMP		  NOT NULL,
	  etime	      TIMESTAMP		  NOT NULL,
	  flows	      INTEGER		  NOT NULL,
	  packets     INTEGER		  NOT NULL,
	  bytes	      INTEGER		  NOT NULL,
	  scan_model  INTEGER		  NOT NULL,
	  scan_prob   FLOAT		  NOT NULL
	);
	CREATE INDEX scans_stime_idx ON	scans (stime);
	CREATE INDEX scans_etime_idx ON	scans (etime);

       To import rwscan's --scandb output into a SQLite	database, use the
       following command:

	$ perl -nwe 'chomp;
	    print "INSERT INTO scans VALUES (NULL,",
		  (join	",",map	{ / / ?	qq("$_") : $_ }	split /\|/),
		  ");\n";' \
	scans.txt | sqlite3 scans.sqlite

ENVIRONMENT
       SILK_CLOBBER
	   The SiLK tools normally refuse to overwrite existing	files.
	   Setting SILK_CLOBBER	to a non-empty value removes this restriction.

       SILK_CONFIG_FILE
	   This	environment variable is	used as	the value for the
	   --site-config-file when that	switch is not provided.

       SILK_DATA_ROOTDIR
	   This	environment variable specifies the root	directory of data
	   repository.	As described in	the "FILES" section, rwscan may	use
	   this	environment variable when searching for	the SiLK site
	   configuration file.

       SILK_PATH
	   This	environment variable gives the root of the install tree.  When
	   searching for configuration files, rwscan may use this environment
	   variable.  See the "FILES" section for details.

FILES
       ${SILK_CONFIG_FILE}
       ${SILK_DATA_ROOTDIR}/silk.conf
       /data/silk.conf
       ${SILK_PATH}/share/silk/silk.conf
       ${SILK_PATH}/share/silk.conf
       /usr/local/share/silk/silk.conf
       /usr/local/share/silk.conf
	   Possible locations for the SiLK site	configuration file which are
	   checked when	the --site-config-file switch is not provided.

SEE ALSO
       rwscanquery(1), rwfilter(1), rwsort(1), rwset(1), rwsetbuild(1),
       silk(7)

BUGS
       When used in an IPv6 environment, rwscan	converts IPv6 flow records
       that contain addresses in the ::ffff:0:0/96 prefix to IPv4.  IPv6
       records outside of that prefix are silently ignored.

SiLK 3.22.2			  2025-11-01			     rwscan(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=rwscan&sektion=1&manpath=FreeBSD+Ports+15.0>

home | help