Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
SAMEFILE(1)			   SAMESAME			   SAMEFILE(1)

NAME
       samefile	- find identical files

       samearchive - find identical files, while keeping archives intact

SYNOPSIS
       samefile	 [-a  |	 -A  |	-At | -L | -Z |	| -Zt] [-g size] [-l | -r] [-m
       size] [-S sep] [-0HiqVvx]

       samearchive [-a | -A | -At | -L | -Z | -Zt] [-g size]  [-l  |  -r]  [-m
       size] [-S sep] [-0HiqVv]	dir1 dir2 [...]

DESCRIPTION
       These  programs	reads a	list of	filenames (one filename	per line) from
       stdin and output	the identical files on stdin.  samearchive is  written
       for the special case where each directory acts as an archive of backup.
       The output will only contain filename pairs that	have the same relative
       path from the archive base.  Therefor the output	of samearchive will be
       a subset	of samefile

       The  output  exist  out of six fields: the size in bytes, two filenames
       (with identical contence), the character	= if the two files are on  the
       same  device,  X	 otherwise, and	the link counts	of the two files.  The
       output is sorted	in reverse order by size as the	primary	key and	a sec-
       ondary key that depends on the user input.

OPTIONS
       -0     Indicates	that the input list of file names is  NUL  terminated,
	      for example as generated by implementations of find(1) that sup-
	      port  the	 -print0  option.  Without this	option,	the file names
	      are assumed to be	newline	terminated.

       -A     Sort filenames alphabetically. (default)

       -At    Sort filenames cronologicly using	the modification date  (oldest
	      first).	This  option is	not available when you've compiled the
	      application with the low memory profile.	 This  option  is  not
	      available	when you've compiled the application with the low mem-
	      ory profile.

       -a     Do not sort files	with same size alphabetically.

       -g size
	      Compare  only files with size greater than size bytes.  (Default
	      is 0.)

       -H     Print human friendly statistic when at verbose level 2

       -i     Allow files with the same	device/i-node pair to be added to  the
	      binary  tree.   This  might be useful if output will be fed into
	      some other program.

       -L     Sort filenames in	reversed natural order	using  the  number  of
	      times the	file was hard linked.

       -l     Do not report whether identical files are	hard linked.  This op-
	      tion reverses the	effects	of the -r option.

       -m size
	      Compare only files with size less	or equal than size bytes.  De-
	      fault is 0 which indicates there is no limit.

       -q     This  option  keep  the  information you are recieved during the
	      processes	to a minimum. (Verbose level 0)

       -r     Report whether identical files are hard linked.	The  separator
	      string  followed	by  the	 [bracketed] link count	is appended to
	      each name	pair if	they are hard links created with ln(1).	  This
	      option  is incompatible with the -l option.  Note	that this kind
	      of output	has only four fields and will appear  unsorted	before
	      the actual output	of samefile.

       -S sep Use  string sep as the output field separator, defaults to a tab
	      character.  Useful if filenames contain tab characters and  out-
	      put must be processed by another program,	say awk(1).

       -V     Print the	version	information and	exit.

       -v     This  option  increases  the  amount  of information you recieve
	      while running samefile.  At level	0 you will just	see the	 error
	      messages.	  At  level 1 you will see warning messages indicating
	      that samefile coudn't do something.  And at level	2 you will re-
	      cieve information	about the stages that samefile enters and some
	      statistic	when samefile finishes.	 Defaults to verbose level 1.

       -x     By default the program will print	just 1 x n lines for each  set
	      of  matches, but when this option	is used	the program will print
	      m	x n lines for each set of matches.  (i.e. when using  the  op-
	      tion -i and two files match and on is hard
	       linked  twice  and the other is hard linked three time then you
	      will get
	       6 lines instead of just 2 or 3.)

       -Z     Sort filenames in	reversed alphabetical order.

       -Zt    Sort filenames in	reversed cronological order using the  modifi-
	      cation date (youngest first).  This option is not	available when
	      you've  compiled	the  application  with the low memory profile.
	      This option is not available when	you've compiled	 the  applica-
	      tion with	the low	memory profile.

INTERNALS
       These programs uses two stages to give optimum performance.

       In  the	first stage, all non-plain files are skipped (directories, de-
       vices, FIFOs, sockets, symbolic links)  as  well	 as  files  for	 which
       stat(2)	fails and files	that have a size less than or equal to size or
       greater than size.

       When the	memory is full,	samefile will try to store a part of the file-
       names temporarily in /tmp/samefile/<pid>.  When samefile	is not able to
       do this it will rais the	minimum	size and removes paths from the	memory
       accordingly.

       In the second stage the filenames that are hard	linked	are  reported,
       assuming	 option	 -r was	passed to the program.	And the	files are com-
       pared and identical filenames are reported after	this.

       For any i-node only one filename	will  be  added	 (unless  -i  was  re-
       quested.)

       For  each two i-nodes that match	n lines	will be	printed	that shows the
       first filename of the first i-node matched against all the filenames of
       the second i-node.  Note	however, that because only the first  filename
       per  i-node gets	into the second	stage, the output for a	group of iden-
       tical files with	different i-node numbers is also minimized.

       Suppose you have	six identical files of size 100	 in  an	 i-node	 group
       consisting  of  the  three i-nodes with numbers 10, 20 and 30 (the term
       file systems - it merely	refers to a set	of  i-nodes  addressing	 files
       with identical contents):

       % ls -i
	  10 file1     20 file4	    30 file6
	  10 file2     20 file5
	  10 file3
       % ls | samefile
       100     file1   file4   =       3       2
       100     file1   file6   =       3       1

       The  sum	 of  the sizes in the first column is the amount of disk space
       you could gain by making	all 6 files links to only one file  or	remove
       all  but	 one  of the files.  To	be precise, disk space is allocated in
       blocks -	you will probably gain two blocks here,	rather than 200	bytes.
       Note that it is not enough to just remove file4 and  file6  (you	 would
       gain  only 100 bytes because file5 still	exists.)  The proper way is to
       use the -i option. The output will look like:

       100     file1   file4   =       3       2
       100     file1   file5   =       3       2
       100     file1   file6   =       3       1

       Removing	all files listed in the	third field  will  leave  only	file1.
       Making all files	hard links to file1 is easy.  If the fourth field is a
       ``=''  do  a  forced hard link.	If you need to know about all combina-
       tions of	identical files, then you use both  the	 -i  and  -x  options.
       This produces:

       % ls | samefile -ix
       100     file1   file4   =       3       2
       100     file1   file5   =       3       2
       100     file2   file4   =       3       2
       100     file2   file5   =       3       2
       100     file3   file4   =       3       2
       100     file3   file5   =       3       2
       100     file1   file6   =       3       1
       100     file2   file6   =       3       1
       100     file3   file6   =       3       1
       100     file4   file6   =       2       1
       100     file5   file6   =       2       1

FILES
       /tmp/samefile/<pid>

	      When  the	 list  is  to  large to	fit in to the memory, samefile
	      tries to temporarily store the path on  the  disk	 by  creaeting
	      files within the directory /tmp/samefile/<pid>

       /tmp/samearchive/<pid>

	      When  the	 list is to large to fit in to the memory, samearchive
	      tries to temporarily store the path on  the  disk	 by  creaeting
	      files within the directory /tmp/samefile/<pid>

EXAMPLES
       Find all	identical files	in the current working directory:

       % ls | samefile -i

       Find  all  identical  files in my HOME directory	and subdirectories and
       also tell me if there are hard links:

       % find $HOME -type f -print | samefile -r

       Find all	identical files	in the /usr directory  tree  that  are	bigger
       than  10000 bytes and write the result to /tmp/usr (that	one is for the
       sysadmin	folks, you may want to 'amp' - put it in the  background  with
       the ampersand & - this command because it takes a few minutes.)

       % find /usr -type f -print | samefile -g	10000 >	/tmp/usr

       Find  all  identical files with in the system archives that live	within
       the current working directory:

       % find /path/to/backup/system-* | samearchive system-*

DIAGNOSTICS
       inaccessible: path This is probably due to a 'permission	denied'	 error
       on  files  or  directories  within the given path for which you have no
       read permission.

       unreadable: path	The file could be opend	for reading jet	 failed	 while
       reading.	  You  shouldn't  encounter such a warnings but	if you do, and
       recieve more than a few,	this could be very well	due  to	 failing  hard
       disk.

       <file.cpp>:<line>  message  You can encounter such a errors when	you've
       compiled	the port with debugging	information.  Please report such  mes-
       sages  to the author with some relevant information about how to	repro-
       duce this bug.

       memory full: written amount path	to disk	The memory was full and	a num-
       ber of paths where temporarily written to disk.

       memory full: changed minimum file size to number	The  memory  was  full
       and  the	 program coudn't temporarily write paths to disk, so it	raised
       the minimum file	size to	the given number.  At a	later time  you	 could
       rerun  the  program  using the option -m	to check that paths that where
       skipped and going to be skipped as a result.

       memory full: aborting...	to manny files with the	same size  There  were
       just  to	 manny	files with the same size to fit	in to memory from this
       point on.  Try to split the list	up and then run	the  program  multiple
       times.

SEE ALSO
       samearchive-lite(1) sameln(1) samesame(1) find(1) ls(1)

NOTES
       Input  filenames	 must  not have	leading	or trailing white space	unless
       the white space is part of the filename.

HISTOR
       samefile	was first written by Jens Schweikhardt in 1996.	 It was	 later
       rewritten  by  Alex  de kruijff in 2009 in order	to improve the perfor-
       mace.  In addition the program now was able to handle memory allocation
       problems	due to large list and gained some addition options.

BUGS
       The list	is not sorted properly when using the option -x.  This is  not
       a  bug  but a feature. Proper sorting would either consume vast amounts
       of memory or time.  The sorting options are there just to controle  the
       output.	(i.e. use -Zt if you intent to link with the file that was the
	most recently modified.	You will find that file	on the left.)

AUTHOR
       Alex de Kruijff

				 14 APRIL 2009			   SAMEFILE(1)

Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=samearchive&sektion=1&manpath=FreeBSD+Ports+15.0>

home | help