How to find duplicate files in GNU/Linux

Last updated: Oct 20, 2024

Thanks to http://tips4linux.com/, I've found out how to track duplicate files on my GNU/Linux system. I've modified the proposed solution to suit my needs. In short, the command retrieves the size of each file, and compares them to see if they are the same files sizes. If so, an md5 hash will be performed to ensure that the files are exactly the same.

Command

We set a SEARCH variable, which will contain the path where we wish to search for duplicate files:

root@host:~# SEARCH=/data
root@host:~# find $SEARCH -not -empty -type f -printf %s\\n | sort -rn | uniq -d | xargs -I{} -n1 find $SEARCH -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

Explanations

find $SEARCH -not -empty -type f -printf %s\\n

-not: same as ! expr
-empty: empty
-type f: search for files only
-printf %s\\n: prints result (file size) followed by a carriage return

sort -rn

-rn: reverse the result of comparisons and compares according to the numerical value of the string

uniq -d

-d: print only duplicate lines, one for each group

xargs -I{} -n1 find $SEARCH -type f -size {}c -print0

xargs -I{} -n1: replaces standard input element with {} and uses a maximum of 1 argument per command line.
find -type f -size {}c -print0: print files names whose size is equal to {} (given by xargs)

xargs -0 md5sum

-0: Input elements are terminated by a null character instead of a blank space. Useful when returned elements may contain spaces, quotation marks or backslashes.

uniq -w32 -all-repeated=separate

-w: compares only the first 32 characters of each line (in order to compare only hash results)
--all-repeated=separate: group duplicated lines.