How to find duplicate files in GNU/Linux

Thanks to this site I found out how to track duplicate files on my GNU/Linux system. I modified the solution proposed to adapt it for my own need. To sum up, the command get every file size, and compare them in order to know if they are same files sizes. If it match, a md5 hash will be executed to be sure that the files are exactly the same.

Configuration

OS : Arch Linux

Command

We set the SEARCH variable which contains the path where we would like to track duplicate files :

root@host:~# SEARCH=/data
root@host:~# find $SEARCH -not -empty -type f -printf %s\\n | sort -rn | uniq -d | xargs -I{} -n1 find $SEARCH -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

Explanations

find $SEARCH -not -empty -type f -printf %s\\n

-not : Same as ! expr
-empty : void
-type f : only looking for files
-printf %s\\n : print result (files size) fallowed by a carriage return

sort -rn

-rn : reverse the result of comparisons and compare according to string numerical value

uniq -d

-d : only print duplicate lines, one for each group

xargs -I{} -n1 find $SEARCH -type f -size {}c -print0

xargs -I{} -n1 : replace elements from standard input by {} and use at most 1 argument per command line.
find -type f -size {}c -print0 : print files name where size is equal to {} (gave by xargs)

xargs -0 md5sum

-0 : Input items are terminated by a null character instead of by whitespace. Usefull when returned items could have whitespace, quotes or backslash.

uniq -w32 -all-repeated=separate

-w : compare no more than 32 characters in lines (in order to only compare hash results)
--all-repeated=separate : group duplicate lines.

Computing

Music

Misc

How to find duplicate files in GNU/Linux

Configuration

Command

Explanations