TAGS :Viewed: 2 - Published at: a few seconds ago

[ Python method or class to compare two video files? ]

I'm trying to write a program to compare files and show the duplicates in python. Anyone know any good functions or methods related to this? I am sorta lost...

Answer 1

If you're just looking for exact duplicates, do an MD5 hash on both and see if they match:

import hashlib

file1 = open('file1.avi', 'r').read()
file2 = open('file2.avi', 'r').read()

if hashlib.sha512(file1).hexdigest() == hashlib.sha512(file2).hexdigest():
  print 'They are the same'
  print 'They are different'

If not, I'd try OpenCV's Python Bindings and check if they match up frame by frame.

Answer 2

I would first start out comparing filenames and filesizes. If you find a match, you could then loop through the bytes of the file to compare them, although this is probably pretty intensive.

I do not know of a library that can do this in python.

Answer 3

I would use os.walk to go through the file tree.

For each file, I would store the absolutepath+filename, indexed by file size and signature (first 16 bytes? Hash of first 512 bytes? Hash on full file?).

When finished, you end up with a dict of file sizes; for each size, a dict of file signatures; for each signature, a list of all files sharing that signature. If your file signature is not based on the full file, or has significant chance of collisions, you can then do a more in-depth comparison of just those colliding files.