Let me start off by saying that in this short posting, I don’t intend to provide or discuss any solutions, that specifically address de-duplication of content in SharePoint. But rather explore the idea, that duplication of content is often mistaken for a problem its not, and that while de-duplication may seem like the most logical solution; it may really cause more problems then it solves.
I believe that when the topic of duplicate content comes up it generally revolves around content discovery, primarily the impact that duplicate content has on search results. While duplicate content is quickly pegged as the culprit, it’s really more of a discoverability issue around authorative content. Removing duplicate content may not really be the solution, but perhaps effectively surfacing authorative content by reconfiguring search result page(s), fine tuning your search configuration, and doing a better job leveraging refiners and scopes.
As a user who sometimes copies documents and presentation from other areas into my own collaboration or personal sites, I wouldn’t be in favor of a solution that automatically removes my copies. I believe that in trying to remove duplicate content, you’ll quickly find many such users, and you may ultimately find yourself trying to change too much about how your users get work done.
Of course, discoverability may not be the issue you are trying to solve with de-duplication, in which case it may ultimately have to do with storage. If so, it may point to a bigger issue around information architecture, lifecycle management, and content expiration… But I’ll leave those topics for another time in the interest of keeping this posting short.