Welcome to DU! The truly grassroots left-of-center political community where regular people, not algorithms, drive the discussions and set the standards. Join the community: Create a free account Support DU (and get rid of ads!): Become a Star Member All Forums Issue Forums Culture Forums Alliance Forums Region Forums Support Forums Help & Search

Yavin4

(35,357 posts)
Mon Nov 7, 2016, 11:25 AM Nov 2016

What you need to know about email de-duplication

I work in a field called Electronic Discovery, and we deal with processing, reviewing, and producing emails all of the time. De-duplication of emails has been around for years. There are several existing applications that can quickly de-duplicate an email set. The process is described below:

Yes. De-duplication is usually performed by comparing cryptographic hashes (e.g. MD5, SHA1 etc.) of documents to each other. The calculated hash values are based on the binary contents of documents and do not take into account external metadata that is stored in the file system. Therefore, two files with the same contents but different file names would produce the same hash value.

Most e-Discovery service providers would allow you to use a custom hash that includes your choice of metadata fields in addition to document contents for de-duplication. For example, you could choose to include the file name field in your custom hash if you would like to make sure that documents can be considered duplicates only when their file names are also identical.


You can read more about it here:

http://www.meridiandiscovery.com/articles/frequently-asked-questions-about-de-duplication/

--OnEdit--

I post this so that you will have reference materials to combat your right wing relatives at Thanksgiving when they go off about "rigged" system and you cannot review 650,000 emails.
11 replies = new reply since forum marked as read
Highlight: NoneDon't highlight anything 5 newestHighlight 5 most recent replies
What you need to know about email de-duplication (Original Post) Yavin4 Nov 2016 OP
I just read a reference to de-duplication at another blog. Demit Nov 2016 #1
Depends on the processing application and other factors Yavin4 Nov 2016 #2
It isn't that complicated. BlueStreak Nov 2016 #3
Horrible image.... freebrew Nov 2016 #6
The 650,000 number to me was the big thing duncang Nov 2016 #10
Stupid Repubs probably think... Wounded Bear Nov 2016 #4
Yes, we're past the Efrem Zimbalast days n/t Yavin4 Nov 2016 #5
Takes a long time to print those documents. bluesbassman Nov 2016 #7
Dec 1969 #
Dec 1969 #
 

Demit

(11,238 posts)
1. I just read a reference to de-duplication at another blog.
Mon Nov 7, 2016, 11:38 AM
Nov 2016

For those of us who don't know much computer science-wise, how long would a search of 650K emails take? Are you able to guesstimate that?

Yavin4

(35,357 posts)
2. Depends on the processing application and other factors
Mon Nov 7, 2016, 11:41 AM
Nov 2016

A powerful application like Nuix could probably de-duplicate that many emails in two or three days.

 

BlueStreak

(8,377 posts)
3. It isn't that complicated.
Mon Nov 7, 2016, 11:42 AM
Nov 2016

only about 0.001% of that 650,000 would be of interest. Most of the pictures were probably Anthony sending out pictures of his junk.

By simply filtering down to the emails sent to and by his wife, and then further selecting emails where Hillary was the author or recipient, you end up with a few hundred emails -- a set that can be manually perused in a few hours.

The question is not how they could get through that process in a week, but why it took more than a day. Why did they wait until 2 days before the election to admit there was nothing there?

duncang

(1,907 posts)
10. The 650,000 number to me was the big thing
Mon Nov 7, 2016, 01:58 PM
Nov 2016

Where did that number come from? Was it actually confirmed by the fbi that it was the total number of emails? I bet flynn heard there were 650,000 + files total on the computer. Since she was Secretary of State for the entire 4 years. That would still be around 450 emails everyday being sent and received. Just seems like that is unlikely.

Wounded Bear

(58,440 posts)
4. Stupid Repubs probably think...
Mon Nov 7, 2016, 11:46 AM
Nov 2016

that there are 10 agents in a room somewhere comparing paper documents.

Electronically comparing 600k documents would take a few seconds.

Latest Discussions»General Discussion»What you need to know abo...