Problem: To extract e-mail address of the incoming mail.
Analysis:
1. The e-mail address are stored in .mbox format in macOS. The .mbox for Mac is essentially an archive (can be opened and what is useful in this exercise is the file called filename.mbox). It is a .txt file.
2. Email have standard header info which contains the sender information.
e.g. an e-mail header would look something like this: (information masked for privacy)
From sender@example.com Fri Dec 16 00:11:30 2016
Delivered-To: recipient@domain.com
Received: by IP address with SMTP id hm4csp533342wjb;
Fri, 16 Dec 2016 00:11:30 -0800 (PST)
X-Received: by 10.157.51.53 with SMTP id f50mr1243482otc.34.1481875890678;
Fri, 16 Dec 2016 00:11:30 -0800 (PST)
Return-Path:
Received: from gateway21.websitewelcome.com (gateway21.websitewelcome.com. [IP address])
by mx.google.com with ESMTPS id r129892oib.209.2016.12.16.00.11.30
for
(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
Fri, 16 Dec 2016 00:11:30 -0800 (PST)
Received-SPF: neutral (google.com: IP address is neither permitted nor denied by best guess record for domain of sender@example.com) client-ip=IP address;
Authentication-Results: mx.google.com;
spf=neutral (google.com: IP address is neither permitted nor denied by best guess record for domain of sender@example.com) smtp.mailfrom=sender@example.com
Received: from cm2.websitewelcome.com (cm2.websitewelcome.com [IP address])
by gateway21.websitewelcome.com (Postfix) with ESMTP id 47E19987
for; Fri, 16 Dec 2016 02:11:30 -0600 (CST)
Received: from gator4094.hostgator.com ([IP address])
by cm2.websitewelcome.com with
id LYBV1BW5p; Fri, 16 Dec 2016 02:11:30 -0600
Received: from [IP address] (port=19683 helo=ardapc)
by gator4094.hostgator.com with esmtpa (Exim 4.87)
(envelope-from)
id 1cHnc0-0005GH-4L
for recipient@domain.com; Fri, 16 Dec 2016 02:11:28 -0600
From: Sender Name
To:
Subject: Subject of the e-mail.
Date: Fri, 16 Dec 2016 11:11:22 +0300
3. Learning from other e-mail extractor
I have tried several scripts and paid for a programme that claims to extract e-mail address. However, as one can see, only the first line is useful. If we parse the whole message, we will have many repetition and also capture recipient e-mail, compliant e-mail (abuse@, postmaster@ etc) and it is largely useless.
Solution:
1) Upload mbox file to Linux shell with sed utility
2) Extract lines with "From " and remove everything else
sed -n '/From /p' mbox filename > output1.txt
3) Use Notepad ++ to remove extra lines that captured "From " in the mail body
From my experience, there are about 1-2% where the word "From " was used in the mail so it is being captured in the above sed command.
From sender1@example.com Fri Dec 16 00:11:30 2016
From sender2@example.com Fri Dec 16 00:11:30 2016
From sender3@example.com Fri Dec 16 00:11:30 2016
From sender4@example.com Fri Dec 16 00:11:30 2016
4) Use Notepad ++ to remove "From " from output1.txt
Simple "Find and Replace"
sender1@example.com Fri Dec 16 00:11:30 2016
sender2@example.com Fri Dec 16 00:11:30 2016
sender3@example.com Fri Dec 16 00:11:30 2016
sender4@example.com Fri Dec 16 00:11:30 2016
5) Upload back to Linux shell
6) Remove everything else after the space (which is the date)
sed 's/\s.*$//' output1.txt > output2.txt
sender1@example.com
sender2@example.com
sender3@example.com
sender4@example.com
7) Remove duplicate with Excel
8) Search and remove email addresses containing "reply" - remove "no_reply" / "do_not_reply"
9) Optionally, remove entries containing your domain
Analysis:
1. The e-mail address are stored in .mbox format in macOS. The .mbox for Mac is essentially an archive (can be opened and what is useful in this exercise is the file called filename.mbox). It is a .txt file.
2. Email have standard header info which contains the sender information.
e.g. an e-mail header would look something like this: (information masked for privacy)
From sender@example.com Fri Dec 16 00:11:30 2016
Delivered-To: recipient@domain.com
Received: by IP address with SMTP id hm4csp533342wjb;
Fri, 16 Dec 2016 00:11:30 -0800 (PST)
X-Received: by 10.157.51.53 with SMTP id f50mr1243482otc.34.1481875890678;
Fri, 16 Dec 2016 00:11:30 -0800 (PST)
Return-Path:
Received: from gateway21.websitewelcome.com (gateway21.websitewelcome.com. [IP address])
by mx.google.com with ESMTPS id r129892oib.209.2016.12.16.00.11.30
for
(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
Fri, 16 Dec 2016 00:11:30 -0800 (PST)
Received-SPF: neutral (google.com: IP address is neither permitted nor denied by best guess record for domain of sender@example.com) client-ip=IP address;
Authentication-Results: mx.google.com;
spf=neutral (google.com: IP address is neither permitted nor denied by best guess record for domain of sender@example.com) smtp.mailfrom=sender@example.com
Received: from cm2.websitewelcome.com (cm2.websitewelcome.com [IP address])
by gateway21.websitewelcome.com (Postfix) with ESMTP id 47E19987
for
Received: from gator4094.hostgator.com ([IP address])
by cm2.websitewelcome.com with
id LYBV1BW5p; Fri, 16 Dec 2016 02:11:30 -0600
Received: from [IP address] (port=19683 helo=ardapc)
by gator4094.hostgator.com with esmtpa (Exim 4.87)
(envelope-from
id 1cHnc0-0005GH-4L
for recipient@domain.com; Fri, 16 Dec 2016 02:11:28 -0600
From: Sender Name
To:
Subject: Subject of the e-mail.
Date: Fri, 16 Dec 2016 11:11:22 +0300
3. Learning from other e-mail extractor
I have tried several scripts and paid for a programme that claims to extract e-mail address. However, as one can see, only the first line is useful. If we parse the whole message, we will have many repetition and also capture recipient e-mail, compliant e-mail (abuse@, postmaster@ etc) and it is largely useless.
Solution:
1) Upload mbox file to Linux shell with sed utility
2) Extract lines with "From " and remove everything else
sed -n '/From /p' mbox filename > output1.txt
3) Use Notepad ++ to remove extra lines that captured "From " in the mail body
From my experience, there are about 1-2% where the word "From " was used in the mail so it is being captured in the above sed command.
From sender1@example.com Fri Dec 16 00:11:30 2016
From sender2@example.com Fri Dec 16 00:11:30 2016
From sender3@example.com Fri Dec 16 00:11:30 2016
From sender4@example.com Fri Dec 16 00:11:30 2016
4) Use Notepad ++ to remove "From " from output1.txt
Simple "Find and Replace"
sender1@example.com Fri Dec 16 00:11:30 2016
sender2@example.com Fri Dec 16 00:11:30 2016
sender3@example.com Fri Dec 16 00:11:30 2016
sender4@example.com Fri Dec 16 00:11:30 2016
5) Upload back to Linux shell
6) Remove everything else after the space (which is the date)
sed 's/\s.*$//' output1.txt > output2.txt
sender1@example.com
sender2@example.com
sender3@example.com
sender4@example.com
7) Remove duplicate with Excel
8) Search and remove email addresses containing "reply" - remove "no_reply" / "do_not_reply"
9) Optionally, remove entries containing your domain
Comments
Post a Comment