Skip to main content

Email extract script for mbox

Problem: To extract e-mail address of the incoming mail.

Analysis:
1. The e-mail address are stored in .mbox format in macOS. The .mbox for Mac is essentially an archive (can be opened and what is useful in this exercise is the file called filename.mbox). It is a .txt file.

2. Email have standard header info which contains the sender information.

e.g. an e-mail header would look something like this: (information masked for privacy)

From sender@example.com Fri Dec 16 00:11:30 2016
Delivered-To: recipient@domain.com
Received: by IP address with SMTP id hm4csp533342wjb;
        Fri, 16 Dec 2016 00:11:30 -0800 (PST)
X-Received: by 10.157.51.53 with SMTP id f50mr1243482otc.34.1481875890678;
        Fri, 16 Dec 2016 00:11:30 -0800 (PST)
Return-Path:
Received: from gateway21.websitewelcome.com (gateway21.websitewelcome.com. [IP address])
        by mx.google.com with ESMTPS id r129892oib.209.2016.12.16.00.11.30
        for
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Fri, 16 Dec 2016 00:11:30 -0800 (PST)
Received-SPF: neutral (google.com: IP address is neither permitted nor denied by best guess record for domain of sender@example.com) client-ip=IP address;
Authentication-Results: mx.google.com;
       spf=neutral (google.com: IP address is neither permitted nor denied by best guess record for domain of sender@example.com) smtp.mailfrom=sender@example.com
Received: from cm2.websitewelcome.com (cm2.websitewelcome.com [IP address])
    by gateway21.websitewelcome.com (Postfix) with ESMTP id 47E19987
    for ; Fri, 16 Dec 2016 02:11:30 -0600 (CST)
Received: from gator4094.hostgator.com ([IP address])
    by cm2.websitewelcome.com with
    id LYBV1BW5p; Fri, 16 Dec 2016 02:11:30 -0600
Received: from [IP address] (port=19683 helo=ardapc)
    by gator4094.hostgator.com with esmtpa (Exim 4.87)
    (envelope-from )
    id 1cHnc0-0005GH-4L
    for recipient@domain.com; Fri, 16 Dec 2016 02:11:28 -0600
From: Sender Name
To:
Subject: Subject of the e-mail.
Date: Fri, 16 Dec 2016 11:11:22 +0300


3. Learning from other e-mail extractor

I have tried several scripts and paid for a programme that claims to extract e-mail address. However, as one can see, only the first line is useful. If we parse the whole message, we will have many repetition and also capture recipient e-mail, compliant e-mail (abuse@, postmaster@ etc) and it is largely useless.

Solution:

1) Upload mbox file to Linux shell with sed utility

2) Extract lines with "From " and remove everything else

   sed -n '/From /p' mbox filename > output1.txt
   
3) Use Notepad ++ to remove extra lines that captured "From " in the mail body
From my experience, there are about 1-2% where the word "From " was used in the mail so it is being captured in the above sed command.

From sender1@example.com Fri Dec 16 00:11:30 2016
From sender2@example.com Fri Dec 16 00:11:30 2016
From sender3@example.com Fri Dec 16 00:11:30 2016
From sender4@example.com Fri Dec 16 00:11:30 2016  

4) Use Notepad ++ to remove "From " from output1.txt
Simple "Find and Replace"

sender1@example.com Fri Dec 16 00:11:30 2016
sender2@example.com Fri Dec 16 00:11:30 2016
sender3@example.com Fri Dec 16 00:11:30 2016
sender4@example.com Fri Dec 16 00:11:30 2016

5) Upload back to Linux shell

6) Remove everything else after the space (which is the date)

   sed 's/\s.*$//' output1.txt > output2.txt

sender1@example.com
sender2@example.com
sender3@example.com
sender4@example.com  
  

7) Remove duplicate with Excel

8) Search and remove email addresses containing "reply" - remove "no_reply" / "do_not_reply"

9) Optionally, remove entries containing your domain

Comments

Popular posts from this blog

ISPConfig / Pure-FTP / SSL (TLS) setup

ISPConfig comes with LetsEncrypt integrated in its panel for web domains. However, it does not automatically use the SSL cert for FTP service (PureFTP). This post describes the steps to enable the support. 1. We need an FQDN so that Lets Encrypt (LE) will be able to generate SSL under ISPConfig panel. 2. PureFTP TLS support requires a cert in .pem format which can be generated by leveraging the LE cert generated: cat /etc/letsencrypt/live/mydomain.com/privkey.pem /etc/letsencrypt/live/mydomain.com/fullchain.pem > /etc/ssl/private/pure-ftpd.pem 3. Restart PureFTP so that it will not use the new certificate 4. LE certificates need to be renewed regularly so it is necessary to create a cron job to keep the .pem file updated. Setup a crontab 0 6 * * * /etc/letsencrypt/certbot-auto -n renew --quiet --no-self-upgrade && cat /etc/letsencrypt/live/mydomain.com/privkey.pem /etc/letsencrypt/live/mydomain.com/fullchain.pem > /etc/ssl/private/pure-ftpd.pem && se

ISPConfig 3 / Mail / Custom mail filter recipe

Recently trying to setup a mail re-direct (or a cc) to an external e-mail address. It is important to first determine if you are running courier or dovecot because the syntax is different. Under dovecot, it should be in sieve syntax. Therefore, under ISPConfig3 -> Email -> Email Mailbox -> Custom Rules, enter: redirect "mail@example.com"; Ensure it is double straight quotes and semi-colon at the end. Wait until the update is done (usually a few minutes) via the cron jobs of ISPConfig3 updating the /var/vmail/domain/username/.sieve

Ubuntu 16.04 and ISPConfig 3.1 - stopping ClamAV

ClamAV requires quite a bit of resources to run in the background and this usually slows down the mail delivery. In the ISPConfig 3 (Under Perfect Server setup), clamAV is run within Amavis. Therefore, typical removal of clamAV commands will not remove it. When RAM is really low, Linux kills amavis and this will cause mail not being delivered. Therefore, if we run amavis to manage anti-virus and spam, consider a minimum of 2G or 4G RAM VM/Cloud servers. The steps to disable clamav and amavisd are: (1) edit postfix conf - note amavis uses a special port 10024 and 10026. Therefore, if you are not using these ports, consider closing them in your firewall settings. nano /etc/postfix/main.cf # content_filter = amavis:[127.0.0.1]:10024 # receive_override_options = no_address_mappings (2) Under ISPConfig 3.1, comment additional 2 lines nano /etc/postfix/tag_as_foreign.re # /^/ FILTER amavis:[127.0.0.1]:10024 nano /etc/postfix/tag_as_originating.re # /^/ FILTER amavi