Downloading emails with Google Takeout
20 Jan 2020This is part 1 of a 4-ish (or more? or less??) part series describing how to train a bot to write emails like your friends.
- Part 1: How to download your emails from Google (This post)
- Part 2: How to extract relevant text from the the mbox file format
- Part 3: How to train GPT2 on your text (for free!) using Google Colab
- Part 4: Visualizing you email data with ggplot2
Email Archeaology
Looking through old emails, I feel like an archaeologist of myself. I’ll run a search for a specific email, find a bunch of old things, and suddenly be immersed in the hot topics of 2009 and see all the wrong opinions I once held.
My college friends and I have used an ongoing email list for discussion for over a decade. (I’m probably showing my age here; the kids these days I’m sure have group tik-toks or something). I’ve often wanted to download this data to play with it, answering questions about who writes emails when, whose emails are longest, who is most likely to only send just a New York Times article, what words are most associated with which writer, and more.
I finally built up the momentum to get all of this data when I decided to finetune GPT-2 to mimic my friends’ writing styles. GPT-2 is a pre-trained neural network for text generation that has had some interesting results over the last year.
This post explains how I obtained emails from a specific email list.
Google Takeout
Google Takeout is a service that allows you to download data that Google has about you. It’s frankly pretty extensive and impressive. On the one hand, it’s terrifying the amount of information Google has about my life. On the other, it’s pretty easy for me to get this data, at least in bulk.
One issue is that there can be too much data. Takeout offered a way for me to download all of my GMail data, which, after over a decade of use and very little deleting, would have been enormous. I wanted just the text of a specific email list. Fortunately, there’s a simple workaround.
Targeting Specific Emails
Google Takeout allows you to download all emails with a specific label. First, I created a new GMail label, “friends.” Then, I used a GMail filter to label all the emails I wanted. When creating the filter, make sure to select “Apply filter to matching emails.”
The emails I was looking for were first sent to an email list managed by our college’s IT department and later to a Google Group. With some experimentation filtering on the “to:” field, I was reasonably certain I grabbed most of the emails I wanted.
Note that you can use labeling and filtering to download whatever subset of emails you want. Maybe you want to look at the vocab you used when sending emails over a specific period of time. Maybe you want to analyze political fundraising emails. Perhaps you want all the Amazon receipts you’ve ever received. Who knows!
It may take a while for Takeout to generate your data, depending on the volume. You can check back in a few hours or even days (I didn’t see an email notification when my data was ready.)
Alternatively, if all the emails you want are in a Google Group and you are the owner of that group, there’s an even easier method. Google Takeout allows you to download all emails for a Google Group directly. In the Google Takeout page, look for “Groups” and select that instead of “Mail”
Summary and Screenshots
In summation:
- Create a new GMail label.
- Create a GMail Filter that targets the emails you want. Make sure to select “Apply filter to matching emails.”
- Open Google Takeout.
- Select “Mail”, click “All Mail data included,” then deselect everything exept the label you just created.
- Wait for your data to be prepared.
- Download.
- Profit
In the next post, I’ll show my quick-and-dirty script to processing this data using Python’s built-in mailbox library.
Some screenshots of the process are below.
Google Takeout.
Select Mail.
Don’t get everything, unless you don’t have that much data in GMail!
Select just the label you want.