Discretizing Baby Sleep Timeseries with Pandas
23 Jun 2020This is part of a series on visualizing baby sleep data with Python and R. All code is in this repository.
- Visualizing Baby Sleep Times in Python
- Recreating the ‘Most Beautiful Data Visualization of All Time’
- Night and Day, Python and R: Baby Sleep Data Analysis with Siuba
- Discretizing Baby Sleep Timeseries with Pandas
This post shows a few methods for turning a representation of time as a series of periods into a binary vector for use in clustering or other data anslysis.
Motivation: A newborn’s sleep is chaotic before transitioning to a more regular schedule. You can clearly see this in my child’s sleep behavior in the graph below.
I decided I’d try to quantify this shift from more random to more regular over time. Additionally, I wanted to do some clustering to look at when certain transitions occurred and to make some pretty graphs!
My first benchmark here is to compare days to see how similar they are. I decided to represent each day as a vector and then run cosine similarity. Start simple! To do this, however, I need to represent days as vectors. I decided to settle on breaking days into 15 minute increments. The first item in the vector would be a 1 if the baby was asleep in the first 15 minutes of the day and a 0 otherwise. This would repeat for the rest of the day, so we’ll have a vector of length 24 hours * 4 time-periods / hour = 96. For time time periods with some sleep and some awake, we’ll just call it entirely asleep.
Let’s dive into some Pandas code.
Preliminaries
As I showed in my first post, our data looks like this:
Basically, each row is a period of time where the baby was asleep.
First, we import necessary libraries, add some constants, and do some type conversions and add extra columns for convenience.
Discretize Time Periods
Pandas has a powerful resample
function which is great for our use case.
However, to use it, we need our data to be formatted as a Series with a time index.
You could think of this next block of code as turning our “wide” dataframe with multiple observations per row (when did baby wake up? When did baby sleep?) into a “long” dataframe, with one observation per row (when did baby do something, and what was it?). As always, Hadley Wickham’s paper is great for thinking about data formats.
Next up, I’ll use the resample
method in two different ways.
One is easy to write and think about, but much slower.
Method 1: Upsample then Downsample
In this version, we resample to a one second granularity and forward fill. We follow this up by resampling to a 15 minute granularity and taking the max value within that 15 minute time window.
In a bit more detail, all_times.resample("1S")
gives us a Series with a row for every second between our start and end times.
The value at the series will be 1 or 0 at the timestamps present in all_times
, and null everywhere else.
The .ffill()
call then fills every null with the first non-null value before it.
The resample
call will give us a new series with a row for each 15 minute time block.
Finally, calling .max()
on the resampled series means
A helpful way to think about resample([time granularity]).function()
is sort of like a groupby
: we group rows together based on the time windows, and then use the function to decide how to go from those rows to either a few rows (like in an aggregation) or many rows (via interpolation or some other method).
Because we’re making potentially a lot of rows, this uses lots of time and computation. But it’s super quick to write, and oftentimes for data scientists we only have to run code like this once or twice.
Method 2: Resample Twice and Combine
We’re doing two things here.
First, let’s focus on r_last
.
We could think of the values in all_times
as when we’re throwing an on/off, awake/asleep switch.
Thus, for any time after we’ve set the switch to one, we want the value to be one.
If baby fell asleep at 4:58pm and wakes up at 5:28pm, we want the 5:00pm and 5:15pm periods to be asleep as well.
r.last()
says when we have multiple rows in a time window, take the value of the last one.
The .fillna(method = "ffill")
then “pushes” that last value forward, replacing all nulls until it hits a
That’s going to be correct for all the nulls it replaces, but sometimes incorrect
for the window in which the value occurs.
If baby fell asleep at 12:16am and woke up at 12:18am, we want the value to be asleep for the 12:15-12:30am window, and this will take the last value, awake.
So that’s why we need to use r_max
.
While r_last
will be correct for all values that were null in the resampled array, r_max
will be correct for all values taht were not null in the resampled array.
If any value appears in the time window, then the baby was asleep at some portion of it.
Thus, we set all these values to 1.
Finally, we cimply combine these two arrays with the combine_first
method, which works similarly to a SQL coalesce
and set the type to int
.
Method Comparison
The first method is quick to write but slow to execute. The second method is more cumbersome but faster. How much faster? On my old laptop where I’m running this, method 1 took 3 seconds and method 2 took 68ms, so basically two orders of magnitude difference.
They do give the same result, as we can verify with:
Create Feature Vectors
The above code gave us one long time series broken into evenly spaced, 15 minute increments. Since we want to compare days, let’s now break apart that series into separate days. Again, I’l show two methods.
One option is to operate in “numpy” land by operating directly on the values. We figure out how big our vector should be, and reshape.
Another option is to turn our timeseries into a DataFrame, add some columns to break apart each day, and use the pivot
method to rearrange everything.
The pro of this method is we get to keep our data as a DataFrame with comprehensible column names.
The downside is we’re now in Weird Pandas Multi-Index Multi-Column world, which I really dislike.
Again, we can verify the two are the same:
Distance Metric and Plotting
Finally, we actually compute the cosine similarity, the easiest part of all of this. I did this by zipping the two lists of vectors with a one day offset.
Next I’ll plot using my favorite plotting library of the moment, plotnine. I use some custom colors because I thought they looked nice in my last post.
The trend does indeed appear to go from really dissimilar sleep to much more regular patterns!