Night and Day, Python and R: Baby Sleep Data Analysis with Siuba
10 Jun 2020This is part of a series on visualizing baby sleep data with Python and R. All code is in this repository.
- Visualizing Baby Sleep Times in Python
- Recreating the ‘Most Beautiful Data Visualization of All Time’
- Night and Day, Python and R: Baby Sleep Data Analysis with Siuba
More Baby Sleep Data!
I’ve continued to have fun graphing my baby’s sleep times with data recorded from her fancy bassinet. We’ve had some… interesting… experiences learning about our infant’s sleep patterns and how they change over time. The plots today show the shift from our baby’s time napping to time sleeping at night.
Previously, I visualized our baby’s sleep times in Python and made a pretty image using R. This week, I’ll confuse the heck out of everyone by using Python libraries that bring R’s great ideas to Python. Will this be a beautiful moment of cross-programming-language unity? Will it just be confusing for everyone? Read on to find out!
Specifically, I’m going to use siuba, a data manipulation library heavily inspired by R’s dplyr. I’ll also use plotnine, a plotting library that clones R’s ggplot2. I’ve always found Python more enjoyable for programming and R more enjoyable for data analysis, and I’m excited to see R’s ideas brought to Python.
This post is broken up into two parts:
- Part 1 shows the plots and describes them, no programming involves.
- Part 2 walks through the code in detail.
Plots
Details
Enough of looking at pretty plots… how did we make them?
Data Manipulation: Pandas, Dplyr, and Siuba
Before we dive into the code, I’ll mention that this code will look weird to Python programmers and eerily familiar to those who use R’s dplyr.
Some day in another post I’ll break down why I like R’s dplyr
library and how learning it made me a better data scientist!
The 5 minute summary is that most data manipulation only uses a few crucial operations.
Dplyr (and it’s Python equivalent, siuba) break these operations into a few functions that operate on a dataframe:
mutate
: add a new columnselect
: select a subset of columnsfilter
: select a subset of rowsarrange
: sort your dataframegroup_by
: apply the next operations only by groupssummarize
: collapse groups into summary statistics
These operations can be done in pandas (there are many websites comparing pandas and dplyr).
I like dplyr because a few features make my coding closer to “speed of thought.”
In dplyr and siuba, a dataframe is piped thorugh a series of the above commands, in siuba using the >>
operator.
The _
variable is used to refer to the current dataframe in the pipe.
This makes it easy to manipulate a dataframe in multiple ways in one statement without having to make and keep track of multiple variables.
If this all feels a little fuzzy, please take a look at the code below, and stay tuned for some future love-letters to R :).
I should mention that I attempted to port dplyr to python as well, with the dplython library, however, due to fatherhood and a fulltime job I haven’t put much time into dplython recently, and I’m happy to see siuba exists!
Okay, now let’s actually dive into the code.
Preliminaries
First we have imports and some constants.
You’ll notice the *
imports, which isn’t typically good Python style.
However, it makes the code much less verbose and I think makes sense for data analysis style coding.
We load in our data and do some basic type conversion.
Splitting Time Periods
The split_times
function tackles the same challenge we faced in my earlier post: how do we make sure a time period doesn’t cross midnight?
I’ve taken roughly the same logic as the previous post but using siuba syntax instead of vanilla pandas.
What the code is doing is the following:
- Create a new datetime column with date equal to the start time’s date, and time equal to the time we want to split on.
- Break dataframe into two new dataframes: one with sessions that cross the splitting time, and one with sessions that don’t.
- For sessions that cross the time, create two new dataframes. One will contain the first portion of the time period: original start time, ending at the splitting time. The other will have the second portion: start at the split time, end at the original end time.
- Concatenate all dataframes together.
- Profit.
Labeling Days and Nights
We add a few columns with just the time of day (not the full date and time) and convert to an easier format. Also, crucially, we label each time period with whether it took place at night or during the day. I’ve defined “night” as 8pm to 8am and “day” the opposite. This doesn’t align with most of my behavior most of my life, but works well enough for the baby’s schedule.
Check Plot
Next, I recreate a plot from the original post, but color the rectangles by whether the time is day or night. This is mainly to make sure I didn’t have any bugs!
I pulled colors from the excellent Palettable library. As a style point that others might disagree with, I like to remove labels and titles when I think graphs are self explanatory. The tick marks from the x-axis clearly show this is time, so why waste space writing “Time”? The values clearly show night or day, so why put a title on the legend? Etc.
Sleep Time: Day vs. Night
This is where I think siuba really shines. I have many sleep sessions for each day. My goal is to compute summary statistics for each day. In one statement, I:
- Add a column calculating sleep session length
- Group by day and type of sleep (day or night)
- Summarize these groups, calculating the total amount of sleep.
- Add another column expressing this as an integer instead of a timedelta.
Now let’s plot it, using plotnine:
Here’s the resulting plot:
I love how easy it is to add a per-group trendline by simply adding + geom_smooth()
to the plotting code.
Proportion Spent Sleeping
Now for each day I calculate the percent of total sleep that is during the day or at night. Again, I can write this in one statement of chained together functions:
I love getting the per-group summary of total sleep time and applying it to get within-group percentages. I’ve typically had to do this by doing some yucky joins with multiple dataframes in pandas (though I imagine there may be smoother ways now).
Summary
Woof, this post was longer than I was hoping at the outset. For the Python programmers, hopefully you enjoyed this cursory glance into the world of R! For the R programmers, hopefully you saw how they tidyverse’s ideas are slowing moving into Python.
If you enjoyed this post, consider signing up for email updates in the menu on the left or following me on Twitter.