Download the full dataset from utccuip.com:
Block turned off as it only needs to run once
url <-"http://cscdc-2019.utccuip.com/CSCDC+Dataset.zip"
dest_file <- "01_data/CSCDC+Dataset.zip"
download.file(url, dest_file)
unzip(zipfile = dest_file, exdir = "01_data/CSCDC+Dataset/")
Load up a few libraries
Grab the air quality data then give it a quick skim
(from the skimr
package):
air_quality <- read_csv("01_data/CSCDC+Dataset/air_quality_csv/all.csv")
## have to force the formatting on the timestamp.
air_quality %>%
mutate( `timestamp-iso` = mdy_hm(`timestamp-iso`)) ->
air_quality
skim(air_quality) %>% kable()
Skim summary statistics
n obs: 105726
n variables: 34
Variable type: character
variable | missing | complete | n | min | max | empty | n_unique |
---|---|---|---|---|---|---|---|
nicename | 0 | 105726 | 105726 | 11 | 12 | 0 | 7 |
Variable type: numeric
variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
current_dewpoint_f | 0 | 105726 | 105726 | 60.74 | 7.61 | 32 | 56 | 63 | 67 | 74 | ▁▁▁▂▃▃▇▂ |
current_humidity | 0 | 105726 | 105726 | 48.99 | 14.48 | 11 | 37 | 49 | 62 | 80 | ▁▃▅▆▇▅▇▁ |
current_temp_f | 0 | 105726 | 105726 | 83.5 | 7.19 | 59 | 79 | 82 | 88 | 108 | ▁▁▃▇▅▃▂▁ |
lat | 0 | 105726 | 105726 | 35.04 | 0.0021 | 35.04 | 35.04 | 35.04 | 35.05 | 35.05 | ▃▁▅▅▁▅▃▇ |
lon | 0 | 105726 | 105726 | -85.3 | 0.0056 | -85.31 | -85.31 | -85.3 | -85.3 | -85.29 | ▇▃▅▁▅▅▁▃ |
p_0_3_um | 0 | 105726 | 105726 | 1879.3 | 1138.26 | 177.43 | 1081.73 | 1607.55 | 2337.86 | 16531.32 | ▇▂▁▁▁▁▁▁ |
p_0_3_um_b | 0 | 105726 | 105726 | 1883.18 | 1142.33 | 217.46 | 1090.62 | 1613.96 | 2324.64 | 15918.44 | ▇▂▁▁▁▁▁▁ |
p_0_5_um | 0 | 105726 | 105726 | 557.04 | 339.09 | 55.46 | 320.35 | 476.25 | 693.68 | 4845.58 | ▇▂▁▁▁▁▁▁ |
p_0_5_um_b | 0 | 105726 | 105726 | 560.96 | 343.38 | 67.89 | 325.13 | 480.71 | 691.95 | 4746.94 | ▇▂▁▁▁▁▁▁ |
p_1_0_um | 0 | 105726 | 105726 | 90.63 | 75.87 | 4.38 | 42.75 | 70.18 | 110.29 | 1424.29 | ▇▁▁▁▁▁▁▁ |
p_1_0_um_b | 0 | 105726 | 105726 | 89.78 | 74.71 | 4.37 | 42.92 | 69.86 | 108.27 | 1141.65 | ▇▁▁▁▁▁▁▁ |
p_10_0_um | 0 | 105726 | 105726 | 0.18 | 0.28 | 0 | 0 | 0.03 | 0.25 | 3.97 | ▇▁▁▁▁▁▁▁ |
p_10_0_um_b | 0 | 105726 | 105726 | 0.19 | 1.25 | 0 | 0 | 0 | 0.25 | 365.71 | ▇▁▁▁▁▁▁▁ |
p_2_5_um | 0 | 105726 | 105726 | 4.87 | 5.53 | 0 | 1.79 | 3.16 | 5.39 | 161.49 | ▇▁▁▁▁▁▁▁ |
p_2_5_um_b | 0 | 105726 | 105726 | 4.96 | 5.49 | 0 | 1.79 | 3.23 | 5.64 | 107.1 | ▇▁▁▁▁▁▁▁ |
p_5_0_um | 0 | 105726 | 105726 | 0.94 | 1.11 | 0 | 0.26 | 0.62 | 1.14 | 35.89 | ▇▁▁▁▁▁▁▁ |
p_5_0_um_b | 0 | 105726 | 105726 | 0.97 | 3.26 | 0 | 0.25 | 0.61 | 1.19 | 966.92 | ▇▁▁▁▁▁▁▁ |
pm1_0_atm | 0 | 105726 | 105726 | 10.34 | 6.86 | 0.01 | 5.53 | 8.73 | 13.17 | 101.68 | ▇▂▁▁▁▁▁▁ |
pm1_0_atm_b | 0 | 105726 | 105726 | 10.29 | 6.89 | 0.19 | 5.54 | 8.71 | 12.97 | 99.78 | ▇▂▁▁▁▁▁▁ |
pm1_0_cf_1 | 0 | 105726 | 105726 | 10.05 | 6.05 | 0.01 | 5.53 | 8.73 | 13.16 | 68.74 | ▇▆▂▁▁▁▁▁ |
pm1_0_cf_1_b | 0 | 105726 | 105726 | 9.97 | 6.01 | 0.19 | 5.54 | 8.71 | 12.95 | 67.19 | ▇▆▂▁▁▁▁▁ |
pm10_0_atm | 0 | 105726 | 105726 | 15.66 | 11.76 | 0.17 | 7.91 | 12.66 | 19.36 | 200.35 | ▇▁▁▁▁▁▁▁ |
pm10_0_atm_b | 0 | 105726 | 105726 | 15.54 | 11.62 | 0.66 | 7.93 | 12.61 | 19.1 | 144.72 | ▇▂▁▁▁▁▁▁ |
pm10_0_cf_1 | 0 | 105726 | 105726 | 15.48 | 11.12 | 0.17 | 7.91 | 12.66 | 19.36 | 135.6 | ▇▃▁▁▁▁▁▁ |
pm10_0_cf_1_b | 0 | 105726 | 105726 | 15.36 | 10.97 | 0.66 | 7.93 | 12.61 | 19.09 | 104.51 | ▇▅▁▁▁▁▁▁ |
pm2_5_atm | 0 | 105726 | 105726 | 15 | 11.13 | 0.1 | 7.59 | 12.17 | 18.61 | 175.4 | ▇▁▁▁▁▁▁▁ |
pm2_5_atm_b | 0 | 105726 | 105726 | 14.85 | 10.98 | 0.5 | 7.61 | 12.12 | 18.3 | 138.54 | ▇▂▁▁▁▁▁▁ |
pm2_5_cf_1 | 0 | 105726 | 105726 | 14.47 | 9.57 | 0.1 | 7.59 | 12.17 | 18.6 | 257.33 | ▇▁▁▁▁▁▁▁ |
pm2_5_cf_1_b | 0 | 105726 | 105726 | 14.33 | 9.5 | 0.5 | 7.61 | 12.12 | 18.29 | 513.31 | ▇▁▁▁▁▁▁▁ |
timestamp | 0 | 105726 | 105726 | 1.6e+12 | 5.2e+08 | 1.6e+12 | 1.6e+12 | 1.6e+12 | 1.6e+12 | 1.6e+12 | ▇▇▇▇▇▇▇▇ |
Unnamed: 0 | 0 | 105726 | 105726 | 52862.54 | 30520.68 | 0 | 26431.25 | 52862.5 | 79293.75 | 105784 | ▇▇▇▇▇▇▇▇ |
X1 | 0 | 105726 | 105726 | 52862.54 | 30520.68 | 0 | 26431.25 | 52862.5 | 79293.75 | 105784 | ▇▇▇▇▇▇▇▇ |
Variable type: POSIXct
variable | missing | complete | n | min | max | median | n_unique |
---|---|---|---|---|---|---|---|
timestamp-iso | 0 | 105726 | 105726 | 2019-06-01 | 2019-06-22 | 2019-06-11 | 30211 |
Let’s make these field names better so we don’t have to quote these later. Using janitor::clean_names()
to do the dirty work. And we have a bunch of fields with 2_5 in the name so let’s look at those.
air_quality %>%
clean_names() %>%
select(
timestamp_iso,
nicename,
p_2_5_um,
p_2_5_um_b,
pm2_5_atm,
pm2_5_atm_b,
pm2_5_cf_1,
pm2_5_cf_1_b,
current_temp_f,
current_humidity
) ->
aq_limited
ggpairs(aq_limited %>% select(-timestamp_iso) )
Let’s look at a few of these individually.
ggplot(aq_limited) +
aes(x = timestamp_iso, y = pm2_5_cf_1) +
geom_line(size = 1.92, colour = "#0c4c8a") +
labs(x = "Date & Time", y = "Sensor Level", title = "`pm2_5_cf_1` by Station") +
theme_minimal() +
facet_wrap(vars(nicename))
I don’t know what it means exactly, but I bet there’s hella traffic at douglas compared to the other stations. The volatility there is off the hook (comparatively)
Let’s look at the same data layered on the same graph. I can’t tell if they move together… seems like they should to some degree:
ggplot(aq_limited) +
aes(x = timestamp_iso, y = pm2_5_cf_1, color = nicename) +
geom_line(size = 1.2) +
labs(x = "Date & Time", y = "Sensor Level", title = "`pm2_5_cf_1` by Station") +
theme_minimal() +
scale_colour_calc()
Well that’s sort of noisy. I wonder what the correlation is. Let’s reshape the data to look at correlations quickly.
Do a quick reshape of the data to get the stations in columns…
I discovered in the data some time and location combinations that have multiple readings. I would not expect this. But since it’s there, we’ll just average them. And the data is irregular. So we need to put this junk in buckets. I’m thinking maybe 10 min buckets? Let’s see if we can do that. Feels like a job for grouping by a floor value. We can get the 10 minute interval floors from the lubridate
package.
aq_limited %>%
select(timestamp_iso, nicename, pm2_5_cf_1) %>%
group_by(timestamp_iso, nicename) %>%
summarize(pm2_5_cf_1 = mean(pm2_5_cf_1)) %>%
group_by(timemark = floor_date(timestamp_iso, "10 mins"), nicename) %>%
select(-timestamp_iso) %>%
summarize_all(mean, na.rm = TRUE) %>%
arrange(timemark) ->
aq_limited_long
aq_limited_long %>%
spread(nicename, pm2_5_cf_1) ->
aq_limited_wide
kable(head(aq_limited_wide, 20))
timemark | mlk-central | mlk-douglas | mlk-georgia | mlk-houston | mlk-lindsay | mlk-magnolia | mlk-peeples |
---|---|---|---|---|---|---|---|
2019-06-01 04:00:00 | 14.074 | 14.810 | 14.252 | 15.104 | 16.370 | 13.198 | 16.104 |
2019-06-01 04:10:00 | 13.724 | 15.110 | 14.556 | 15.348 | 15.796 | 13.840 | 16.790 |
2019-06-01 04:20:00 | 16.012 | 15.262 | 14.818 | 15.348 | 15.882 | 14.568 | 18.332 |
2019-06-01 04:30:00 | 15.868 | 15.578 | 15.718 | 15.496 | 16.082 | 15.310 | 18.888 |
2019-06-01 04:40:00 | 16.512 | 16.026 | 14.790 | 16.282 | 15.912 | 14.280 | 17.936 |
2019-06-01 04:50:00 | 17.706 | 16.394 | 15.582 | 16.014 | 16.026 | 14.802 | 17.778 |
2019-06-01 05:00:00 | 16.854 | 16.340 | 15.510 | 16.010 | 16.014 | 14.686 | 18.460 |
2019-06-01 05:10:00 | 36.456 | 16.600 | 14.906 | 16.864 | 15.640 | 14.808 | 17.342 |
2019-06-01 05:20:00 | 16.050 | 16.954 | 15.640 | 16.988 | 16.924 | 14.804 | 17.684 |
2019-06-01 05:30:00 | 16.546 | 16.924 | 16.992 | 16.482 | 16.922 | 14.898 | 17.538 |
2019-06-01 05:40:00 | 15.988 | 16.918 | 16.716 | 17.112 | 17.318 | 15.062 | 17.538 |
2019-06-01 05:50:00 | 16.162 | 17.554 | 17.244 | 17.292 | 17.104 | 16.340 | 18.030 |
2019-06-01 06:00:00 | 16.534 | 17.048 | 16.186 | 17.166 | 16.830 | 15.722 | 18.078 |
2019-06-01 06:10:00 | 16.604 | 17.836 | 16.828 | 18.946 | 18.668 | 17.094 | 18.842 |
2019-06-01 06:20:00 | 17.944 | 17.924 | 16.218 | 18.114 | 17.426 | 16.522 | 18.946 |
2019-06-01 06:30:00 | 17.330 | 17.580 | 16.736 | 17.330 | 17.268 | 16.812 | 18.914 |
2019-06-01 06:40:00 | 18.152 | 17.498 | 17.354 | 17.716 | 17.496 | 17.272 | 19.134 |
2019-06-01 06:50:00 | 18.974 | 17.890 | 18.476 | 17.786 | 17.336 | 16.236 | 20.290 |
2019-06-01 07:00:00 | 19.712 | 18.052 | 18.446 | 18.516 | 17.868 | 16.192 | 19.876 |
2019-06-01 07:10:00 | 18.054 | 18.812 | 17.228 | 18.630 | 18.368 | 17.426 | 20.712 |
That looks better. Let’s see what’s going on here…
Let’s plot and look at the pairs plot with correlations:
aq_limited_wide %>%
select(-timemark) %>%
ggpairs()
## Adding missing grouping variables: `timemark`
As I would suspect, these are all pretty correlated. But that Douglas intersection though… it’s different. I suspect high traffic (again). Still correlated but it doesn’t drop as low as quickly as the others… I wonder if there’s something dusty near there or if it just gets more traffic?
Since we’re slicing and dicing, let’s look at temp
vs. p_2_5_um
for each station:
ggplot(aq_limited) +
aes(x = current_temp_f, y = p_2_5_um) +
geom_point(size = 1.92, colour = "#0c4c8a") +
labs(x = "Temp (F)", y = "Sensor Level", title = "`p_2_5_um` vs Temp (F)") +
theme_minimal() +
facet_wrap(vars(nicename))
That’s not as insightful as I had hoped. Let’s look at humidity:
ggplot(aq_limited) +
aes(x = current_humidity, y = p_2_5_um) +
geom_point(size = 1.92, colour = "#0c4c8a") +
labs(x = "Absolute Humidity", y = "Sensor Level", title = "`p_2_5_um` vs Humidity") +
theme_minimal() +
facet_wrap(vars(nicename))
When I first got these data, I don’t know where it was collected, but I suspect it was around Chattanooga. Let’s plot the locations and see:
library(ggmap)
library(sf)
library(mapview)
library(leaflet)
# grab the coordinates
air_quality %>%
group_by(nicename, lat, lon) %>%
summarize(obs_count = n()) ->
obs_location
# turn them into a spatial sf object for mapping
obs_location_sf <-
st_as_sf(obs_location,
coords = c("lon", "lat"),
crs = 4326)
# build the map
m1 <- mapview(obs_location_sf, legend = FALSE)
# add the names as labels
l1 <- addStaticLabels(
m1,
label = obs_location_sf$nicename,
textsize = "20px",
direction = 'left'
)
l1
Cool, they are all right up one strip in Chattanooga. I guess now we know what the MLK prefix in all the station names stands for. Looks like the stations are named after MLK and the cross street. Makes sense.
We should do something similar to the above with the video data.
We’ll start with a skim
:
video_events <- read_csv("01_data/CSCDC+Dataset/video_event_csv/all.csv")
skim(video_events) %>% kable()
Skim summary statistics
n obs: 4070343
n variables: 11
Variable type: character
variable | missing | complete | n | min | max | empty | n_unique |
---|---|---|---|---|---|---|---|
camera_id | 0 | 4070343 | 4070343 | 17 | 18 | 0 | 10 |
id | 0 | 4070343 | 4070343 | 36 | 36 | 0 | 4070342 |
label | 0 | 4070343 | 4070343 | 3 | 13 | 0 | 18 |
locations | 0 | 4070343 | 4070343 | 190 | 9897 | 0 | 4070342 |
pole_id | 0 | 4070343 | 4070343 | 13 | 14 | 0 | 10 |
Variable type: logical
variable | missing | complete | n | mean | count |
---|---|---|---|---|---|
intersection | 0 | 4070343 | 4070343 | 0.84 | TRU: 3405701, FAL: 664642, NA: 0 |
Variable type: numeric
variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
hit_counts | 0 | 4070343 | 4070343 | 5.35 | 4.78 | 2 | 2 | 3 | 7 | 82 | ▇▁▁▁▁▁▁▁ |
timestamp | 0 | 4070343 | 4070343 | 1.6e+12 | 5.3e+08 | 1.6e+12 | 1.6e+12 | 1.6e+12 | 1.6e+12 | 1.6e+12 | ▅▇▇▆▇▆▆▇ |
Unnamed: 0 | 0 | 4070343 | 4070343 | 2e+06 | 1175006.96 | 0 | 1e+06 | 2e+06 | 3052756.5 | 4070342 | ▇▇▇▇▇▇▇▇ |
X1 | 0 | 4070343 | 4070343 | 2e+06 | 1175006.96 | 0 | 1e+06 | 2e+06 | 3052756.5 | 4070342 | ▇▇▇▇▇▇▇▇ |
Variable type: POSIXct
variable | missing | complete | n | min | max | median | n_unique |
---|---|---|---|---|---|---|---|
timestamp-iso | 0 | 4070343 | 4070343 | 2019-06-01 | 2019-06-22 | 2019-06-11 | 3876225 |
head(video_events)
## # A tibble: 6 x 11
## X1 `Unnamed: 0` camera_id hit_counts id intersection label
## <dbl> <dbl> <chr> <dbl> <chr> <lgl> <chr>
## 1 0 0 mlk-cent… 7 fb33… TRUE car
## 2 1 1 mlk-cent… 2 12a6… TRUE car
## 3 2 2 mlk-cent… 2 8430… TRUE car
## 4 3 3 mlk-cent… 14 d5c5… TRUE car
## 5 4 4 mlk-cent… 8 8328… TRUE car
## 6 5 5 mlk-cent… 6 7bc4… TRUE car
## # … with 4 more variables: locations <chr>, pole_id <chr>,
## # timestamp <dbl>, `timestamp-iso` <dttm>
So there’s a lot going on in there… let’s strip out some unique values for a few of the fields so we can see what we’re dealing with.
So there’s an id
field that has a CUSIP for each vehicle… let’s check that out
video_events %>%
select(id) %>%
distinct(id) %>%
summarize(n())
## # A tibble: 1 x 1
## `n()`
## <int>
## 1 4070342
So there are a little over 4m unique vehicle ids. Interesting.
I wonder if any ids are in there multiple times.
video_events %>%
group_by(id) %>%
summarize(rec_count = n()) %>%
filter(rec_count > 1) ->
has_dupe
has_dupe
## # A tibble: 1 x 2
## id rec_count
## <chr> <int>
## 1 b3cb7029-86ed-47e0-a634-7ae1ba9d3076 2
One… weird. I wonder if that’s a random CUSIP clash (unlikely).
video_events %>%
inner_join(has_dupe) ->
double_up
## Joining, by = "id"
double_up
## # A tibble: 2 x 12
## X1 `Unnamed: 0` camera_id hit_counts id intersection label
## <dbl> <dbl> <chr> <dbl> <chr> <lgl> <chr>
## 1 4.03e6 4029658 mlk-cent… 8 b3cb… TRUE car
## 2 4.03e6 4032606 mlk-cent… 8 b3cb… TRUE car
## # … with 5 more variables: locations <chr>, pole_id <chr>,
## # timestamp <dbl>, `timestamp-iso` <dttm>, rec_count <int>
That’s just straight up a dupe record. Well it probably won’t cause us any trouble.. but we’ll kick out one manually for OCD reasons.
double_up %>%
summarize(max(X1)) %>%
pluck(1) ->
bad_point
video_events %>%
filter(X1 != bad_point) ->
video_events_nodupes
Let’s look at the cameras and see what’s going on…
video_events_nodupes %>%
select(camera_id) %>%
arrange(camera_id) %>%
distinct
## # A tibble: 10 x 1
## camera_id
## <chr>
## 1 mlk-central-cam-2
## 2 mlk-douglas-cam-1
## 3 mlk-douglas-cam-2
## 4 mlk-georgia-cam-1
## 5 mlk-georgia-cam-3
## 6 mlk-houston-cam-1
## 7 mlk-lindsay-cam-2
## 8 mlk-magnolia-cam-2
## 9 mlk-peeples-cam-1
## 10 mlk-peeples-cam-3
Oh… some intersections have two cameras.. but some only have one. Interesting.
I think for now we just want data from one camera per intersection… so let’s pick the camera with the most traffic and also adjust the names so we can use them later to match to pm data
video_events_nodupes %>%
group_by(camera_id) %>%
summarize(rec_count = n()) %>%
separate(camera_id, into = c("street1", "street2", "street3", "street4"), sep="-", remove = FALSE) %>%
mutate(nicename = paste(street1, street2, sep = "-")) %>%
select(nicename, camera_id, rec_count) %>%
group_by(nicename) %>%
top_n(1, rec_count) %>%
ungroup %>%
{.} ->
cameras
cameras
## # A tibble: 7 x 3
## nicename camera_id rec_count
## <chr> <chr> <int>
## 1 mlk-central mlk-central-cam-2 880557
## 2 mlk-douglas mlk-douglas-cam-1 287188
## 3 mlk-georgia mlk-georgia-cam-3 623252
## 4 mlk-houston mlk-houston-cam-1 568195
## 5 mlk-lindsay mlk-lindsay-cam-2 490929
## 6 mlk-magnolia mlk-magnolia-cam-2 188323
## 7 mlk-peeples mlk-peeples-cam-3 448677
Join that back to the video data to get just one cam per intersetion
video_events_nodupes %>%
inner_join(cameras %>% select(camera_id, nicename)) ->
video_events_nodupes_1per
## Joining, by = "camera_id"
in our initial skim
we saw we have a field called label
that has 18 unique values… let’s look at that
video_events_nodupes_1per %>%
group_by(label) %>%
summarize(row_count = n()) %>%
arrange(row_count)
## # A tibble: 17 x 2
## label row_count
## <chr> <int>
## 1 stop sign 3
## 2 parking meter 23
## 3 suitcase 28
## 4 aeroplane 33
## 5 dog 44
## 6 handbag 57
## 7 bird 79
## 8 boat 81
## 9 backpack 178
## 10 umbrella 255
## 11 fire hydrant 951
## 12 bicycle 3690
## 13 motorbike 6509
## 14 bus 17207
## 15 truck 105490
## 16 person 126431
## 17 car 3226062
Oh… that’s a lot of stuff.
While I’m curious about the 83 boats, 33 airplanes, and 1 cat that wandered down JFK, I’m going to assume that’s missclassification noise. And I’m going to filter out only the vehicles:
video_events_nodupes_1per %>%
filter(label %in% c("car","bus","truck", "motorbike")) %>%
clean_names() %>%
{.} ->
video_events_nodupes_1per_filtered
so let’s group these data into 10 minute buckets the way we did with the pm data:
video_events_nodupes_1per_filtered %>%
group_by(timemark = floor_date(timestamp_iso, "10 mins"), nicename, label) %>%
summarize(vehicle_count = n()) ->
video_events_summarized_long
head(video_events_summarized_long)
## # A tibble: 6 x 4
## # Groups: timemark, nicename [4]
## timemark nicename label vehicle_count
## <dttm> <chr> <chr> <int>
## 1 2019-06-01 04:00:00 mlk-central car 154
## 2 2019-06-01 04:00:00 mlk-central truck 3
## 3 2019-06-01 04:00:00 mlk-douglas car 29
## 4 2019-06-01 04:00:00 mlk-georgia car 169
## 5 2019-06-01 04:00:00 mlk-georgia truck 5
## 6 2019-06-01 04:00:00 mlk-houston bus 1
That’s cool, but let’s make it wide instead of long:
video_events_summarized_long %>%
group_by(nicename, label) %>%
spread(label, vehicle_count) %>%
replace(is.na(.), 0) ->
video_events_summarized_wide
head(video_events_summarized_wide)
## # A tibble: 6 x 6
## # Groups: nicename [6]
## timemark nicename bus car motorbike truck
## <dttm> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2019-06-01 04:00:00 mlk-central 0 154 0 3
## 2 2019-06-01 04:00:00 mlk-douglas 0 29 0 0
## 3 2019-06-01 04:00:00 mlk-georgia 0 169 0 5
## 4 2019-06-01 04:00:00 mlk-houston 1 151 1 1
## 5 2019-06-01 04:00:00 mlk-lindsay 0 145 0 1
## 6 2019-06-01 04:00:00 mlk-magnolia 0 22 0 0
Let’s join the ‘wide’ video data to the polution data.
aq_limited_long %>%
inner_join(video_events_summarized_wide) %>%
rowwise() %>%
mutate(total_traffic = sum(bus, car, motorbike, truck) ) ->
pm_traffic
## Joining, by = c("timemark", "nicename")
Do a quick time plot of each intersection (again)
ggplot(pm_traffic) +
aes(x = timemark, y = pm2_5_cf_1) +
geom_line(size = 1, colour = "#0c4c8a") +
labs(x = "Date & Time", y = "Sensor Level", title = "`pm2_5_cf_1` by Station") +
theme_minimal() +
facet_wrap(vars(nicename))
And traffic:
ggplot(pm_traffic) +
aes(x = timemark, y = total_traffic) +
geom_line(size = 1, colour = "#800000") +
labs(x = "Date & Time", y = "Total Vehicles", title = "Total Vehicles by Station") +
theme_minimal() +
facet_wrap(vars(nicename))
Well that’s some regular looking cycles…
Let’s build a super simple linear model:
library(parsnip)
library(broom)
lm(data = pm_traffic, pm2_5_cf_1 ~ bus + car + motorbike + truck) ->
linear_model
summary(linear_model)
##
## Call:
## lm(formula = pm2_5_cf_1 ~ bus + car + motorbike + truck, data = pm_traffic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.223 -6.621 -2.291 4.122 48.312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.5887601 0.0993488 156.909 < 2e-16 ***
## bus -0.0301903 0.0555936 -0.543 0.587098
## car -0.0067491 0.0007974 -8.464 < 2e-16 ***
## motorbike -0.1945797 0.0527587 -3.688 0.000226 ***
## truck -0.0022071 0.0147943 -0.149 0.881410
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.412 on 21095 degrees of freedom
## Multiple R-squared: 0.01232, Adjusted R-squared: 0.01213
## F-statistic: 65.79 on 4 and 21095 DF, p-value: < 2.2e-16
So right out of the gate we have issues…. the sign on the coefficients is not at all as expected. We would expect more traffic = more particulates… but this models says the opposite. That’s a huge warning sign.
Let’s plot the actual vs. modeled
pm_traffic %>%
bind_cols( tibble(.pred = predict(linear_model, new_data = pm_traffic)) ) ->
predicted_lm
ggplot(predicted_lm) +
aes(x = pm2_5_cf_1, y = .pred) +
geom_line(size = 1, colour = "#228B22") +
labs(x = "pm2_5_cf_1", y = "Predicted pm2_5_cf_1", title = "Modeled vs. Actual") +
theme_minimal() +
facet_wrap(vars(nicename))
Well that’s amazingly shitty.
The OLS never predicts under 15 (the intercept), but we have lots of data under 15. That’s weird.
Let’s look at a quick correlation matrix:
pm_traffic %>%
select(pm2_5_cf_1, bus, car, motorbike, truck, total_traffic) %>%
cor()
Weird… mos def negative correlations
Maybe if we break it into intersections and just look at total traffic correlation to pm
pm_traffic %>%
group_by(nicename) %>%
summarize(correl = cor(pm2_5_cf_1, total_traffic))
Total traffic is always just a little bit neg correlated….
So clearly something else is going on.. most likly the effect is lagged. We can drill into one station and see what’s going on. We’ll normalize so we can plot together
library(forecast)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
## method from
## fitted.fracdiff fracdiff
## residuals.fracdiff fracdiff
pm_traffic %>%
filter(nicename == "mlk-central") %>%
ungroup() %>%
na.omit() %>%
mutate_at(funs(scale), .vars = vars(pm2_5_cf_1, total_traffic)) ->
single_intersection
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
ggplot(single_intersection) +
geom_line(size = 1,
colour = "#0c4c8a",
aes(x = timemark, y = pm2_5_cf_1)) +
geom_line(size = 1,
colour = "red",
aes(x = timemark, y = total_traffic)) +
labs(x = "Date & Time", y = "Scaled Values", title = "Traffic and PM") +
theme_minimal() +
facet_wrap(vars(nicename))
Oh.. look at that… there’s pretty clearly a lag. So in order to get anything out of the modeling then we need to account for the lag.
Let’s work this as a time series decomp. We’ll lag the pm data 6 hours
library(tsbox)
##
## Attaching package: 'tsbox'
## The following objects are masked from 'package:skimr':
##
## ts_end, ts_start
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following object is masked from 'package:leaflet':
##
## addLegend
## The following objects are masked from 'package:dplyr':
##
## first, last
library(tidyquant)
## Loading required package: PerformanceAnalytics
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
## Loading required package: quantmod
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
single_intersection %>%
select(timemark, pm2_5_cf_1, total_traffic) ->
subset_data
xts((subset_data %>% select(-timemark)),
order.by = subset_data$timemark,
frequency = 6 * 24 * 7) ->
single_intersect_ts
single_intersection %>%
select(timemark, pm2_5_cf_1, total_traffic) %>%
ts_long %>%
ts_ts %>%
ts(frequency = 6 * 24 * 7) ->
single_intersect_ts
## [time]: 'timemark'
## [time]: 'timemark'
single_intersect_ts[, "total_traffic"] %>%
decompose(type = "additive") %>%
autoplot
single_intersect_ts[, "pm2_5_cf_1"] %>%
decompose(type = "additive") %>%
autoplot
cor(single_intersect_ts[, "total_traffic"], single_intersect_ts[, "pm2_5_cf_1"])
## [1] -0.1563726
# cross correlation
ccf(single_intersect_ts[, "total_traffic"], single_intersect_ts[, "pm2_5_cf_1"], lag.max = 70)
ok, so we’re seeing some lagged correlation. It peaks at .06 of a frequency unit… our frequency is 6 * 24 * 7, so the lag is 6 * 24 * 7 * .06 = 60.48
I need to make sense of this lagged effect and I’ve clearly forgotten all my time series heuristics… Time to get Rob Hyndman’s book out and read up… also need to check those time zones as somthing seems off…
Let’s bring in some weather data. I ran this previously and it takes a few minutes so I saved the results to a csv:
library(rWind)
# chattanooga is 35.0456° N, 85.3097° W
dt <- seq(ymd_hms(paste(2019, 6, 01, 00, 00, 00, sep = "-")),
ymd_hms(paste(2019, 7, 01, 00, 00, 00, sep = "-")), by = "3 hours")
ww <- wind.dl_2(dt, -85.3097,-85.3097, 35.0456, 35.0456)
tidy (ww) %>%
write_csv("01_data/wind_data.csv")
Read in that csv
wind_data <- read_csv("01_data/wind_data.csv")
## Parsed with column specification:
## cols(
## time = col_datetime(format = ""),
## lat = col_double(),
## lon = col_double(),
## ugrd10m = col_double(),
## vgrd10m = col_double(),
## dir = col_double(),
## speed = col_double()
## )
so we have a known date issue… not 100% sure what time zone the traffic data is… assuming UTC, but I bet it’s eastern. Have to fix this later
now we need to join every intersection to the wind data.