Fetch Data

Download the full dataset from utccuip.com:

Block turned off as it only needs to run once

url <-"http://cscdc-2019.utccuip.com/CSCDC+Dataset.zip"
dest_file <- "01_data/CSCDC+Dataset.zip"

download.file(url, dest_file)
unzip(zipfile = dest_file, exdir = "01_data/CSCDC+Dataset/")

Load up a few libraries

Air Quality Data

Grab the air quality data then give it a quick skim (from the skimr package):

Air Quality Skim

air_quality <- read_csv("01_data/CSCDC+Dataset/air_quality_csv/all.csv")

## have to force the formatting on the timestamp. 
air_quality %>%
  mutate( `timestamp-iso` = mdy_hm(`timestamp-iso`)) ->
  air_quality
  
skim(air_quality)  %>% kable()

Skim summary statistics
n obs: 105726
n variables: 34

Variable type: character

variable missing complete n min max empty n_unique
nicename 0 105726 105726 11 12 0 7

Variable type: numeric

variable missing complete n mean sd p0 p25 p50 p75 p100 hist
current_dewpoint_f 0 105726 105726 60.74 7.61 32 56 63 67 74 ▁▁▁▂▃▃▇▂
current_humidity 0 105726 105726 48.99 14.48 11 37 49 62 80 ▁▃▅▆▇▅▇▁
current_temp_f 0 105726 105726 83.5 7.19 59 79 82 88 108 ▁▁▃▇▅▃▂▁
lat 0 105726 105726 35.04 0.0021 35.04 35.04 35.04 35.05 35.05 ▃▁▅▅▁▅▃▇
lon 0 105726 105726 -85.3 0.0056 -85.31 -85.31 -85.3 -85.3 -85.29 ▇▃▅▁▅▅▁▃
p_0_3_um 0 105726 105726 1879.3 1138.26 177.43 1081.73 1607.55 2337.86 16531.32 ▇▂▁▁▁▁▁▁
p_0_3_um_b 0 105726 105726 1883.18 1142.33 217.46 1090.62 1613.96 2324.64 15918.44 ▇▂▁▁▁▁▁▁
p_0_5_um 0 105726 105726 557.04 339.09 55.46 320.35 476.25 693.68 4845.58 ▇▂▁▁▁▁▁▁
p_0_5_um_b 0 105726 105726 560.96 343.38 67.89 325.13 480.71 691.95 4746.94 ▇▂▁▁▁▁▁▁
p_1_0_um 0 105726 105726 90.63 75.87 4.38 42.75 70.18 110.29 1424.29 ▇▁▁▁▁▁▁▁
p_1_0_um_b 0 105726 105726 89.78 74.71 4.37 42.92 69.86 108.27 1141.65 ▇▁▁▁▁▁▁▁
p_10_0_um 0 105726 105726 0.18 0.28 0 0 0.03 0.25 3.97 ▇▁▁▁▁▁▁▁
p_10_0_um_b 0 105726 105726 0.19 1.25 0 0 0 0.25 365.71 ▇▁▁▁▁▁▁▁
p_2_5_um 0 105726 105726 4.87 5.53 0 1.79 3.16 5.39 161.49 ▇▁▁▁▁▁▁▁
p_2_5_um_b 0 105726 105726 4.96 5.49 0 1.79 3.23 5.64 107.1 ▇▁▁▁▁▁▁▁
p_5_0_um 0 105726 105726 0.94 1.11 0 0.26 0.62 1.14 35.89 ▇▁▁▁▁▁▁▁
p_5_0_um_b 0 105726 105726 0.97 3.26 0 0.25 0.61 1.19 966.92 ▇▁▁▁▁▁▁▁
pm1_0_atm 0 105726 105726 10.34 6.86 0.01 5.53 8.73 13.17 101.68 ▇▂▁▁▁▁▁▁
pm1_0_atm_b 0 105726 105726 10.29 6.89 0.19 5.54 8.71 12.97 99.78 ▇▂▁▁▁▁▁▁
pm1_0_cf_1 0 105726 105726 10.05 6.05 0.01 5.53 8.73 13.16 68.74 ▇▆▂▁▁▁▁▁
pm1_0_cf_1_b 0 105726 105726 9.97 6.01 0.19 5.54 8.71 12.95 67.19 ▇▆▂▁▁▁▁▁
pm10_0_atm 0 105726 105726 15.66 11.76 0.17 7.91 12.66 19.36 200.35 ▇▁▁▁▁▁▁▁
pm10_0_atm_b 0 105726 105726 15.54 11.62 0.66 7.93 12.61 19.1 144.72 ▇▂▁▁▁▁▁▁
pm10_0_cf_1 0 105726 105726 15.48 11.12 0.17 7.91 12.66 19.36 135.6 ▇▃▁▁▁▁▁▁
pm10_0_cf_1_b 0 105726 105726 15.36 10.97 0.66 7.93 12.61 19.09 104.51 ▇▅▁▁▁▁▁▁
pm2_5_atm 0 105726 105726 15 11.13 0.1 7.59 12.17 18.61 175.4 ▇▁▁▁▁▁▁▁
pm2_5_atm_b 0 105726 105726 14.85 10.98 0.5 7.61 12.12 18.3 138.54 ▇▂▁▁▁▁▁▁
pm2_5_cf_1 0 105726 105726 14.47 9.57 0.1 7.59 12.17 18.6 257.33 ▇▁▁▁▁▁▁▁
pm2_5_cf_1_b 0 105726 105726 14.33 9.5 0.5 7.61 12.12 18.29 513.31 ▇▁▁▁▁▁▁▁
timestamp 0 105726 105726 1.6e+12 5.2e+08 1.6e+12 1.6e+12 1.6e+12 1.6e+12 1.6e+12 ▇▇▇▇▇▇▇▇
Unnamed: 0 0 105726 105726 52862.54 30520.68 0 26431.25 52862.5 79293.75 105784 ▇▇▇▇▇▇▇▇
X1 0 105726 105726 52862.54 30520.68 0 26431.25 52862.5 79293.75 105784 ▇▇▇▇▇▇▇▇

Variable type: POSIXct

variable missing complete n min max median n_unique
timestamp-iso 0 105726 105726 2019-06-01 2019-06-22 2019-06-11 30211

Let’s make these field names better so we don’t have to quote these later. Using janitor::clean_names() to do the dirty work. And we have a bunch of fields with 2_5 in the name so let’s look at those.

air_quality %>%
  clean_names() %>%
  select(
    timestamp_iso,
    nicename,
    p_2_5_um,
    p_2_5_um_b,
    pm2_5_atm,
    pm2_5_atm_b,
    pm2_5_cf_1,
    pm2_5_cf_1_b,
    current_temp_f,
    current_humidity
  ) ->
  aq_limited

ggpairs(aq_limited %>% select(-timestamp_iso) )

Air Quality Detailed Plots

Let’s look at a few of these individually.

ggplot(aq_limited) +
 aes(x = timestamp_iso, y = pm2_5_cf_1) +
 geom_line(size = 1.92, colour = "#0c4c8a") +
 labs(x = "Date & Time", y = "Sensor Level", title = "`pm2_5_cf_1` by Station") +
 theme_minimal() +
 facet_wrap(vars(nicename))

I don’t know what it means exactly, but I bet there’s hella traffic at douglas compared to the other stations. The volatility there is off the hook (comparatively)

Let’s look at the same data layered on the same graph. I can’t tell if they move together… seems like they should to some degree:

ggplot(aq_limited) +
 aes(x = timestamp_iso, y = pm2_5_cf_1, color = nicename) +
 geom_line(size = 1.2) +
 labs(x = "Date & Time", y = "Sensor Level", title = "`pm2_5_cf_1` by Station") +
 theme_minimal() +
 scale_colour_calc()

Well that’s sort of noisy. I wonder what the correlation is. Let’s reshape the data to look at correlations quickly.

Do a quick reshape of the data to get the stations in columns…

I discovered in the data some time and location combinations that have multiple readings. I would not expect this. But since it’s there, we’ll just average them. And the data is irregular. So we need to put this junk in buckets. I’m thinking maybe 10 min buckets? Let’s see if we can do that. Feels like a job for grouping by a floor value. We can get the 10 minute interval floors from the lubridate package.

aq_limited %>%
  select(timestamp_iso, nicename, pm2_5_cf_1) %>%
  group_by(timestamp_iso, nicename) %>%
  summarize(pm2_5_cf_1 = mean(pm2_5_cf_1))  %>%
  group_by(timemark = floor_date(timestamp_iso, "10 mins"), nicename) %>%
  select(-timestamp_iso) %>%
  summarize_all(mean, na.rm = TRUE) %>%
  arrange(timemark) ->
  aq_limited_long

aq_limited_long %>%
  spread(nicename, pm2_5_cf_1) ->
  aq_limited_wide

kable(head(aq_limited_wide, 20))
timemark mlk-central mlk-douglas mlk-georgia mlk-houston mlk-lindsay mlk-magnolia mlk-peeples
2019-06-01 04:00:00 14.074 14.810 14.252 15.104 16.370 13.198 16.104
2019-06-01 04:10:00 13.724 15.110 14.556 15.348 15.796 13.840 16.790
2019-06-01 04:20:00 16.012 15.262 14.818 15.348 15.882 14.568 18.332
2019-06-01 04:30:00 15.868 15.578 15.718 15.496 16.082 15.310 18.888
2019-06-01 04:40:00 16.512 16.026 14.790 16.282 15.912 14.280 17.936
2019-06-01 04:50:00 17.706 16.394 15.582 16.014 16.026 14.802 17.778
2019-06-01 05:00:00 16.854 16.340 15.510 16.010 16.014 14.686 18.460
2019-06-01 05:10:00 36.456 16.600 14.906 16.864 15.640 14.808 17.342
2019-06-01 05:20:00 16.050 16.954 15.640 16.988 16.924 14.804 17.684
2019-06-01 05:30:00 16.546 16.924 16.992 16.482 16.922 14.898 17.538
2019-06-01 05:40:00 15.988 16.918 16.716 17.112 17.318 15.062 17.538
2019-06-01 05:50:00 16.162 17.554 17.244 17.292 17.104 16.340 18.030
2019-06-01 06:00:00 16.534 17.048 16.186 17.166 16.830 15.722 18.078
2019-06-01 06:10:00 16.604 17.836 16.828 18.946 18.668 17.094 18.842
2019-06-01 06:20:00 17.944 17.924 16.218 18.114 17.426 16.522 18.946
2019-06-01 06:30:00 17.330 17.580 16.736 17.330 17.268 16.812 18.914
2019-06-01 06:40:00 18.152 17.498 17.354 17.716 17.496 17.272 19.134
2019-06-01 06:50:00 18.974 17.890 18.476 17.786 17.336 16.236 20.290
2019-06-01 07:00:00 19.712 18.052 18.446 18.516 17.868 16.192 19.876
2019-06-01 07:10:00 18.054 18.812 17.228 18.630 18.368 17.426 20.712

That looks better. Let’s see what’s going on here…

Let’s plot and look at the pairs plot with correlations:

aq_limited_wide %>%
  select(-timemark) %>%
  ggpairs()
## Adding missing grouping variables: `timemark`

As I would suspect, these are all pretty correlated. But that Douglas intersection though… it’s different. I suspect high traffic (again). Still correlated but it doesn’t drop as low as quickly as the others… I wonder if there’s something dusty near there or if it just gets more traffic?

Since we’re slicing and dicing, let’s look at temp vs. p_2_5_um for each station:

ggplot(aq_limited) +
 aes(x = current_temp_f, y = p_2_5_um) +
 geom_point(size = 1.92, colour = "#0c4c8a") +
 labs(x = "Temp (F)", y = "Sensor Level", title = "`p_2_5_um` vs Temp (F)") +
 theme_minimal() +
 facet_wrap(vars(nicename))

That’s not as insightful as I had hoped. Let’s look at humidity:

ggplot(aq_limited) +
 aes(x = current_humidity, y = p_2_5_um) +
 geom_point(size = 1.92, colour = "#0c4c8a") +
 labs(x = "Absolute Humidity", y = "Sensor Level", title = "`p_2_5_um` vs Humidity") +
 theme_minimal() +
 facet_wrap(vars(nicename))

Air Quality Station Map

When I first got these data, I don’t know where it was collected, but I suspect it was around Chattanooga. Let’s plot the locations and see:

library(ggmap)
library(sf)
library(mapview)
library(leaflet)
# grab the coordinates
air_quality %>%
  group_by(nicename, lat, lon) %>%
  summarize(obs_count = n()) ->
  obs_location

# turn them into a spatial sf object for mapping
obs_location_sf <-
  st_as_sf(obs_location,
           coords = c("lon", "lat"),
           crs = 4326)

# build the map
m1 <- mapview(obs_location_sf, legend = FALSE)

# add the names as labels
l1  <- addStaticLabels(
  m1,
  label = obs_location_sf$nicename,
  textsize = "20px",
  direction = 'left'
)

l1

Cool, they are all right up one strip in Chattanooga. I guess now we know what the MLK prefix in all the station names stands for. Looks like the stations are named after MLK and the cross street. Makes sense.

Video Events

We should do something similar to the above with the video data.

We’ll start with a skim:

Video Skim

video_events <- read_csv("01_data/CSCDC+Dataset/video_event_csv/all.csv")
skim(video_events)  %>% kable()

Skim summary statistics
n obs: 4070343
n variables: 11

Variable type: character

variable missing complete n min max empty n_unique
camera_id 0 4070343 4070343 17 18 0 10
id 0 4070343 4070343 36 36 0 4070342
label 0 4070343 4070343 3 13 0 18
locations 0 4070343 4070343 190 9897 0 4070342
pole_id 0 4070343 4070343 13 14 0 10

Variable type: logical

variable missing complete n mean count
intersection 0 4070343 4070343 0.84 TRU: 3405701, FAL: 664642, NA: 0

Variable type: numeric

variable missing complete n mean sd p0 p25 p50 p75 p100 hist
hit_counts 0 4070343 4070343 5.35 4.78 2 2 3 7 82 ▇▁▁▁▁▁▁▁
timestamp 0 4070343 4070343 1.6e+12 5.3e+08 1.6e+12 1.6e+12 1.6e+12 1.6e+12 1.6e+12 ▅▇▇▆▇▆▆▇
Unnamed: 0 0 4070343 4070343 2e+06 1175006.96 0 1e+06 2e+06 3052756.5 4070342 ▇▇▇▇▇▇▇▇
X1 0 4070343 4070343 2e+06 1175006.96 0 1e+06 2e+06 3052756.5 4070342 ▇▇▇▇▇▇▇▇

Variable type: POSIXct

variable missing complete n min max median n_unique
timestamp-iso 0 4070343 4070343 2019-06-01 2019-06-22 2019-06-11 3876225
head(video_events)
## # A tibble: 6 x 11
##      X1 `Unnamed: 0` camera_id hit_counts id    intersection label
##   <dbl>        <dbl> <chr>          <dbl> <chr> <lgl>        <chr>
## 1     0            0 mlk-cent…          7 fb33… TRUE         car  
## 2     1            1 mlk-cent…          2 12a6… TRUE         car  
## 3     2            2 mlk-cent…          2 8430… TRUE         car  
## 4     3            3 mlk-cent…         14 d5c5… TRUE         car  
## 5     4            4 mlk-cent…          8 8328… TRUE         car  
## 6     5            5 mlk-cent…          6 7bc4… TRUE         car  
## # … with 4 more variables: locations <chr>, pole_id <chr>,
## #   timestamp <dbl>, `timestamp-iso` <dttm>

So there’s a lot going on in there… let’s strip out some unique values for a few of the fields so we can see what we’re dealing with.

So there’s an id field that has a CUSIP for each vehicle… let’s check that out

video_events %>%
  select(id) %>%
  distinct(id) %>%
  summarize(n())
## # A tibble: 1 x 1
##     `n()`
##     <int>
## 1 4070342

So there are a little over 4m unique vehicle ids. Interesting.

I wonder if any ids are in there multiple times.

video_events %>%
  group_by(id) %>%
  summarize(rec_count = n()) %>%
  filter(rec_count > 1) -> 
  has_dupe

has_dupe
## # A tibble: 1 x 2
##   id                                   rec_count
##   <chr>                                    <int>
## 1 b3cb7029-86ed-47e0-a634-7ae1ba9d3076         2

One… weird. I wonder if that’s a random CUSIP clash (unlikely).

video_events %>%
  inner_join(has_dupe) ->
  double_up
## Joining, by = "id"
double_up
## # A tibble: 2 x 12
##       X1 `Unnamed: 0` camera_id hit_counts id    intersection label
##    <dbl>        <dbl> <chr>          <dbl> <chr> <lgl>        <chr>
## 1 4.03e6      4029658 mlk-cent…          8 b3cb… TRUE         car  
## 2 4.03e6      4032606 mlk-cent…          8 b3cb… TRUE         car  
## # … with 5 more variables: locations <chr>, pole_id <chr>,
## #   timestamp <dbl>, `timestamp-iso` <dttm>, rec_count <int>

That’s just straight up a dupe record. Well it probably won’t cause us any trouble.. but we’ll kick out one manually for OCD reasons.

double_up %>%
  summarize(max(X1)) %>%
  pluck(1) ->
  bad_point

video_events %>%
  filter(X1 != bad_point) ->
  video_events_nodupes

Let’s look at the cameras and see what’s going on…

video_events_nodupes %>%
  select(camera_id) %>%
  arrange(camera_id) %>%
  distinct
## # A tibble: 10 x 1
##    camera_id         
##    <chr>             
##  1 mlk-central-cam-2 
##  2 mlk-douglas-cam-1 
##  3 mlk-douglas-cam-2 
##  4 mlk-georgia-cam-1 
##  5 mlk-georgia-cam-3 
##  6 mlk-houston-cam-1 
##  7 mlk-lindsay-cam-2 
##  8 mlk-magnolia-cam-2
##  9 mlk-peeples-cam-1 
## 10 mlk-peeples-cam-3

Oh… some intersections have two cameras.. but some only have one. Interesting.

I think for now we just want data from one camera per intersection… so let’s pick the camera with the most traffic and also adjust the names so we can use them later to match to pm data

video_events_nodupes %>% 
  group_by(camera_id) %>%
  summarize(rec_count = n()) %>%
  separate(camera_id, into  = c("street1", "street2", "street3", "street4"), sep="-", remove = FALSE)  %>%
  mutate(nicename = paste(street1, street2, sep = "-")) %>%
  select(nicename, camera_id, rec_count)  %>% 
  group_by(nicename) %>%
  top_n(1, rec_count) %>%
  ungroup %>%
  {.} ->
  cameras

cameras
## # A tibble: 7 x 3
##   nicename     camera_id          rec_count
##   <chr>        <chr>                  <int>
## 1 mlk-central  mlk-central-cam-2     880557
## 2 mlk-douglas  mlk-douglas-cam-1     287188
## 3 mlk-georgia  mlk-georgia-cam-3     623252
## 4 mlk-houston  mlk-houston-cam-1     568195
## 5 mlk-lindsay  mlk-lindsay-cam-2     490929
## 6 mlk-magnolia mlk-magnolia-cam-2    188323
## 7 mlk-peeples  mlk-peeples-cam-3     448677

Join that back to the video data to get just one cam per intersetion

video_events_nodupes %>%
  inner_join(cameras %>% select(camera_id, nicename)) ->
  video_events_nodupes_1per
## Joining, by = "camera_id"

in our initial skim we saw we have a field called label that has 18 unique values… let’s look at that

video_events_nodupes_1per %>%
  group_by(label) %>%
  summarize(row_count = n()) %>%
  arrange(row_count)
## # A tibble: 17 x 2
##    label         row_count
##    <chr>             <int>
##  1 stop sign             3
##  2 parking meter        23
##  3 suitcase             28
##  4 aeroplane            33
##  5 dog                  44
##  6 handbag              57
##  7 bird                 79
##  8 boat                 81
##  9 backpack            178
## 10 umbrella            255
## 11 fire hydrant        951
## 12 bicycle            3690
## 13 motorbike          6509
## 14 bus               17207
## 15 truck            105490
## 16 person           126431
## 17 car             3226062

Oh… that’s a lot of stuff.

While I’m curious about the 83 boats, 33 airplanes, and 1 cat that wandered down JFK, I’m going to assume that’s missclassification noise. And I’m going to filter out only the vehicles:

video_events_nodupes_1per %>%
  filter(label %in% c("car","bus","truck", "motorbike")) %>%
  clean_names() %>%
  {.} ->
  video_events_nodupes_1per_filtered

so let’s group these data into 10 minute buckets the way we did with the pm data:

video_events_nodupes_1per_filtered %>%
  group_by(timemark = floor_date(timestamp_iso, "10 mins"), nicename, label) %>%
  summarize(vehicle_count = n())  ->
  video_events_summarized_long

head(video_events_summarized_long)
## # A tibble: 6 x 4
## # Groups:   timemark, nicename [4]
##   timemark            nicename    label vehicle_count
##   <dttm>              <chr>       <chr>         <int>
## 1 2019-06-01 04:00:00 mlk-central car             154
## 2 2019-06-01 04:00:00 mlk-central truck             3
## 3 2019-06-01 04:00:00 mlk-douglas car              29
## 4 2019-06-01 04:00:00 mlk-georgia car             169
## 5 2019-06-01 04:00:00 mlk-georgia truck             5
## 6 2019-06-01 04:00:00 mlk-houston bus               1

That’s cool, but let’s make it wide instead of long:

video_events_summarized_long %>%
  group_by(nicename, label) %>%
  
  spread(label, vehicle_count) %>%
  replace(is.na(.), 0) ->
  video_events_summarized_wide

head(video_events_summarized_wide)
## # A tibble: 6 x 6
## # Groups:   nicename [6]
##   timemark            nicename       bus   car motorbike truck
##   <dttm>              <chr>        <dbl> <dbl>     <dbl> <dbl>
## 1 2019-06-01 04:00:00 mlk-central      0   154         0     3
## 2 2019-06-01 04:00:00 mlk-douglas      0    29         0     0
## 3 2019-06-01 04:00:00 mlk-georgia      0   169         0     5
## 4 2019-06-01 04:00:00 mlk-houston      1   151         1     1
## 5 2019-06-01 04:00:00 mlk-lindsay      0   145         0     1
## 6 2019-06-01 04:00:00 mlk-magnolia     0    22         0     0

Let’s join the ‘wide’ video data to the polution data.

aq_limited_long %>%
  inner_join(video_events_summarized_wide) %>%
  rowwise() %>%
  mutate(total_traffic = sum(bus, car, motorbike, truck) ) ->
  pm_traffic
## Joining, by = c("timemark", "nicename")

Do a quick time plot of each intersection (again)

ggplot(pm_traffic) +
 aes(x = timemark, y = pm2_5_cf_1) +
 geom_line(size = 1, colour = "#0c4c8a") +
 labs(x = "Date & Time", y = "Sensor Level", title = "`pm2_5_cf_1` by Station") +
 theme_minimal() +
 facet_wrap(vars(nicename))

And traffic:

ggplot(pm_traffic) +
 aes(x = timemark, y = total_traffic) +
 geom_line(size = 1, colour = "#800000") +
 labs(x = "Date & Time", y = "Total Vehicles", title = "Total Vehicles by Station") +
 theme_minimal() +
 facet_wrap(vars(nicename))

Well that’s some regular looking cycles…

Let’s build a super simple linear model:

library(parsnip)
library(broom)
lm(data = pm_traffic, pm2_5_cf_1 ~ bus + car + motorbike + truck) ->
  linear_model

summary(linear_model)
## 
## Call:
## lm(formula = pm2_5_cf_1 ~ bus + car + motorbike + truck, data = pm_traffic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.223  -6.621  -2.291   4.122  48.312 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.5887601  0.0993488 156.909  < 2e-16 ***
## bus         -0.0301903  0.0555936  -0.543 0.587098    
## car         -0.0067491  0.0007974  -8.464  < 2e-16 ***
## motorbike   -0.1945797  0.0527587  -3.688 0.000226 ***
## truck       -0.0022071  0.0147943  -0.149 0.881410    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.412 on 21095 degrees of freedom
## Multiple R-squared:  0.01232,    Adjusted R-squared:  0.01213 
## F-statistic: 65.79 on 4 and 21095 DF,  p-value: < 2.2e-16

So right out of the gate we have issues…. the sign on the coefficients is not at all as expected. We would expect more traffic = more particulates… but this models says the opposite. That’s a huge warning sign.

Let’s plot the actual vs. modeled

pm_traffic %>%
  bind_cols( tibble(.pred = predict(linear_model, new_data = pm_traffic)) ) ->
  predicted_lm 

ggplot(predicted_lm) +
 aes(x = pm2_5_cf_1, y = .pred) +
 geom_line(size = 1, colour = "#228B22") +
 labs(x = "pm2_5_cf_1", y = "Predicted pm2_5_cf_1", title = "Modeled vs. Actual") +
 theme_minimal() +
 facet_wrap(vars(nicename))

Well that’s amazingly shitty.

The OLS never predicts under 15 (the intercept), but we have lots of data under 15. That’s weird.

Let’s look at a quick correlation matrix:

pm_traffic %>%
  select(pm2_5_cf_1, bus, car,  motorbike, truck, total_traffic) %>%
  cor()

Weird… mos def negative correlations

Maybe if we break it into intersections and just look at total traffic correlation to pm

pm_traffic %>%
  group_by(nicename) %>%
  summarize(correl = cor(pm2_5_cf_1, total_traffic)) 

Total traffic is always just a little bit neg correlated….

So clearly something else is going on.. most likly the effect is lagged. We can drill into one station and see what’s going on. We’ll normalize so we can plot together

library(forecast)
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
##   method             from    
##   fitted.fracdiff    fracdiff
##   residuals.fracdiff fracdiff
pm_traffic %>%
  filter(nicename == "mlk-central") %>% 
  ungroup() %>%
  na.omit() %>%
  mutate_at(funs(scale), .vars = vars(pm2_5_cf_1, total_traffic)) ->
  single_intersection
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
ggplot(single_intersection) +
  geom_line(size = 1,
            colour = "#0c4c8a",
            aes(x = timemark, y = pm2_5_cf_1)) +
  geom_line(size = 1,
            colour = "red",
            aes(x = timemark, y = total_traffic)) +
  labs(x = "Date & Time", y = "Scaled Values", title = "Traffic and PM") +
  theme_minimal() +
  facet_wrap(vars(nicename))

Oh.. look at that… there’s pretty clearly a lag. So in order to get anything out of the modeling then we need to account for the lag.

Let’s work this as a time series decomp. We’ll lag the pm data 6 hours

library(tsbox)
## 
## Attaching package: 'tsbox'
## The following objects are masked from 'package:skimr':
## 
##     ts_end, ts_start
library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following object is masked from 'package:leaflet':
## 
##     addLegend
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(tidyquant)
## Loading required package: PerformanceAnalytics
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
## Loading required package: quantmod
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
single_intersection %>%
  select(timemark, pm2_5_cf_1, total_traffic) ->
  subset_data

xts((subset_data %>% select(-timemark)),
    order.by = subset_data$timemark,
    frequency = 6 * 24 * 7) ->
  single_intersect_ts

single_intersection %>%
  select(timemark, pm2_5_cf_1, total_traffic) %>%
  ts_long %>%
  ts_ts %>%
  ts(frequency = 6 * 24 * 7) ->
  single_intersect_ts
## [time]: 'timemark'
## [time]: 'timemark'
single_intersect_ts[, "total_traffic"] %>%
  decompose(type = "additive") %>%
  autoplot

single_intersect_ts[, "pm2_5_cf_1"] %>%
  decompose(type = "additive") %>%
  autoplot

cor(single_intersect_ts[, "total_traffic"], single_intersect_ts[, "pm2_5_cf_1"])
## [1] -0.1563726
# cross correlation
ccf(single_intersect_ts[, "total_traffic"], single_intersect_ts[, "pm2_5_cf_1"], lag.max = 70)

ok, so we’re seeing some lagged correlation. It peaks at .06 of a frequency unit… our frequency is 6 * 24 * 7, so the lag is 6 * 24 * 7 * .06 = 60.48

I need to make sense of this lagged effect and I’ve clearly forgotten all my time series heuristics… Time to get Rob Hyndman’s book out and read up… also need to check those time zones as somthing seems off…

TO Do section… bring in wind data and see if that’s a big factor

Let’s bring in some weather data. I ran this previously and it takes a few minutes so I saved the results to a csv:

library(rWind)
# chattanooga is 35.0456° N, 85.3097° W

dt <- seq(ymd_hms(paste(2019, 6, 01, 00, 00, 00, sep = "-")),
          ymd_hms(paste(2019, 7, 01, 00, 00, 00, sep = "-")), by = "3 hours")
ww <- wind.dl_2(dt, -85.3097,-85.3097, 35.0456, 35.0456)

tidy (ww) %>%
  write_csv("01_data/wind_data.csv")

Read in that csv

wind_data <- read_csv("01_data/wind_data.csv") 
## Parsed with column specification:
## cols(
##   time = col_datetime(format = ""),
##   lat = col_double(),
##   lon = col_double(),
##   ugrd10m = col_double(),
##   vgrd10m = col_double(),
##   dir = col_double(),
##   speed = col_double()
## )

so we have a known date issue… not 100% sure what time zone the traffic data is… assuming UTC, but I bet it’s eastern. Have to fix this later

now we need to join every intersection to the wind data.