Fetch Data

Download the full dataset from utccuip.com:

Block turned off as it only needs to run once

url <-"http://cscdc-2019.utccuip.com/CSCDC+Dataset.zip"
dest_file <- "01_data/CSCDC+Dataset.zip"

download.file(url, dest_file)
unzip(zipfile = dest_file, exdir = "01_data/CSCDC+Dataset/")

Load up a few libraries

Air Quality Data

Grab the air quality data then give it a quick skim (from the skimr package):

Air Quality Skim

air_quality <- read_csv("01_data/CSCDC+Dataset/air_quality_csv/all.csv")

## have to force the formatting on the timestamp. 
air_quality %>%
  mutate( `timestamp-iso` = mdy_hm(`timestamp-iso`)) ->
  air_quality
  
skim(air_quality)  %>% kable()

Skim summary statistics
n obs: 105726
n variables: 34

Variable type: character

variable	missing	complete	n	min	max	empty	n_unique
nicename	0	105726	105726	11	12	0	7

Variable type: numeric

variable	complete	n	mean	sd	p0	p25	p50	p75	p100	hist
current_dewpoint_f	105726	105726	60.74	7.61	32	56	63	67	74	▁▁▁▂▃▃▇▂
current_humidity	105726	105726	48.99	14.48	11	37	49	62	80	▁▃▅▆▇▅▇▁
current_temp_f	105726	105726	83.5	7.19	59	79	82	88	108	▁▁▃▇▅▃▂▁
lat	105726	105726	35.04	0.0021	35.04	35.04	35.04	35.05	35.05	▃▁▅▅▁▅▃▇
lon	105726	105726	-85.3	0.0056	-85.31	-85.31	-85.3	-85.3	-85.29	▇▃▅▁▅▅▁▃
p_0_3_um	105726	105726	1879.3	1138.26	177.43	1081.73	1607.55	2337.86	16531.32	▇▂▁▁▁▁▁▁
p_0_3_um_b	105726	105726	1883.18	1142.33	217.46	1090.62	1613.96	2324.64	15918.44	▇▂▁▁▁▁▁▁
p_0_5_um	105726	105726	557.04	339.09	55.46	320.35	476.25	693.68	4845.58	▇▂▁▁▁▁▁▁
p_0_5_um_b	105726	105726	560.96	343.38	67.89	325.13	480.71	691.95	4746.94	▇▂▁▁▁▁▁▁
p_1_0_um	105726	105726	90.63	75.87	4.38	42.75	70.18	110.29	1424.29	▇▁▁▁▁▁▁▁
p_1_0_um_b	105726	105726	89.78	74.71	4.37	42.92	69.86	108.27	1141.65	▇▁▁▁▁▁▁▁
p_10_0_um	105726	105726	0.18	0.28	0	0	0.03	0.25	3.97	▇▁▁▁▁▁▁▁
p_10_0_um_b	105726	105726	0.19	1.25	0	0	0	0.25	365.71	▇▁▁▁▁▁▁▁
p_2_5_um	105726	105726	4.87	5.53	0	1.79	3.16	5.39	161.49	▇▁▁▁▁▁▁▁
p_2_5_um_b	105726	105726	4.96	5.49	0	1.79	3.23	5.64	107.1	▇▁▁▁▁▁▁▁
p_5_0_um	105726	105726	0.94	1.11	0	0.26	0.62	1.14	35.89	▇▁▁▁▁▁▁▁
p_5_0_um_b	105726	105726	0.97	3.26	0	0.25	0.61	1.19	966.92	▇▁▁▁▁▁▁▁
pm1_0_atm	105726	105726	10.34	6.86	0.01	5.53	8.73	13.17	101.68	▇▂▁▁▁▁▁▁
pm1_0_atm_b	105726	105726	10.29	6.89	0.19	5.54	8.71	12.97	99.78	▇▂▁▁▁▁▁▁
pm1_0_cf_1	105726	105726	10.05	6.05	0.01	5.53	8.73	13.16	68.74	▇▆▂▁▁▁▁▁
pm1_0_cf_1_b	105726	105726	9.97	6.01	0.19	5.54	8.71	12.95	67.19	▇▆▂▁▁▁▁▁
pm10_0_atm	105726	105726	15.66	11.76	0.17	7.91	12.66	19.36	200.35	▇▁▁▁▁▁▁▁
pm10_0_atm_b	105726	105726	15.54	11.62	0.66	7.93	12.61	19.1	144.72	▇▂▁▁▁▁▁▁
pm10_0_cf_1	105726	105726	15.48	11.12	0.17	7.91	12.66	19.36	135.6	▇▃▁▁▁▁▁▁
pm10_0_cf_1_b	105726	105726	15.36	10.97	0.66	7.93	12.61	19.09	104.51	▇▅▁▁▁▁▁▁
pm2_5_atm	105726	105726	15	11.13	0.1	7.59	12.17	18.61	175.4	▇▁▁▁▁▁▁▁
pm2_5_atm_b	105726	105726	14.85	10.98	0.5	7.61	12.12	18.3	138.54	▇▂▁▁▁▁▁▁
pm2_5_cf_1	105726	105726	14.47	9.57	0.1	7.59	12.17	18.6	257.33	▇▁▁▁▁▁▁▁
pm2_5_cf_1_b	105726	105726	14.33	9.5	0.5	7.61	12.12	18.29	513.31	▇▁▁▁▁▁▁▁
timestamp	105726	105726	1.6e+12	5.2e+08	1.6e+12	1.6e+12	1.6e+12	1.6e+12	1.6e+12	▇▇▇▇▇▇▇▇
Unnamed: 0	105726	105726	52862.54	30520.68	0	26431.25	52862.5	79293.75	105784	▇▇▇▇▇▇▇▇
X1	105726	105726	52862.54	30520.68	0	26431.25	52862.5	79293.75	105784	▇▇▇▇▇▇▇▇

Variable type: POSIXct

variable	missing	complete	n	min	max	median	n_unique
timestamp-iso	0	105726	105726	2019-06-01	2019-06-22	2019-06-11	30211

Let’s make these field names better so we don’t have to quote these later. Using janitor::clean_names() to do the dirty work. And we have a bunch of fields with 2_5 in the name so let’s look at those.

air_quality %>%
  clean_names() %>%
  select(
    timestamp_iso,
    nicename,
    p_2_5_um,
    p_2_5_um_b,
    pm2_5_atm,
    pm2_5_atm_b,
    pm2_5_cf_1,
    pm2_5_cf_1_b,
    current_temp_f,
    current_humidity
  ) ->
  aq_limited

ggpairs(aq_limited %>% select(-timestamp_iso) )

Air Quality Detailed Plots

Let’s look at a few of these individually.

ggplot(aq_limited) +
 aes(x = timestamp_iso, y = pm2_5_cf_1) +
 geom_line(size = 1.92, colour = "#0c4c8a") +
 labs(x = "Date & Time", y = "Sensor Level", title = "`pm2_5_cf_1` by Station") +
 theme_minimal() +
 facet_wrap(vars(nicename))

I don’t know what it means exactly, but I bet there’s hella traffic at douglas compared to the other stations. The volatility there is off the hook (comparatively)

Let’s look at the same data layered on the same graph. I can’t tell if they move together… seems like they should to some degree:

ggplot(aq_limited) +
 aes(x = timestamp_iso, y = pm2_5_cf_1, color = nicename) +
 geom_line(size = 1.2) +
 labs(x = "Date & Time", y = "Sensor Level", title = "`pm2_5_cf_1` by Station") +
 theme_minimal() +
 scale_colour_calc()

Well that’s sort of noisy. I wonder what the correlation is. Let’s reshape the data to look at correlations quickly.

Do a quick reshape of the data to get the stations in columns…

I discovered in the data some time and location combinations that have multiple readings. I would not expect this. But since it’s there, we’ll just average them. And the data is irregular. So we need to put this junk in buckets. I’m thinking maybe 10 min buckets? Let’s see if we can do that. Feels like a job for grouping by a floor value. We can get the 10 minute interval floors from the lubridate package.

aq_limited %>%
  select(timestamp_iso, nicename, pm2_5_cf_1) %>%
  group_by(timestamp_iso, nicename) %>%
  summarize(pm2_5_cf_1 = mean(pm2_5_cf_1))  %>%
  group_by(timemark = floor_date(timestamp_iso, "10 mins"), nicename) %>%
  select(-timestamp_iso) %>%
  summarize_all(mean, na.rm = TRUE) %>%
  arrange(timemark) ->
  aq_limited_long

aq_limited_long %>%
  spread(nicename, pm2_5_cf_1) ->
  aq_limited_wide

kable(head(aq_limited_wide, 20))

timemark	mlk-central	mlk-douglas	mlk-georgia	mlk-houston	mlk-lindsay	mlk-magnolia	mlk-peeples
2019-06-01 04:00:00	14.074	14.810	14.252	15.104	16.370	13.198	16.104
2019-06-01 04:10:00	13.724	15.110	14.556	15.348	15.796	13.840	16.790
2019-06-01 04:20:00	16.012	15.262	14.818	15.348	15.882	14.568	18.332
2019-06-01 04:30:00	15.868	15.578	15.718	15.496	16.082	15.310	18.888
2019-06-01 04:40:00	16.512	16.026	14.790	16.282	15.912	14.280	17.936
2019-06-01 04:50:00	17.706	16.394	15.582	16.014	16.026	14.802	17.778
2019-06-01 05:00:00	16.854	16.340	15.510	16.010	16.014	14.686	18.460
2019-06-01 05:10:00	36.456	16.600	14.906	16.864	15.640	14.808	17.342
2019-06-01 05:20:00	16.050	16.954	15.640	16.988	16.924	14.804	17.684
2019-06-01 05:30:00	16.546	16.924	16.992	16.482	16.922	14.898	17.538
2019-06-01 05:40:00	15.988	16.918	16.716	17.112	17.318	15.062	17.538
2019-06-01 05:50:00	16.162	17.554	17.244	17.292	17.104	16.340	18.030
2019-06-01 06:00:00	16.534	17.048	16.186	17.166	16.830	15.722	18.078
2019-06-01 06:10:00	16.604	17.836	16.828	18.946	18.668	17.094	18.842
2019-06-01 06:20:00	17.944	17.924	16.218	18.114	17.426	16.522	18.946
2019-06-01 06:30:00	17.330	17.580	16.736	17.330	17.268	16.812	18.914
2019-06-01 06:40:00	18.152	17.498	17.354	17.716	17.496	17.272	19.134
2019-06-01 06:50:00	18.974	17.890	18.476	17.786	17.336	16.236	20.290
2019-06-01 07:00:00	19.712	18.052	18.446	18.516	17.868	16.192	19.876
2019-06-01 07:10:00	18.054	18.812	17.228	18.630	18.368	17.426	20.712

That looks better. Let’s see what’s going on here…

Let’s plot and look at the pairs plot with correlations:

aq_limited_wide %>%
  select(-timemark) %>%
  ggpairs()

## Adding missing grouping variables: `timemark`

As I would suspect, these are all pretty correlated. But that Douglas intersection though… it’s different. I suspect high traffic (again). Still correlated but it doesn’t drop as low as quickly as the others… I wonder if there’s something dusty near there or if it just gets more traffic?

Since we’re slicing and dicing, let’s look at temp vs. p_2_5_um for each station:

ggplot(aq_limited) +
 aes(x = current_temp_f, y = p_2_5_um) +
 geom_point(size = 1.92, colour = "#0c4c8a") +
 labs(x = "Temp (F)", y = "Sensor Level", title = "`p_2_5_um` vs Temp (F)") +
 theme_minimal() +
 facet_wrap(vars(nicename))

That’s not as insightful as I had hoped. Let’s look at humidity:

ggplot(aq_limited) +
 aes(x = current_humidity, y = p_2_5_um) +
 geom_point(size = 1.92, colour = "#0c4c8a") +
 labs(x = "Absolute Humidity", y = "Sensor Level", title = "`p_2_5_um` vs Humidity") +
 theme_minimal() +
 facet_wrap(vars(nicename))

Air Quality Station Map

When I first got these data, I don’t know where it was collected, but I suspect it was around Chattanooga. Let’s plot the locations and see:

library(ggmap)
library(sf)
library(mapview)
library(leaflet)

# grab the coordinates
air_quality %>%
  group_by(nicename, lat, lon) %>%
  summarize(obs_count = n()) ->
  obs_location

# turn them into a spatial sf object for mapping
obs_location_sf <-
  st_as_sf(obs_location,
           coords = c("lon", "lat"),
           crs = 4326)

# build the map
m1 <- mapview(obs_location_sf, legend = FALSE)

# add the names as labels
l1  <- addStaticLabels(
  m1,
  label = obs_location_sf$nicename,
  textsize = "20px",
  direction = 'left'
)

l1

Cool, they are all right up one strip in Chattanooga. I guess now we know what the MLK prefix in all the station names stands for. Looks like the stations are named after MLK and the cross street. Makes sense.

Video Events

We should do something similar to the above with the video data.

We’ll start with a skim:

Video Skim

video_events <- read_csv("01_data/CSCDC+Dataset/video_event_csv/all.csv")
skim(video_events)  %>% kable()

Skim summary statistics
n obs: 4070343
n variables: 11

Variable type: character

variable	complete	n	min	max	n_unique
camera_id	4070343	4070343	17	18	10
id	4070343	4070343	36	36	4070342
label	4070343	4070343	3	13	18
locations	4070343	4070343	190	9897	4070342
pole_id	4070343	4070343	13	14	10

Variable type: logical

variable	missing	complete	n	mean	count
intersection	0	4070343	4070343	0.84	TRU: 3405701, FAL: 664642, NA: 0

Variable type: numeric

variable	complete	n	mean	sd	p0	p25	p50	p75	p100	hist
hit_counts	4070343	4070343	5.35	4.78	2	2	3	7	82	▇▁▁▁▁▁▁▁
timestamp	4070343	4070343	1.6e+12	5.3e+08	1.6e+12	1.6e+12	1.6e+12	1.6e+12	1.6e+12	▅▇▇▆▇▆▆▇
Unnamed: 0	4070343	4070343	2e+06	1175006.96	0	1e+06	2e+06	3052756.5	4070342	▇▇▇▇▇▇▇▇
X1	4070343	4070343	2e+06	1175006.96	0	1e+06	2e+06	3052756.5	4070342	▇▇▇▇▇▇▇▇

Variable type: POSIXct

variable	missing	complete	n	min	max	median	n_unique
timestamp-iso	0	4070343	4070343	2019-06-01	2019-06-22	2019-06-11	3876225

head(video_events)

## # A tibble: 6 x 11
##      X1 `Unnamed: 0` camera_id hit_counts id    intersection label
##   <dbl>        <dbl> <chr>          <dbl> <chr> <lgl>        <chr>
## 1     0            0 mlk-cent…          7 fb33… TRUE         car  
## 2     1            1 mlk-cent…          2 12a6… TRUE         car  
## 3     2            2 mlk-cent…          2 8430… TRUE         car  
## 4     3            3 mlk-cent…         14 d5c5… TRUE         car  
## 5     4            4 mlk-cent…          8 8328… TRUE         car  
## 6     5            5 mlk-cent…          6 7bc4… TRUE         car  
## # … with 4 more variables: locations <chr>, pole_id <chr>,
## #   timestamp <dbl>, `timestamp-iso` <dttm>

So there’s a lot going on in there… let’s strip out some unique values for a few of the fields so we can see what we’re dealing with.

So there’s an id field that has a CUSIP for each vehicle… let’s check that out

video_events %>%
  select(id) %>%
  distinct(id) %>%
  summarize(n())

## # A tibble: 1 x 1
##     `n()`
##     <int>
## 1 4070342

So there are a little over 4m unique vehicle ids. Interesting.

I wonder if any ids are in there multiple times.

video_events %>%
  group_by(id) %>%
  summarize(rec_count = n()) %>%
  filter(rec_count > 1) -> 
  has_dupe

has_dupe

## # A tibble: 1 x 2
##   id                                   rec_count
##   <chr>                                    <int>
## 1 b3cb7029-86ed-47e0-a634-7ae1ba9d3076         2

One… weird. I wonder if that’s a random CUSIP clash (unlikely).

video_events %>%
  inner_join(has_dupe) ->
  double_up

## Joining, by = "id"

double_up

## # A tibble: 2 x 12
##       X1 `Unnamed: 0` camera_id hit_counts id    intersection label
##    <dbl>        <dbl> <chr>          <dbl> <chr> <lgl>        <chr>
## 1 4.03e6      4029658 mlk-cent…          8 b3cb… TRUE         car  
## 2 4.03e6      4032606 mlk-cent…          8 b3cb… TRUE         car  
## # … with 5 more variables: locations <chr>, pole_id <chr>,
## #   timestamp <dbl>, `timestamp-iso` <dttm>, rec_count <int>

That’s just straight up a dupe record. Well it probably won’t cause us any trouble.. but we’ll kick out one manually for OCD reasons.

double_up %>%
  summarize(max(X1)) %>%
  pluck(1) ->
  bad_point

video_events %>%
  filter(X1 != bad_point) ->
  video_events_nodupes

Let’s look at the cameras and see what’s going on…

video_events_nodupes %>%
  select(camera_id) %>%
  arrange(camera_id) %>%
  distinct

## # A tibble: 10 x 1
##    camera_id         
##    <chr>             
##  1 mlk-central-cam-2 
##  2 mlk-douglas-cam-1 
##  3 mlk-douglas-cam-2 
##  4 mlk-georgia-cam-1 
##  5 mlk-georgia-cam-3 
##  6 mlk-houston-cam-1 
##  7 mlk-lindsay-cam-2 
##  8 mlk-magnolia-cam-2
##  9 mlk-peeples-cam-1 
## 10 mlk-peeples-cam-3

Oh… some intersections have two cameras.. but some only have one. Interesting.

I think for now we just want data from one camera per intersection… so let’s pick the camera with the most traffic and also adjust the names so we can use them later to match to pm data

video_events_nodupes %>% 
  group_by(camera_id) %>%
  summarize(rec_count = n()) %>%
  separate(camera_id, into  = c("street1", "street2", "street3", "street4"), sep="-", remove = FALSE)  %>%
  mutate(nicename = paste(street1, street2, sep = "-")) %>%
  select(nicename, camera_id, rec_count)  %>% 
  group_by(nicename) %>%
  top_n(1, rec_count) %>%
  ungroup %>%
  {.} ->
  cameras

cameras

## # A tibble: 7 x 3
##   nicename     camera_id          rec_count
##   <chr>        <chr>                  <int>
## 1 mlk-central  mlk-central-cam-2     880557
## 2 mlk-douglas  mlk-douglas-cam-1     287188
## 3 mlk-georgia  mlk-georgia-cam-3     623252
## 4 mlk-houston  mlk-houston-cam-1     568195
## 5 mlk-lindsay  mlk-lindsay-cam-2     490929
## 6 mlk-magnolia mlk-magnolia-cam-2    188323
## 7 mlk-peeples  mlk-peeples-cam-3     448677

Join that back to the video data to get just one cam per intersetion

video_events_nodupes %>%
  inner_join(cameras %>% select(camera_id, nicename)) ->
  video_events_nodupes_1per

## Joining, by = "camera_id"

in our initial skim we saw we have a field called label that has 18 unique values… let’s look at that

video_events_nodupes_1per %>%
  group_by(label) %>%
  summarize(row_count = n()) %>%
  arrange(row_count)

## # A tibble: 17 x 2
##    label         row_count
##    <chr>             <int>
##  1 stop sign             3
##  2 parking meter        23
##  3 suitcase             28
##  4 aeroplane            33
##  5 dog                  44
##  6 handbag              57
##  7 bird                 79
##  8 boat                 81
##  9 backpack            178
## 10 umbrella            255
## 11 fire hydrant        951
## 12 bicycle            3690
## 13 motorbike          6509
## 14 bus               17207
## 15 truck            105490
## 16 person           126431
## 17 car             3226062

Oh… that’s a lot of stuff.

While I’m curious about the 83 boats, 33 airplanes, and 1 cat that wandered down JFK, I’m going to assume that’s missclassification noise. And I’m going to filter out only the vehicles:

video_events_nodupes_1per %>%
  filter(label %in% c("car","bus","truck", "motorbike")) %>%
  clean_names() %>%
  {.} ->
  video_events_nodupes_1per_filtered

so let’s group these data into 10 minute buckets the way we did with the pm data:

video_events_nodupes_1per_filtered %>%
  group_by(timemark = floor_date(timestamp_iso, "10 mins"), nicename, label) %>%
  summarize(vehicle_count = n())  ->
  video_events_summarized_long

head(video_events_summarized_long)

## # A tibble: 6 x 4
## # Groups:   timemark, nicename [4]
##   timemark            nicename    label vehicle_count
##   <dttm>              <chr>       <chr>         <int>
## 1 2019-06-01 04:00:00 mlk-central car             154
## 2 2019-06-01 04:00:00 mlk-central truck             3
## 3 2019-06-01 04:00:00 mlk-douglas car              29
## 4 2019-06-01 04:00:00 mlk-georgia car             169
## 5 2019-06-01 04:00:00 mlk-georgia truck             5
## 6 2019-06-01 04:00:00 mlk-houston bus               1

That’s cool, but let’s make it wide instead of long:

video_events_summarized_long %>%
  group_by(nicename, label) %>%
  
  spread(label, vehicle_count) %>%
  replace(is.na(.), 0) ->
  video_events_summarized_wide

head(video_events_summarized_wide)

## # A tibble: 6 x 6
## # Groups:   nicename [6]
##   timemark            nicename       bus   car motorbike truck
##   <dttm>              <chr>        <dbl> <dbl>     <dbl> <dbl>
## 1 2019-06-01 04:00:00 mlk-central      0   154         0     3
## 2 2019-06-01 04:00:00 mlk-douglas      0    29         0     0
## 3 2019-06-01 04:00:00 mlk-georgia      0   169         0     5
## 4 2019-06-01 04:00:00 mlk-houston      1   151         1     1
## 5 2019-06-01 04:00:00 mlk-lindsay      0   145         0     1
## 6 2019-06-01 04:00:00 mlk-magnolia     0    22         0     0

Let’s join the ‘wide’ video data to the polution data.

aq_limited_long %>%
  inner_join(video_events_summarized_wide) %>%
  rowwise() %>%
  mutate(total_traffic = sum(bus, car, motorbike, truck) ) ->
  pm_traffic

## Joining, by = c("timemark", "nicename")

Do a quick time plot of each intersection (again)

ggplot(pm_traffic) +
 aes(x = timemark, y = pm2_5_cf_1) +
 geom_line(size = 1, colour = "#0c4c8a") +
 labs(x = "Date & Time", y = "Sensor Level", title = "`pm2_5_cf_1` by Station") +
 theme_minimal() +
 facet_wrap(vars(nicename))

And traffic:

ggplot(pm_traffic) +
 aes(x = timemark, y = total_traffic) +
 geom_line(size = 1, colour = "#800000") +
 labs(x = "Date & Time", y = "Total Vehicles", title = "Total Vehicles by Station") +
 theme_minimal() +
 facet_wrap(vars(nicename))

Well that’s some regular looking cycles…

Let’s build a super simple linear model:

library(parsnip)
library(broom)

lm(data = pm_traffic, pm2_5_cf_1 ~ bus + car + motorbike + truck) ->
  linear_model

summary(linear_model)

## 
## Call:
## lm(formula = pm2_5_cf_1 ~ bus + car + motorbike + truck, data = pm_traffic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.223  -6.621  -2.291   4.122  48.312 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.5887601  0.0993488 156.909  < 2e-16 ***
## bus         -0.0301903  0.0555936  -0.543 0.587098    
## car         -0.0067491  0.0007974  -8.464  < 2e-16 ***
## motorbike   -0.1945797  0.0527587  -3.688 0.000226 ***
## truck       -0.0022071  0.0147943  -0.149 0.881410    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.412 on 21095 degrees of freedom
## Multiple R-squared:  0.01232,    Adjusted R-squared:  0.01213 
## F-statistic: 65.79 on 4 and 21095 DF,  p-value: < 2.2e-16

So right out of the gate we have issues…. the sign on the coefficients is not at all as expected. We would expect more traffic = more particulates… but this models says the opposite. That’s a huge warning sign.

Let’s plot the actual vs. modeled

pm_traffic %>%
  bind_cols( tibble(.pred = predict(linear_model, new_data = pm_traffic)) ) ->
  predicted_lm 

ggplot(predicted_lm) +
 aes(x = pm2_5_cf_1, y = .pred) +
 geom_line(size = 1, colour = "#228B22") +
 labs(x = "pm2_5_cf_1", y = "Predicted pm2_5_cf_1", title = "Modeled vs. Actual") +
 theme_minimal() +
 facet_wrap(vars(nicename))

Well that’s amazingly shitty.

The OLS never predicts under 15 (the intercept), but we have lots of data under 15. That’s weird.

Let’s look at a quick correlation matrix:

pm_traffic %>%
  select(pm2_5_cf_1, bus, car,  motorbike, truck, total_traffic) %>%
  cor()

Weird… mos def negative correlations

Maybe if we break it into intersections and just look at total traffic correlation to pm

pm_traffic %>%
  group_by(nicename) %>%
  summarize(correl = cor(pm2_5_cf_1, total_traffic))

Total traffic is always just a little bit neg correlated….

So clearly something else is going on.. most likly the effect is lagged. We can drill into one station and see what’s going on. We’ll normalize so we can plot together

library(forecast)

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Registered S3 methods overwritten by 'forecast':
##   method             from    
##   fitted.fracdiff    fracdiff
##   residuals.fracdiff fracdiff

pm_traffic %>%
  filter(nicename == "mlk-central") %>% 
  ungroup() %>%
  na.omit() %>%
  mutate_at(funs(scale), .vars = vars(pm2_5_cf_1, total_traffic)) ->
  single_intersection

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

ggplot(single_intersection) +
  geom_line(size = 1,
            colour = "#0c4c8a",
            aes(x = timemark, y = pm2_5_cf_1)) +
  geom_line(size = 1,
            colour = "red",
            aes(x = timemark, y = total_traffic)) +
  labs(x = "Date & Time", y = "Scaled Values", title = "Traffic and PM") +
  theme_minimal() +
  facet_wrap(vars(nicename))

Oh.. look at that… there’s pretty clearly a lag. So in order to get anything out of the modeling then we need to account for the lag.

Let’s work this as a time series decomp. We’ll lag the pm data 6 hours

library(tsbox)

## 
## Attaching package: 'tsbox'

## The following objects are masked from 'package:skimr':
## 
##     ts_end, ts_start

library(xts)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following object is masked from 'package:leaflet':
## 
##     addLegend

## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(tidyquant)

## Loading required package: PerformanceAnalytics

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

## Loading required package: quantmod

## Loading required package: TTR

## Version 0.4-0 included new data defaults. See ?getSymbols.

single_intersection %>%
  select(timemark, pm2_5_cf_1, total_traffic) ->
  subset_data

xts((subset_data %>% select(-timemark)),
    order.by = subset_data$timemark,
    frequency = 6 * 24 * 7) ->
  single_intersect_ts

single_intersection %>%
  select(timemark, pm2_5_cf_1, total_traffic) %>%
  ts_long %>%
  ts_ts %>%
  ts(frequency = 6 * 24 * 7) ->
  single_intersect_ts

## [time]: 'timemark'

## [time]: 'timemark'

single_intersect_ts[, "total_traffic"] %>%
  decompose(type = "additive") %>%
  autoplot

single_intersect_ts[, "pm2_5_cf_1"] %>%
  decompose(type = "additive") %>%
  autoplot

cor(single_intersect_ts[, "total_traffic"], single_intersect_ts[, "pm2_5_cf_1"])

## [1] -0.1563726

# cross correlation
ccf(single_intersect_ts[, "total_traffic"], single_intersect_ts[, "pm2_5_cf_1"], lag.max = 70)

ok, so we’re seeing some lagged correlation. It peaks at .06 of a frequency unit… our frequency is 6 * 24 * 7, so the lag is 6 * 24 * 7 * .06 = 60.48

I need to make sense of this lagged effect and I’ve clearly forgotten all my time series heuristics… Time to get Rob Hyndman’s book out and read up… also need to check those time zones as somthing seems off…

TO Do section… bring in wind data and see if that’s a big factor

Let’s bring in some weather data. I ran this previously and it takes a few minutes so I saved the results to a csv:

library(rWind)
# chattanooga is 35.0456° N, 85.3097° W

dt <- seq(ymd_hms(paste(2019, 6, 01, 00, 00, 00, sep = "-")),
          ymd_hms(paste(2019, 7, 01, 00, 00, 00, sep = "-")), by = "3 hours")
ww <- wind.dl_2(dt, -85.3097,-85.3097, 35.0456, 35.0456)

tidy (ww) %>%
  write_csv("01_data/wind_data.csv")

Read in that csv

wind_data <- read_csv("01_data/wind_data.csv")

## Parsed with column specification:
## cols(
##   time = col_datetime(format = ""),
##   lat = col_double(),
##   lon = col_double(),
##   ugrd10m = col_double(),
##   vgrd10m = col_double(),
##   dir = col_double(),
##   speed = col_double()
## )

so we have a known date issue… not 100% sure what time zone the traffic data is… assuming UTC, but I bet it’s eastern. Have to fix this later

now we need to join every intersection to the wind data.

CUIP Fully Data Analysis

JD Long

8/22/2019