Analysis of NYPD 2011: Stop and Frisk

Data

sqf_default <- readRDS("data/sqf_2011.rds")

sqf <- sqf_default %>% 
  mutate(datestop = mdy(datestop),
         month    = month(datestop, label = TRUE),
         day      = day(datestop),
         wday     = wday(datestop, label = TRUE)) %>% 
  select(year, pct, ser_num, datestop, month, day, wday, timestop:detailcm)

Let’s take a preliminary look at this data:

sqf %>%
  count(pct) %>% 
  arrange(desc(n)) %>% 
  slice(1:3)

# A tibble: 3 x 2
    pct     n
  <dbl> <int>
1    75  3107
2    73  2517
3   115  1815

It seems like precincts 075, 073 and 115 had the most sqf incidents.

sqf %>%
  count(race, offunif) %>% 
  group_by(race) %>% 
  mutate(prop_in_unif = n / sum(n)) %>% 
  filter(offunif == "Y") %>% 
  select(race, prop_in_unif) %>% 
  arrange(prop_in_unif)

# A tibble: 8 x 2
# Groups:   race [8]
  race  prop_in_unif
  <chr>        <dbl>
1 W            0.662
2 A            0.736
3 B            0.745
4 P            0.755
5 Q            0.759
6 I            0.785
7 U            0.814
8 Z            0.839

The proportions can be seen in the table above. Proportions of clothed officer arrests were lowest for white suspects. They were lowest for those categorized as “other” race.

Additionally, it seems that “U” is not recorded value in the data dictionary. Due to the absence of the “X” variable in this chart (which is present in the dictionary), I assume that U means “unknown”.

sqf %>%
  group_by(month) %>% 
  filter(month == "Jan" || month == "Aug") %>% 
  summarise(avg_per_stop=mean(perstop))

# A tibble: 2 x 2
  month avg_per_stop
  <ord>        <dbl>
1 Jan           5.39
2 Aug           5.77

It does seem like the average minutes of a stop are higher in August than in January. However, when looking at the data, there are a view observations that list “0” as the period of stop, which seems unlikely to be true. Let’s see what happens when we remove these values

sqf %>%
  group_by(month) %>% 
  filter(month == "Jan" || month == "Aug") %>% 
  filter(perstop != 0) %>% 
  summarise(avg_per_stop=mean(perstop))

# A tibble: 2 x 2
  month avg_per_stop
  <ord>        <dbl>
1 Jan           5.39
2 Aug           5.77

This doesn’t seem to change the results too much. There are also a lot of recorded “999” which could be some kind of default value. Let’s see if ommitting those changes anything:

sqf %>%
  group_by(month) %>% 
  filter(month == "Jan" || month == "Aug") %>% 
  filter(perstop != 0) %>% 
  filter(perstop != 99) %>%
  summarise(avg_per_stop=mean(perstop))

# A tibble: 2 x 2
  month avg_per_stop
  <ord>        <dbl>
1 Jan           5.39
2 Aug           5.77

Again, not much changes. It does seem that the mean period of stop for sqf event periods is generally larger in August than in January.

sqf %>%
  group_by(month) %>% 
  filter(datestop == min(datestop)) %>% 
  filter(timestop == min(timestop)) %>% 
  select(month, perstop) %>% 
  arrange(month)

# A tibble: 14 x 2
# Groups:   month [12]
   month perstop
   <ord>   <dbl>
 1 Jan         2
 2 Feb         5
 3 Feb         5
 4 Mar         5
 5 Apr         2
 6 May         5
 7 Jun         3
 8 Jun         5
 9 Jul        10
10 Aug         2
11 Sep         5
12 Oct         6
13 Nov         1
14 Dec        10

This is a table displaying the amount of time the first stop of every month took place. Some first stops happened at the same time, so I left them in the table (hence there are 14 rows).

sqf %>%
  filter(month == "Mar") %>%
  ggplot(mapping = aes(y = age , x = wday)) +
  geom_boxplot() +
  theme_bw()+
  labs(title = "Avg Ages of Suspects on Different Days of the Week in March")

On first glance it seems like there are some weird ages that are recorded. Let’s take a look at the data

sqf %>%
  count(age)

# A tibble: 137 x 2
     age     n
   <dbl> <int>
 1     0     4
 2     1    86
 3     2    10
 4     3     8
 5     4     1
 6     5    19
 7     6     4
 8     7     2
 9     9     3
10    10     7
# … with 127 more rows

sqf %>%
  count(age) %>% 
  arrange(desc(age))

# A tibble: 137 x 2
     age     n
   <dbl> <int>
 1   999    76
 2   550     1
 3   455     1
 4   416     1
 5   387     1
 6   369     1
 7   365     1
 8   341     1
 9   335     1
10   334     1
# … with 127 more rows

Based on this, it seems there are some anamolies in the entered age. Thus, in the following task, I will only consider ages between 10-90 because anything above or below that doesn’t seem to make sense in terms of what the data describes (i.e. why would the police stop and frisk children or the very elderly?)

sqf %>%
  filter(month == "Mar") %>%
  filter(age<91 & age>9) %>% 
  ggplot(mapping = aes(y = age , x = wday)) +
  geom_boxplot() +
  theme_bw()+
  labs(title = "Ages of Suspects on Different Days of the Week in March", 
       x="Day of the Week (in March)", 
       y="Age of Suspect")

The mean age seems to be slightly higher in the middle of the week (around Wednesday) and then fall towards the weekend (around Saturday).

First, finding the top 10 precincts:

toptenpct <- sqf %>%
  count(pct) %>% 
  arrange(desc(n)) %>% 
  slice(1:10) %>% 
  arrange(pct)

Next, finding the probability of being arrested in a specific precinct and month given you are stopped.

sqfarstprob <- sqf %>%
  mutate(pct = as.factor(pct)) %>% 
  count(pct, month, arstmade) %>% 
  group_by(pct, month) %>%
  mutate(pctmonthtotal = sum(n)) %>%
  mutate(arstpropor = n/pctmonthtotal) %>% 
  filter(pct %in% toptenpct$pct) %>% 
  filter(arstmade == "Y")

sqfarstprob

# A tibble: 120 x 6
# Groups:   pct, month [120]
   pct   month arstmade     n pctmonthtotal arstpropor
   <fct> <ord> <chr>    <int>         <int>      <dbl>
 1 23    Jan   Y           15           339     0.0442
 2 23    Feb   Y           11           177     0.0621
 3 23    Mar   Y           13           113     0.115 
 4 23    Apr   Y            3           112     0.0268
 5 23    May   Y            5           108     0.0463
 6 23    Jun   Y            5           107     0.0467
 7 23    Jul   Y            7           119     0.0588
 8 23    Aug   Y            4           110     0.0364
 9 23    Sep   Y            5           116     0.0431
10 23    Oct   Y            5           221     0.0226
# … with 110 more rows

Now, creating the plot, using the above chart:

sqfarstprob %>%
  ggplot(mapping=aes(x=month, y=pct, fill = arstpropor))+
  geom_tile(color = "grey")+
  coord_equal()+
  scale_fill_gradient(low = "#fee0d2", 
                       high = "#de2d26",
                       breaks = c(0, 0.05, 0.10, 0.15))+
  theme(axis.text.x = element_text(size = 5, color = "grey50"),
        axis.text.y = element_text(size = 5, color = "grey50"),
        plot.title = element_text(size = 7),
        axis.title.y = element_text(size=6,),
        axis.title.x = element_text(size=6),
        legend.position="bottom",
        legend.text = element_text(size=5, color = "grey50"),
        legend.title = element_text(size=6),
        plot.caption = element_text(size = 5))+
  labs(
    title = 
      "There is high variablity in arrest probability within and between precincts",
       x = "Month",
       y= "Precinct",
       fill = "Probablity of an arrest", 
       caption = "Top 10 precincts involved in SQF events")

In the last few days, New York City’s stop and frisk policy has seen a resurgence in public discourse because of former mayor Michael Bloomberg’s statements on the policy (which was used greatly during his tenure).

In a newly released tape recorded in 2015, Bloomberg stated that “Ninety-five percent of your murders — murderers and murder victims — fit one M.O.”. He specifies, “They are male minorities, 16 to 25. That’s true in New York. That’s true in virtually every city.”.

These racially charged comments combined with the 2013 federal court ruling that declared stop and frisk searches unconstitutional make the policy very controversial from the point of view of race.

In the light of recent discussion, it would be interesting to look at the efficacy of this policy, and the treatment of individuals of different racial groups by police officers.

Question One: How many searches actually resulted in arrests?

sqf %>%
  count(arstmade) %>%
  mutate(arstprp = n/sum(n))

# A tibble: 2 x 3
  arstmade     n arstprp
  <chr>    <int>   <dbl>
1 N        64467  0.941 
2 Y         4029  0.0588

It seems like only about 5% of stops resulted in arrests.

Question Two: What proportion of each race Let’s examine the proportion of those arrested in each race.

In both of the following graphics

sqf %>%
  ggplot(mapping = aes(x=race, fill = arstmade))+
  geom_bar()+
  labs(title = "Stops by Race (with Arrest Proportions)", fill = "Arrest Made?")+
  scale_fill_discrete(labels = c("No", "Yes"))+
  coord_flip()+
  scale_x_discrete(
    limits = c("B", "Q", "W", "P", "A", "Z", "U", "I"),
    labels = c("Black", "White-Hispanic", "White", "Black-Hispanic", "Asian", "Other", "Unknown", "Native American"))+
  scale_fill_manual(values=c('#99ccff','#0066cc'))

This graph shows us that Black and White-Hispanic people had the most stops conducted. But how does it line up proportionally?

sqf %>%
  ggplot(mapping = aes(x=race, fill = arstmade))+
  geom_bar(position="fill")+
  labs(title = "Arrest Proportion by Race", fill = "Arrest Made?", subtitle= "Average Shown by Black Line", y = "proportion arrested")+
  scale_fill_discrete(labels = c("No", "Yes"))+
  scale_x_discrete(
    limits = c("B", "Q", "W", "P", "A", "Z", "U", "I"),
    labels = c("Black", "White-Hispanic", "White", "Black-Hispanic", "Asian", "Other", "Unknown", "Native American"))+
  coord_flip()+
  geom_hline(yintercept = 0.04770457)+
  scale_fill_manual(values=c('#99ccff','#0066cc'))

It seems like black and white-hispanic people were more likely to be stopped but not significantly more likely to be arrested. In fact, white people seemed to be more likely to get arrested if they are stopped.

The above graphs both deny and support Bloomberg’s claims. At first glance, there are purely a larger number of black and white-hispanic people getting arrested. However, the proportions of the arrests seem to be closer together. So the higher number of “crimes deterred” that might have been commited by black or hispanic people might simply be because more black or hispanic people were stopped.

Question Three: Which races got searched or frisked more?

The observations in this dataset describe the circumstances of police stops. Usually, when the policy is discusses as a whole, the public discusses stop and frisk. As such it would be interesting to also look at the number of frisks by racial category.

The dataset includes both searches and frisks. Searches are legally defined as being more thorough frisks, which falls into the purview of this question. It would be helpful to create a new variable combining these two:

sqf <- sqf %>% 
  mutate(searchorfrisk = ifelse(frisked == "Y", "Yes",
                                ifelse(searched == "Y", "Yes", "No")))

Now we can graph the proportion of each race that faced a search or a frisk:

sqf %>%
  ggplot(mapping = aes(x=race, fill = searchorfrisk))+
  geom_bar(position="fill")+
  labs(title = "Search/Frisk Proportion by Race", fill = "Searched or Frisked?", y = "proportion searched or frisked")+
    scale_x_discrete(
    limits = c("P", "B", "Q", "Z", "A", "I", "W", "U"),
    labels = c("Black Hispanic", "Black", "White-Hispanic", "Other", "Asian", "Native American", "White", "Unknown"))+
    scale_fill_manual(values=c('#99ccff','#0066cc'))+
  coord_flip()

It seems that Black Hispanic, Black and White Hispanic people were more likely to be searched or frisked. This is interesting given the fact that White people were the most likely to be arrested if they were stopped.

Question Four: What group had the most force used on them?

Thee are 9 categories of force recorded in the data set. We can use them to make one variable that specifies if any force was used at all.

sqf <- sqf %>%
  mutate(forceused = ifelse(pf_hands == "Y" | pf_wall == "Y" | pf_grnd == "Y" | pf_drwep == "Y" | pf_ptwep == "Y" | pf_baton == "Y" | pf_hcuff == "Y" | pf_pepsp == "Y" |pf_other == "Y", "Yes", "No"))

Now we can see the proportion of stops that involved force for each racial group.

sqf %>%
  ggplot(mapping = aes(x=race, fill = forceused))+
  geom_bar(position="fill")+
  labs(title = "Force Use Proportion by Race", fill = "Force Used?", x= "Race", y = "Proportion of Searches Using Force")+
    scale_x_discrete(
    limits = c("P", "Q", "B", "Z", "A", "W", "I", "U"),
    labels = c("Black-Hispanic", "White-Hispanic", "Black", "Other", "Asian", "White", "Native American", "Unknown"))+
    scale_fill_manual(values=c('#99ccff','#0066cc'))+
  coord_flip()

We can see that force is used at a higher rate for Hispanic and Black people. Again, this is interesting given that the arrest rates are so similar amongst racial groups.

Conclusions

The answers to the previous four questions result in some interesting insights. Overall, Black and Hispanic people had the most searches conducted on them. However, amongst Black, Hispanic (Black and White), and White people arrest proportions were similar. Despite this equality in proportion, Black and Hispanic people were still more likely to be searched or frisk and have force used against them.

It seems that even though the proportions of different races breaking the law are similar, some races are searched and violated by force more. For further study, it would be interesting to look at the variables relating to reasons provided for force usage and the reasons for the stops themselves. Perhaps these could help glean further information.