MongoDB via mongolite: Aggregations and Queries

Dan Gray

2019/04/08

Summary

MongoDB offers some advanced data-wrangling capabilities; which I expose via the the R package mongolite. Data is sourced from the nycflights2013 database.

Importing data: JSON versus BSON

The native data-format for MongoDB is .bson, a specific type of .json that holds the ObjectID of the documents and offers performance advantages over standard JSON files.

This performance boost can be already be noticed at the stage of data-import. Files stored in bson format are also about 25% smaller on disk.

library(mongolite)
library(microbenchmark)

setwd("C:/Users/eoedd/Desktop/locstore/projects/mongodb/")

# create two different collections
flt_bson <- mongo(collection ="fBSON", db = "temp")
flt_json <- mongo(collection = "fJSON", db = "temp")

# benchmark a single import
mbm<-microbenchmark("bson" = {
  flt_bson$import(file("flights-dump.bson"),bson=TRUE)
  },
"json" = {
flt_json$import(file("flights-dump.json"))  
},
times=1)

# bson import is about 6x faster than json
mbm
## Unit: seconds
##  expr       min        lq      mean    median        uq       max neval
##  bson  6.954829  6.954829  6.954829  6.954829  6.954829  6.954829     1
##  json 64.428150 64.428150 64.428150 64.428150 64.428150 64.428150     1

Aggregations

MongoDB offers two aggregation paradigms: an aggregation-pipeline and a map-reduce approach which relies on JavaScript functions.

The standard aggregation-framework is faster and covers most needs of the typical user whereas the map-reduce option offers more flexibility at the cost of complexity and abstraction.

We will look at simple examples of each in turn.

Aggregation-Pipelines

Here $match is used to initially filter to a documents on the “origin” field.

Then we provide the grouping field "_id" via the $group parameter. The aggregation fields follow this identifier.

  • $sum returns the count of the identifier
  • $avg returns the average of the field “$distance”

The additional option “allowDiskUse” allows the query to write to the _tmp directory in the dbPath directory. See the documentation.

# calculate average distance flown per carrier
airline_stats <- flt_bson$aggregate('[
                              {"$match" : {"origin" : "LGA"}},
                              {"$group" : {"_id": "$carrier",
              "count" : {"$sum":1},
              "average" : {"$avg" : "$distance"}}}  
]', options = '{"allowDiskUse" : true}')

# assign some names
names(airline_stats) <- c("carrier", "count", "average")
print(airline_stats)
##    carrier count   average
## 1       OO    26  423.7692
## 2       F9   685 1620.0000
## 3       FL  3260  664.8294
## 4       MQ 16928  620.8494
## 5       9E  2541  621.8304
## 6       AA 15459 1041.4951
## 7       EV  8826  489.0747
## 8       WN  6087  906.4542
## 9       YV   601  375.0333
## 10      DL 23067  904.4054
## 11      US 13136  287.7186
## 12      UA  8044 1150.9544
## 13      B6  6002 1029.9222

Hierarchical grouping, and ‘roll-ups’ can also be easily built. would be Here I initially calculate the departure delays per carrier-destination pair. Chaining of a 2nd aggregation requires a further $group clause that refers to the primary grouping entity: $avg is called again to compute over the remaining sub-entities.

A prime example of such a aggregation pipeline is aggrgetion by geographical/functional units summing for every entity in a geographical zone, then averaging at the regional level.

I.e summing units sold in cities within a given state, then computing mean sales for every state.

# group by several variables
group_hier <- flt_bson$aggregate('[
                                 {"$group": {"_id": {"airline": "$carrier", "destination": "$dest"}, 
                                 "delay_AirlineDest" : {"$avg": "$dep_delay"}}}]',
                                 options='{"allowDiskUse" : true}')

# show structure of initial result
str(group_hier)
## 'data.frame':    314 obs. of  2 variables:
##  $ _id              :'data.frame':   314 obs. of  2 variables:
##   ..$ airline    : chr  "UA" "OO" "US" "OO" ...
##   ..$ destination: chr  "IAD" "CLE" "LGA" "DTW" ...
##  $ delay_AirlineDest: num  -10 8.19 NA 61 -6 ...
# roll-up
group_roll <- flt_bson$aggregate('[
                                 {"$group": {"_id": {"airline": "$carrier","destination": "$dest"}, 
                                 "delay_AirlineDest" : {"$avg": "$dep_delay"}}},
                                 {"$group": {"_id": "$_id.airline", "delay_Airline": {"$avg": "$delay_AirlineDest"} }}]',
                                 options='{"allowDiskUse" : true}')

head(group_roll)
##   _id delay_Airline
## 1  F9     20.215543
## 2  AS      5.804775
## 3  FL     12.519132
## 4  VX      5.016934
## 5  MQ     11.500179
## 6  DL      7.085287

Map-Reduce

A complete description of the map-reduce functionality can be found at the MongoDB Documentation portal.

# inspecting the collection contents
mongo_head <- flt_bson$find('{}',limit=5)
mongo_head
##   year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013     1   1      517         2      830        11      UA  N14228
## 2 2013     1   1      533         4      850        20      UA  N24211
## 3 2013     1   1      542         2      923        33      AA  N619AA
## 4 2013     1   1      544        -1     1004       -18      B6  N804JB
## 5 2013     1   1      554        -6      812       -25      DL  N668DN
##   flight origin dest air_time distance hour minute
## 1   1545    EWR  IAH      227     1400    5     17
## 2   1714    LGA  IAH      227     1416    5     33
## 3   1141    JFK  MIA      160     1089    5     42
## 4    725    JFK  BQN      183     1576    5     44
## 5    461    LGA  ATL      116      762    5     54
# basic map-reduce
dist_avg <- flt_bson$mapreduce(
  map = "function() {emit(this.carrier,this.distance)}", 
  reduce = "function(key, values) {return Array.avg(values)}",
  query = '{"origin" : "JFK"}'
)

print(dist_avg)
##    _id     value
## 1   9E  553.6672
## 2   AA 1740.9683
## 3   B6 1131.2175
## 4   DL 1633.4662
## 5   EV  228.0000
## 6   HA 4983.0000
## 7   MQ  444.3784
## 8   UA 2532.0919
## 9   US 1077.4331
## 10  VX 2514.5537

Math functions and binning can be used to get counts of “bucketed” data.

# binning delays and count
delay_bin <- flt_bson$mapreduce(
  map = "function(){emit(Math.floor(this.dep_delay/10)*10, 1)}", 
  reduce = "function(id, counts){return Array.sum(counts)}"
)

# give other names
names(delay_bin) <- c("departure_delay_buckets", "count_n")
print(delay_bin)
##    departure_delay_buckets count_n
## 1                      NaN    8255
## 2                      -50       1
## 3                      -40       2
## 4                      -30      38
## 5                      -20    6537
## 6                      -10  176997
## 7                        0   59253
## 8                       10   22356
## 9                       20   13924
## 10                      30    9572
## 11                      40    7112
## 12                      50    5670
## 13                      60    4457
## 14                      70    3559
## 15                      80    2982
## 16                      90    2491
## 17                     100    1972
## 18                     110    1710
## 19                     120    1473
## 20                     130    1165
## 21                     140     973
## 22                     150     880
## 23                     160     776
## 24                     170     676
## 25                     180     558
## 26                     190     489
## 27                     200     441
## 28                     210     348
## 29                     220     287
## 30                     230     277
## 31                     240     222
## 32                     250     173
## 33                     260     149
## 34                     270     156
## 35                     280     116
## 36                     290     115
## 37                     300      82
## 38                     310      80
## 39                     320      65
## 40                     330      57
## 41                     340      41
## 42                     350      40
## 43                     360      34
## 44                     370      27
## 45                     380      30
## 46                     390      23
## 47                     400      18
## 48                     410      14
## 49                     420       8
## 50                     430      11
## 51                     440       4
## 52                     450       3
## 53                     460       2
## 54                     470       8
## 55                     480       5
## 56                     490       1
## 57                     500       8
## 58                     510       2
## 59                     520       1
## 60                     530       2
## 61                     540       2
## 62                     570       1
## 63                     580       2
## 64                     590       3
## 65                     600       1
## 66                     610       1
## 67                     620       1
## 68                     630       3
## 69                     650       1
## 70                     660       1
## 71                     680       1
## 72                     690       1
## 73                     700       1
## 74                     740       1
## 75                     750       1
## 76                     760       1
## 77                     780       3
## 78                     790       3
## 79                     800       2
## 80                     810       1
## 81                     820       1
## 82                     840       2
## 83                     850       3
## 84                     870       1
## 85                     890       3
## 86                     910       1
## 87                     960       1
## 88                    1000       1
## 89                    1010       1
## 90                    1120       1
## 91                    1130       1
## 92                    1300       1

Most flights seem to be either range between slightly early to slightly delayed, with only very rare extreme delays of ~ 1 day!