Summary
MongoDB offers some advanced data-wrangling capabilities; which I expose via the the R package mongolite. Data is sourced from the nycflights2013
database.
Importing data: JSON versus BSON
The native data-format for MongoDB is .bson, a specific type of .json that holds the ObjectID of the documents and offers performance advantages over standard JSON files.
This performance boost can be already be noticed at the stage of data-import. Files stored in bson format are also about 25% smaller on disk.
library(mongolite)
library(microbenchmark)
setwd("C:/Users/eoedd/Desktop/locstore/projects/mongodb/")
# create two different collections
flt_bson <- mongo(collection ="fBSON", db = "temp")
flt_json <- mongo(collection = "fJSON", db = "temp")
# benchmark a single import
mbm<-microbenchmark("bson" = {
flt_bson$import(file("flights-dump.bson"),bson=TRUE)
},
"json" = {
flt_json$import(file("flights-dump.json"))
},
times=1)
# bson import is about 6x faster than json
mbm
## Unit: seconds
## expr min lq mean median uq max neval
## bson 6.954829 6.954829 6.954829 6.954829 6.954829 6.954829 1
## json 64.428150 64.428150 64.428150 64.428150 64.428150 64.428150 1
Aggregations
MongoDB offers two aggregation paradigms: an aggregation-pipeline and a map-reduce approach which relies on JavaScript functions.
The standard aggregation-framework is faster and covers most needs of the typical user whereas the map-reduce option offers more flexibility at the cost of complexity and abstraction.
We will look at simple examples of each in turn.
Aggregation-Pipelines
Here $match
is used to initially filter to a documents on the “origin” field.
Then we provide the grouping field "_id" via the $group
parameter. The aggregation fields follow this identifier.
$sum
returns the count of the identifier$avg
returns the average of the field “$distance”
The additional option “allowDiskUse” allows the query to write to the _tmp
directory in the dbPath directory. See the documentation.
# calculate average distance flown per carrier
airline_stats <- flt_bson$aggregate('[
{"$match" : {"origin" : "LGA"}},
{"$group" : {"_id": "$carrier",
"count" : {"$sum":1},
"average" : {"$avg" : "$distance"}}}
]', options = '{"allowDiskUse" : true}')
# assign some names
names(airline_stats) <- c("carrier", "count", "average")
print(airline_stats)
## carrier count average
## 1 OO 26 423.7692
## 2 F9 685 1620.0000
## 3 FL 3260 664.8294
## 4 MQ 16928 620.8494
## 5 9E 2541 621.8304
## 6 AA 15459 1041.4951
## 7 EV 8826 489.0747
## 8 WN 6087 906.4542
## 9 YV 601 375.0333
## 10 DL 23067 904.4054
## 11 US 13136 287.7186
## 12 UA 8044 1150.9544
## 13 B6 6002 1029.9222
Hierarchical grouping, and ‘roll-ups’ can also be easily built. would be Here I initially calculate the departure delays per carrier-destination pair. Chaining of a 2nd aggregation requires a further $group
clause that refers to the primary grouping entity: $avg
is called again to compute over the remaining sub-entities.
A prime example of such a aggregation pipeline is aggrgetion by geographical/functional units summing for every entity in a geographical zone, then averaging at the regional level.
I.e summing units sold in cities within a given state, then computing mean sales for every state.
# group by several variables
group_hier <- flt_bson$aggregate('[
{"$group": {"_id": {"airline": "$carrier", "destination": "$dest"},
"delay_AirlineDest" : {"$avg": "$dep_delay"}}}]',
options='{"allowDiskUse" : true}')
# show structure of initial result
str(group_hier)
## 'data.frame': 314 obs. of 2 variables:
## $ _id :'data.frame': 314 obs. of 2 variables:
## ..$ airline : chr "UA" "OO" "US" "OO" ...
## ..$ destination: chr "IAD" "CLE" "LGA" "DTW" ...
## $ delay_AirlineDest: num -10 8.19 NA 61 -6 ...
# roll-up
group_roll <- flt_bson$aggregate('[
{"$group": {"_id": {"airline": "$carrier","destination": "$dest"},
"delay_AirlineDest" : {"$avg": "$dep_delay"}}},
{"$group": {"_id": "$_id.airline", "delay_Airline": {"$avg": "$delay_AirlineDest"} }}]',
options='{"allowDiskUse" : true}')
head(group_roll)
## _id delay_Airline
## 1 F9 20.215543
## 2 AS 5.804775
## 3 FL 12.519132
## 4 VX 5.016934
## 5 MQ 11.500179
## 6 DL 7.085287
Map-Reduce
A complete description of the map-reduce functionality can be found at the MongoDB Documentation portal.
# inspecting the collection contents
mongo_head <- flt_bson$find('{}',limit=5)
mongo_head
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## flight origin dest air_time distance hour minute
## 1 1545 EWR IAH 227 1400 5 17
## 2 1714 LGA IAH 227 1416 5 33
## 3 1141 JFK MIA 160 1089 5 42
## 4 725 JFK BQN 183 1576 5 44
## 5 461 LGA ATL 116 762 5 54
# basic map-reduce
dist_avg <- flt_bson$mapreduce(
map = "function() {emit(this.carrier,this.distance)}",
reduce = "function(key, values) {return Array.avg(values)}",
query = '{"origin" : "JFK"}'
)
print(dist_avg)
## _id value
## 1 9E 553.6672
## 2 AA 1740.9683
## 3 B6 1131.2175
## 4 DL 1633.4662
## 5 EV 228.0000
## 6 HA 4983.0000
## 7 MQ 444.3784
## 8 UA 2532.0919
## 9 US 1077.4331
## 10 VX 2514.5537
Math functions and binning can be used to get counts of “bucketed” data.
# binning delays and count
delay_bin <- flt_bson$mapreduce(
map = "function(){emit(Math.floor(this.dep_delay/10)*10, 1)}",
reduce = "function(id, counts){return Array.sum(counts)}"
)
# give other names
names(delay_bin) <- c("departure_delay_buckets", "count_n")
print(delay_bin)
## departure_delay_buckets count_n
## 1 NaN 8255
## 2 -50 1
## 3 -40 2
## 4 -30 38
## 5 -20 6537
## 6 -10 176997
## 7 0 59253
## 8 10 22356
## 9 20 13924
## 10 30 9572
## 11 40 7112
## 12 50 5670
## 13 60 4457
## 14 70 3559
## 15 80 2982
## 16 90 2491
## 17 100 1972
## 18 110 1710
## 19 120 1473
## 20 130 1165
## 21 140 973
## 22 150 880
## 23 160 776
## 24 170 676
## 25 180 558
## 26 190 489
## 27 200 441
## 28 210 348
## 29 220 287
## 30 230 277
## 31 240 222
## 32 250 173
## 33 260 149
## 34 270 156
## 35 280 116
## 36 290 115
## 37 300 82
## 38 310 80
## 39 320 65
## 40 330 57
## 41 340 41
## 42 350 40
## 43 360 34
## 44 370 27
## 45 380 30
## 46 390 23
## 47 400 18
## 48 410 14
## 49 420 8
## 50 430 11
## 51 440 4
## 52 450 3
## 53 460 2
## 54 470 8
## 55 480 5
## 56 490 1
## 57 500 8
## 58 510 2
## 59 520 1
## 60 530 2
## 61 540 2
## 62 570 1
## 63 580 2
## 64 590 3
## 65 600 1
## 66 610 1
## 67 620 1
## 68 630 3
## 69 650 1
## 70 660 1
## 71 680 1
## 72 690 1
## 73 700 1
## 74 740 1
## 75 750 1
## 76 760 1
## 77 780 3
## 78 790 3
## 79 800 2
## 80 810 1
## 81 820 1
## 82 840 2
## 83 850 3
## 84 870 1
## 85 890 3
## 86 910 1
## 87 960 1
## 88 1000 1
## 89 1010 1
## 90 1120 1
## 91 1130 1
## 92 1300 1
Most flights seem to be either range between slightly early to slightly delayed, with only very rare extreme delays of ~ 1 day!