Here is a summary of his results:
Test Environment |
median (sec) |
factor slower |
MapD & 8 x Nvidia Pascal Titan X |
0.109 |
1 |
MapD & 4 x AWS g2.8xlarge |
0.2185 |
2 |
MapD & 4 x Nvidia Titan X |
0.285 |
3 |
MapD & 8 x Nvidia Tesla K80 |
0.123 |
1 |
AWS Redshift on 6 x
ds2.8xlarge |
1.905 |
17 |
AWS Redshift on 1 x
ds2xlarge |
73.5 |
674 |
Google BigQuery |
2 |
18 |
Presto & EMR 5 x unknown instance |
79 |
725 |
Presto & EMR 50 x unknown instance |
43.5 |
399 |
Presto & EMR EMRFS 5 x m3.xlarge |
81 |
743 |
Presto & EMR HDFS 5 x m3.xlarge |
51.5 |
472 |
ElasticSearch & 1 x 4 core, 16GB, ssd |
63.2 |
580 |
SparkSQL & EMRFS 5 x unknown |
466.5 |
4280 |
Postgres & 1 x 4 core, 16GB, ssd |
205 |
1881 |
a couple of observations:
- GPU based databases such as MapD will take over the world (if you can fit your data in them)
- GPU based databases can be inexpensive (The Titan X setup is 10x cheaper than the one using the Tesla) yet performance is 400 x faster than the fastest Hadoop cluster in the test
- Google BigQuery and AWS Redshift are very fast as well. BigQuery has the advantage that you don't have to setup a Redshift server farm.
- Spark is slow (4280 times slower than MapD !!!) and should be avoided for SQL only operations (it has other strengths and those can be very well combined with SparkSQL)
- Presto was used because pure Hive was 3-6 x slower.
Please continue the great work Mark, perhaps you can do a test with Apache Drill, Impala or Slam Data on Mongo?