dipe's log: 2016

Recently I am enjoying Mark Litwintschik's no nonsense blog posts on query performance of different big data environments. For this he ingests the records of more than one billion NYC taxi cab rides , measures the time it takes to ingest the data and run 4 different (simple to complex) SQL queries against the database. He tells us what hardware or cloud config his tests run on and how much it costs. His tests are easily reproducible and he invites people to tell him if they think a test cloud be improved to be fair. Pretty awesome !

Here is a summary of his results:

Test Environment	median (sec)	factor slower
MapD & 8 x Nvidia Pascal Titan X	0.109	1
MapD & 4 x AWS g2.8xlarge	0.2185	2
MapD & 4 x Nvidia Titan X	0.285	3
MapD & 8 x Nvidia Tesla K80	0.123	1
AWS Redshift on 6 x ds2.8xlarge	1.905	17
AWS Redshift on 1 x ds2xlarge	73.5	674
Google BigQuery	2	18
Presto & EMR 5 x unknown instance	79	725
Presto & EMR 50 x unknown instance	43.5	399
Presto & EMR EMRFS 5 x m3.xlarge	81	743
Presto & EMR HDFS 5 x m3.xlarge	51.5	472
ElasticSearch & 1 x 4 core, 16GB, ssd	63.2	580
SparkSQL & EMRFS 5 x unknown	466.5	4280
Postgres & 1 x 4 core, 16GB, ssd	205	1881

a couple of observations:

GPU based databases such as MapD will take over the world (if you can fit your data in them)
GPU based databases can be inexpensive (The Titan X setup is 10x cheaper than the one using the Tesla) yet performance is 400 x faster than the fastest Hadoop cluster in the test
Google BigQuery and AWS Redshift are very fast as well. BigQuery has the advantage that you don't have to setup a Redshift server farm.
Spark is slow (4280 times slower than MapD !!!) and should be avoided for SQL only operations (it has other strengths and those can be very well combined with SparkSQL)
Presto was used because pure Hive was 3-6 x slower.

Please continue the great work Mark, perhaps you can do a test with Apache Drill, Impala or Slam Data on Mongo?

dipe's log

Sunday, September 25, 2016

SQL performance of many big data environments

Followers

Blog Archive

About Me