Sunday, September 25, 2016

SQL performance of many big data environments

Recently I am enjoying Mark Litwintschik's no nonsense blog posts on query performance of different big data environments. For this he ingests the records of more than one billion NYC taxi cab rides , measures the time it takes to ingest the data and run 4 different (simple to complex) SQL queries against the database. He tells us what hardware or cloud config his tests run on and how much it costs. His tests are easily reproducible and he invites people to tell him if they think a test cloud be improved to be fair. Pretty awesome !

Here is a summary of his results:

Test Environment

median (sec)

factor slower

MapD & 8 x Nvidia Pascal Titan X

0.109

1

MapD & 4 x AWS g2.8xlarge

0.2185

2

MapD & 4 x Nvidia Titan X

0.285

3

MapD & 8  x Nvidia Tesla K80

0.123

1

AWS Redshift on 6 x  ds2.8xlarge

1.905

17

AWS Redshift on 1 x  ds2xlarge

73.5

674

Google BigQuery

2

18

Presto & EMR 5 x unknown instance

79

725

Presto & EMR 50 x unknown instance

43.5

399

Presto & EMR EMRFS 5 x m3.xlarge

81

743

Presto & EMR HDFS 5 x m3.xlarge

51.5

472

ElasticSearch & 1 x 4 core, 16GB, ssd

63.2

580

SparkSQL & EMRFS 5 x unknown

466.5

4280

Postgres & 1 x 4 core, 16GB, ssd

205

1881



a couple of observations:
  • GPU based databases such as MapD will take over the world (if you can fit your data in them)
  • GPU based databases can be inexpensive (The Titan X setup is 10x cheaper than the one using the Tesla) yet performance is 400 x faster than the fastest Hadoop cluster in the test
  • Google BigQuery and AWS Redshift are very fast as well. BigQuery has the advantage that you don't have to setup a Redshift server farm.
  • Spark is slow (4280 times slower than MapD !!!) and should be avoided for SQL only operations (it has other strengths and those can be very well combined with SparkSQL)
  • Presto was used because pure Hive was 3-6 x slower.   

Please continue the great work Mark, perhaps you can do a test with Apache Drill, Impala or Slam Data on Mongo?