SQL is is gaining popularity as more researchers work with structured data. Rather than reimport data every session, using a relational database (RDBMS) and leaving the data persistent and using SQL to query data is a significant improvement.
The problem with standard RDBMS systems is that their algorithms are often serial and hampered by the needs to keep transactions (think keeping bank deposits and debits in order) consistent. This is also known as ACID.
In many research cases though researchers do not need transactions, they have data and they just want to query, or their data is append only such as new measurements. By relaxing the transactions needs researchers can use a whole host of new methods that are very scalable.
Enter Apache Hive. Hive is a data warehouse tool that lets data on an Hadoop cluster (such as the cluster at ARC-TS) be queried using SQL syntax. For large tables even in to the thousands of GBytes of data, performance is consistent.
In this example I have data in CSV format from a database. It has 12 columns and 1,487,169,693 rows. Total data size is about 880GB of raw data. With hive though once I have the data in Hadoop and create a table out of it. I can use Hive to query it just as any other SQL table.
SELECT COUNT(*) FROM sample_table;
Researchers who work with a lot of structured data will find SQL on Hive to be intuitive and very powerful and effectively remove all limits to query performance and data size imposed by any other solution.
To many researchers working with SQL or Hadoop is new to them and daunting but is part of the new BigData ecosystem. Please contact ARC-TS at firstname.lastname@example.org and one of our staff can help you with your data.