Abstract:Benchmarks are important tools to evaluate the performance of a variety of computing systems. However, benchmarks for big data systems are lacking as big data is relatively new and researchers are interested in understanding how big data systems including hardware and software work but do not have data. In this paper, an approach to develop big data benchmarks was devised at first. Then a big data benchmark suite named SIAT-Bench, which contains five representative workloads from Shenzhen urban transportation system, was presented. To this end, the program behavior was characterized and the impact of input data sets was qualified by observing metrics from multiple levels such as microarchitecture, OS and application layer. Then statistical techniques such as Principal Component Analysis (PCA) and Clustering were employed to perform similarity analysis between different workload-input pairs. Finally, we built SIATBench by selecting representative workloads and associated input sets according to the clustering results. Experimental results show that SIAT-Bench properly satisfies the requirements of a benchmark suite.