Abstract:
With the increasing popularity of the big data processing framework Apache Spark, ensuring its safe and stable utilization while reducing overhead has become a widely discussed topic in the industry. A critical factor influencing Spark’s performance is its configuration parameters. Improper parameter settings can lead to significant performance degradation or even cause large-scale system failures, resulting in substantial financial losses for users. The key challenge lies in determining the valid range of Spark configuration parameters, which varies depending on workloads, cluster resources, and input data. Furthermore, there are complex interdependencies among parameters. For instance, memory-related parameter ranges depend on the available cluster memory, while memory settings also impact Shuffle performance, indirectly influencing the range of Shuffle-related parameters. Therefore, identifying the valid range of Spark configuration parameters is highly challenging. To tackle this challenge, this study proposes a method to efficiently determine Spark configuration parameter ranges across different application scenarios. The goal is to enhance the security and stability of Spark applications while indirectly reducing time and cost overhead. Using mathematical modeling, we improve traditional parameter range determination methods in two key aspects. First, in terms of search speed, we employ a dynamic probing method that expands and contracts the search interval to determine an initial range, followed by Fibonacci search, which has a fast convergence rate, to further refine the boundaries. Second, for search conditions, our method only requires setting the initial search point to Spark’s default parameter values, making it adaptable to various scenarios. Based on these two enhancements, we introduce Composite Search, a practical approach for searching Spark configuration parameter ranges. Without requiring prior knowledge of parameter values, Composite Search effectively determines parameter ranges under different workloads and cluster conditions, significantly improving speed and robustness compared to traditional methods. To evaluate the effectiveness of Composite Search, we conducted experiments on a four-node x86 cluster using all 103 TPC-DS Spark queries. The results show that compared to traditional methods for determining parameter ranges in software systems, Composite Search achieves speedups of 5.5× and 4.9× in program-level and parameter-level searches, respectively. Additionally, the parameter ranges identified by Composite Search increased the average program success rate from 46.5% to 81.7%. When integrated with existing experiment-driven and machine learning-based tuning methods, Composite Search reduces overall tuning time by an average of 30%.