快速确定Spark应用配置参数值域的方法

李瑞; 李乐乐; 喻之斌

doi:10.12146/j.issn.2095-3135.20241129002

快速确定Spark应用配置参数值域的方法

A Method to Quickly Determine the Range of Configuration Parameter Values for Spark Applications

摘要

摘要: 随着大数据处理框架Apache Spark的流行，如何安全稳定地使用Spark框架和降低开销成为业界广泛关注的课题，而配置参数对Spark的性能影响非常显著。参数配置不当通常会导致程序性能显著下降，甚至可能引发大数据系统崩溃，从而给用户带来巨大经济损失。解决问题的关键在于确定Spark配置参数的值域。在不同的工作负载、集群资源和输入数据下，Spark的值域通常不同。此外，配置参数间还存在复杂的依赖关系。例如，内存相关的配置参数的值域依赖集群可分配的内存资源，而内存配置又会影响Shuffle性能，从而间接影响与Shuffle相关的配置参数值域。因此确定Spark配置参数的值域极具挑战性。为应对挑战，本研究提出一种在不同应用场景下快速确定Spark配置参数值域的方法，旨在提升Spark应用的安全性和稳定性，并间接减少时间和成本开销。利用数学建模的思想，本文从两方面改进了传统软件领域的值域确定方法：在值域搜索的速度方面，本文利用动态探测方法，通过扩大和缩小搜索区间确定初始范围，然后利用收敛速度较快的斐波那契搜索细化边界；在值域搜索的条件方面，本文方法仅需将搜索的起点设为Spark配置参数的默认值，即可适应各种场景。基于上述两方面改进，本文设计了复合搜索，一种搜索Spark配置参数值域的实用方法。与传统的值域确定方法相比，复合搜索无须提供配置参数的经验值即可在不同的工作负载和集群资源下快速确定配置参数的值域，速度和鲁棒性均有效提升。为验证复合搜索方法的效果，本文在一个由4个x86节点组成的集群上利用103个TPC-DS Spark查询进行了评估。实验结果表明，与软件系统中确定配置参数值域的传统方法相比，复合搜索在程序和参数维度上的值域搜索加速比分别达到5.5倍和4.9倍。此外，复合搜索找到的参数值域使得程序的平均成功率从46.5%提至81.7%。在现有的实验驱动调优和机器学习调优方法的基础上，应用复合搜索平均能减少30%的时间开销。

Abstract: With the increasing popularity of the big data processing framework Apache Spark, ensuring its safe and stable utilization while reducing overhead has become a widely discussed topic in the industry. A critical factor influencing Spark’s performance is its configuration parameters. Improper parameter settings can lead to significant performance degradation or even cause large-scale system failures, resulting in substantial financial losses for users. The key challenge lies in determining the valid range of Spark configuration parameters, which varies depending on workloads, cluster resources, and input data. Furthermore, there are complex interdependencies among parameters. For instance, memory-related parameter ranges depend on the available cluster memory, while memory settings also impact Shuffle performance, indirectly influencing the range of Shuffle-related parameters. Therefore, identifying the valid range of Spark configuration parameters is highly challenging. To tackle this challenge, this study proposes a method to efficiently determine Spark configuration parameter ranges across different application scenarios. The goal is to enhance the security and stability of Spark applications while indirectly reducing time and cost overhead. Using mathematical modeling, we improve traditional parameter range determination methods in two key aspects. First, in terms of search speed, we employ a dynamic probing method that expands and contracts the search interval to determine an initial range, followed by Fibonacci search, which has a fast convergence rate, to further refine the boundaries. Second, for search conditions, our method only requires setting the initial search point to Spark’s default parameter values, making it adaptable to various scenarios. Based on these two enhancements, we introduce Composite Search, a practical approach for searching Spark configuration parameter ranges. Without requiring prior knowledge of parameter values, Composite Search effectively determines parameter ranges under different workloads and cluster conditions, significantly improving speed and robustness compared to traditional methods. To evaluate the effectiveness of Composite Search, we conducted experiments on a four-node x86 cluster using all 103 TPC-DS Spark queries. The results show that compared to traditional methods for determining parameter ranges in software systems, Composite Search achieves speedups of 5.5× and 4.9× in program-level and parameter-level searches, respectively. Additionally, the parameter ranges identified by Composite Search increased the average program success rate from 46.5% to 81.7%. When integrated with existing experiment-driven and machine learning-based tuning methods, Composite Search reduces overall tuning time by an average of 30%.

HTML全文

参考文献(44)

施引文献

资源附件(0)