You are reading the article How Does Apache Spark Accumulator Work? updated in October 2023 on the website Phuhoabeautyspa.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested November 2023 How Does Apache Spark Accumulator Work?
Introduction to Spark AccumulatorShared variables are used by Apache Spark. When a cluster executor is sent a task by the driver, each node of the cluster receives a copy of shared variables. There are two basic types supported by Apache Spark of shared variables – Accumulator and broadcast. Apache Spark is widely used and is an open-source cluster computing framework. This comes with features like computation machine learning, streaming of API’s, and graph processing algorithms. Variables that are added through associated operations are Accumulators. Implementing sums and counters are one of the examples of accumulator tasks and there are many other tasks as such. Numeric types are supported by spark easily than any other type, but support can be added to other types by the programmers.
Start Your Free Data Science Course
Syntax:
The above code shares the details for the class accumulator of PySpark.
val acc = sc.accumulator(v)Initially v is set to zero more preferentially when one performs sum r a count operation.
Why do we Use Spark Accumulator?When a user wants to perform communicative or associate operations on the data, we use the Spark accumulator. These can be created without or with a name. The Sparks UI helps in viewing the name created with accumulator and these can also be useful in understanding the progress of running stages. By calling SparkContext.accumulator(v), the accumulator can be created taking the initial value as v, just as similar to Spark broadcast. Used in implementing sums and counter operations as in MapReduce functions. Accumulators are not supported in Python.
Code:
package org.spark.accumulator.crowd.now.aggregator.sample2 var lalaLines: Int = 0 sc.textFile("some log file", 4) if (line.length() == 0) lalaLines += 1 } println (s " Lala Lines are from the above code = $lalaLines")In the above code, the value will be zero when the blank lines code output is printed. Shifting of code by the Spark to each and every executor, the variables are local to that executor, and the latest and updated value is not given back to the driver. Making the blank Lines as an accumulator might solve the above problem. And that will help in updating back all the changes to every variable in every executor.
The above code is written like this:
package org.spark.accumulator.crowd.now.aggregator.sample2 var lalaLines = sc.accumulator(, “lala Lines”) sc.textFile("some log file", 4) if (line.length() == 0) lalaLines += 1 } println (s "tlala Lines are from the above code = $lalaLines.value")This code makes sure that the accumulator blankLine is up to date across each executor and relays back to the driver.
How Does Spark Accumulator Work?Variables of broadcast allow the developers of Spark to keep a secured read only cached variable on different nodes. With the needed tasks, only shipping a copy merely. Without having to waste a lot of time and transfer of network input and output, they can be used in giving a node a large copy of the input dataset. Broadcast variables can be distributed by Spark using a variety of accumulator algorithms which might turn largely and the cost of communication is reduced.
There are different stages in executing the actions of Spark. The stages are then separated by operation – shuffle. In every stage Spark accumulator automatically the common data needs to be in the cache, and should be serialized from which again will be de-serialised by every node before each task is run. And for this cause, If the variables of the broadcast are created explicitly, the multiple staged tasks all across needed with the same data, the above should be done.
The mentioned above broadcast variable creation by wrapping function SparkConext.accumulator, the code for it is:
Code:
val accum = sc.accumulator(0) accum: spark.Accumulator[Int] = 0 accum.value res2: Int = 20 package org.spark.accumulator.crowd.now.aggregator.sample2 object Boot { import util.is def main(args: Array[String]): Unit = { val sparkConfigiration = new Spark.Configuration(true) .setMaster_("L[4]") .setNameOfApp("Analyzer Spark") val sparkContext = new Spark.Context(spark.Configuration) sparkContext.tf(gClass.gRes("log").gPath, println("THE_START") println("HttpStatusCodes are going to be printed in result from access the log parse") println("Http Status Info") println("Status Success") println("Status Redirect") println("Client Error") println("Server Error") println("THE_END") spark.Context.stop() } } } accumulator_ = sc.accumulator(0) rdd = sc.parallelize([1, 2, 3, 4]) def f(x): global accumulator_ accum += x rdd.foreach(f) accum.valueThe accumulator command line is given below:
$SPARK_HOME/bin/spark-submit accumulator.pyOutput:
Advantages and Uses of Spark AccumulatorMemory access is very direct.
Garbage values are least collected in processing overhead.
Memory format is compact columnar.
Query catalyst optimization.
Code generation is the whole stage.
Advantages of compile tile type by datasets over the data-frames.
ConclusionWe have seen the concept of the Spark accumulator. Spark uses shared variables, for processing and parallel. For information aggregations and communicative associations and operations, accumulators variables are used. in map-reduce, for summing the counter or operation we can use an accumulator. Whereas in spark, the variables are mutable. Accumulator’s value cannot be read by the executors. But only the driver program can. Counter in Map reduce java is similar to this.
Recommended ArticlesYou're reading How Does Apache Spark Accumulator Work?
Update the detailed information about How Does Apache Spark Accumulator Work? on the Phuhoabeautyspa.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!