icon-cookie
The website uses cookies to optimize your user experience. Using this website grants us the permission to collect certain information essential to the provision of our services to you, but you may change the cookie settings within your browser any time you wish. Learn more
I agree
Summary | 16 Annotations
accumulators, which are variables that are only “added” to, such as counters and sums
2018/10/11 14:07
.
2018/10/11 14:13
broadcast variables, which can be used to cache a value in memory on all nodes
2018/10/11 14:07
Python In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc. Making your own SparkContext will not work
2018/10/11 14:13
e Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc
2018/10/11 14:13
the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc. Making your own SparkContext will not work.
2018/10/11 14:13
The appName parameter is a name for your application to show on the cluster UI
2018/10/11 14:15
master is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode
2018/10/11 14:15
One important parameter for parallel collections is the number of partitions to cut the dataset into
2018/10/12 06:32
the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc. Making your own SparkContext will not work
2018/10/12 05:58
By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
2018/10/18 03:32
Note that you cannot have fewer partitions than blocks
2018/10/18 03:33
By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS)
2018/10/18 03:33
All transformations in Spark are lazy
2018/10/26 14:48
you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
2018/10/26 14:51
To avoid this issue, the simplest way is to copy field into a local variable instead of accessing it externally:
2018/10/26 16:17