combinebykey 해석

The groupByKey call makes no attempt at merging/combining values, so it’s an expensive operation.

Thus the combineByKey call is just such an optimization. When using combineByKey values are merged into one value at each partition then each partition value is merged into a single value. It’s worth noting that the type of the combined value does not have to match the type of the original value and often times it won’t be. The combineByKey function takes 3 functions as arguments:

groupByKey 는 값들을 합치고 결합하지 않아서 무거운 연산이다.

따라서 combineByKey는 최적화로 불린다. combinByKey를 쓸 때 값들은 한 값으로 각 파티션에 합쳐진다. 그리고 각 파티션 값은 하나의 값으로 합쳐진다. 결합된 값은 기존 값과 비교할 필요없다. ?

3가지 매개변수를 가진다.

A function that creates a combiner. In the aggregateByKey function the first argument was simply an initial zero value. In combineByKey we provide a function that will accept our current value as a parameter and return our new value that will be merged with additional values.
The second function is a merging function that takes a value and merges/combines it into the previously collected values.
The third function combines the merged values together. Basically this function takes the new values produced at the partition level and combines them until we end up with one singular value.

1. 컴바이너 생성하는 기능이다. aggregateByKey기능으로 첫 요소는 zero value이다. combineBYKey에서 현재 값을 파라미터로 받고 추가 값이랑 합쳐진 새로운 값을 반환한다.

2. 두번째 기능은 합쳐지는 기능은 값을 취하고 이전 값과 결합 / 합쳐진다.

3. 세번째 기능은 합쳐진 값들을 결합한다. 기본적으로 이 기능은 파티션 레벨에서 생성된 새로운 값을 가지고 그들은 한 값이 될때 까지 결합한다.

In other words, to understand combineByKey, it’s useful to think of how it handles each element it processes. As combineByKey goes through the elements in a partition, each element either has a key it hasn’t seen before or has the same key as a previous element.

If it’s a new element, combineByKey uses a function we provide, called createCombiner(), to create the initial value for the accumulator on that key. It’s important to note that this happens the first time a key is found in each partition, rather than only the first time the key is found in the RDD.

If it is a value we have seen before while processing that partition, it will instead use the provided function, mergeValue(), with the current value for the accumulator for that key and the new value.

Since each partition is processed independently, we can have multiple accumulators for the same key. When we are merging the results from each partition, if two or more partitions have an accumulator for the same key we merge the accumulators using the user-supplied mergeCombiners() function.

'OLD개발이야기 > bigdata' 카테고리의 다른 글

Hierachy Clustering (0)	2017.04.25

'OLD개발이야기 > bigdata' 카테고리의 다른 글

티스토리툴바