First glance onto distributed computing

A recent post on the computer science blog about distributed database management systems fired up the curiosity in me on distributed computing again.

Even though I have attended the parallel computing course while I was still a student, I have always thought cloud computing as worked by some sort of magic :P. I learned how I can distribute parts of a problem but I don’t understand how a single infrastructure (the cloud), can solve many problems thrown at it and balancing between them but not let them interfere with each other, and all that with no special modification to the hardware in the while.

Also contributed in part to my confusion is perhaps the name of the course, “parallel”, which reflects in my brain as a pair of something going on together. The truth is, parallel means asynchronously, many things can get done at the same time.

Well, there’s no magic after all, it’s our common perception about programming that is bended by our habit. After reading about MapReduce and watching a lecture from Google Code University, I finally get it.

There is no data types, there’s just data!

Similar to what I’ve learned, distributed computing is about “diving and conquer” problems. You’ll have to acquire quantization skills to break down complex problems in to simpler problems with which the two core function can work with:

  • Map (to process and break something in to smaller things). At the core of map, there’s some sort a halt condition to be executed when the problem can’t be broken smaller anymore.

Map is processing green stuff and creates red stuff (illustration courtesy of the University of Washington)

  • Reduce (to combine multiple small things to more usable things).

"Reduce" is combining red stuff and green stuff to produce some result

But these two function alone is not enough. How would different type of data treated when there’s no concept of type? The answer is, you make function that take data and return data but process them in a specific way.

Take the example mentioned in the Google lecture above, to process characters in a string, a bunch of data that represent a string (but is not a string since we have no typing system here) is passed into a function called explode, which will return some data that represent characters. And if the processing is to return a string in the end, a function called implode will be needed. This function do the reverse, take data that represent character and strings and return data that represent strings.

By solving the data typing and transitioning, the above paradigm of distributed computing paved way for cloud computing: while distributed computing systems is built to solve a single, specific problems with steps well-defined when the system is constructed. Cloud computing, on the other hand, is simple a collection of the most primitive functions described above, leaving the decision of how to use them to the developer.

So the good news is, developers will still have much to do creating and maintaining cloud systems, and they should be simpler to maintain. But the bad news is, our perception of data will need to renew itself to prepare for the future.