I’m wrapping up a bit of the work we’ve been doing on data movement optimizations for cloud computing and the latest set of data yielded some interesting points I thought I’d share. The work done here is not really rocket science but may, in some ways, be slightly counter-intuitive and therefore seemed worthy of posting.
Summary: for those who don’t like to read detailed posts or don’t have time, the synopsis is that if you are uploading data to Azure, block your data (even down to 1MB) and upload in parallel. Set your block size based on your source file size, but if you must choose a fixed value, use 1MB. Following the above will result in significant performance gains… upwards of 10x-24x and a reduction in overall file transfer time of upwards of 90% (eg, uploading a 1GB file averaged 46.37 minutes prior to optimizations and averaged 1.86 minutes afterwards).
Detail: For those of you who want more detail, or think that the claims at the end of the preceding paragraph are over-reaching, what follows is information and code supporting these claims. As the title would indicate, these tests were run from our research facility pointing to the Azure cloud (specifically US North Central as it is physically closest to us) and do not represent intra-cloud results… we have performed intra-cloud tests and the overall results are similar in notion but the data rates are significantly different as well as the tipping points for the various block sizes… this will be detailed separately).
We started by building a very simple console application that would loop through a directory and upload each file to Azure storage. This application used the shipping storage client library from the 1.1 version of the azure tools. The only real variation from the client library is that we added code to collect and record the duration (in ms) and size (in bytes) for each file transferred. The code is available here.
We then created a directory that had a collection of files for the following sizes: 2KB, 32KB, 64KB, 128KB, 512KB, 1MB, 5MB, 10MB, 25MB, 50MB, 100MB, 250MB, 500MB, 750MB, and 1GB (50 files for each size listed). These files contained randomly-generated binary data and do not benefit from compression (a separate discussion topic). Our file generation tool is available here.
The baseline was established by running the application described above against the directory containing all of the data files. This application uploads the files in a random order so as to avoid transferring all of the files of a given size sequentially and thereby spreading the affects of periodic Internet delays across the collection of results. We then ran some scripts to split the resulting data and generate some reports. The raw data collected for our non-optimized tests is available via the links in the Related Resources section at the bottom of this post.
For each file size, we calculated the average upload time (and standard deviation) and the average transfer rate (and standard deviation). As you likely are aware, transferring data across the Internet is susceptible to many transient delays which can cause anomalies in the resulting data. It is for this reason that we randomized the order of source file processing as well as executed the tests 50x for each file size. We expect that these steps will yield a sufficiently balanced set of results.
Once the baseline was collected and analyzed, we updated the test harness application with some methods to split the source file into user-defined block sizes and then to upload those blocks in parallel (using the PutBlock() method of Azure storage). The parallelization was handled by simply relying on the Parallel Extensions to .NET to provide a Parallel.For loop (see linked source for specific implementation details in Program.cs, line 173 and following… less than 100 lines total). Once all of the blocks were uploaded, we called PutBlockList() to assemble/commit the file in Azure storage. For each block transferred, the MD5 was calculated and sent ensuring that the bits that arrived matched was was intended. The timer for the blocked/parallelized transfer method wraps the entire process (source file splitting, block transfer, MD5 validation, file committal). A diagram of the process is as follows:
We then tested the affects of blocking & parallelizing the transfers by running the updated application against the same source set and did a parameter sweep on the block size including 256KB, 512KB, 1MB, 2MB, and 4MB (our assumption was that anything lower than 256KB wasn’t worth the trouble and 4MB is the maximum size of a block supported by Azure). The raw data for the parallel tests is available via the links in the Related Resources section at the bottom of this post.
This data was processed and then compared against the single-threaded / non-optimized transfer numbers and the results were encouraging. The Excel version of the results is available here.
Two semi-obvious points need to be made prior to reviewing the data. The first is that if the block size is larger than the source file size you will end up with a “negative optimization” due to the overhead of attempting to block and parallelize. The second is that as the files get smaller, the clock-time cost of blocking and parallelizing (overhead) is more apparent and can tend towards negative optimizations. For this reason (and is supported in the raw data provided in the linked worksheet) the charts and dialog below ignore source file sizes less than 1MB.
(click chart for full size image)
The chart above illustrates some interesting points about the results:
(click chart for full size image)
The above is another view of the same data as the prior chart just with the axis changed (x-axis represents file size and plotted data shows improvement by block size). It again highlights the fact that the 1MB block size is probably the best overall size but highlights the benefits of some of the other block sizes at different source file sizes.
This last chart shows the change in total duration of the file uploads based on different block sizes for the source file sizes. Nothing really new here other than this view of the data highlights the negative affects of poorly choosing a block size for smaller files.
Summary
What we have found so far is that blocking your file uploads and uploading them in parallel results in significant performance improvements. Further, utilizing extension methods and the Task Parallel Library (.NET 4.0) make short work of altering the shipping client library to provide this functionality while minimizing the amount of change to existing applications that might be using the client library for other interactions.
Related Resources
Today marks my 10-year anniversary at eQuest/Planet Technologies. It almost seems odd to write that… 10 years seems like a long time, and particularly to be at the same company in the Internet/Technology field.
During my tenure at Planet, I’ve had the privilege to work and cross paths with a number of great people – too many to name specifically but I do want to mention Scott Tucker, Steve Winter, and Dan Nelson. These men are passionate about their work and a pleasure to work with. I’ve also had the opportunity to hold a number of different roles ranging from “server build guy” to dev team lead, to world traveler, and now get to play around as a research scientist. The work has rarely (read: “never”) been boring and it seems like nearly every day presents another opportunity to solve a difficult challenge.
Today doesn’t represent any sort of change, or adaptation in my career path (I’ve never been particularly good at/fond of career planning). I simply am grateful for the opportunities I have had and am looking forward to what the coming years hold.
I’ve been reading a paper this morning published by Microsoft Research on Quality of Service Aware Clouds. If you are engaged in the cloud computing field, I would suggest that it is worth the time to read (14 pages) if for no other reason than to get your mind rolling (as it did mine) on the topic. Further, I’d be keenly interested in follow-on conversations from the community as to the issues/remedies put forth in that paper.
I’m finding myself split on the topic… academically, there are some interesting points being made:
However, I find myself struggling with a few things:
I think, that in the end, I’m more in favor of simply having more intelligent hypervisors that provide better isolation for VMs, but I’m still thinking this all through. There are some interesting points made in this paper, and intelligent allocations could be interesting…
I am in Redmond this week and am participating in two workshops being hosted by different groups within Microsoft Research. Along with a handful of others, I was asked to participate in a panel discussion on Friday dealing with new experiences that cloud computing would facilitate, as well as things we felt were road blocks to seeing those experiences realized. He specifically challenged us to think "outside the box" and to look beyond (the now typical) conversations surrounding raw performance and to dream a little. I wrote out the following as a means of working through my thoughts for my 5-7 minute portion of the panel discussion and, as it took me longer than 7 minutes to read, I thought I’d post it here as a expansion of the talk and possibly an anchor on which to hang subsequent conversations. Please forgive the casual nature of the talk as it is intended to be, essentially, a script read delivered to a group rather than a formal written version of the same.
—
This topic is certainly interesting to me as I am convinced that cloud computing is here to stay and also presents a platform that can be disruptive to the scientific/technical computing industry (although I would qualify this by saying “disruptive in a constructive sense” – meaning that the disruption leads to the additive good and not the removal of existing work). I have spent a considerable amount of time over the past week contemplating this question (how do we imagine cloud computing facilitating new usage scenarios), and have chosen to present my reply by means of a few examples.
The first example is that of Lego MindStorms. Are you familiar with these? They are kits that provide kids (regardless of how old they are
) the ability to build robots using a familiar (although slightly altered) Lego metaphor. These kits come with motors, sensors, and a "brain" that is programmable via a drag-and-drop software tool but also supports more complex tools such as Microsoft’s Robotics Studio. Do you know what is so great about these (besides the obvious)? They allow common people, with no prior robotics or electronics experience, to dabble in the field. It is, a gateway, if you will, to a much broader field.
The second example is more of an experience that happened to me recently in that I had the privilege of running into my high school science teacher this past weekend – a quiet, rather unassuming fellow named Randy White. Randy’s brilliance is that he has a passion for science and did (at least in my case) an excellent job of transference. If I am ever able to accomplish anything interesting in the scientific domain, a large portion of the credit will lie with him. Probably the most important thing he taught us, was how to think about, or to tackle the complex. I can’t tell you how many times I heard him say, "Start with what you know". The idea being, that most often, incredibly complex problems were comprised of nothing more than a series of far simpler, and additive problems. He taught us to focus on solving what we could, rather than attempting to "swallow the entire elephant" if you’ll allow me to strain a metaphor.
If you find yourself wondering what these two examples have to do with each other, or more germanely, what do they have to do with my vision for the scenarios that cloud computing will open, let me see if I can explain…
You see, much in the same manner as Lego MindStorms have introduced an otherwise unlikely audience to the world of robotics, I believe that cloud computing (based on its cost model and popular programming paradigms) is a means of introducing normal people (and by this, I mean those not formally trained in scientific or technical computing) to the notion of using computation as a tool for solving complex problems. Possibly to the dismay of some in the field, I think that this will, at least initially be done in a means void of the topics of MPI, or Fortran, much in the same way as a 15 year old "programming" his robot doesn’t have to understand the inner workings of concurrency runtimes nor the physics at work when his robot "walks" for the first time. I will be the first to admit that these (MPI, Fortran, concurrency topics, race conditions) are important topics, but I would submit that they should not be gating factors to one’s ability to explore the arena and determine if he/she is interested in further study in that field. I think we will see paradigms that are far simpler to adopt, such as master-worker, map/reduce, etc. (or even cloud-backed applications that are hidden behind more accessible tools such as Excel, or MatLab) take hold in significant ways and that we will see the development of novel approaches to solving problems using this new platform. The tired-and-true tools will remain, and will be used when necessary and appropriate, but I think if we force them down the throats of the next generation of researchers as "the only way to accomplish science", we are doing them a great
disservice.
As to where Mr. White and high-school science comes into play – well, this can best be summarized by a comment made by a friend of mine, Wally McClure when he, almost flippantly, referred to Windows Azure as a "poor man’s supercomputer". Being one that had been working with Azure for quite a bit at the time, I took a little offense at the accolade due to its semi-pejorative nature, and prefer the "common man", but the point is the same regardless: Cloud Computing (at least as currently manifested in both Windows Azure and Amazon’s AWS platform), has a great potential to democratize high-performance computing. You see, the high-school I grew up in was small… we had 23 in my graduating class. While Randy has moved on, he still teaches in a comparatively small school that certainly has no funds for a cluster on which to run experiments. However, with the advances in cloud computing, Randy could devise a collection of simple experiments and actually execute them as part of a class project. He could have a significant computational cluster for the equivalent of a few dollars. He can present "Scientific" computing as something obtainable to his students, and hopefully foster an interest that will develop into the next generation of computational thinkers – solving one problem at a time, incrementally, on the way to solving massive problems that we have trouble even describing today.
It is, in my opinion, incumbent upon us – the current generation of computational researchers and domain-specific scientists – to look at cloud computing not as a threat to the establishment, but as facilitating a new means of scientific discovery. We should consider ways to make large-scale computation more accessible to "normal" people. We should be opening up the community, sharing wherever possible, reducing the barriers to entry. Challenge yourselves and your students to push boundaries, to consider non-traditional approaches, and to enjoy "playing" with computational resources.