msgbartop
random musings and walks through code
msgbarbottom

06 Jan 11 Book Review: Host Your Web Site In The Cloud

hostyourwebsiteinthecloudOver the holiday break I spent some time getting ready for the cloud computing precompiler at CodeMash and as part of that effort I read Jeff Barr’s Host Your Web Site In The Cloud, Amazon Web Services Made Easy. This book is one of the few physical paper books I’ve gotten recently, and is unique to me in that it is the only book I have that is signed by the author.

That aside, I’d like to recommend this book to anyone who is looking at Amazon Web Services, or would consider themselves a beginner with AWS. I found the writing style to be very easy to read and, while I’m not a PHP developer, the code samples and walkthroughs were clear and simple to follow.

AWS is a fast moving target, and even though Jeff is on the team, I’m certain it was difficult to get a book to market that wasn’t completely outdated by the time it hit the shelves, but I think he does a good job of addressing the basics, providing a foundation on which you can build your knowledge, and even slips in a few notes regarding late breaking updates (as of press time) such as EC2 instances being bootable from EBS.

In my mind, this book is similar to the Windows Azure Training Kit in that it gives you most everything you need to get your feed wet, get rolling with the technology, and provides you with the framework by which you can add to your skills.

04 Jan 11 Speaking at the CodeMash Precompiler

timidI’m thrilled to be speaking at the CodeMash Precompiler next week. I’m going to be joined by Mike Wood and helped by Brian Prince and Michael Collier. Together, we’ll have nearly 8 hours of instruction and hands on labs covering both the Amazon and Microsoft cloud computing platforms. Below I’ve listed the abstracts for each of the sessions as well as the prerequisites for those planning on joining us. If you are going to be in Sandusky next Wednesday, be sure to drop by.

An Introduction to Amazon Web Services (half-day, afternoon)

AWS has been in the cloud computing space longer than most anyone, and they are the de facto standard when it comes to Infrastructure as a Service. While most developers are comfortable with the notion of virtual machines, reviewing the AWS offering can sometimes look like alphabet soup (EC2, S3, SNS, SDB, SQS). Join us to learn the power behind these acronyms and the tools that they can provide your next project. We’ll discuss the major components, some of the trade-offs between different implementation choices (i.e. boot from S3/boot from EBS, etc.) and provide you with the opportunity to work through some labs, deploy some code, and begin to experience the Amazon cloud for yourself.

Examples are in .NET, but fundamental concepts apply to all platforms.

 

An Introduction to Windows Azure (half-day, morning)

Steve Ballmer has made it very clear that Microsoft is "all in" when it comes to the cloud and by now most have heard about Microsoft’s Windows Azure platform… but what does that mean for you? Whether you are an experienced .NET developer who is wondering what all this cloud stuff means for how you write code, or maybe you are a traditional *nix developer looking to understand how to integrate your existing code with the Microsoft version of the cloud, join us for an in-depth discussion on what Platform as a Service is, how Microsoft has implemented it, what scenarios it best addresses, and a collection of hands-on-labs to get you started.
Examples are in .NET, but fundamental concepts apply to all platforms.

 

Prerequisites

The sessions will be part presentation, part hands on labs.  While you aren’t required to bring a laptop, you’ll get much more out of the sessions if you have one available to work through the labs with (but, there might be some people willing to pair as well!).  Please make sure to bring your power cord! 

Here are the prerequisites to have loaded:

An Introduction to Windows Azure

· Operating Systems Supported: Windows 7 (Ultimate, Professional, and Enterprise Editions); Windows Server 2008; Windows Server 2008 R2; Windows Vista (Ultimate, Business, and Enterprise Editions) with either Service Pack 1 or Service Pack 2

· Microsoft Visual Studio 2010 (full version or the free trial).

· SQL Server 2005 Express Edition (or above) (this is usually installed with Visual Studio)

· Install the Windows Azure Tools for Microsoft Visual Studio (and some hotfixes)

· Install the AppFabric SDK

· Install the Windows Azure Platform Training Kit

An Introduction to Amazon Web Services

· Amazon AWS SDK for .NET

· Requires Microsoft .NET Framework 2.0 or later.

· Use the AWS SDK for .NET with any of the following Visual Studio editions:

o Microsoft Visual Studio 2008 Professional Edition or later

o Microsoft Visual C# 2008 Express Edition (free!)

o Microsoft Visual Web Developer 2008 Express Edition (free!)

You might be thinking, "Hey, What a second!  This is CodeMash, you just listed all Microsoft tools there!".  Just like CodeMash, both Windows Azure and Amazon AWS are happy to mix in multiple development stacks.  Our labs and demos will be shown using Visual Studio, but don’t let that stop you from following along or trying out the cloud platforms from your Mac, or using Java, PHP and Ruby on Windows.  Below are links to other SDKs for each cloud platform.  Please, feel free to explore your options and load these SDKs or libraries up if you prefer them.

For Windows Azure

· Windows Azure SDK For Java

o AppFabric: http://www.jdotnetservices.com/

· Windows Azure SDK for PHP

o AppFabric: http://dotnetservicesphp.codeplex.com/

o and tools http://azurephptools.codeplex.com/

o and Companion http://www.interoperabilitybridges.com/projects/windows-azure-companion

o Oh, and some love for Eclipse via a plug in: http://www.windowsazure4e.org/

· Windows Azure AppFabric SDK For Ruby

For Amazon AWS

· AWS Java Developer Center

· AWS PHP Developer Center

· AWS Python Developer Center

· AWS Ruby Developer Center

21 Dec 10 Planet Technologies Launches GovCloud

This is just a quick post to let those interested know that the company I work for – Planet Technologies – has launched a new cloud-focused practice focused on helping public sector/government agencies utilize cloud computing. Services range from high-level assessments, hands-on migration assistance, and most everything in between.

You can read the press release here: http://www.technology-digital.com/planet-technologies-launches-govcloud-new-cloud-practice-designed-specifically-assist-government-ag-0

And visit the website here: http://govcloud.com/

17 Nov 10 Does Amazon’s Cluster Compute Platform Still Represent Cloud Computing?

I’m sitting at the airport in New Orleans, after having attended the first half of the ACM/IEEE 2010 Super Computing conference. This was the first time I have attended this conference, and it was certainly interesting to participate.

During the workshop I participated in on Sunday (Petascale Data Analytics on Clouds: Trends, Challenges, and Opportunities), there arose a conversation regarding the Amazon EC2 “cluster compute instances” and their having reached a spot on the Top 500 list. What surprised me, however, was not that they were mentioned (I actually expected them to receive more attention than they did), but that they were described as not being “real” cloud computing.  The point was made that they represented some sort of special configuration that was done just for the tests and that the offering was somehow significantly different than the rest of the general populous could acquire. The two primary individuals involved in the exchange have significant history in classic HPC and have, at least a degree of “anti-cloud” bias, but I am responsible for helping influence the viewpoint of one of these folks so I’ve been thinking a bit over the past few days about how to properly articulate the inaccuracies of the argument… and wondering if it really matters anyway.

Commodity Hardware – by this I mean that the platforms being utilized could be purchased/deployed by anyone… and, by “anyone”, I am thinking of a moderately skilled computer hobbyist. I’m referring, particularly, to the chip architectures, availability of the motherboards, etc. A quick glance at the specs for a given machine validates that anyone (with enough money) could easily assemble a similarly-configured machine. It is simply a quad-core Intel box with 24 GB of RAM and roughly 2TB of disk. One might argue that the newly-announced Cluster GPU Instance is specialized hardware, but then again, anyone with an extra $2,700 to spare could add one of these to their machine. The point is, that machines in this class are in the 5K range, not the 50K or 500K range.

Commodity Networking – now to some of you, 10GB non-blocking networks might seem specialized or exotic, but – at least in the HPC realm – it isn’t. Most serious HPC platforms utilize a network technology called InfiniBand (usually QDR) or something fancier (more expensive such as an IBM custom interconnect or CRAY’s Gemini. A quick search shows one could purchase 10GBE switches starting in the 2-3K range and going up from there whereas IB QDR switches are at least double that.

Broad Availability – this point gets a little stickier. The point is, that anyone can get access to CCI nodes at any point – simply using a credit card and visiting the AWS website. However, getting access to 880 of them (the number used in the Top 500 run) is likely to be more difficult. The reason is not an unwillingness on Amazon’s part to provide this (I’m sure, given the proper commitment, this would not be impossible), but rather a question of economics and scale. Their more “general” nodes have a large demand and use case… the scale of demand for CCI nodes is yet to be established although I’d imagine the sweet spot for these customers is in the 16-64 node range… folks who could really use a cluster some of the time, but certainly don’t need it all of the time. As such, I (and I have no inside knowledge of their supply/demand change) don’t imagine that the demand is currently so large that beyond the currently active nodes, they have ~1000 nodes of this instance type sitting around just waiting for you to request them (this will likely changes as demand grows).

Inexpensive + Utility-style Pricing – This is one area where this instance type represents all of the goodness we have become accustomed to in the cloud computing world. These nodes (remember I listed the above as starting around 5K) are available at $1.60/hour ($2.10/hour for the GPU-enabled nodes). This makes a significant computing platform available to almost anyone. For just over $100/hour, you can have a reasonably-well powered 64-node cluster on which to run your experiments… that is disruptive in my opinion. The best part about it, is that this price is the worst case scenario – meaning, this is the price with no special arrangement, or reservation, or volume discount, or anything. It represents no long term commitment… nothing beyond a commitment for the current hour.

So… what is different? – I have spent the majority of this post explaining how I think that these instance types are similar in many ways to other IaaS offerings and thereby deserve categorization as “regular” cloud computing, but that begs the question – what is unique about these nodes that would cause Amazon to promote them as better for HPC workloads? What facts formed the foundation for these rather experienced HPC experts to classify them as different? In my mind, there is really only 2 or three things here. The first is the networking – rather than being connected to a shared 1GBE network, you are given access to a 10GBE network, and guaranteed full bisection bandwidth node-to-node. It is this fact alone that makes the platform so interesting to the HPC folks as it makes it actually viable for network-heavy applications (think traditional MPI apps). Secondly, you have clear visibility to the hardware. Amazon tells you exactly what type of processors you are running on allowing you to optimize your codes for that particular processor (somewhat common in the HPC realm). Tightly coupled with this fact is that you can’t get a “part” of this instance type. You get the entire node (less the hypervisor) and, as such, are not contending with any other customers for node-local resources (RAM, ephemeral disks, network, etc). Finally, the fact that you can get nodes that have specialized hardware (NVidia GPUs) is unique… there are very few cloud providers currently offering this sort of feature set.

In the end, I think the Amazon offerings are very much representative of the “cloud” and, particularly, of where the cloud is going. I think we will continue to see a broad level of homogeneity (basic hardware abstractions) with comparatively small pockets of broad-domain specific assets. The key being that for a large number of researchers, the offerings announced by Amazon this summer (and additionally this week) make the decision as to whether or not to buy that new departmental cluster much more difficult – especially when a true TCO analysis is performed. These are similar to the arguments and justifications for “normal” cloud compute scenarios and as such, should be considered one and the same.

24 Sep 10 Maximizing Throughput in Windows Azure – Part 3

This is the third in a series of posts I’m writing while working on a paper dealing with the issue of maximizing data throughput when interacting with the Windows Azure compute cloud. You can read the first post here and the second here. This post assumes you’ve read the other two, so if you haven’t now might be a good time to at least peruse them.

Summary: Based on the work performed and detailed in the first two parts of this series, we scaled the load tests horizontally to 20 concurrent nodes to ensure that the performance characteristics of the storage platform were not overly degraded. We found that while we were able to move a significant amount of data in a relatively short period of time (500 GB in around 10 minutes – roughly 6.9 Gbs), we experienced something less than a linear scale up from what a single node could transfer (up to 39% attenuation in our tests).

Detail: After the work done in the first two stages, I decided to see what the affect of horizontal scaling would have on the realized throughput. To test this, I took the test harness (code links below) and set it for the “optimal” approach for both upload and download as determined by the prior runs (sub-file chunked & parallelized uploads combined with whole-file parallelized downloads) and then deployed it to 20 nodes and then did a parameter sweep on node size. I tested a few different methods of starting all of the nodes simultaneously (including Wade Wegner’s “Release the Hounds” approach) but settled on Steve Marx’s “pseudo code” (Wade’s words – not mine) approach as I had issues with getting multicasting on the service bus to scale using the on-demand payment model. This provided for a slightly crude (triggers were not *exactly* timed together) start time, but was more-or-less concurrent.

You can see from the following chart that my overall performance wasn’t too bad – based on the node type we saw upwards of 6Gbs download speed and around 2Gbs upload. Also consistent with our prior tests, we saw a direct relationship between node size and realized throughput rates.

s3chart1

While the chart above is interesting, the real question is whether or not the effective throughput was linear based on node count. The following charts compare the average results from three runs per node size of 20 concurrent nodes to the average numbers from the prior tests by node size multiplied by 20 (perfect scale). What we see is that uploads demonstrate an attenuation of between 25% and 45% while downloads taper between 18% and 39%.

Note: the XL size uploads actually demonstrated a better-than-linear scale (around 101% of linear) which is attributed to a generally good result for the three test sets for this experiment and a comparatively poor result set (likely network congestion) for the XL nodes in the prior tests. The results in this test are from an average of three runs (each run consisting of 20 nodes transferring 50 files each) – performing more runs would likely render a higher accuracy of trend data. 

s3chart2

s3chart3

Looking at the data triggered some follow-on questions such as what the attenuation curve would look like (at what node count do we stop scaling linearly) or what do the individual transfer times per file look like. This prompted me to dig a bit further into the collected data and generate some additional charts. I’ve not displayed them all here (the entire collection is available in the related resources section below), however I’ve selected a few that are illustrative of a my subsequent line of questioning.

For the downloads, most of the charts looked pretty good and we saw a distribution similar to the following two charts. As you can tell, the transfers are of similar length and the histogram shows a fairly tight distribution curve.

s3chart_L_3_down

s3chart_L_3_down_histogram

Uploads on the other hand, were all over the board. The following two charts are representative of some of the data inconsistencies we found. What is interesting to note here, is there there are three legs that are significantly longer (visually double) than the mean of the remainder. This would cause one to wonder if the storage platform was getting pounded, effectively placing those three nodes on hold until the pressure abated. You can see from the associated histogram that the distribution was much broader representing less consistency in transfer rates.

s3chart_L_3_up

s3chart_L_3_up_histogram

The previous charts got me to wondering further, so I wrote some code to generate charts (timeline) of transfers for each node within size/run collection (one chart for each of the horizontal bars in the chart above). Immediately obvious in the charts below is a bug in my data collection (my log data for the individual files was tracking the total seconds elapsed, but the end time was being recorded in minutes – resulting in the oddities (right alignment) in the bar display below – this will be fixed in future runs/posts). Ignoring the bug for a minute, the first chart is something similar to what you would expect… parallelized transfers that overlap some and stair-step over the elapsed time.

s3chart_L_3_up_RD00155D323E60

The following two charts, however, represent situations different than you would expect and illustrate what would appear to be problems in the network/transfer/nodes/my code (something). In both scenarios there are large blocks of time with apparently nothing happening, as well as individual files that apparently took significantly longer than the rest to transfer. In the next set of tests (and follow on blog post) I’ll be digging into this issue and looking to understand exactly what is happening and, hopefully, be able to explain a little bit of why.

s3chart_L_3_up_RD00155D3243E2

s3chart_L_3_up_RD00155D32444D

 

Related Resources:

 

Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U. S. Department of Energy.

13 Sep 10 Maximizing Throughput in Windows Azure – Part 1

[NOTE: Updated 9/23/2010. See the bottom of this post for an explanation of the changes]

I’m working on a writing a paper dealing with the issue of maximizing data throughput when interacting with the Windows Azure compute cloud and am drafting some of that work as a couple of blog posts to help me work through my thoughts. I’m still working through some test scenarios and will have more to post later, but I wanted to get this out while it was still fresh.

I’ve posted before, that utilizing parallelized file file transfers is a great way to increase your overall throughput when externally interacting with Windows Azure, and the unsaid but possibly inferred thought was that it worked well for internal-to-Azure data movements as well. At the time I wrote the initial post I had done some testing of this scenario and had mixed results. A couple of recent papers I’ve read got me thinking about the topic again and so I started testing further with a slightly different approach and a different take on the variables.

Summary: Within the context of the Azure datacenter (intra-datacenter transfers), sub-file parallelization is not always as beneficial as it is outside the datacenter (local to azure or azure to local). Further, the size of the VM host has a significant impact on the realized throughput.

Detail: The key point I pulled from a paper I was reading (I’m sorry, I don’t have the reference at this time) was that another researcher had been doing tests in the Amazon cloud and indicated they were seeing significant deltas in throughput based on the Instance size/type they selected. Neither Microsoft nor Amazon list bandwidth as a variable associated with instance types (with the possible exception of the Amazon Cluster Compute Instance which boasts a 10Gbps network) but it stands to reason that given a physical host of a fixed size, an increase in the number of virtual hosts on that box (smaller instances) will result in a decrease in available throughput per virtual host. The inverse (scenarios with larger instances)also follows. This got me to thinking about Azure and whether or not the same would hold true, and, if so, how that would impact our recommended approach of splitting your files, transferring them  in parallel, and then reassembling them on the other side.

Approach: Rather than doing a parameter sweep on a number of file sizes, I selected a specific file size (500 MB) of randomly generated data and executed my tests with that. For each parameter set, I ran executed 3 runs of 50 transfers each (150 total per parameter set). I also tore down and re-published my platform between each run to increase my chances of being provisioned to different hardware nodes within the Azure datacenter and – theoretically, a different contention ratio with other instances on the same physical host. Also, I performed a run for all parameter sets before starting subsequent runs to decrease the likelihood that one parameter set would be inappropriately benefited (or harmed) by the time of day in which it was executed. In each test, a single worker role instance was run targeting a single storage account. There were no other applications or activities targeting that storage account during the tests runs. All of these tests were performed in the Windows Azure US North Central region between the dates of August 27, 2010 and September 2, 2010

Results: The first sweep was aimed at identifying the impact of VM size on transfer rate using the standard MS-provided storage client library (no modifications). What we found, was that, for the most part, there was a clear relationship between the VM size and the realized throughput.

s1chart2

s1chart4

The second sweep had a similar objective as the first, with the only change being that rather than using the standard/single-threaded API calls, we used the parallelized version that we developed for our external-to-Azure tests. The results were similar to the above in that the node size showed (mostly) a consistent impact on the realized throughput (keep reading past the charts if you review the following and think I’m out of my mind).

s1chart1

s1chart3

If you are still with me, you are probably wondering why the numbers for the Parallel Upload by Node Size chart look so off from the assumed behavior… The fact of the matter is that similarly to the small node standard download tests, the third run for the small node parallel upload tests experienced a radically different performance (>75% better) than the prior two runs. This was so jolting to the numbers that I actually prepared another chart showing only the first two runs of this test to illustrate the difference that the last run made in the average results:
 s1chart8

As you can tell from the above, these results are much closer to what you might expect (based on the values from the other tests above). The key take-away at this point, and the reason I am belaboring this aberration, in an environment where you are not in complete control, the performance you obtain from shared services (networks, storage clusters, etc) may vary widely in actual use.

The real question of interest, was to compare the two approaches (standard library vs. parallelized) so one could select the best one for a given scenario. The first chart showed exactly what I expected – the parallelized version was significantly better than the standard approach for all node sizes although the benefit waned as the node size increased.

s1chart5

The second chart initially caught me off guard as it illustrated that the work being done to block/download/reassemble in parallel was far less efficient than simply downloading the data.

s1chart6 My initial thoughts were that I was simply using an inefficient mechanism for reassembling the file but that the parallelized transfer was still likely faster than the stock approach but some additional instrumentation invalidated that thought. For the parallelized version, roughly 50% of the total time per file was spent in reassembling it, however even considering just the 50% spent in network transfer, it was roughly 50% longer than the stock approach (I’ll dig into that a bit more in later posts).

Therefore, from the data and tests we’ve run so far, using a blocked or chunked approach and parallelized transfers works well for external-to-Azure uploads and downloads as well as uploads (compute to blob storage) for internal-to-Azure movements. Internal-to-Azure downloads (blob storage to compute targets) should be performed using the standard/non-parallelized approach.

This last chart is designed to give an idea of the realized throughput by node for both upload and downloads using the “optimal” approach as determined via the tests detailed above.

s1chart7 As you can imagine, the results listed here triggered a number of other questions and tests. Some of these will be addressed in the next post on this topic which should be available soon.

Related Resources

 

NOTE: This post was updated on 9/23/2010. The changes are both substantial and not at the same time. While working on the other posts in the series, I became concerned that there were too many calculations being performed ad-hoc in Excel to get from the raw data to the charts and conclusions described here. A key goal of mine is for someone who questions my results to be able to re-run them and analyze my analysis of the data. Therefore I stepped back and generated the charts using code that shows each calculation and query. The links to the code are posted above as are links to the raw data. The charts are identical to what were here originally with the exception of some formatting changes due to the differences in generation engines. The charts are also higher-resolution and clicking on them will open the full-size version of the chart.

 

Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U. S. Department of Energy.

31 Aug 10 Containerized Computing: Onsite Demo

As part of our cloud computing initiative we have been investigating the use of containerized computing and exploring if and where it might play a role in the computational environment where I work. In the context of this effort I have had the privilege of visiting a few different locations and seeing the containers first hand – an experience which has both answered and generated a number of questions for me.

header_icecube

We have a unique opportunity in that the SGI ICE Cube demo truck is going to be on-site here Thursday and Friday September 9th and 10th. During that time the container will be available both for walk-in traffic as well as pre-scheduled, in-depth presentations. While there are a handful of different container vendors and approaches, seeing one in person will give you a baseline and framework by which to analyze others.

For those of you not familiar with containerized computing, it (as an approach/concept, not necessarily this particular design) is being used in some of the largest datacenters being built and is a key component in Microsoft’s 3rd and 4th generation datacenter designs.

Some interesting characteristics of the SGI design (other vendors have other distinguishing features although there are some common threads such as high density, energy efficiency, etc):

  • Provide the ability to operate at elevated temperatures (ambient – cold aisle – air of 85F)
  • Incredibly high density, surpassing 46,080 cores within one container
  • Reduced cooling cost
  • 20’ or 40’ containers, can be stacked 3 to 5 high
  • Dual row, universal and air-cooled designs
  • Can contain compute, storage, network, and cooling
  • Just plug in power, network, and chilled water (if needed)

SGI has recently added to their suite of designs a totally air cooled unit that simply requires a garden hose for intake water (read: “massive energy savings”).

More information on the SGI container can be found here: http://www.sgi.com/products/data_center/ice_cube/ and a PDF datasheet is here: Datasheet PDF

If any of you live near where I work, and are interested in seeing this in person, contact me and I’ll see what I can work out (note: you must be a US citizen).

07 Jul 10 Amazon Web Services for the .NET Developer

I spoke at CodeStock (http://codestock.org) a few weeks ago and one of my talks was focused on AWS from the perspective of the .NET developer. The slides are available here:

 

And the video of the session is available here:

18 May 10 You Still Have to Plan and Understand Your Toolset

I just finished reading an article (http://searchcloudcomputing.techtarget.com/news/article/0,289142,sid201_gci1512394,00.html) discussing some of the power issues and related outages at one of Amazon’s (http://aws.amazon.com) data centers last week. While much of the article was fine and factual, I take a bit of issue with the way the article wraps up:

Users may not like being told they should fend for themselves on disaster preparedness, but that appears to be part of the price for getting everything else AWS offers.

This highlights a sentiment that is unfortunately pervasive within the community of those evaluating or adopting cloud computing – that of believing that cloud computing is a panacea for all scale and datacenter problems.

What the users of these platforms need to understand is that they are toolkits. While the various cloud computing vendors provide important services and features, the consumer of said platforms must do their homework to understand the technical tradeoffs of various decisions so that they can appropriately reap the benefits of the selected platform. Simply uploading your code/application and expecting it to be always available is unrealistic. The consumer must understand what high availability features are offered by their particular cloud vendor and exploit those features to ensure that their app has the appropriate availability. In the case of the Amazon outage(s), if users had followed the high-availability guidelines provided by Amazon, they would not have experienced any outage at all. Cloud providers such as Microsoft, Amazon, and others provide the notion of availability zones, or regions, and – much like you would if you were hosting the app yourself – you need to distribute your application across such to ensure that a failure in one location doesn’t mean a complete outage for your application.

Rather than a magic wand that solves all scaling and availability issues, cloud computing provides a democratized toolset that informed consumers can use to develop a highly available, scalable, and fault-tolerant application. The key word here is “democratized” – meaning – these features are available to anyone, at a fraction of the cost of doing it yourself. I experience similar frustration when reading complaints from folks about the pricing of Windows Azure (i.e. “Why can’t I host my simple website there fore $10/month?”). The question illustrates that the inquirer doesn’t understand the fundamental architecture of the platform (both how it works, and what its primary use cases are). Neither Amazon’s EC2 nor Windows Azure are designed to compete with a low-cost web hoster… rather they are designed to provide the tools by which a company that needs features not available from a low-cost hoster, but doesn’t have (or wish to spend) the capital to build those features themselves.

They are great platforms that provide you the ability to build a very solid offering, but you have to understand how to properly utilize those features. Cloud computing should not be approached with ignorance or any less planning than you would if you were building out the infrastructure yourself (of course the level of detail will differ).

26 Apr 10 External File Upload Optimizations for Windows Azure

I’m wrapping up a bit of the work we’ve been doing on data movement optimizations for cloud computing and the latest set of data yielded some interesting points I thought I’d share. The work done here is not really rocket science but may, in some ways, be slightly counter-intuitive and therefore seemed worthy of posting.

Summary: for those who don’t like to read detailed posts or don’t have time, the synopsis is that if you are uploading data to Azure, block your data (even down to 1MB) and upload in parallel. Set your block size based on your source file size, but if you must choose a fixed value, use 1MB. Following the above will result in significant performance gains… upwards of 10x-24x and a reduction in overall file transfer time of upwards of 90% (eg, uploading a 1GB file averaged 46.37 minutes prior to optimizations and averaged 1.86 minutes afterwards).

Detail: For those of you who want more detail, or think that the claims at the end of the preceding paragraph are over-reaching, what follows is information and code supporting these claims. As the title would indicate, these tests were run from our research facility pointing to the Azure cloud (specifically US North Central as it is physically closest to us) and do not represent intra-cloud results… we have performed intra-cloud tests and the overall results are similar in notion but the data rates are significantly different as well as the tipping points for the various block sizes… this will be detailed separately).

We started by building a very simple console application that would loop through a directory and upload each file to Azure storage. This application used the shipping storage client library from the 1.1 version of the azure tools. The only real variation from the client library is that we added code to collect and record the duration (in ms) and size (in bytes) for each file transferred. The code is available here.

We then created a directory that had a collection of files for the following sizes: 2KB, 32KB, 64KB, 128KB, 512KB, 1MB, 5MB, 10MB, 25MB, 50MB, 100MB, 250MB, 500MB, 750MB, and 1GB (50 files for each size listed). These files contained randomly-generated binary data and do not benefit from compression (a separate discussion topic). Our file generation tool is available here.

The baseline was established by running the application described above against the directory containing all of the data files. This application uploads the files in a random order so as to avoid transferring all of the files of a given size sequentially and thereby spreading the affects of periodic Internet delays across the collection of results.  We then ran some scripts to split the resulting data and generate some reports. The raw data collected for our non-optimized tests is available via the links in the Related Resources section at the bottom of this post.

For each file size, we calculated the average upload time (and standard deviation) and the average transfer rate (and standard deviation). As you likely are aware, transferring data across the Internet is susceptible to many transient delays which can cause anomalies in the resulting data. It is for this reason that we randomized the order of source file processing as well as executed the tests 50x for each file size. We expect that these steps will yield a sufficiently balanced set of results.

Once the baseline was collected and analyzed, we updated the test harness application with some methods to split the source file into user-defined block sizes and then to upload those blocks in parallel (using the PutBlock() method of Azure storage). The parallelization was handled by simply relying on the Parallel Extensions to .NET to provide a Parallel.For loop (see linked source for specific implementation details in Program.cs, line 173 and following… less than 100 lines total). Once all of the blocks were uploaded, we called PutBlockList() to assemble/commit the file in Azure storage. For each block transferred, the MD5 was calculated and sent ensuring that the bits that arrived matched was was intended. The timer for the blocked/parallelized transfer method wraps the entire process (source file splitting, block transfer, MD5 validation, file committal). A diagram of the process is as follows:

ParallelAzureUploadDirect

We then tested the affects of blocking & parallelizing the transfers by running the updated application against the same source set and did a parameter sweep on the block size including 256KB, 512KB, 1MB, 2MB, and 4MB (our assumption was that anything lower than 256KB wasn’t worth the trouble and 4MB is the maximum size of a block supported by Azure). The raw data for the parallel tests is available via the links in the Related Resources section at the bottom of this post.

This data was processed and then compared against the single-threaded / non-optimized transfer numbers and the results were encouraging. The Excel version of the results is available here.

Two semi-obvious points need to be made prior to reviewing the data. The first is that if the block size is larger than the source file size you will end up with a “negative optimization” due to the overhead of attempting to block and parallelize. The second is that as the files get smaller, the clock-time cost of blocking and parallelizing (overhead) is more apparent and can tend towards negative optimizations. For this reason (and is supported in the raw data provided in the linked worksheet) the charts and dialog below ignore source file sizes less than 1MB.

RateImprovement

(click chart for full size image)

The chart above illustrates some interesting points about the results:

  • When the block size is smaller than the source file, performance increases but as the block size approaches and then passes the source file size, you see decreasing benefit to the point of negative gains (see the values for the 1MB file size)
  • For some of the moderately-sized source files, small blocks (256KB) are best
  • As the size of the source file gets larger (see values for 50MB and up), the smallest block size is not the most efficient (presumably due, at least in part, to the increased number of blocks, increased number of individual transfer requests, and reassembly/committal costs).
  • Once you pass the 250MB source file size, the difference in rate for 1MB to 4MB blocks is more-or-less constant
  • The 1MB block size gives the best average improvement (~16x) but the optimal approach would be to vary the block size based on the size of the source file.

 

RateImprovement2 
(click chart for full size image)

The above is another view of the same data as the prior chart just with the axis changed (x-axis represents file size and plotted data shows improvement by block size). It again highlights the fact that the 1MB block size is probably the best overall size but highlights the benefits of some of the other block sizes at different source file sizes.

DurationReduction

This last chart shows the change in total duration of the file uploads based on different block sizes for the source file sizes. Nothing really new here other than this view of the data highlights the negative affects of poorly choosing a block size for smaller files.

 

Summary

What we have found so far is that blocking your file uploads and uploading them in parallel results in significant performance improvements. Further, utilizing extension methods and the Task Parallel Library (.NET 4.0) make short work of altering the shipping client library to provide this functionality while minimizing the amount of change to existing applications that might be using the client library for other interactions.

 

Related Resources