Monday
Jun062011

Cloud Futures 2011–Scaling Document Clustering

I was honored to be able to give a talk at the Microsoft Research Cloud Futures 2011 conference this past week. I joined a number of other researchers and academics from around the world and discussed what folks were doing with the cloud, where issues remained, and where progress was being made. Having been in attendance at last year’s event, I was quite pleased to see the advancements both in logistics and content… the level of material was definitely stronger this year.

The talk I gave was on some early work we are doing in scaling a document clustering algorithm (Piranha) using cloud primitives. The slide deck is below and, if you attended the talk, I’d appreciate if you’d take a minute to rate the talk using the button below.

Monday
May162011

Windows Azure Handbook: v1

azurehandbookv1
Earlier this year, a fellow Windows Azure MVP, David Pallmann wrote the first book in series on Windows Azure. For whatever reason, I started reading the book with a great deal of skepticism (fast moving technology, books are outdated long before they reach press, etc.) but was pleasantly surprised with the strong business relevance of this book.

In fact, I’d go so far as to say that most any consultant working in the Azure field (and other cloud fields as well) should take the time to read this book. There are a number of worksheets and thought processes that David walks through that illustrate a maturity (i.e. lack of hype) to the cloud engagement process.

Of particular interest to me were the later chapters that moved a bit away from the technology specifics and more on the reasoning behind why one would consider a move to the cloud, what should be considered, how to identify good/bad cloud application candidates, how to plan an effective pilot, etc.

Friday
May062011

Hands On with Amazon Web Services

[updated 6/1/2011 with embedded video]

I have the opportunity to talk at StirTrek today and wanted to make the slides available from today's session. I'll update this post a bit more following the session.

Friday
Feb112011

Moving Applications To The Cloud with Windows Azure

appsinthecloud I just finished reading a book from the Microsoft Patterns & Practices group called Moving Applications to the Cloud on the Microsoft Windows Azure Platform. I’ve had the book for a few months, and my when I first received it, I read the first chapter or two, decided it wasn’t worth the read, and set it aside.

Lately, however, I picked it up again – finished the book, and am glad I did. Don’t get be wrong, it didn’t magically morph into a superb spectacle of literary greatness, but I did find that as I read further, the authors moved further from the very basics of the Windows Azure platform and the content became increasingly interesting.

If you are new (or relatively so) to the Windows Azure platform and contemplating the moving of existing applications to the cloud, this is a worthwhile discussion of a fictitious scenario that did just that. The scenario is slightly on the cheesy side, but realistic enough to help you think through issues you may be facing in your business.

If you are well experienced with the platform, you will likely find this a bit dry – especially the first portions. You’ll also likely be distracted or bothered by the not-so-covert marketing that takes place. That said, the book covers some more complex topics such as multiple tasks/threads sharing the same physical worker role, various optimization topics, and more. In the end, I’m glad I read it and feel that I learned some things from the book.

My last thought has nothing to do specifically with the book, but rather a growing frustration of mine with the Windows Azure platform – the design of the table storage platform. Upon reading books such as this I’m reminded (they stress it *many* times) how important your partition key/row key strategy is, and how literally hosed you are if you get it wrong. This compares with my recent experiences with Amazon’s SimpleDB product, and the delta couldn’t be more striking. Both platforms solve essentially the same problem, but in the case of SDB, it is effortless (at least by comparison). I don’t have to think of partition keys, or be overly concerned with how the underlying storage platform works… I just put data in it. Additionally, *every* column is indexed and performs reasonably under queries. I can’t shake the feeling that the Azure team is missing it here – there has to be a way to get a well-designed, horizontally scaling table structure without placing such a design burden on the users.

Monday
Jan242011

Return of the Windows Azure GAC Viewer

I’m pleased to announce that the excellent utility – the Azure GAC Viewer – is once again online and available for general use. You can access it at http://gacviewer.cloudapp.net. This tool shows you a dynamically generated list of all of the assemblies present in the GAC for an Azure instance. Additionally, it also allows you to upload your project file (*.csproj or *.vbproj) to have the references scanned and let you know if there are any discrepancies between what you are using and what is available (by default) in Azure. You can then adjust your project file (copy-local=true) to ensure your application can run successfully.

gacviewer

If you are familiar with the tool, you may be thinking “Wait! you aren’t Wayne Berry, and besides, the URL has changed!” – and you would be correct on both counts. Wayne developed the tool and posted about it back in September of last year. Since that time, however, Wayne has accepted a position on the Windows Azure team and is unable to continue to maintaining the site full time. As a gesture of kindness to the community, he has passed the source code to me and given me his blessing to re-launch the tool.

As it stands today, the tool is nearly exactly as Wayne developed, with a few tweaks to have it use Guest OS 2.1 rather than 1.6. I’ve also added a contributors page to give credit to Wayne and to the organizations that are allowing me to maintain and keep the site online.

In the future, I hope to make the source code available on CodePlex as well as to add to the list of tools that live on the site. If you have any bugs with the current site or ideas for future changes, please feel free to contact me.

Thursday
Jan062011

Book Review: Host Your Web Site In The Cloud

hostyourwebsiteinthecloudOver the holiday break I spent some time getting ready for the cloud computing precompiler at CodeMash and as part of that effort I read Jeff Barr’s Host Your Web Site In The Cloud, Amazon Web Services Made Easy. This book is one of the few physical paper books I’ve gotten recently, and is unique to me in that it is the only book I have that is signed by the author.

That aside, I’d like to recommend this book to anyone who is looking at Amazon Web Services, or would consider themselves a beginner with AWS. I found the writing style to be very easy to read and, while I’m not a PHP developer, the code samples and walkthroughs were clear and simple to follow.

AWS is a fast moving target, and even though Jeff is on the team, I’m certain it was difficult to get a book to market that wasn’t completely outdated by the time it hit the shelves, but I think he does a good job of addressing the basics, providing a foundation on which you can build your knowledge, and even slips in a few notes regarding late breaking updates (as of press time) such as EC2 instances being bootable from EBS.

In my mind, this book is similar to the Windows Azure Training Kit in that it gives you most everything you need to get your feed wet, get rolling with the technology, and provides you with the framework by which you can add to your skills.

Tuesday
Jan042011

Speaking at the CodeMash Precompiler

timidI’m thrilled to be speaking at the CodeMash Precompiler next week. I’m going to be joined by Mike Wood and helped by Brian Prince and Michael Collier. Together, we’ll have nearly 8 hours of instruction and hands on labs covering both the Amazon and Microsoft cloud computing platforms. Below I’ve listed the abstracts for each of the sessions as well as the prerequisites for those planning on joining us. If you are going to be in Sandusky next Wednesday, be sure to drop by.

An Introduction to Amazon Web Services (half-day, afternoon)

AWS has been in the cloud computing space longer than most anyone, and they are the de facto standard when it comes to Infrastructure as a Service. While most developers are comfortable with the notion of virtual machines, reviewing the AWS offering can sometimes look like alphabet soup (EC2, S3, SNS, SDB, SQS). Join us to learn the power behind these acronyms and the tools that they can provide your next project. We'll discuss the major components, some of the trade-offs between different implementation choices (i.e. boot from S3/boot from EBS, etc.) and provide you with the opportunity to work through some labs, deploy some code, and begin to experience the Amazon cloud for yourself.

Examples are in .NET, but fundamental concepts apply to all platforms.

 

An Introduction to Windows Azure (half-day, morning)

Steve Ballmer has made it very clear that Microsoft is "all in" when it comes to the cloud and by now most have heard about Microsoft's Windows Azure platform... but what does that mean for you? Whether you are an experienced .NET developer who is wondering what all this cloud stuff means for how you write code, or maybe you are a traditional *nix developer looking to understand how to integrate your existing code with the Microsoft version of the cloud, join us for an in-depth discussion on what Platform as a Service is, how Microsoft has implemented it, what scenarios it best addresses, and a collection of hands-on-labs to get you started.
Examples are in .NET, but fundamental concepts apply to all platforms.

 

Prerequisites

The sessions will be part presentation, part hands on labs.  While you aren't required to bring a laptop, you'll get much more out of the sessions if you have one available to work through the labs with (but, there might be some people willing to pair as well!).  Please make sure to bring your power cord! 

Here are the prerequisites to have loaded:

An Introduction to Windows Azure

· Operating Systems Supported: Windows 7 (Ultimate, Professional, and Enterprise Editions); Windows Server 2008; Windows Server 2008 R2; Windows Vista (Ultimate, Business, and Enterprise Editions) with either Service Pack 1 or Service Pack 2

· Microsoft Visual Studio 2010 (full version or the free trial).

· SQL Server 2005 Express Edition (or above) (this is usually installed with Visual Studio)

· Install the Windows Azure Tools for Microsoft Visual Studio (and some hotfixes)

· Install the AppFabric SDK

· Install the Windows Azure Platform Training Kit

An Introduction to Amazon Web Services

· Amazon AWS SDK for .NET

· Requires Microsoft .NET Framework 2.0 or later.

· Use the AWS SDK for .NET with any of the following Visual Studio editions:

o Microsoft Visual Studio 2008 Professional Edition or later

o Microsoft Visual C# 2008 Express Edition (free!)

o Microsoft Visual Web Developer 2008 Express Edition (free!)

You might be thinking, "Hey, What a second!  This is CodeMash, you just listed all Microsoft tools there!".  Just like CodeMash, both Windows Azure and Amazon AWS are happy to mix in multiple development stacks.  Our labs and demos will be shown using Visual Studio, but don't let that stop you from following along or trying out the cloud platforms from your Mac, or using Java, PHP and Ruby on Windows.  Below are links to other SDKs for each cloud platform.  Please, feel free to explore your options and load these SDKs or libraries up if you prefer them.

For Windows Azure

· Windows Azure SDK For Java

o AppFabric: http://www.jdotnetservices.com/

· Windows Azure SDK for PHP

o AppFabric: http://dotnetservicesphp.codeplex.com/

o and tools http://azurephptools.codeplex.com/

o and Companion http://www.interoperabilitybridges.com/projects/windows-azure-companion

o Oh, and some love for Eclipse via a plug in: http://www.windowsazure4e.org/

· Windows Azure AppFabric SDK For Ruby

For Amazon AWS

· AWS Java Developer Center

· AWS PHP Developer Center

· AWS Python Developer Center

· AWS Ruby Developer Center

Tuesday
Dec212010

Planet Technologies Launches GovCloud

This is just a quick post to let those interested know that the company I work for - Planet Technologies - has launched a new cloud-focused practice focused on helping public sector/government agencies utilize cloud computing. Services range from high-level assessments, hands-on migration assistance, and most everything in between.

You can read the press release here: http://www.technology-digital.com/planet-technologies-launches-govcloud-new-cloud-practice-designed-specifically-assist-government-ag-0

And visit the website here: http://govcloud.com/

Wednesday
Nov172010

Does Amazon’s Cluster Compute Platform Still Represent Cloud Computing?

I’m sitting at the airport in New Orleans, after having attended the first half of the ACM/IEEE 2010 Super Computing conference. This was the first time I have attended this conference, and it was certainly interesting to participate.

During the workshop I participated in on Sunday (Petascale Data Analytics on Clouds: Trends, Challenges, and Opportunities), there arose a conversation regarding the Amazon EC2 “cluster compute instances” and their having reached a spot on the Top 500 list. What surprised me, however, was not that they were mentioned (I actually expected them to receive more attention than they did), but that they were described as not being “real” cloud computing.  The point was made that they represented some sort of special configuration that was done just for the tests and that the offering was somehow significantly different than the rest of the general populous could acquire. The two primary individuals involved in the exchange have significant history in classic HPC and have, at least a degree of “anti-cloud” bias, but I am responsible for helping influence the viewpoint of one of these folks so I’ve been thinking a bit over the past few days about how to properly articulate the inaccuracies of the argument… and wondering if it really matters anyway.

Commodity Hardware – by this I mean that the platforms being utilized could be purchased/deployed by anyone… and, by “anyone”, I am thinking of a moderately skilled computer hobbyist. I’m referring, particularly, to the chip architectures, availability of the motherboards, etc. A quick glance at the specs for a given machine validates that anyone (with enough money) could easily assemble a similarly-configured machine. It is simply a quad-core Intel box with 24 GB of RAM and roughly 2TB of disk. One might argue that the newly-announced Cluster GPU Instance is specialized hardware, but then again, anyone with an extra $2,700 to spare could add one of these to their machine. The point is, that machines in this class are in the 5K range, not the 50K or 500K range.

Commodity Networking – now to some of you, 10GB non-blocking networks might seem specialized or exotic, but – at least in the HPC realm – it isn’t. Most serious HPC platforms utilize a network technology called InfiniBand (usually QDR) or something fancier (more expensive such as an IBM custom interconnect or CRAY’s Gemini. A quick search shows one could purchase 10GBE switches starting in the 2-3K range and going up from there whereas IB QDR switches are at least double that.

Broad Availability – this point gets a little stickier. The point is, that anyone can get access to CCI nodes at any point – simply using a credit card and visiting the AWS website. However, getting access to 880 of them (the number used in the Top 500 run) is likely to be more difficult. The reason is not an unwillingness on Amazon’s part to provide this (I’m sure, given the proper commitment, this would not be impossible), but rather a question of economics and scale. Their more “general” nodes have a large demand and use case… the scale of demand for CCI nodes is yet to be established although I’d imagine the sweet spot for these customers is in the 16-64 node range… folks who could really use a cluster some of the time, but certainly don’t need it all of the time. As such, I (and I have no inside knowledge of their supply/demand change) don’t imagine that the demand is currently so large that beyond the currently active nodes, they have ~1000 nodes of this instance type sitting around just waiting for you to request them (this will likely changes as demand grows).

Inexpensive + Utility-style Pricing – This is one area where this instance type represents all of the goodness we have become accustomed to in the cloud computing world. These nodes (remember I listed the above as starting around 5K) are available at $1.60/hour ($2.10/hour for the GPU-enabled nodes). This makes a significant computing platform available to almost anyone. For just over $100/hour, you can have a reasonably-well powered 64-node cluster on which to run your experiments… that is disruptive in my opinion. The best part about it, is that this price is the worst case scenario – meaning, this is the price with no special arrangement, or reservation, or volume discount, or anything. It represents no long term commitment… nothing beyond a commitment for the current hour.

So… what is different? – I have spent the majority of this post explaining how I think that these instance types are similar in many ways to other IaaS offerings and thereby deserve categorization as “regular” cloud computing, but that begs the question – what is unique about these nodes that would cause Amazon to promote them as better for HPC workloads? What facts formed the foundation for these rather experienced HPC experts to classify them as different? In my mind, there is really only 2 or three things here. The first is the networking – rather than being connected to a shared 1GBE network, you are given access to a 10GBE network, and guaranteed full bisection bandwidth node-to-node. It is this fact alone that makes the platform so interesting to the HPC folks as it makes it actually viable for network-heavy applications (think traditional MPI apps). Secondly, you have clear visibility to the hardware. Amazon tells you exactly what type of processors you are running on allowing you to optimize your codes for that particular processor (somewhat common in the HPC realm). Tightly coupled with this fact is that you can’t get a “part” of this instance type. You get the entire node (less the hypervisor) and, as such, are not contending with any other customers for node-local resources (RAM, ephemeral disks, network, etc). Finally, the fact that you can get nodes that have specialized hardware (NVidia GPUs) is unique… there are very few cloud providers currently offering this sort of feature set.

In the end, I think the Amazon offerings are very much representative of the “cloud” and, particularly, of where the cloud is going. I think we will continue to see a broad level of homogeneity (basic hardware abstractions) with comparatively small pockets of broad-domain specific assets. The key being that for a large number of researchers, the offerings announced by Amazon this summer (and additionally this week) make the decision as to whether or not to buy that new departmental cluster much more difficult – especially when a true TCO analysis is performed. These are similar to the arguments and justifications for “normal” cloud compute scenarios and as such, should be considered one and the same.

Friday
Sep242010

Maximizing Throughput in Windows Azure – Part 3

This is the third in a series of posts I’m writing while working on a paper dealing with the issue of maximizing data throughput when interacting with the Windows Azure compute cloud. You can read the first post here and the second here. This post assumes you’ve read the other two, so if you haven’t now might be a good time to at least peruse them.

Summary: Based on the work performed and detailed in the first two parts of this series, we scaled the load tests horizontally to 20 concurrent nodes to ensure that the performance characteristics of the storage platform were not overly degraded. We found that while we were able to move a significant amount of data in a relatively short period of time (500 GB in around 10 minutes – roughly 6.9 Gbs), we experienced something less than a linear scale up from what a single node could transfer (up to 39% attenuation in our tests).

Detail: After the work done in the first two stages, I decided to see what the affect of horizontal scaling would have on the realized throughput. To test this, I took the test harness (code links below) and set it for the “optimal” approach for both upload and download as determined by the prior runs (sub-file chunked & parallelized uploads combined with whole-file parallelized downloads) and then deployed it to 20 nodes and then did a parameter sweep on node size. I tested a few different methods of starting all of the nodes simultaneously (including Wade Wegner’s “Release the Hounds” approach) but settled on Steve Marx’s “pseudo code” (Wade’s words - not mine) approach as I had issues with getting multicasting on the service bus to scale using the on-demand payment model. This provided for a slightly crude (triggers were not *exactly* timed together) start time, but was more-or-less concurrent.

You can see from the following chart that my overall performance wasn’t too bad – based on the node type we saw upwards of 6Gbs download speed and around 2Gbs upload. Also consistent with our prior tests, we saw a direct relationship between node size and realized throughput rates.

While the chart above is interesting, the real question is whether or not the effective throughput was linear based on node count. The following charts compare the average results from three runs per node size of 20 concurrent nodes to the average numbers from the prior tests by node size multiplied by 20 (perfect scale). What we see is that uploads demonstrate an attenuation of between 25% and 45% while downloads taper between 18% and 39%.

Note: the XL size uploads actually demonstrated a better-than-linear scale (around 101% of linear) which is attributed to a generally good result for the three test sets for this experiment and a comparatively poor result set (likely network congestion) for the XL nodes in the prior tests. The results in this test are from an average of three runs (each run consisting of 20 nodes transferring 50 files each) – performing more runs would likely render a higher accuracy of trend data. 

Looking at the data triggered some follow-on questions such as what the attenuation curve would look like (at what node count do we stop scaling linearly) or what do the individual transfer times per file look like. This prompted me to dig a bit further into the collected data and generate some additional charts. I’ve not displayed them all here (the entire collection is available in the related resources section below), however I’ve selected a few that are illustrative of a my subsequent line of questioning.

For the downloads, most of the charts looked pretty good and we saw a distribution similar to the following two charts. As you can tell, the transfers are of similar length and the histogram shows a fairly tight distribution curve.

Uploads on the other hand, were all over the board. The following two charts are representative of some of the data inconsistencies we found. What is interesting to note here, is there there are three legs that are significantly longer (visually double) than the mean of the remainder. This would cause one to wonder if the storage platform was getting pounded, effectively placing those three nodes on hold until the pressure abated. You can see from the associated histogram that the distribution was much broader representing less consistency in transfer rates.

The previous charts got me to wondering further, so I wrote some code to generate charts (timeline) of transfers for each node within size/run collection (one chart for each of the horizontal bars in the chart above). Immediately obvious in the charts below is a bug in my data collection (my log data for the individual files was tracking the total seconds elapsed, but the end time was being recorded in minutes – resulting in the oddities (right alignment) in the bar display below – this will be fixed in future runs/posts). Ignoring the bug for a minute, the first chart is something similar to what you would expect… parallelized transfers that overlap some and stair-step over the elapsed time.

The following two charts, however, represent situations different than you would expect and illustrate what would appear to be problems in the network/transfer/nodes/my code (something). In both scenarios there are large blocks of time with apparently nothing happening, as well as individual files that apparently took significantly longer than the rest to transfer. In the next set of tests (and follow on blog post) I’ll be digging into this issue and looking to understand exactly what is happening and, hopefully, be able to explain a little bit of why.

 

Related Resources:

 

Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U. S. Department of Energy.