Wednesday
Apr072010

Cloud Futures 2010: Panel on Cloud Applications - New Experiences and Expectations

I am in Redmond this week and am participating in two workshops being hosted by different groups within Microsoft Research. Along with a handful of others, I was asked to participate in a panel discussion on Friday dealing with new experiences that cloud computing would facilitate, as well as things we felt were road blocks to seeing those experiences realized. He specifically challenged us to think "outside the box" and to look beyond (the now typical) conversations surrounding raw performance and to dream a little. I wrote out the following as a means of working through my thoughts for my 5-7 minute portion of the panel discussion and, as it took me longer than 7 minutes to read, I thought I'd post it here as a expansion of the talk and possibly an anchor on which to hang subsequent conversations. Please forgive the casual nature of the talk as it is intended to be, essentially, a script read delivered to a group rather than a formal written version of the same.

---

This topic is certainly interesting to me as I am convinced that cloud computing is here to stay and also presents a platform that can be disruptive to the scientific/technical computing industry (although I would qualify this by saying “disruptive in a constructive sense” – meaning that the disruption leads to the additive good and not the removal of existing work). I have spent a considerable amount of time over the past week contemplating this question (how do we imagine cloud computing facilitating new usage scenarios), and have chosen to present my reply by means of a few examples.

The first example is that of Lego MindStorms. Are you familiar with these? They are kits that provide kids (regardless of how old they are :) ) the ability to build robots using a familiar (although slightly altered) Lego metaphor. These kits come with motors, sensors, and a "brain" that is programmable via a drag-and-drop software tool but also supports more complex tools such as Microsoft's Robotics Studio. Do you know what is so great about these (besides the obvious)? They allow common people, with no prior robotics or electronics experience, to dabble in the field. It is, a gateway, if you will, to a much broader field.

The second example is more of an experience that happened to me recently in that I had the privilege of running into my high school science teacher this past weekend - a quiet, rather unassuming fellow named Randy White. Randy's brilliance is that he has a passion for science and did (at least in my case) an excellent job of transference. If I am ever able to accomplish anything interesting in the scientific domain, a large portion of the credit will lie with him. Probably the most important thing he taught us, was how to think about, or to tackle the complex. I can't tell you how many times I heard him say, "Start with what you know". The idea being, that most often, incredibly complex problems were comprised of nothing more than a series of far simpler, and additive problems. He taught us to focus on solving what we could, rather than attempting to "swallow the entire elephant" if you'll allow me to strain a metaphor.

If you find yourself wondering what these two examples have to do with each other, or more germanely, what do they have to do with my vision for the scenarios that cloud computing will open, let me see if I can explain...

You see, much in the same manner as Lego MindStorms have introduced an otherwise unlikely audience to the world of robotics, I believe that cloud computing (based on its cost model and popular programming paradigms) is a means of introducing normal people (and by this, I mean those not formally trained in scientific or technical computing) to the notion of using computation as a tool for solving complex problems. Possibly to the dismay of some in the field, I think that this will, at least initially be done in a means void of the topics of MPI, or Fortran, much in the same way as a 15 year old "programming" his robot doesn't have to understand the inner workings of concurrency runtimes nor the physics at work when his robot "walks" for the first time. I will be the first to admit that these (MPI, Fortran, concurrency topics, race conditions) are important topics, but I would submit that they should not be gating factors to one's ability to explore the arena and determine if he/she is interested in further study in that field. I think we will see paradigms that are far simpler to adopt, such as master-worker, map/reduce, etc. (or even cloud-backed applications that are hidden behind more accessible tools such as Excel, or MatLab) take hold in significant ways and that we will see the development of novel approaches to solving problems using this new platform. The tired-and-true tools will remain, and will be used when necessary and appropriate, but I think if we force them down the throats of the next generation of researchers as "the only way to accomplish science", we are doing them a great
disservice.

As to where Mr. White and high-school science comes into play - well, this can best be summarized by a comment made by a friend of mine, Wally McClure when he, almost flippantly, referred to Windows Azure as a "poor man's supercomputer". Being one that had been working with Azure for quite a bit at the time, I took a little offense at the accolade due to its semi-pejorative nature, and prefer the "common man", but the point is the same regardless: Cloud Computing (at least as currently manifested in both Windows Azure and Amazon's AWS platform), has a great potential to democratize high-performance computing. You see, the high-school I grew up in was small... we had 23 in my graduating class. While Randy has moved on, he still teaches in a comparatively small school that certainly has no funds for a cluster on which to run experiments. However, with the advances in cloud computing, Randy could devise a collection of simple experiments and actually execute them as part of a class project. He could have a significant computational cluster for the equivalent of a few dollars. He can present "Scientific" computing as something obtainable to his students, and hopefully foster an interest that will develop into the next generation of computational thinkers - solving one problem at a time, incrementally, on the way to solving massive problems that we have trouble even describing today.

It is, in my opinion, incumbent upon us - the current generation of computational researchers and domain-specific scientists - to look at cloud computing not as a threat to the establishment, but as facilitating a new means of scientific discovery. We should consider ways to make large-scale computation more accessible to "normal" people. We should be opening up the community, sharing wherever possible, reducing the barriers to entry. Challenge yourselves and your students to push boundaries, to consider non-traditional approaches, and to enjoy "playing" with computational resources.

Friday
Feb262010

The Danger with using a framework…

The danger with using a framework is that sometimes it does things that you aren’t aware of that can send you in circles for  quite some time before you figure them out.

I’ve been working on some tests of some “creative” ways to get data in and out of cloud platforms at rates above the norm and I’ve written a test harness that I’ve been using that will grab a bunch of files, one at a time, and record the file size, duration, etc. for the transfer. I’ve been doing this in a single-threaded fashion for quite some time with reasonable success. The problem began when I attempted to use to run a test that did multi-threaded downloads (multiple threads each grabbing a portion of the file).

NOTE/Crazy Quirk: I don’t yet know why this is the case, but the problem I’m preparing to explain did *not* appear while I had Fiddler running… only when it was *not* running. I’m guessing that this is due to some “magic” that Fiddler does to the HTTP/networking stack..

The behavior I was seeing, was that after two threads would execute, all subsequent threads would fail or timeout. Obviously, when one is doing a significant amount of data movement, this is sub-ideal. The culprit turned out to be the ServicePointManager’s DefaultConnectionLimit. By default, this is configured to 2 which means you can, at most, have 2 open connections to the same TLD at the same time. When I was doing this in serial, there was no problem as the connections were managed/re-used on the main (only) thread.  When doing a number of operations to the same URL (TLD) from multiple threads (especially when you are setting up/tearing them down quickly), it appears that the ServicePointManager is unable to re-use them (not surprising) but neither is it able to determine that the thread is now gone as should be the connection count. (yes, I was behaving and closing my connections).

The solution I came up with was to first shorten the time to live for idle threads, next to monitor the number of threads currently “consumed” and to increase the limit based on how many I needed for the current operations, all while ensuring an upper bound and stand-off mechanism should things get too far out of bounds.

 

// ensure that we don't have lingering connections that will hamper our 
// ability to continue...
// Start by getting the ServicePoint for our current Url
ServicePoint servicePoint =
ServicePointManager.FindServicePoint(new Uri(url));

// see how many connections currently exist...
int existingConnections = servicePoint.CurrentConnections;

// if we are above our upper bound, wait a bit to let things settle down...
while (existingConnections >= 64)
{
Console.WriteLine("Connection count too high. sleeping for a bit.");
Thread.Sleep(1000);
}

// ensure that we have enough room to do what we need
if ((existingConnections + options.ConcurrentThreads + 1) >
servicePoint.ConnectionLimit)
{
servicePoint.ConnectionLimit = existingConnections +
options.ConcurrentThreads + 1;
}

// only give them a few seconds (5) to time out...
ServicePointManager.MaxServicePointIdleTime = 5000;

Console.WriteLine("Pre-Existing Connections: {0}", existingConnections);
Console.WriteLine("Connection Limit: {0}", servicePoint.ConnectionLimit);



Hopefully, this will be helpful for someone else hitting the same issue.

Tuesday
Feb232010

Linux Desktop for a Windows Guy

I’ve found myself working quite a bit on “alternative” platforms (various distros of Linux, Mac, etc.) and have been struggling to maintain a bit of simplicity or, at least consistency amongst my work environment. I currently have a setup that is working for me and I thought I’d list how I got it working for those who care – hoping that it might help some windows guy like me who is wanting to live with a single keyboard/mouse setup. I should also state that I’m certain that there is a better way to accomplish this – I welcome suggestions.

The Gear:

  • My main machine at work is an HP workstation with 8 GB of Ram, a bunch of disk space, and three monitors (a 27” in the middle surrounded by two 21” screens).
  • My main laptop is an HP tablet running Fedora 12 (this flavor was at the insistence of a certain Mr. Billings who assured me that this was the only real build). However, to Mr. Billings’ chagrin, I’m still running the Gnome desktop and not KDE like “real people” do.

 

The Goal(s):

  • From my main machine, dedicate at various times one full screen (likely the main one) to the Fedora desktop. I specifically wanted the full Gnome desktop and not just a singular app forwarded over X11
  • Mouse/keyboard movement between the Windows desktop and the Fedora desktop should be seamless – which ever app/desktop had focus should receive the input. Specifically, I didn’t want to have to hit a key-combination of some sort to “release” the input devices from the Fedora desktop and get back to the Windows desktop.
  • Simple integration points such as copy/paste should work seamlessly between them.

 

The solution:

  • Caveats:
    • Let me preface the following by saying I tried a number of things… a handful of Xservers such as Xming and a couple of commercial servers. I’m certain that they work to varying levels, but I didn’t have much success.
    • I also tried some options such as running DSL via qemu (screen never looked right), VMWare hosting another Linux OS as an Xclient into the laptop, etc. None of these worked as smoothly as I felt they should and they all added more overhead to my host box than I was interested in giving up
  • Current Implementation
    • Installed cygwin from http://www.cygwin.com
    • Installed Cygwin/X from http://x.cygwin.com
    • Installed Putty from http://www.chiark.greenend.org.uk/~sgtatham/putty/
    • Created a putty profile for my laptop and enabled X11 forwarding.
    • Open a cygwin bash shell and type
      xwin –nodecoration –screen 0 @1
    • At this point, it will look like nothing happened. You can verify things are started by checking in your system tray for the X server icon.
    • As I understand it, this starts the Xwin server, tells it not to give “Windows”-like borders to the windows it opens/displays from the laptop, and tells it to only use my first screen (my center one). If you just use the shortcut from the start menu you may (as in my case) get a window that spans all of your screens and can be unweildly if they run at different resolutions (the screen would work, but didn’t display properly enough to actually be useable).
    • Now, I establish a Putty connection (ssh) to my laptop and, after logging in, am at the bash shell.
    • at the laptop’s bash shell, I type:
      gnome-session
    • with any luck (at least in my case), I get a full Fedora/Gnome desktop on my main screen and am ready to go, and have met all of the goals listed above.

 

Hope that helps someone!

Tuesday
Feb162010

Walking the Talk: Cloud Transfer Tests

Last week, I bemoaned the issue of “proof” and data provenance behind scientific work and publication and, to that end, I’ve made an effort to change, at least in the way in which my work is performed.

To that end, I’ve posted the methodology to the tests I’m currently running and writing about here. If you are at all interested in the results (posted subsequently) please take a moment to read the methodology so that you can interpret the results with understanding. Further, in the interest of improving the work and resultant technology, I’m very interested in (constructive) critique of the methodology and tools (source code is available per the methodology link).

I’ve also posted the results of a few of the tests:

 

These are introductory and represent the state prior to any optimizations in transfer approaches. It will be interesting to see how these values change as we experiment with different techniques.

Friday
Feb122010

Data and Published Results in Scientific Research

I've been working on data-intensive projects recently and I'm sure that there is a point in every computational researcher's life when he begins to think about the data that they are generating – how are they going to store it, how is it going to be tracked, what code/circumstances were used to create or collect it, how is it going to be associated or linked with the results, how will someone who questions the research reproduce or validate it? Most large projects plan for these sorts of issues early on in their life cycle, but for many smaller projects it seems like an afterthought – if thought about at all.

I recently reviewed a paper that was being submitted for publication and the authors, while on target with their overall thesis, supported such with some broad claims whose veracity was supported only by some pictorial charts (units were not displayed). There was no detail regarding the number of times the tests were run to produce the resultant chart. It wasn't explained that the values represented an average over many runs, or what level of variance was represented by the result set. There was no pointer to the raw data set, or detailed test archive, etc. The test that they had run had inherent variability in the source (Internet latency/contention) yet no explanation was given as to how this was accounted for in the published results. Essentially, as a reviewer, I was being asked to sign off on a set of assertions for which I had nothing beyond the credibility of the authors as validation. If I were simply a reader of the publication and held a critical view of the view being presented, I would have no means of learning further or accurately countering the author's claims (assuming that the goal of scientific publication is not only the dissemination of knowledge but the constructive debate of theories leading to a community-refined understanding of reality). Maybe I am naive, but I think we can do better.

I recently attended a workshop and one of the speakers (he was a researcher at Google but I don't remember his name/position) mentioned (almost in passing) during his talk that he and a colleague had been discussing the need for a reality in which every experiment can be reproduced and independently validated at any point in time. He quickly admitted that this was a lofty aspiration and there exist many hurdles that would have to be overcome to facilitate such, but I found myself strongly agreeing with the core sentiment. As a relative newcomer to the scientific community, I've been a bit surprised at the shroud of secrecy that most researchers place around the raw data from their work. There seems to be a prevailing desire for self aggrandizement over fostering collaborative solutions to hard problems. I'm probably somehow missing the boat, but I find myself hoping for a scenario in which data is published early and often – critiqued and validated by others, pointing the community at large towards solutions rather than individuals towards papers.

While thinking about this problem area, I was reminded of Project Trident – an effort by Microsoft Research to solve a similar problem. As I recall, this platform bundles the variables, originating source, and resultant data together in a repository for subsequent validation and archival. I hope that they are successful in this effort and that similar tools are developed in the community. Ideally, the scientific community will embrace the “cloud” for more than simply large scale compute, but also as a means to build a platform such as one referred to earlier in this post whereby any person with interest could browse through existing experiments, and re-execute them with constraints similar to the originals. Then, as the collective imagination grows, the community can experiment with other permutations or derivative works.

Friday
Jan152010

Cloud Development Best Practices and Additional Links

I noticed that Amazon posted a new white paper by Prashant Sridharan yesterday entitled “Architecting for the Cloud: Best Practices”. I pulled this down, read it, and wanted to pass it along to those who might have attended my talk yesterday as, while it is somewhat slanted to Amazon’s way of thinking, there are many sound and general concepts put forth in this doc and it is worth a read by anyone targeting the cloud. I do, however, find myself wondering, how long it will be until we can have a paper on cloud computing technologies without feeling the need to spend the first quarter of it justifying the premise (this is not a slam on the paper… more a reflection of the current state of things in the industry).

I also wanted to post a link to Amazon’s AWS Security Whitepaper as well as to their notice about having completed their SAS70 Type II Audit in support of a conversation I had prior to the talk with a gentleman looking at cloud computing for some SLG clients he supports. Additionally, the link to the government-focused cloud application site (apps.gov) as well as to the introduction to the site provided in the webcast by US CIO Vivek Kundra.

 

Finally, I briefly discussed a whitepaper from MSR titled “The Fourth Paradigm: Data-Intensive Scientific Discovery” and wanted to provide the link to that as well.

Thursday
Jan142010

CodeMash: Azure – Lessons from the Field

I had the privilege of speaking at CodeMash today and had a blast. The attendance was good, and the conversation both before and after the session was great.

As promised, the following is the slide deck from today’s presentation:

 

And here are some links that may be of interest:

Monday
Dec212009

Time to do some digging…

I’ve been getting my test harness and reporting tools setup for some performance baselining that I’m doing relative to cloud computing providers and when I left the office on Friday I set off a test that was uploading a collection of binary files (NetCDF files if you care) to an Azure container. I was doing nothing fancy… looping through a directory, for each file found, upload to the container using the defaults for BlobBlock and then record the duration (start/finish) for that file and the file size. The source directory contained 144 files representing roughly 58 GB of data. 32 of the files were roughly 1.5 GB each and the remainder were about 92.5 MB.

I came in this morning expecting to find the script long finished with some numbers to start looking at. Instead, what I found is that, after uploading some 70 files (almost 15 GB), every subsequent upload attempt failed with a timeout error – stating that the operation couldn’t be completed in the default 90-second time window. I started doing some digging into what was happening and so far have uncovered the following:

  • By default, the Storage Client that ships with the November CTP breaks your file up into 4 MB blocks (assuming you are using BlobBlock – which you should if your file is over the 64 MB limit.
  • The client then manages 4 concurrent threads uploading the data. as each thread completes, another is started – keeping four active most the entire time.
  • At some point Saturday afternoon (just after 12 noon UTC), the client could no longer successfully upload a 4 MB file (block) in the 90 second window, and all subsequent attempts failed.
  • I initially assumed that my computer had simply tripped up or that a local networking event caused the problem so I restarted the tool – only to find every request continuing to fail.
  • I then began to wonder if the problem was the new storage client library (not sure why) so I pulled out a tool to manage  Azure storage – Cloud Storage Studio (http://www.cerebrata.com/Products/CloudStorageStudio/Default.aspx) and noticed that I was able to successfully upload a file. I remembered that CSS (by default) splits the file into fairly small blocks, so I cracked open Fiddler and began monitoring what was going on. I learned that it was using 256 KB blocks (this is configurable via settings in the app).
  • I then adjusted my upload script to set the ServiceClient.WriteBlockSizeInBytes property (ServiceClient is a property of the CloudBlockBlob object) to 256k and re-ran the script. This time, I had no troubles at all (other than a painfully slow experience).
  • So, I can upload data (not a service outage) but while 256K blocks work, the 4 MB blocks that worked on Friday no longer work – I’m assuming that there’s a networking issue on my end, or something in the Azure platform. To provide more clarity, I adjusted the tool again, this time using a WriteBlockSizeInBytes value of 1MB and re-ran the tool – again, seeing successful uploads.

 

While this last step was running, I thought it might be good to go back and do some crunching on the data I had so far. The following chart represents the uploads rate from the files that successfully were uploaded on Friday/Saturday followed by the a chart showing the probability density. The mean rate was 2.74 mbits/sec with a standard deviation of 0.1968. It is interesting to note that there was no upward drift at the end of the collection of successful runs, indicating that more than likely, the “fault” was likely caused by something specific rather than being the result of a gradual shift or failure based on usage (imagine a scenario wherein as more data is populated in a container, indexes slow down, causing upload speeds to trail off).

UploadRate

Upload Speeds [click image for full size]

UploadRateStdDev_2

Probability Density [click image for full size]

 

I then ran similar reports against the data I from this morning’s runs. I’m still in the process of generating a full report on the data, but a representative sample shows the following: The mean upload rate was 0.15 mbits/sec with a standard deviation rate of 0.0375. This is over 17x slower than Friday. This data points represented below are for three batches – the first batch used a WriteBlockSizeInBytes of 256K, the second used 1MB, and the third used 2MB (10 points per size). The file upload did not succeed with the 2MB size – only finished about 1/4th of the full file.

 

uploadSpeeds

Upload Speeds [click image for full size]

UploadRateStdDev_3

Probability Density [click image for full size]

I’ve seen a few comments from others today that indicate the slow down may be widespread – My next course of action is to attempt to run the tests from a few different locations to hopefully eliminate my local network as the problem set and have more data with which to address the issue.

Friday
Dec182009

Automated Chart Generation

It’s late on the Friday afternoon before Christmas week which means things are pretty quiet around the office. This quiet has the net-effect of allowing me to get quite a bit done. The last few days have been very productive with respect to our research project and Azure work (more on that coming soon) which is now in full swing. We are currently working on collecting performance data from our codes running in Azure (and soon in the Amazon cloud) and are also doing some testing of transfer speeds of data both to/from the cloud as well as between compute and storage in the cloud.

I’ve been working to automate much of this testing so we can do things in a repeatable fashion as well has have something that others could run (both other users like ourselves as well as possibly vendors should we come across something that requires a repro scenario). So far, running tests and generating data in CSV or XML format is pretty simple, but I found myself wanting to automatically generate charts/graphs of the data as part of the test process to allow a quick visualization of how the test performed. I spent a good bit of the day looking at old tools for command-line generation of charts (i.e. RDTool, etc.) and none of them were exactly what I was looking for – not to mention my proclivity to using C# and VS.NET tools and my desire to have something that looked refined/polished and not overly raw.

Thankfully, I stumbled upon something I should have remembered existed but simply hadn’t had the need to use before – the System.Windows.Forms.DataVisualization.Charting class. If you aren’t familiar with this assembly, it was released at PDC08 and has a companion Web class for performing similar operations in ASP.NET applications. In my basic testing I was able to build a console application that would ingest the CSV output from my testing harness and then generate some fairly nice looking charts based on that data. The following shows a chart (click the chart to see it full size) generated from ~1800 data points, and automatically generates a 50% band and 90% band allowing the viewer to very easily ascertain the averages and data points. This was generated using a combination of the FastPoint and BoxPlot chart types.

chartImage

Friday
Dec182009

Windows Azure, Climate Data, and Microsoft Surface

I’ve been working on moving a large collection data to, from, and around Azure as we are testing the data profile for scientific computing and large-scale experiment post-processing and, in order to verify the data we uploaded and processed turned out as we wanted tit to, I built a simple visualization app that does a real-time query against the data in Azure and displays it. Originally the app was built as a simple WPF desktop application, but I got to thinking that it would be particularly interesting on the Surface and therefore took a day or two to port it over. The video below is a walkthrough of the app – the dialog is a bit cheesy but the app is interesting as it provides a very tactile means of interacting with otherwise stale data.

Page 1 ... 3 4 5 6 7 ... 11 Next 10 Entries »