msgbartop
random musings and walks through code
msgbarbottom

16 Feb 10 Walking the Talk: Cloud Transfer Tests

Last week, I bemoaned the issue of “proof” and data provenance behind scientific work and publication and, to that end, I’ve made an effort to change, at least in the way in which my work is performed.

To that end, I’ve posted the methodology to the tests I’m currently running and writing about here. If you are at all interested in the results (posted subsequently) please take a moment to read the methodology so that you can interpret the results with understanding. Further, in the interest of improving the work and resultant technology, I’m very interested in (constructive) critique of the methodology and tools (source code is available per the methodology link).

I’ve also posted the results of a few of the tests:

 

These are introductory and represent the state prior to any optimizations in transfer approaches. It will be interesting to see how these values change as we experiment with different techniques.

15 Jan 10 Cloud Development Best Practices and Additional Links

I noticed that Amazon posted a new white paper by Prashant Sridharan yesterday entitled “Architecting for the Cloud: Best Practices”. I pulled this down, read it, and wanted to pass it along to those who might have attended my talk yesterday as, while it is somewhat slanted to Amazon’s way of thinking, there are many sound and general concepts put forth in this doc and it is worth a read by anyone targeting the cloud. I do, however, find myself wondering, how long it will be until we can have a paper on cloud computing technologies without feeling the need to spend the first quarter of it justifying the premise (this is not a slam on the paper… more a reflection of the current state of things in the industry).

I also wanted to post a link to Amazon’s AWS Security Whitepaper as well as to their notice about having completed their SAS70 Type II Audit in support of a conversation I had prior to the talk with a gentleman looking at cloud computing for some SLG clients he supports. Additionally, the link to the government-focused cloud application site (apps.gov) as well as to the introduction to the site provided in the webcast by US CIO Vivek Kundra.

 

Finally, I briefly discussed a whitepaper from MSR titled “The Fourth Paradigm: Data-Intensive Scientific Discovery” and wanted to provide the link to that as well.

14 Jan 10 CodeMash: Azure – Lessons from the Field

I had the privilege of speaking at CodeMash today and had a blast. The attendance was good, and the conversation both before and after the session was great.

As promised, the following is the slide deck from today’s presentation:

 

And here are some links that may be of interest:

18 Dec 09 Windows Azure, Climate Data, and Microsoft Surface

I’ve been working on moving a large collection data to, from, and around Azure as we are testing the data profile for scientific computing and large-scale experiment post-processing and, in order to verify the data we uploaded and processed turned out as we wanted tit to, I built a simple visualization app that does a real-time query against the data in Azure and displays it. Originally the app was built as a simple WPF desktop application, but I got to thinking that it would be particularly interesting on the Surface and therefore took a day or two to port it over. The video below is a walkthrough of the app – the dialog is a bit cheesy but the app is interesting as it provides a very tactile means of interacting with otherwise stale data.

14 Sep 09 HUNTUG Meeting 090914

I have the distinct privilege of speaking at the Huntsville (AL) New Technology Users Group meeting this evening on the topic of Windows Azure: Lessons from the field. Beyond the content of the talk I’ll be sharing a handful of links during the talk and wanted to ensure that the slides and links are here and available for those who attended the session.

Slides:

 

Links:

03 Sep 09 “Live” Monitoring Of Your Worker Roles in Azure

gauge

I’ve been working for a bit on some larger-scale jobs targeting the Windows Azure platform and early last week had assembled a collection of worker roles that were supposed to be processing my datasets for a number of days moving forward. Unfortunately, they wouldn’t stay running. As always, they “worked on my machine”, so I naturally assumed that the problem was with the Azure platform :) . I then proceeded to do what I thought was the correct action… go to the Azure portal and request that the logs be transferred to my storage account so I could review them and fix the problem. What I learned, is that there were two problems with this solution:

  1. The time delay between requesting the logs and actually being able to review them is prohibitive for productive use. In my experience, the minimum turn around was 20 minutes and was most often 30 or longer. I’m not sure why this was happening – is this by design, or a temporary bug, or an artifact of the actual problem with my code, or what, but I know it was too long.
  2. Logs appear to get “lost”. In my scenario, my worker roles were throwing an exception that was un-caught my by code. Near as I can tell, when this happens, the Azure health monitoring platform assumes that the world has come to an end, shuts down that instance, and spins a new instance. While this (health monitoring and auto-recovery) is a good thing, one side effect (caveat is the fact that this is my experience and may not be reality) is that the logs were stored locally and, when the instance was shutdown/recycled, those log files went to the great bit-bucket in the sky. I was stuck in a failure mode with no visibility as to what was going wrong nor how to fix it.

After pounding my head for a bit, I came up with the following solution – trap every exception possible and use queues. The first aspect allowed my worker roles to stay running. This may not always be the right answer, but for my use case, I adapted my code to handle the error cases and trapping / suppressing all exceptions proved to be a good answer. Further, doing so allowed me to grab the error message and do something interesting with it.

The second step (using queues) solved the (my) impatience problem. I created a local method called WriteToLog that did two things: write to the regular Azure log, and write to a queue I created called status (or something similarly brilliant). I replaced all of my “RoleManager.WriteToLog()” calls with calls to the local method and I then wrote a console app that would periodically (every few seconds) pop as many messages as it could (API-limited to 32) off of the status queue, dump the data as a local csv for logging and write the data to the screen. This allowed me to drastically reduce the feedback loop between my app and me, enabling me to fix the problems quickly.

There are certainly some downsides to this approach (do queues hit a max?, what is the overhead introduced by logging to a queue, once a message is dequeued, it is not available for other clients to read, etc), but it was a nice spot fix. A better implementation would have a flag in the config file or something similar that would control the queue-logging.

As you can see from the image above, I also wrote a little winform app to display the approximate queue length so I’d have an idea of the progress and how much work remained.

20 Aug 09 AtomPub, JSON, Azure and Large Datasets, Part 2

Last Friday I posted some initial results from some simplistic testing I had done comparing pulling data from Azure via ATOM (the ADO.NET data services client) and JSON. I was surprised at the significant difference in payload and time to completion. A little later, Steve Marx questioned my methodology based on the fact that Azure storage doesn’t support JSON. Steve wasn’t being contrary, but rather pushing for clarification to the methodology of my testing as well as a desire to keep people from attempting to exploit the JSON interface of Azure storage when none exists. This post is a follow up to that one and attempts to clarify things a bit and highlight some expanded findings.

The platform I’m working against is an Azure account with a storage account hosting the data (Azure Tables), and a web role providing multiple interaction points to the data, as well as making the interaction point anonymous. Essentially, this web role serves as a “proxy” to the data and reformats it as necessary. After Steve’s question last week, I got to wondering particularly about the overhead (if any) the web role/proxy was introducing and if, esp. in the case of the ATOM data, it was drastically affecting the results. I also got to wondering if the delays I was experiencing in data transmission were, in some part, caused by the fact of having to issue 9 serial requests in order to retrieve the entire 8100 rows that satisfied my query.

To address these issues, I made the following adjustments:

  1. Tweaked my test harness for ATOM to optionally hit the storage platform directly (bypassing the proxy data service).
  2. Tweaked the data service to allow an extra query string parameter to indicate that the proxy service should make as many calls to the data service as necessary to gather the complete result set and then return the results as a single batch to the caller. This allowed me to eliminate the 1000 row limit as well as to issue only a single HTTP request from the client.
  3. I increased the test runs from 10 to 20 – still not scientifically accurate by any means, but a bit longer to provide a little better sense of the average lengths for each request batch.

The results I received as follows and not altogether different than one might expect:

chart01

chart02

 

As you can see from the charts above, the JSON FULL option was the fastest with an average time to completion of 14.4 seconds. When compared to the regular JSON approach, you can infer that the overhead introduced from multiple calls is roughly 4 seconds (18.55 average time to completion).

In the ATOM category, I find it interesting that the difference between the ATOM Direct (directly to the storage service) was only marginally faster (0.2 of a second on average) than the ATOM FULL approach. This would indicate that the network calls between the web role and the storage role are almost a non-factor (hinting at rather good network speeds). Remember, in the case of ATOM Full, the web role is doing the exact same thing as the test client is doing in Atom Direct, but additionally bundling the XML response into a single blob (rather than 9) and then sending it back to the client.

The following chart shows the average payload per request between the test harness and Azure. Atom Full is different then Atom and Atom Direct in that the former is all 8,100 rows whereas the later two represent a single batch of 1000. It is interesting to note that the JSON representation of all 8,100 records is only marginally larger than the ATOM representation of 1,000 records (1,326,298 bytes compared to 1,118,892 bytes).

chart03

At the end of the day, none of this is too surprising. JSON is less verbose in markup than ATOM and would logically be smaller on the wire and therefore complete sooner (although I wouldn’t have imagined it was a factor of 9 difference). What is interesting, is that the transfer of data b/t the data layer and the web role is almost trivially fast (remember, that 9 MB of XML moved between the layers and was then reformatted as JSON and shoved down back to the client in 14 seconds). It further makes you wonder what the performance improvement would/could be if Azure storage exposed a native JSON interface…

14 Aug 09 AtomPub, JSON, Azure, and Large Datasets

UPDATE 8/20/2009, 15:29 EST: There is some confusing content in this post (i.e. Azure storage doesn’t support JSON). A follow up to this post with further explanation/detail is available here

UPDATE 8/14/2009, 17:16 EST: @smarx pointed out that this post is a bit misleading (my word) in that Azure storage doesn’t support JSON. I have a web role in place that serves the data, which, upon reflection could be introducing some time delays into the Atom feed. I will test further and update this post.

—-

I’m just really beginning to scratch the surface on my work on cloud computing and scientific computing but it seems that nearly every day I’m able to spend time on this I come away with something at least moderately novel. Today’s observation is, on reflection, a bit of a no-brainer but it wasn’t immediately obvious to me.

I’m kicking around some scientific data and have a single collection of data, with somewhere north of 40,000 subsets of data, with each subset containing roughly 8,100 rows. I’ve had an interesting time getting this data into Azure tables, but that’s not the point of this post. Once the data resided in Azure, I built a little ADO.NET client to pull the data down based on various queries (in my case, a single “subset” at a time, or 8,100 rows). In case you are wondering, the data is partitioned in Azure based on the key representing each subset, so I know that each query is only hitting a single partition. I proceeded to follow the examples for paging and data calls (checking for continuation tokens, etc.) and it wasn’t long before I had a client that would query for a particular slice of data, and then make however many individual data calls necessary until the complete result set was downloaded and ready for processing. I was, however, disappointed in the time it took to pull down a single slice of data… averages were around 55 seconds. Pulling down a number of slices of the dataset, at nearly a minute each, was a bit slow.

I spent some time poking around with Fiddler and some other tools and discovered that I was suffering from XML bloat. The “convenience” factor of having a common, easy-to-consume format was killing me. Each response coming back from the server (1000 rows) was averaging over 1MB of XML.

After a while of kicking around my options (frankly, too long), I decided to try pulling the data as JSON. I hadn’t used JSON previously, but had heard it touted as being very lightweight. I also found some nice libraries on CodePlex for de-serializing the response so I could use it as I had the results of the Atom feed. Once I made this change, I was shocked to see the amount of improvement (I expected some, but what I saw was much more than anticipated). My average time dropped to around 14 seconds for the entire batch and the average size of the response body dropped to about 163k.

 

I’ve included some charts below showing the results of my tests. I ran the tests 10 times for each of the approaches. Code base for each test harness is identical with the exception of the protocol-specific text handling. Time measurements are from the start of the query through the point that each response has been de-serialized into an identical .NET object (actually, a List<T> objects).

 

These first two charts show the time required to retrieve the entire slice of data. The unit of measure is seconds.

 

chart04

 

chart05

 

These charts show the average payload returned per request for the two different methods. In both cases, the unit of measurement is bytes.

chart06

 

chart07

05 Aug 09 Silverlight and Azure Table Data Paging

I’m playing around with a data visualization app using Silverlight and data hosted in Azure Tables and have been learning quite a bit in the process. Firstly, Azure tables only allows you to return 1000 records in a given query. If you issue a query that has a larger matching result set, Azure will return some extra headers indicating as such (x-ms-continuation-NextPartitionKey and x-ms-continuation-NextRowKey). It wasn’t hard to find an example of data paging using Azure table data, however it used the Execute() method of the DataServiceQuery object. Unfortunately, this isn’t available in Silverlight as you have to use the asynchronous methods (BeginQuery and EndQuery). I’m a bit slow, and for whatever reason translating the MSDN sample for synchronous to the asynchronous model took me longer than it should have. I’m posting this so that maybe the next person will find this, get the answer they need, and move on and not waste the same amount of time I did.

My button event handler looks quite a bit different from the MSDN sample but is pretty easy to figure out:

code01

I instantiate the context, create a query based on that context, cast that query as a DataServiceQuery<t> and then call the BeginExecute method passing my callback method and the query as my state object. (Note: in case you are wondering about the Where clause in the query above, I know that all of the data that matches the first conditional is located in a given partition within the Azure table and have found that specifying the partition greatly increases the performance).

My callback method (ProcessDataRequest) does a bit of recursion to support the unknown number of subsequent calls needed to retrieve all of the matching records.

The contents of the ProcessDataRequest method are listed below:

 

code02

Note that unlike some samples that are simply focusing on async data calls and don’t handle paging via headers, I cast the output of EndExecute as a QueryOperationResponse object which allows me to subsequently access the headers and interrogate them for the continuation keys. If I find the continuation keys, I create a new query object, set the additional query options, and execute it in the same fashion as the original call.  The AddPointsToScreen method simply processes the values and renders them as polygons to the screen. I’ve included it here not because there is anything special in it, but rather for completeness.

code03

30 Jul 09 Azure Blob Storage Blob IDs and “+”

I’ve been kicking the tires on Azure’s blob storage and am working on uploading a 1.2GB+ NetCDF file. I stumbled across a couple of samples online that were very helpful in avoiding the de facto client library that ships with the SDK however I found myself bit by something (likely due to my error somehow) that I thought I’d pass along.

When processing a larger file, my upload process would always fail at block #248. At first, I assumed it was a network transience issue and simply re-ran the upload, however, after having it fail on the exact same block 3 times, I decided that it wasn’t the network. In digging a bit into things, I found that the problem had to do with the encoding of the block IDs. The offending piece of code is here:

code04

 

where i is an integer representing the index of the current block within the file and blockIds is an array of IDs used to build the block ID list as part of a putBlockList operation.

The Azure SDK would indicate that this code snippet is perfectly valid… block IDs need to be a base 64-encoded string uniquely identifying the block within the blob. Further, each ID (within a blob) must be of the same length prior to encoding (same number of bytes). In this scenario, BitConverter.GetBytes returns a 4-byte array of values for all numbers within the range (in my case, 0 – 314). The following is an example of the resulting string for some numbers:

  • 246: 9gAAAA==
  • 247: 9wAAAA==
  • 248: +AAAAA==

There continues about 4 that begin with a ‘+’ sign, and a similar number that begin with ‘\’. Every other index in my collection began with a normal alpha character. After doing some poking around I found some indications that others were having similar problems and went down the path of encoding the line differently (i.e. HttpServerUtility.UrlTokenEncode, etc) to no avail. What I ended up with is simply prefixing my values with a standard “safe” string (“BlockId”)

code05

This yielded a blockId that was unique, consistent length (notice the formatting of the indexer in the ToString() method), and “safe” in that it always began with a URL-safe character.

I’m certain that there is likely a better way to solve this problem, but this did the trick for me and maybe it will be helpful to someone else.