Entries in storage (5)

Friday
Sep242010

Maximizing Throughput in Windows Azure – Part 3

This is the third in a series of posts I’m writing while working on a paper dealing with the issue of maximizing data throughput when interacting with the Windows Azure compute cloud. You can read the first post here and the second here. This post assumes you’ve read the other two, so if you haven’t now might be a good time to at least peruse them.

Summary: Based on the work performed and detailed in the first two parts of this series, we scaled the load tests horizontally to 20 concurrent nodes to ensure that the performance characteristics of the storage platform were not overly degraded. We found that while we were able to move a significant amount of data in a relatively short period of time (500 GB in around 10 minutes – roughly 6.9 Gbs), we experienced something less than a linear scale up from what a single node could transfer (up to 39% attenuation in our tests).

Detail: After the work done in the first two stages, I decided to see what the affect of horizontal scaling would have on the realized throughput. To test this, I took the test harness (code links below) and set it for the “optimal” approach for both upload and download as determined by the prior runs (sub-file chunked & parallelized uploads combined with whole-file parallelized downloads) and then deployed it to 20 nodes and then did a parameter sweep on node size. I tested a few different methods of starting all of the nodes simultaneously (including Wade Wegner’s “Release the Hounds” approach) but settled on Steve Marx’s “pseudo code” (Wade’s words - not mine) approach as I had issues with getting multicasting on the service bus to scale using the on-demand payment model. This provided for a slightly crude (triggers were not *exactly* timed together) start time, but was more-or-less concurrent.

You can see from the following chart that my overall performance wasn’t too bad – based on the node type we saw upwards of 6Gbs download speed and around 2Gbs upload. Also consistent with our prior tests, we saw a direct relationship between node size and realized throughput rates.

While the chart above is interesting, the real question is whether or not the effective throughput was linear based on node count. The following charts compare the average results from three runs per node size of 20 concurrent nodes to the average numbers from the prior tests by node size multiplied by 20 (perfect scale). What we see is that uploads demonstrate an attenuation of between 25% and 45% while downloads taper between 18% and 39%.

Note: the XL size uploads actually demonstrated a better-than-linear scale (around 101% of linear) which is attributed to a generally good result for the three test sets for this experiment and a comparatively poor result set (likely network congestion) for the XL nodes in the prior tests. The results in this test are from an average of three runs (each run consisting of 20 nodes transferring 50 files each) – performing more runs would likely render a higher accuracy of trend data. 

Looking at the data triggered some follow-on questions such as what the attenuation curve would look like (at what node count do we stop scaling linearly) or what do the individual transfer times per file look like. This prompted me to dig a bit further into the collected data and generate some additional charts. I’ve not displayed them all here (the entire collection is available in the related resources section below), however I’ve selected a few that are illustrative of a my subsequent line of questioning.

For the downloads, most of the charts looked pretty good and we saw a distribution similar to the following two charts. As you can tell, the transfers are of similar length and the histogram shows a fairly tight distribution curve.

Uploads on the other hand, were all over the board. The following two charts are representative of some of the data inconsistencies we found. What is interesting to note here, is there there are three legs that are significantly longer (visually double) than the mean of the remainder. This would cause one to wonder if the storage platform was getting pounded, effectively placing those three nodes on hold until the pressure abated. You can see from the associated histogram that the distribution was much broader representing less consistency in transfer rates.

The previous charts got me to wondering further, so I wrote some code to generate charts (timeline) of transfers for each node within size/run collection (one chart for each of the horizontal bars in the chart above). Immediately obvious in the charts below is a bug in my data collection (my log data for the individual files was tracking the total seconds elapsed, but the end time was being recorded in minutes – resulting in the oddities (right alignment) in the bar display below – this will be fixed in future runs/posts). Ignoring the bug for a minute, the first chart is something similar to what you would expect… parallelized transfers that overlap some and stair-step over the elapsed time.

The following two charts, however, represent situations different than you would expect and illustrate what would appear to be problems in the network/transfer/nodes/my code (something). In both scenarios there are large blocks of time with apparently nothing happening, as well as individual files that apparently took significantly longer than the rest to transfer. In the next set of tests (and follow on blog post) I’ll be digging into this issue and looking to understand exactly what is happening and, hopefully, be able to explain a little bit of why.

 

Related Resources:

 

Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U. S. Department of Energy.
Monday
Sep202010

Maximizing Throughput in Windows Azure – Part 2

[NOTE: Updated 9/23/2010. See bottom of this post for an explanation of the changes]

This is the second in a series of posts I’m writing while working on a writing a paper dealing with the issue of maximizing data throughput when interacting with the Windows Azure compute cloud. You can read the first part here. I’m still running some different test scenarios so I expect there to be another post or two in the series.

Summary: Our tests confirmed that while within the context of the Azure datacenter (intra-datacenter transfers), sub-file parallelization for downloads (Azure blob storage to Azure Compute) is not recommended (overhead is too high), whole-file-level parallelization (parallelizing the transfer of multiple complete files) does provide a significant increase in overall throughput when compared to transferring the same number of files sequentially. Also, consistent with our prior tests, the size of the VM has a direct correlation to the realized throughput.

Detail: During the testing described in Part 1, we saw that the attempts to parallelize at the sub-file level for downloads within the Azure datacenter was significantly more expensive (on average 76.8% lower throughput) as compared to direct transfers. As such sub-file parallelization is not recommended for downloads within the Azure datacenter.

As I considered these results and thought through where the bottleneck might be, I went back through and re-instrumented the test so I could get a time snap in midst of the parallel download routine at the spot after the file blocks have been downloaded and prior to reassembling the file. What I found was that roughly 50% of the time of an individual operation was consumed in network transfer while the other half was spent assembling the individual blocks into a single file. While this gave me some ideas for further optimization, the 50% time for transfer was still significantly longer than the entire non-blocked operation (by almost 50%). As such, it seemed beneficial to take a completely different approach to improving the transfer speed.

What we came up with was to parallelize at the whole file level rather than at the sub-file level. This effectively eliminated half of the prior parallelization effort cost (no reassembly) and wouldn’t involve the overhead of querying the storage platform for the size, and then issuing a collection of range-gets.

s2chart1

As you can see from the chart above, even in the worse case, there is a significant improvement in the overall throughput when files are transferred simultaneously rather than sequentially. While the individual-file transfer rate dropped (average 40.1% worse), the overall transfer rate averaged 86.21% better.

Consistent with our prior results, instance size plays a role in the bandwidth. Our tests showed an average improvement in realized throughput per step increase in instance size of 14.46% (please see following note)

Note: A review of the chart hints that the small instance size takes a significant hit in the area of total network throughput and, while this accurately reflects the data collected, the third run took abnormally longer than the first two and pulled down the total results. This can be explained by a number of different factors (e.g. heavy contention on the host for network resources). I ran the test for that scenario a 4th time to satisfy my curiosity as to whether or not the third run was reflective of a larger trend and the results of the fourth run were much closer to that of the first two. So much so that if I were to substitute the fourth run results for the third run results, the overall improvement due to parallelism raises to 89.46% and  the average improvement in throughput by step increase in node size goes to 10.75%. It is my belief that if I were to have run these tests/scenarios more, the outliers would have reduced and the results would be closer to those ignoring the 3rd run rather than including it.

Approach: Rather than doing a parameter sweep on a number of file sizes, I selected a specific file size (500 MB) of randomly generated data and executed my tests with that. For each parameter set, I ran executed 3 runs of 50 transfers each (150 total per parameter set). While the transfer time of each file was tracked, the total time transfer time (for all 50 files in the run) was the primary value being collected and represented in the charts above. It should be noted that this total time includes a little bit of time per file for tracing data so, in a scenario wherein that tracing activity was not present, the numbers above might be slightly better. I also tore down and re-published my platform between each run to increase my chances of being provisioned to different hardware nodes within the Azure datacenter and – theoretically, a different contention ratio with other instances on the same physical host. Also, I performed a run for all parameter sets before starting subsequent runs to decrease the likelihood that one parameter set would be inappropriately benefited (or harmed) by the time of day in which it was executed. In each test, a single worker role instance was run targeting a single storage account. There were no other applications or activities targeting that storage account during the tests runs. All of these tests were performed in the Windows Azure US North Central region between the dates of August 27, 2010 and September 2, 2010

Related Resources

 

NOTE: This post was updated on 9/23/2010. The changes are both substantial and not at the same time. While working on the other posts in the series, I became concerned that there were too many calculations being performed ad-hoc in Excel to get from the raw data to the charts and conclusions described here. A key goal of mine is for someone who questions my results to be able to re-run them and analyze my analysis of the data. Therefore I stepped back and generated the charts using code that shows each calculation and query. The links to the code are posted above as are links to the raw data. The charts are identical to what were here originally with the exception of some formatting changes due to the differences in generation engines. The charts are also higher-resolution and clicking on them will open the full-size version of the chart. 

 

Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U. S. Department of Energy.
Monday
Sep132010

Maximizing Throughput in Windows Azure – Part 1

[NOTE: Updated 9/23/2010. See the bottom of this post for an explanation of the changes]

I’m working on a writing a paper dealing with the issue of maximizing data throughput when interacting with the Windows Azure compute cloud and am drafting some of that work as a couple of blog posts to help me work through my thoughts. I’m still working through some test scenarios and will have more to post later, but I wanted to get this out while it was still fresh.

I’ve posted before, that utilizing parallelized file file transfers is a great way to increase your overall throughput when externally interacting with Windows Azure, and the unsaid but possibly inferred thought was that it worked well for internal-to-Azure data movements as well. At the time I wrote the initial post I had done some testing of this scenario and had mixed results. A couple of recent papers I’ve read got me thinking about the topic again and so I started testing further with a slightly different approach and a different take on the variables.

Summary: Within the context of the Azure datacenter (intra-datacenter transfers), sub-file parallelization is not always as beneficial as it is outside the datacenter (local to azure or azure to local). Further, the size of the VM host has a significant impact on the realized throughput.

Detail: The key point I pulled from a paper I was reading (I’m sorry, I don’t have the reference at this time) was that another researcher had been doing tests in the Amazon cloud and indicated they were seeing significant deltas in throughput based on the Instance size/type they selected. Neither Microsoft nor Amazon list bandwidth as a variable associated with instance types (with the possible exception of the Amazon Cluster Compute Instance which boasts a 10Gbps network) but it stands to reason that given a physical host of a fixed size, an increase in the number of virtual hosts on that box (smaller instances) will result in a decrease in available throughput per virtual host. The inverse (scenarios with larger instances)also follows. This got me to thinking about Azure and whether or not the same would hold true, and, if so, how that would impact our recommended approach of splitting your files, transferring them  in parallel, and then reassembling them on the other side.

Approach: Rather than doing a parameter sweep on a number of file sizes, I selected a specific file size (500 MB) of randomly generated data and executed my tests with that. For each parameter set, I ran executed 3 runs of 50 transfers each (150 total per parameter set). I also tore down and re-published my platform between each run to increase my chances of being provisioned to different hardware nodes within the Azure datacenter and – theoretically, a different contention ratio with other instances on the same physical host. Also, I performed a run for all parameter sets before starting subsequent runs to decrease the likelihood that one parameter set would be inappropriately benefited (or harmed) by the time of day in which it was executed. In each test, a single worker role instance was run targeting a single storage account. There were no other applications or activities targeting that storage account during the tests runs. All of these tests were performed in the Windows Azure US North Central region between the dates of August 27, 2010 and September 2, 2010

Results: The first sweep was aimed at identifying the impact of VM size on transfer rate using the standard MS-provided storage client library (no modifications). What we found, was that, for the most part, there was a clear relationship between the VM size and the realized throughput.

s1chart2

s1chart4

The second sweep had a similar objective as the first, with the only change being that rather than using the standard/single-threaded API calls, we used the parallelized version that we developed for our external-to-Azure tests. The results were similar to the above in that the node size showed (mostly) a consistent impact on the realized throughput (keep reading past the charts if you review the following and think I’m out of my mind).

s1chart1

s1chart3

If you are still with me, you are probably wondering why the numbers for the Parallel Upload by Node Size chart look so off from the assumed behavior… The fact of the matter is that similarly to the small node standard download tests, the third run for the small node parallel upload tests experienced a radically different performance (>75% better) than the prior two runs. This was so jolting to the numbers that I actually prepared another chart showing only the first two runs of this test to illustrate the difference that the last run made in the average results:
 s1chart8

As you can tell from the above, these results are much closer to what you might expect (based on the values from the other tests above). The key take-away at this point, and the reason I am belaboring this aberration, in an environment where you are not in complete control, the performance you obtain from shared services (networks, storage clusters, etc) may vary widely in actual use.

The real question of interest, was to compare the two approaches (standard library vs. parallelized) so one could select the best one for a given scenario. The first chart showed exactly what I expected – the parallelized version was significantly better than the standard approach for all node sizes although the benefit waned as the node size increased.

s1chart5

The second chart initially caught me off guard as it illustrated that the work being done to block/download/reassemble in parallel was far less efficient than simply downloading the data.

s1chart6 My initial thoughts were that I was simply using an inefficient mechanism for reassembling the file but that the parallelized transfer was still likely faster than the stock approach but some additional instrumentation invalidated that thought. For the parallelized version, roughly 50% of the total time per file was spent in reassembling it, however even considering just the 50% spent in network transfer, it was roughly 50% longer than the stock approach (I’ll dig into that a bit more in later posts).

Therefore, from the data and tests we’ve run so far, using a blocked or chunked approach and parallelized transfers works well for external-to-Azure uploads and downloads as well as uploads (compute to blob storage) for internal-to-Azure movements. Internal-to-Azure downloads (blob storage to compute targets) should be performed using the standard/non-parallelized approach.

This last chart is designed to give an idea of the realized throughput by node for both upload and downloads using the “optimal” approach as determined via the tests detailed above.

s1chart7 As you can imagine, the results listed here triggered a number of other questions and tests. Some of these will be addressed in the next post on this topic which should be available soon.

Related Resources

 

NOTE: This post was updated on 9/23/2010. The changes are both substantial and not at the same time. While working on the other posts in the series, I became concerned that there were too many calculations being performed ad-hoc in Excel to get from the raw data to the charts and conclusions described here. A key goal of mine is for someone who questions my results to be able to re-run them and analyze my analysis of the data. Therefore I stepped back and generated the charts using code that shows each calculation and query. The links to the code are posted above as are links to the raw data. The charts are identical to what were here originally with the exception of some formatting changes due to the differences in generation engines. The charts are also higher-resolution and clicking on them will open the full-size version of the chart.

 

Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the U. S. Department of Energy.
Monday
Dec212009

Time to do some digging…

I’ve been getting my test harness and reporting tools setup for some performance baselining that I’m doing relative to cloud computing providers and when I left the office on Friday I set off a test that was uploading a collection of binary files (NetCDF files if you care) to an Azure container. I was doing nothing fancy… looping through a directory, for each file found, upload to the container using the defaults for BlobBlock and then record the duration (start/finish) for that file and the file size. The source directory contained 144 files representing roughly 58 GB of data. 32 of the files were roughly 1.5 GB each and the remainder were about 92.5 MB.

I came in this morning expecting to find the script long finished with some numbers to start looking at. Instead, what I found is that, after uploading some 70 files (almost 15 GB), every subsequent upload attempt failed with a timeout error – stating that the operation couldn’t be completed in the default 90-second time window. I started doing some digging into what was happening and so far have uncovered the following:

  • By default, the Storage Client that ships with the November CTP breaks your file up into 4 MB blocks (assuming you are using BlobBlock – which you should if your file is over the 64 MB limit.
  • The client then manages 4 concurrent threads uploading the data. as each thread completes, another is started – keeping four active most the entire time.
  • At some point Saturday afternoon (just after 12 noon UTC), the client could no longer successfully upload a 4 MB file (block) in the 90 second window, and all subsequent attempts failed.
  • I initially assumed that my computer had simply tripped up or that a local networking event caused the problem so I restarted the tool – only to find every request continuing to fail.
  • I then began to wonder if the problem was the new storage client library (not sure why) so I pulled out a tool to manage  Azure storage – Cloud Storage Studio (http://www.cerebrata.com/Products/CloudStorageStudio/Default.aspx) and noticed that I was able to successfully upload a file. I remembered that CSS (by default) splits the file into fairly small blocks, so I cracked open Fiddler and began monitoring what was going on. I learned that it was using 256 KB blocks (this is configurable via settings in the app).
  • I then adjusted my upload script to set the ServiceClient.WriteBlockSizeInBytes property (ServiceClient is a property of the CloudBlockBlob object) to 256k and re-ran the script. This time, I had no troubles at all (other than a painfully slow experience).
  • So, I can upload data (not a service outage) but while 256K blocks work, the 4 MB blocks that worked on Friday no longer work – I’m assuming that there’s a networking issue on my end, or something in the Azure platform. To provide more clarity, I adjusted the tool again, this time using a WriteBlockSizeInBytes value of 1MB and re-ran the tool – again, seeing successful uploads.

 

While this last step was running, I thought it might be good to go back and do some crunching on the data I had so far. The following chart represents the uploads rate from the files that successfully were uploaded on Friday/Saturday followed by the a chart showing the probability density. The mean rate was 2.74 mbits/sec with a standard deviation of 0.1968. It is interesting to note that there was no upward drift at the end of the collection of successful runs, indicating that more than likely, the “fault” was likely caused by something specific rather than being the result of a gradual shift or failure based on usage (imagine a scenario wherein as more data is populated in a container, indexes slow down, causing upload speeds to trail off).

UploadRate

Upload Speeds [click image for full size]

UploadRateStdDev_2

Probability Density [click image for full size]

 

I then ran similar reports against the data I from this morning’s runs. I’m still in the process of generating a full report on the data, but a representative sample shows the following: The mean upload rate was 0.15 mbits/sec with a standard deviation rate of 0.0375. This is over 17x slower than Friday. This data points represented below are for three batches – the first batch used a WriteBlockSizeInBytes of 256K, the second used 1MB, and the third used 2MB (10 points per size). The file upload did not succeed with the 2MB size – only finished about 1/4th of the full file.

 

uploadSpeeds

Upload Speeds [click image for full size]

UploadRateStdDev_3

Probability Density [click image for full size]

I’ve seen a few comments from others today that indicate the slow down may be widespread – My next course of action is to attempt to run the tests from a few different locations to hopefully eliminate my local network as the problem set and have more data with which to address the issue.

Friday
Aug142009

AtomPub, JSON, Azure, and Large Datasets

UPDATE 8/20/2009, 15:29 EST: There is some confusing content in this post (i.e. Azure storage doesn’t support JSON). A follow up to this post with further explanation/detail is available here

--

UPDATE 8/14/2009, 17:16 EST: @smarx pointed out that this post is a bit misleading (my word) in that Azure storage doesn’t support JSON. I have a web role in place that serves the data, which, upon reflection could be introducing some time delays into the Atom feed. I will test further and update this post.

----

I’m just really beginning to scratch the surface on my work on cloud computing and scientific computing but it seems that nearly every day I’m able to spend time on this I come away with something at least moderately novel. Today’s observation is, on reflection, a bit of a no-brainer but it wasn’t immediately obvious to me.

I’m kicking around some scientific data and have a single collection of data, with somewhere north of 40,000 subsets of data, with each subset containing roughly 8,100 rows. I’ve had an interesting time getting this data into Azure tables, but that’s not the point of this post. Once the data resided in Azure, I built a little ADO.NET client to pull the data down based on various queries (in my case, a single “subset” at a time, or 8,100 rows). In case you are wondering, the data is partitioned in Azure based on the key representing each subset, so I know that each query is only hitting a single partition. I proceeded to follow the examples for paging and data calls (checking for continuation tokens, etc.) and it wasn’t long before I had a client that would query for a particular slice of data, and then make however many individual data calls necessary until the complete result set was downloaded and ready for processing. I was, however, disappointed in the time it took to pull down a single slice of data… averages were around 55 seconds. Pulling down a number of slices of the dataset, at nearly a minute each, was a bit slow.

I spent some time poking around with Fiddler and some other tools and discovered that I was suffering from XML bloat. The “convenience” factor of having a common, easy-to-consume format was killing me. Each response coming back from the server (1000 rows) was averaging over 1MB of XML.

After a while of kicking around my options (frankly, too long), I decided to try pulling the data as JSON. I hadn’t used JSON previously, but had heard it touted as being very lightweight. I also found some nice libraries on CodePlex for de-serializing the response so I could use it as I had the results of the Atom feed. Once I made this change, I was shocked to see the amount of improvement (I expected some, but what I saw was much more than anticipated). My average time dropped to around 14 seconds for the entire batch and the average size of the response body dropped to about 163k.

 

I’ve included some charts below showing the results of my tests. I ran the tests 10 times for each of the approaches. Code base for each test harness is identical with the exception of the protocol-specific text handling. Time measurements are from the start of the query through the point that each response has been de-serialized into an identical .NET object (actually, a List<T> objects).

 

These first two charts show the time required to retrieve the entire slice of data. The unit of measure is seconds.

 

chart04

 

chart05

 

These charts show the average payload returned per request for the two different methods. In both cases, the unit of measurement is bytes.

chart06

 

chart07