Saturday
Feb022013

Big Data 101: Handling Millions of Files

I've been talking a bit recently with members of my team about some of the basic tools that need to be in any data scientist's toolbox. Things that, if you want to lay any claim to working with "big data" should be second nature. Many of these things are not terribly complicated, nor does one have to be overly clever to employ them - however the lack of knowledge as to when to properly apply them could cost you dearly (lost time, lost data, needless system maintenance, etc).

The first such topic came up a week or so ago when one of our younger team members mentioned that his machine fell over after he had written around 3,000,000 files to the same directory. This reminded me of a lesson I learned back in late 2000 when I was working with Microsoft on the "Millon Mailbox March" and the MCIS mail platform. MCIS contained a mail platform designed by Microsoft for the ISP industry (later replaced by Exchange). This mail platform used an interesting approach to store the potentially millions of mailboxes it housed on the file system. Similar approaches can be (and often have been) applied to modern day storage problems within the Big Data space.

So, I came up with the following exercise/challenge for my students and colleagues - I hope you find it interesting. If you've faced a similar problem in the past, you are likly jumping to solutions and know exactly how you would solve it. It will be interesting to see the solutions presented by our team. I'll post any particularly interesting ones here.

Challenge #1: Handling Millions of Files.
Design and implement a solution for storing 100,000,000 files on a "normal" file system (NTFS, ext4, etc.). The solution should be tested/verified and should be reasonably balanced. The system should provide a naming convention that ensures against collisions. The solution should function properly on Windows, Mac and Linux. Finally, you need to be able to explain the reasoning behind each design decision implemented in your solution. 

Deliverables:

 

  • A short writeup defining your approach and any incremental steps along the way. Remember: details around interim "failed" attempts are as important as the final solution.
  • All code used in your solution. By "All" this means everything necessary to recreate your scenario and solution. This includes any means you used to measure and analyize your results.
  • Timing of the overall activity is important. For example: how long did it take for you to create the file set and analyze the results? While speed of operation is not the primary goal for this exercise (efficacy is), timing information is always informative
  • Extra credit is given for striking a clear balance between robustness and simplicity.

 

Support Files:

 

  • This exercise does not require any initial data sets.

 

 Assumptions:

  • Disk space should not be an issue during this experiment.
  • The solution both can (and should) assume that the target file system (NTFS/ext4/etc.) is of sufficient size to house the files/data in a contiguous set and single namespace.

 

 

Thursday
Nov012012

Would you like a Cassette for your data?

A colleague of mine sent me a link to this story about using cassettes for storing data and asked for my thoughts. This was clearly a good-natured jab at me in light of a prior conversation we had debating the appropriateness of tape or disk for a research data storage platform. I had been arguing against tape as an out-moded and inappropriate storage mechanism.

The problem is… I was wrong.

And so was he.

I am using the word “wrong” not in the moral sense but to mean “less than the ideal”, “unfortunate”, “sad”, “depressing”, <enter your own term here>.

You see, I was wrong in that as he argued (and is well articulated in the article) there is simply too much data being generated to make disks a tractable solution. Even with recent and projected growth in hard drives (60TB expected by 2016) our ability to produce data – particularly in an automated means via sensors and scientific instruments – already does, and will continue to out-pace these advancements. Even if hard drives were able to keep up with the space demands, the power required to keep those drives running quickly becomes prohibitive. And let’s not even talk about disk transfer rate issues.

Whether or not he was “wrong” is probably a bit more subjective (an admission I’m certain he’ll enjoy). My frustration with his position is that it tends to be synonymous with “slow” or “laborious”. That said, an unfortunate reality in the digital sciences is that with data sets of any significant size, a researcher often has to plan well in advance of his experiment to stage the data. Data has to be loaded into online storage from some cold storage mechanism (often a tape library). This significantly limits one’s curiosity and causes some questions to go unanswered (i.e. “I wonder if… well, it’s probably not worth the time/effort to load up the data just to see…”).  I suppose that if I’m honest, I’m simply manifesting the “Google effect” – the expectation that I can as a question of tons of data and get an answer instantly. I’d love for this to be possible of all scientific data – but as any data scientist will tell you, that desire is simply naïve. Providing platforms such as Google’s is hard work, and not achieved without significant planning and effort. Admitting this still doesn’t mean I can’t hope for it…

The real nugget buried in the article and underlying our friendly debate, is that storage technologies are nowhere close to where we need them to be. No matter which option we choose it will be a compromise between a.) discarding data – unfortunate no matter how you look at it, b.) spending astronomical amounts of money on both hardware and power, or  c.) using slow, offline, and deterioration-prone devices such as tapes. Frankly, none of these options are attractive. There is a little part of me that dies when I think of a researcher having to choose to discard data simply because he doesn’t have space to store it and doesn’t currently think it is important to his work. What if he’s wrong? What if that data is (or was) the key to solving part of the problem, he just didn’t know it yet? Or maybe it is the key to solving a problem he doesn’t yet know he has…

So here’s hoping that the researchers working on storage technologies will be successful. That they will develop means and methods for us to store massive amounts of data, access it in increasingly shorter times, and with a power envelope that is reasonable and sustainable.  That’s not asking too much, is it?

Monday
Oct152012

Debugging and Reversing Basics 0.01

[Note: the title is what it is because I consider myself a n00b at this and these are likely things anyone else already knows]

I'm always trying to learn more and recently topics in the infosec world have garnered my attention. To further my understanding of the space, I've been reading a bit and this weekend was reading part of Gray Hat Python: Python Programming for Hackers and Reverse Engineers and found myself purpsefully doing that which you shouldn't do when learning a new subject: not following the instructions. You see, the author specifically indicates that the samples were written on/tested on a Windows x86 machine and his assumption is that you will be running on the same. In my case, I haven't run a 32-bit OS in years (since Vista was released) and I made two assumptions: 1.) It probably doesn't matter that much and 2.) even if it does, it's probably a good thing to learn what the differences b/t 32 and 64 bit debugging/reversing are. Well, after a few hours of playing around, I can tell you the first assumption was flat wrong and the second is probably accurate.

The fun beings in chapter 3 where you build a simple debugger. I got stuck on the very first step which was a simple demonstration of attaching to an existing process (calc.exe). I would run the script and simply get an error: "[*] Unable to attach to the process." I figured I must have done something wrong, so I poked around a bit and even diff'd my code against the reference and still didn't see any important differences. As a side note: If you've not yet looked at the errata for the book, you need to do so. There are a number of code/bug fixes that are required to get things working.

The key came from a blog post I came across written by A. H. where he hints that the problem may have to do with the architecture of the application I am attempting to attach to. He suggests adjusting the error line in the script as follows:

print "[*] Unable to attach to the process. %s" % FormatError(kernel32.GetLastError())
and if the error ends with "The request is not supported" you can rest sure that your problem is a 32/64 bit issue. Unfortunately, A.H.'s solution was to simply use a 32-bit box for the rest of the testing.

What Version of Python am I running?

The next issue that occured to me was to determine which version of Python (bit-ness) I was running. I simple search brought up Ned Deily's answer on Stack Overflow which indicated that one simple way to check would be to run the following:

python -c "import struct;print( 8 * struct.calcsize('P'))"

You will get either 32 or 64 as a result - in my case, 32. Great. So I know that my debugging thread is a 32-bit application, what is the image type of calc.exe?

DumpBin

Some poking around led me to an article by Frank Chism on the Windows HPC blog that pointed to being able to run a tool called dumpbin to see if an exe was 32 or 64 bit. I followed the instructions on his post, opened a VS 2010-enabled command shell and typed the following:

dumpbin /headers c:\Windows\system32\calc.exe|findstr "magic machine"

Which resulted in:

14C machine (x86)
    32 bit word machine
10B magic # (PE32)

Ok... so my debugging thread is 32-bit, and the executeable that I'm running is 32-bit, so why am I unable to attach to the thread?

Process Explorer

At this point, I pulled up the trusty Sys Internals Process Explorer to see if it would shed any light on the issue. From within Process Monitor, if you select the View menu and then click on "Select Columns" you can tick the box for "Image Type" which will allow you to see for each process/executeable running what the image type is. And, after quickly checking, I see that calc.exe is running as a 64-bit image. How in the world is this happening?

WOW64

64 Bit Windows has a feature called the File System Redirector which seems to be the root of my issues. If I understand how this works (dubious), this is a layer within the OS that "magically" redirects you to the proper version of the application based on the calling process. For example, if a 64-bit process attempts to open the 64-bit image of calc.exe (located in C:\Windows\System32), it will work just fine. However, if a 32-bit process attempts to do the same thing, it will get magically re-directed to the 32-bit version of the application which is located in C:\Windows\SysWOW64 (don't even ask why the folders are named the way they are based on the versions of the applications that they house). What this means, is that if you simply hit Windows+R and type calc, you are calling it from a 64-bit process (the shell) and therefore you get the 64-bit version of the application. If, however, you reference calc.exe from a 32-bit process (i.e. dumpbin), you get redirected to the 32-bit version.

If you specifically need the 32-bit version (as I did to complete my testing), you can open a command prompt, navigate to c:\Windows\SysWOW64 and then type calc.exe or, you can have it launched from any 32-bit process. To see this second option in action, open a command prompt, navigate to c:\windows\SysWOW64 and then type cmd.exe. Via Process Explorer you can confirm that you are running a 32-bit version of cmd.exe. Then navigate wherever you'd like (i.e. c:\) and then type calc.exe. You will now get the 32-bit version of the application (and can confirm it in Process Explorer).

From here, I can attach to the process (calc.exe as 32-bit) from my python code. This moves me forward a bit but doesn't solve the "how do I bind to the 64-bit image" question. That will be a problem for another day.

 

 

Friday
Aug312012

DevLink: Wireless Network Security

On Wednesday I had the privelege of speaking at DevLink on the topic of wireless network security. I had a great time giving the talk and had great audience participation (including some who were unknowning victims to my man-in-the-middle attack). The slides from the talk are posted below.

 

Tuesday
Jul102012

Windows 8 Release Preview on Samsung Slate

I've been playing with Windows 8 on a Samsung 700T1A slate for a number of months and was quite excited with the Release Preview was announced and attempted to install it straight away. Unfortunately, I was unable to get it installed right away and set it aside for awhile, trying occasionally, failing, and setting it aside.

The problem I was having, was that the slate wouldn't boot to the Windows 8 media - DVD, USB, no matter what I burned it to, it wouldn't work. I even verified that the media was valid by using it to install on other machines. 

Tonight, I finally got it working and the problem was both so odd, and simple, that I thought I'd post it here to maybe help someone else who comes along searching for the same problem. 

It seems that the slate, when shipped, has a bios setting that has "Support for Legacy USB" devices enabled. However, as soon as the system is updated with a purchase date, it automatically flips this switch (presumably for faster boots). Unfortunately, this also causes it to not check for bootable USB devices during POST (cf. http://skp.samsungcsportal.com/integrated/popup/FaqDetailPopup3.jsp?cdsite=hk_en&seq=431318)

The post that tipped me off was this: http://skp.samsungcsportal.com/integrated/popup/FaqDetailPopup3.jsp?cdsite=hk_en&seq=431320

Bios

(image courtesy of Samsung)

However, the bios on my slate didn't look like this - there is no Fast BIOS Mode menu item. However, there is a menu item that says "Support Legacy USB Devices". Based on this article and the previous one, I took a guess that changing this would fix it and, magically, everything worked just as you would have expected.

Sunday
Jun172012

CodeStock 2012: Buffer Overflow Attack

We had a great time at CodeStock a few days ago discussing buffer overflow attacks, showing developers how they are discovered and exploited and a bit about how to avoid creating software that is vulnerable to these types of attacks. Below are the slides and video from the session:





Sunday
Jun172012

CodeStock 2012: You Think Your WiFi is Safe?

This past Friday I had the privilege of speaking on WiFi security at CodeStock 2012. I had a blast both preparing for the talk and delivering it and I hope it was beneficial to some of those who attended.

As promised (although a bit late), the following are the slides and video from the session:



Monday
Apr162012

Manually Interacting with the MSF database

Metasploit


I've been doing some penetration testing and working through a lab full of exercises which has led me to spending some time with the Metasploit framework. I still consider myself somewhat of a novice when it comes to using this venerable tool. That being said, while I appreciate the database-backing of the tool, and the fact that NMAP scans (and other tools) can feed directly into the database (powered by Postgres), I found myself wanting to interact with the database directly. I wanted to write SQL scripts and other things to update entries on servers. In general, I want to be able to use this DB as the main repository for artifacts and documentation for my pentest and to serve as the basis for my report.


To start, I wanted to connect pg admin to the database and poke around a bit, but I had difficulty figuring out what the connection details were. After digging a bit, i found the connection details stored in /opt/metasploit/config/database.yml. With this file, i was able to connect and tweak the database to my heart's consent.

Thursday
Apr122012

Building My Personal Cloud

78881z2p31zg232This is a bit of an odd post in that it is more a “wondering aloud” then anything specifically prescriptive or informative. I was asked yesterday by a student if I knew of a Windows box he could use to test something on. It was a simple request – he had been building an application on his Mac, and before handing it over to be used, wanted to validate that it worked on Windows. It had the one unique requirement that it had to run on our network at Work due to accessing some IP-restricted content.  The “official” answer was to have him contact our help desk and they would set him up, however the answers he got from them were (understandably) geared towards long-term use of a Windows “desktop” … and came with the associated costs (licenses, labor, etc.). What he really needed, was a VM running a trial version of Windows that he could use for two hours and then throw away… What he needed, was a place he could go, select a Windows machine from a catalog, click “go”, run for a few hours and then click “done” and have it go away… What he needed, was access to a cloud platform.

Unfortunately, I didn’t have a good answer for him, and am still a bit uncertain how he is going to go about fixing the issue, but it did get me thinking about what it would take to have/run a “personal” cloud… nothing fancy, but say I have a single “beefy” workstation/server and I’d like to have a thin hypervisor, and a web-based interface to provision/de-provision running instances… maybe setup a catalog virtual machines that I’ve used before as “starting points” for new machines. The platform should support a wide variety of OS choices… Windows and Linux without noticeable compromises for either. It would be nice if the platform could support various virtual networking (nothing fancy, but the ability, at times to create a private network that only two machines can talk on – for testing various things). The final two requirements are that it should support *modest* horizontal scaling (say, I add one or two new physical boxes) and it should be free (or very low cost). This is for personal use or a small group test platform… nothing fancy or official… just something that works. As soon as you get into anything significant cost-wise, you raise the attention of corporate IT, project budgets, etc. and your simple idea just ballooned into something with a budget line item, project managers, and no chance of coming to life.

Options/Solutions: The following is a list of some of the options I’ve been considering along with some commentary on each. I’d be very interested in feedback, but I would caveat that I don’t want the platform itself to be a research endeavor… it should basically “just work” out of the box and not require significant fiddling to get it stable/working.

  • I have a leaning towards Microsoft products (mostly due to familiarity) so I considered them first, but quickly ruled them out. VirtualPC is not anything what I’m looking for, and HyperV/SystemCenterVMM/<insertManyMoreAcronymsHere> seems ok for a lager-scale deployment but seems like *way* too much overhead for what I’m looking for.
  • Eucalyptus – This is a platform I’ve worked with in the past, and is amazingly easy to setup/run but its support for Windows platforms (at lest in my recollection) isn’t that great.
  • VMWare Workstation has some interesting things, and to large degree, this would be my front-runner short of the fact that there is no web UI for a remote user to provision/interact with a machine. This solution works fairly well, however, for a single user directly interacting with the physical host. While not free, the cost seems reasonable enough (~$200).
  • VMWare vSphere Essentials is promising, especially if used in conjunction with VMWare Workstation. It comes with a web-UI, can scale to multiple physical hosts, has broad OS support, and comes from, arguably, one of the longest-standing leaders in this space. It isn’t free, but at <$500, it falls in the reasonable category – esp. considering it supports up to three physical machines. The price does, however, tend to eliminate it from consideration for my home network.
  • CloudStack + <nameYourHypervisor> – CloudStack is another interesting option and it certainly has a nice UI, however our recent experiences have shown that, while very powerful and great for building large cloud deployments, it seems a bit over-complicated for a one or two server installation. Further, we had significant issues with networking performance for our Windows machines (note: this may be an artifact of the hypervisor we chose: Citrix’s XenServer). Further, we found that there was significant amount of “tinkering” that had to be done to get it working… i.e. download patch X from this svn repo, only use version such-and-such of python, etc. Most of these issues may have been environmental and/or solvable by using other distros/hypervisors, but it was a non-trivial deployment.
  • OpenStack – OpenStack seems to be what all of the cool kids are using these days, but our experiences once again hinted at a lack of maturity in the computing platform for the general sense (range of OS support) and the lack of a solid web-UI eliminates it from consideration.
  • XenServer – Another option I considered briefly is to use the free XenServer and then the windows desktop client to manage the servers and create new images/machines. This would work (we are actually doing this in one of our lab environments) but fails the self-provisioning portion of the exercise (no web-based portal, etc). Additionally, we’ve had issues with the networking stack for non-Linux machines and have spent a large amount of time tracking down driver issues and performance issues.

At this point, I don’t have a solution… I’m still looking, and am guessing that whatever I end up with will require an amount of compromise. Comments/suggestions are welcome…

Image courtesy of Pixomar.

Monday
Apr092012

Speaking at CodeStock 2012

codestock_lowresIt’s that time of year again… the CodeStock session line up has been announced and – much like years past, it is looking to be a great conference. If you’ve not attended CodeStock before, or are not familiar with what it is, it’s a great regional developer conference hosted in Knoxville, TN early each summer. You can see a list of the content that will be covered at this year’s conference here: http://codestock.org/Sessions/Default.aspx – that’s a lot of content crammed into two days – and a great value at only $60.

I’m quite excited have been selected to speak at CodeStock this year… probably more than I have in the past. The reason is that I chose a topic that I’ve been studying and enjoying, but not one that is typical for this sort of conference. Over the past three years, I’ve spoken at CodeStock on topics including SharePoint, Team System, Amazon Web Services, Windows Azure, and GPGPU computing with CUDA. This year, however, the talks I’m giving (buffer overflow detection/exploitation and wireless network security) affect developers but are often considered topics for security-specific conferences. What is great, is not only did these talks make the cut, but the buffer overflow talk was #12 in the list of top vote getters (CodeStock lets registered attendees vote on what sessions they’d like to see). If you happen to be in the Knoxville area in the middle of June I hope you’ll consider attending the conference.

Anatomy of a Buffer Overflow Attack - You've heard of "buffer overflows" and maybe you've even been the cause of a few, but do you understand why they are bad? Maybe you're a ".NET developer" and you've never really thought about them. In this session we'll discuss how attackers discover buffer overflows, how they interrogate them, and, finally, how they are exploited. We'll walk through a live demonstration from fuzzing through obtaining a remote shell. You'll leave with a better understanding of how they work, and why you should ensure your code is protected from them.
WiFu - so you think your wireless connection is safe? - In this session we'll discuss various wireless security techniques including common misconceptions and mis-configurations. We will demonstrate how easy it is to compromise even "secured" connections and what the implications are for you as an IT professional. Using free software and inexpensive hardware (~$30), we'll demonstrate a number of attacks and highlight the vulnerabilities that are present in the behavior of many wireless devices.