Not too long ago I was stuck waiting for BAM parts so I spent a little time working on the RasperryPi software, and as I think I mentioned at the time, made pretty good progress. The only thing was that once I’d finished writing the program I ended up streamlining it so much I left myself with a whopping 50% of unused processor time, which seems like such a waste. So I started thinking about what I could add into the program that would put this spare processor time to good use.
Advanced notice, I spent waaay too long working on this so excuse me if this goes a little crazy. Although as frivolous as it seems, working through this taught me a lot about big data processing.
So, ever heard of the thought experiment that if you had a infinite number of monkeys all typing away at typewriters then one of them would write the complete works of Shakespeare. Or you may also have heard it as one monkey and over an infinite amount of time it’ll produce the same. There are a few variation but the same basic premise, infinite things or time + typewriter = Shakespeare.
Well I wondered, what if I tried to do that experiment using my soon to be army of idle processors.
Obviously I don’t have an infinite number of monkeys, or for that matter any monkeys at all. But what I do have is a program called
monk.py which I wrote to virtually bang it’s furry self on a virtual keyboard and generate text strings which, I think is close enough and involves less poop (a key consideration in many experiments).
So I wondered if I put my
monk.py to work generating thousands of text strings a second on the RaspberryPi and compared the output to a reference text (e.g. Shakespeare) how close could I get to actually producing the complete works. Obviously I don’t have an infinite amount of time but I reasoned that perhaps with several RaspberryPis running this code, 24 hours a day for years might match just a small portion.
To give myself the best chance of this working I did a few things to stack the odds a little more in my favor. Firstly I ditched the impressively long ‘complete works of Shakespeare’ (884,431 words) and instead used ‘The Hitchhikers Guide to the Galaxy’ (~50,000 words) as my source, along with being shorter I also argue it had had a greater impact on my life so is worth experimenting on:). I removed any keys from the
monk.py typewriter that isn’t in my source text, including stripping out all punctuation. As I think this blog shows, I’ve never been a stickler for proper punctuation use so I shouldn”t expect any better from my
The monk.py program was designed to generate a 256,569 character string of text and then compare that string to a saved copy of HHG2TG. The program would then produce a percentage match for that string. The program would then re-run this loop every 0.2 seconds and record the highest ‘guess’ it had achieved.
So I set the program up on my much more powerful laptop as a trial and pressed go one evening. I was hoping that by the next day I might have at least managed to randomly guess a single word (0.002%) match (see results in the screen grab to the right).
Hmm so after the
monk.py had typed up ~50 thousand different versions he hadn’t even managed to get a single word right. That’s probably not a good sign. Thankfully math can help and tell me how many possible combinations there are to try. To do this you solve 37 (the number of possible characters to choose from) to the power 256,569 which will tell you how many possible ways those characters can be arranged.
Wow so after a surprisingly long 3.5 seconds of calculation it gave me an answer of 887446…followed by 402,346 zeros. Which, just to give a sense of scale, when typed in size 10 font takes up 1,450 A4 pages to write out and would take about 17 days (I can write about 10 zeros in 3.6 seconds) assuming I don’t eat, sleep, use the bathroom or pause to attend to the bloody wound that would be my hand by about day 2.
So as this seems pretty unapproachably big what if I cheat even further and I now modified my
monk.py engine so that it only tried each combination once (rather than using random chance) and I got it to try every possible combination ( aaa, aab, aac… etc). That should massively cut down the amount of time taken to get to the correct answer as it won’t ever repeat itself which should make it quite doable?
Well no, according to some very quick napkin math, in the time it would take for
monk.py to try every combination a red dwarf start could have been born and died 402,331 times. So it’s probably not going to work and to be honest I now kind of feel silly for letting it run over the weekend.
But originally I did say I was planning on having multiple microscopes running
monk.py, might that not make a difference, I hear no one ask! Well not really no, if I assume a perfect business model of selling one microscope to every person on the planet I can shrink my number to 402,319 red dwarf star life times. Which doesn’t help us but might make the red dwarf star feel a bit better.
But what if my sales exceeded even these perfect levels. How many
monk.pys do I need to run though all the probabilities in say an hour, well it’s approximately 10402,351 which considering that the number of atoms in the universe is only around 1082 would certainly cause some manufacturing issues. Not even mentioning the power drain required (we may have to plant more than a tree to offset the ecological impact on the universe).
At this point I figured that it was probably time to give up on my dream of using cloud computing to pointlessly write a book bu, one coffee break when I was rambling on about this someone asked “well what if you were trying to guess the human genome instead”. Given how deep I’ve already gone I was happy to go a little further in aid of this already fairly insane collection of math.
So like the HHG2TG guessing all we need to do is work out the possible number of combinations by solving 4 (the number possible bases in the genome) to the power 3.2 billion (the length of the human genome), this time I skipped the calculator app and went straight to python.
I left it running for 2 days before stopping it, I guess calculating things to the power 3.2 billion is a bit too processor heavy.
But just like before I can go a little further and work out how long this combination-calculation might actually take my computer. By running a small scale series of trial power calculations and creating a best fit (using MagicPlot) I worked out that solving the problem 4 to the power 3.2 billion will take 1965 days of constant processing. So just to re-cap I can’t calculate how impossibly long it would take to randomly guess the human genome because the calculation to do that will take an impractical 5 years to complete.
So after all that math I have concluded one thing, the infinite monkey experiment doesn’t really need an infinite number of
monk.pys, just enough to be so impractical that the the number just looks silly. Then again it is just meant to be guessing at random so it could hit the right answer tomorrow, but it is really really really really really really really really really unlikely. But even though I know how massively improbable it is that I might leave
monk.py running a bit longer, just in case 😉