Those who don’t care about data processing might want to look away now. I’d like to think this post will be engaging for many different people and the work I discuss is highly transferable but let’s be honest – if you aren’t interested in getting more information out of data, then this is probably going to be about as much fun as watching flock wallpaper dry.
I deal with a LOT of spectral data – it is not unusual for an experiment to produce 0.5 GB of data, all of which needs processing into some kind of meaningful results. This is the perennial problem with a lot of current science – computers are great at running automated experiments generating massive amounts of data but not as smart when it comes to spitting out the results – “It looks like it worked Dave, now try it at 25 degrees”. Although to put my ‘problem’ in to some perspective – when running, CERN produces about 15 petabytes annually, which works out at the equivalent of one of my experiments every second. But still, they have a pretty big team, supercomputers and some clever crowdsourcing working on that, where as I have me and a 2 year old laptop.
For my work, the key component of the spectral traces is the peaks. In the looping GIF below, there is a typical spectral trace from one of my early experiments (a trial back in 2009 I think). The graph is plotting the intensity of light at the wavelengths being transmitted through a fibre while it is coated with ~15 layers of material.
As you can see, there are a great number of peaks in the graph, some disappear mid-experiment while others appear. There is a clear pattern in the graph as the experiment progresses but what is needed is a simple break down of that pattern. The simplest solution to this problem is to just simply pick a number of peaks, record the values as each layer of coating is applied and then plot them. This solution is okay for a quick look but there are ~50 peaks in that data – manually tracking all of them will be highly time consuming and in all likelihood, make you go blind from staring at graphs. The spectrum may respond differently at different wavelengths so randomly picking points is a poor approach to take.
The next step is to just have the computer find the peaks for you. Most data analysis software has some kind of peak tracking software and can easily cope with finding the peaks in 500 sequential scans, which would then produced 500 little lists of the ~50 peak positions. Which is only slightly more useful that the raw data but is still a long way from being any interest because the computer skips a step that you would have easily done without thinking if you did it yourself, tracking the peaks.
When you watch the GIF I showed at the beginning of the post, your eye would have naturally tracked the peaks as they move to the right. A computer isn’t that smart – all it can do is scan each spectra and say “yup I found 50 peaks in this one too!”. This can be perfectly acceptable as if the number of peaks never changes, then peak 6 in one spectra is likely to be the same as peak 6 in the next, so you can simply plot peak 6 in every one of the 500 spectra and hope that the peaks always correspond. Although this pretty much never happens as data is inherently noisy, complicated and generally a pain. Below is a little example of what more commonly happens when attempting to track peaks in this way.
So the solution to this is that you need to provide the computer with a few ‘smart’ ways of distinguishing what peak data needs linking to what peak. While these methods are very unlikely to be better than a human’s natural ability to look for patterns, a computerised method has the added advantage of being impartial and therefore less likely to confer its own conclusions. So to do this I wrote SIR – otherwise known as the Spectrum Interrogation Routine; which is absolutely not awkwardly named after an obscure cartoon character…
SIR is designed to track the movement of spectral peaks with a range of different ‘smart’ peak tracking methods. It was important to give SIR multiple methods as the movement of spectral peaks can vary greatly depending on the the experiment, so it was necessary that the program could cope with a large range of peak movements. All of the below tracking systems work by building up the data one scan at a time so the question it is constantly trying to answer is – “where does this list of peaks fit in relation to the previous n scans”
- Default tracking – This is an upgrade on the simplistic idea of just listing the peaks. The method I wrote for this uses the three previous points to try and ‘predict’ the position of the the 4th. The diagram to the right makes this a little clearer. If the new peak value is within ~5% of the predicted position then it is assumed to be associated with that data, if not then the same algorithm is run on the next possible peak. If the new point doesn’t match to any existing peaks predicted position, the program calls it a new peak. This is a good starting method as it can cope with a wide range of peak movements, assuming a high sampling speed.
- Moving average – Moving averages are pretty common but this method uses it for tracking the position of the peak – by taking a moving average of the previous n values and comparing the new peak value to this value. If the peak is within a multiple of the STDEV then it is associated with that peak. If not, again the process is repeated with other nearby peaks until it either finds a match or a new peak is created. This method is a ‘fuzzy’ alternative to the default method and works best for widely separated peaks to allow for a large STDEV multiple. This is particularly good with ‘noisy data’.
- Linear or Polynomial fit – These are more complex tracking methods as they require some assumptions about the effect that you are expecting to see in the data. For this, the program will create a small model based on a certain number of previous points and then compare the position of the new peak value. This model system is highly noise-resistant and allows for tracking of peaks that might disappear for a number of spectra before re-appearing. However, these methods only work where your data actually fits the model. If your data isn’t linear or polynomial then you are out of luck.
On the off chance that there is anyone interested in using this code for their own spectral tracking, then the software is available below.
- SIR (Windows Installer) – This will allow you to run the program without any additional programs on any version of windows above XP (inclusive)
- SIR (source VIs) – This is a ZIP file containing all the VIs I used to make the program. All are free to use under the creative commons access.
Please let me know about any problems, bugs, feature requests or suggestions you think might enhance the program – either through the comments below or via the e-mail address listed in the help files. One of the biggest problems with the code is that I would like this program to be as usable as possible but I only had a relatively small sample of data sets to play with. So if you have a dataset that doesn’t work for whatever reason, then please feel free to send it to me and I’ll try to tweak the code and make it work!