Monday, July 13, 2009

Graphs and Automation

Software

Making graphs is a lot of work, even if the subject is bats (and their songs). And while automated systems, like the bat logger, are a great boon in the unattended collection of information, they can pump out data at an alarming rate.

Here's some quick arithmetic:

Hours (8:00 PM – 6:00 AM) - 10

Samples per hour (6 samples every 10 seconds) - 360

Total - 3600

So I can draw a graph like this every day:

But how? Excel can make really pretty graphs, but that's a lot of work. Here's what always happens to me – 2 minutes to make the graph, 15 minutes to make it pretty. In other words, it takes longer to remove lines, edit the legend, choose colors, fix the axes, and so on, by far, than it does to whip out a basic graph in the first place. Fifteen minutes per day, every day is not all that attractive.

The solution to an automated data deluge is automatic graphing, or at least scripted graphing. And Excel doesn't lend itself to easy scripting. But gnuplot does. Here's a link with some examples:

http://nucl.sci.hokudai.ac.jp/~ohnishi/Lib/gnuplot.html

As you can see from the link above, gnuplot will draw a wide variety of graphs from text files, all neatly scriptable with program code. I had to add a bit of external awk scripting to filter the data logger output stream, but the bulk of the work is done by gnuplot. Yes, it took a while to set up and debug, but it saves a ton of time over trying to generate the same graph every day with Excel.

Here's how it works: The bat logger writes the data it collects in ASCII text, comma-delimited files. These land on a USB stick plugged into the logger and I just unplug it one day during the week (when the logger like the bats, is sleeping) and transfer the data to my laptop.

Here's a sample:



2,02/25/2009 - 18:32:00,2
2,02/25/2009 - 18:32:02,2
1,02/25/2009 - 18:32:04,5
2,02/25/2009 - 18:32:04,2
3,02/25/2009 - 18:32:06,7465
4,02/25/2009 - 18:32:06,9221
5,02/25/2009 - 18:32:06,7007
6,02/25/2009 - 18:32:06,0
2,02/25/2009 - 18:32:06,2
1,02/25/2009 - 18:32:08,10
2,02/25/2009 - 18:32:08,2
1,02/25/2009 - 18:32:10,40
2,02/25/2009 - 18:32:10,2
1,02/25/2009 - 18:32:12,15


The first column, that is the number before the first comma, is the channel number. I've numbered the channels like this:



#define BAT_ECHO_COUNT_CHANNEL 1
#define WIND_COUNT_CHANNEL 2
#define TEMP_CHANNEL1 3
#define LIGHT_CHANNEL 4
#define BATTERY_CHANNEL 5
#define RAIN_SENSOR_CHANNEL 6


So the very first two rows shown are from channel 2, and represents the a count of anemometer cup rotations, which in turn is a measure of wind speed. (The idea is to collect some data to see if high winds really do deter bat activity, as some have quite logically suggested.)

The next row starts with a 1, so it's a count of bat echolocation clicks from the bat detector. Basically each pulse that makes up a bat echolocation call gets counted this way. Given that each bat call is composed of tens of clicks, this count mounts up rapidly.

Channel 3 (the fifth row) shows a raw temperature value. I keep meaning to calibrate the sensor, but right now all I'm collecting is a count from the logger's Analog to Digital Converter (ADC). It could range from 0 to 4096.

Channel 4 is a measure of daylight from a photoresistor. The bat logger software uses this to decide when to sleep, and when to wake up and collect data.

Channel 5 shows the battery voltage, again uncalibrated.

Finally, Channel 6 is a rain sensor from an irrigation system. It is ON, (greater than zero) when it gets wet, and OFF otherwise. The data above show that it was a dry night on Febrary 25th, at least around 6:30 PM.

The first task for my script is to split the raw data file into separate files, one per channel. Using a bash shell either on Linux or on Windows under Cygwin (http://www.cygwin.com), this is really just a grep. (Non-Unix weenies can tune out now).



egrep "^1," $1.DAT >$1_1.tmp
awk -f ./bin/filter_bad_dates2.awk $1_1.tmp >$1_1.DAT
egrep "^2," $1.DAT >$1_2.tmp
awk -f ./bin/filter_bad_dates2.awk $1_2.tmp >$1_2.DAT
egrep "^3," $1.DAT >$1_3.tmp
awk -f ./bin/filter_bad_dates2.awk $1_3.tmp >$1_3.DAT

awk -f ./bin/filter_bad_dates2.awk $1_6.tmp >$1_6.DAT
#awk -f combine_echos.awk $1_1.DAT >$1_1a.DAT


The egrep command (http://unixhelp.ed.ac.uk/CGI/man-cgi?grep) looks for lines starting with the digit 1, and siphons them off into a file named ending in .tmp. So when this script is run on the data from February 25th, 2009 the input file name is 20090225.DAT and the samples for channel 1 will end up in 20090225_1.tmp. Samples from channel 2 will end up 20090225_2.tmp and so on.

The awk command (http://www.hcs.harvard.edu/~dholland/computers/awk.html) that follows performs a bit of data cleanup on the dates. More on that in a latter post about the bat logger. For now, just know that I sometimes get some zero dates that need removal to avoid messing up the time scale for the graphs.

To graph a single channel of data, each file is complete. Just script gnuplot, point it to the .tmp file and run. But to graph multiple channels on the same graph, so as to show, say temperature and wind versus echolocation calls, I need to put the relevant files back together again. Gnuplot expects each data series to be grouped together in a single input file, so concatenating the individual channel files back together provides the needed input. Each data series is separated in gnuplot's input file by two blank lines. Hence this odd-looking code:



echo >>$1_1.DAT
echo >>$1_1.DAT
echo >>$1_2.DAT
echo >>$1_2.DAT

echo >>$1_5.DAT
echo >>$1_5.DAT


This just appends two newline characters to each channel file.



cat $1_1.DAT $1_2.DAT $1_3.DAT $1_4.DAT $1_5.DAT $1_6.DAT >$1_all.DAT


The result of this command is a single output file that consists of 20090225_1.tmp with 20090225_2.tmp, 20090225_3.tmp and so forth appended. This is the data file for gnuplot.



gnuplot -background black gnuplot_20090225.plt


At long last we're ready to run the gnuplot script. Here it is:



# Draw bat graphs
set terminal png x000000 xFFFFFF x404040 xffa500 x66cdaa x9500d3
set output './plot/20090225_plot.png'
set datafile separator ","
set xdata time
# Use this format for the data as spat out by the logger, with the hyphen
# between the date and the time.
set timefmt "%m/%d/%Y - %H:%M:%S"
set title "Bat echolocation calls, Temperature and Wind Gusts, beginning 20090225"
#set xtics 1000000
#set ytics 0, 100
#set y2tics 0, 100
set ylabel "Calls"
set y2label "Temp"
unset key
set format x "%H"
set yrange [0:1000]
set y2range [2000:4000]
plot '20090225_all.DAT' index 0 using 2:3 smooth frequency axis x1y1, \
'20090225_all.DAT' index 2 using 2:( 10240 - $3 ) smooth frequency axis x1y2, \
'20090225_all.DAT' index 1 using 2:($3 * 100 ) axis x1y1 ;


Now it took me quite a while, and a lot of peeks at the web to set this up, but the beauty of the system is that I can produce today's graph with a single bash command. Notice that the output terminal is set to “png” in the second line, so gnuplot draws the graph in 20090225_plot.png.

The result is a graph like this:


The main things I learned from this whole exercise were:

1.Script everything or drown in data from your automation.

2.Clean the data, and remove errors and outliers, before they mess up your graphs and analysis.

3.The simplest tools are the best. It's amazing what you can do in a shell script and with awk.

4.Create output files in ascii and use simple delimited formats. Yes it takes up a bit more space than some obtuse binary format, but it facilitates using simple tools and makes the observations easy to view, sanity-check and edit.


I'll post more information about the bat logger in the future.