« Amazing turtle navigation! | Main | Great twitter quote »

Optimizing SST for sea turtles

[This post will be updated frequently. If you are interested in the topic please check back and post your comments below.]

Quite a few years ago now, when I was first setting up STAT and Maptool, I recall being unhappy with the publicly available sea surface temperature (SST) data sets at the time. This was probably some time during 2003.

sst24_2005_305.gif

I was able to get access to GOES SST, which was not publicly accessible at the time. This was an excellent data set for the time, with daily 6km resolution SST images. Unfortunately it was not a global data set, only covering parts of the Atlantic and Pacific (mainly areas of US interest). In any case, I included these data in STAT and Maptool because the quality was relatively good.

All of the global SST data sets that I could find at the time either had too many gaps (eg clouds or lack of satellite coverage) or too much smoothing (eg over-interpolating to fill in the gaps), such that the small scale variations in areas such as the Gulf Stream or the Kuroshio Current would be lost.

It was about this time that I discovered the NAVO MCSST data set on the US GODAE web site. This was a nice data set because it was provided in text format, so I could easily parse out the latitude, longitude and temperature data, then process the raw SST data in to whatever temporal and spatial resolution I wanted. Based on some educated guessing I selected a spatial resolution of 0.5 degrees and tried temporal resolutions of daily, weekly and monthly. It should be noted that the underlying AVHRR data do not have full spatial coverage on a daily basis. So my daily maps were actually a rolling 7 day average of each day with the previous and next three days of data. I did some comparisons at the time that suggested that you needed at least 7 days of AVHRR data for (nearly) full spatial coverage.

So with processes in hand to convert these data into SST grids I happily went on developing STAT and Maptool. Unfortunately after a couple of years I started to run in to a new set of problems. Each of my AVHRR SST grids was about 65MB, and the GOES SST grids were about 24 MB. You do the math. The daily grids alone were taking up more than 32GB per year, and that's not mentioning all of the other data layers I was adding. I was starting to run out of drive space. Now drive space is cheap and I had no problem buying more or bigger hard drives to store the stuff on at home, but it's a bit more difficult to add drive space to a remote server. I had to start making difficult decisions about how much and which data I was going to leave on the server.

The Publication That Never Happened

[Likely shooting myself in the foot here with regards to publication. But the reality is that I hate the peer-reviewed publication process and the chances of me ever formally writing this up are slim (and none). Besides, knowledge should be free. Publishing information in journals that don't make the results freely available to anyone is counter-productive, particularly when dealing with endangered species.]

Then it occurred to me two or three years ago that perhaps I was overdoing things. Did I really need such high spatial and temporal resolutions for my applications? Let's set things up here. What I am interested in is associating various oceanographic parameters with locations received from Argos satellite transmitters placed on sea turtles. The locations received from System Argos have a certain level of inherent error that has to be taken in to account. Do we really need hourly 50m SST grids? Does it make sense given the error of Argos location data? How much accuracy do you gain or lose when averaging across time and space?

Another issue is the gaps in SST data. Tracking data are relatively expensive to gather, so you want to avoid throwing out data where possible. Large gaps in SST coverage can render a lot of your tracking data useless, leaving you very little data for comparison or analysis. So it's equally important to minimize SST data gaps.

One last issue is how far back in time do the SST data exist? Or, conversely, how current are they? For example, the really nice GOES data only go back to 2003, so you can't use it with any tracking data collected before that time. Obviously it would be nice to use data source(s) that have been active long enough to retrospectively analyze tracking data in to the 1990s. And sensors and satellites fail, what do you do when a data stream ends?

So, what is the best spatial and temporal resolution for sampling oceanographic data that will provide, in this case, a sea surface temperature value as close to reality as possible at each satellite tracking location, taking in to account issues of number of files generated and file size to minimize storage requirements and processing time? No surprise that it also takes more time to create and sample larger (doubling the spatial resolution quadruples the file size) grids, which is a critical concern when building interactive tools for use online.

It was unbelievable to me that no one had tackled at least the biological aspect of the problem yet, at least not that I could find. The closest thing I have found is this Bradshaw et al 2002 paper: The optimal spatial scale for the analysis of elephant seal foraging as determined by geo-location in relation to sea surface temperatures

First question for the comments: Do you know of publications that have addressed this issue of optimal spatial and temporal resolution of oceanographic parameters for biological applications?

So about two years ago I finally set out to answer this question. I generated SST grids of various scales from AVHRR data along with a number of publicly available SST data sets. I compared the SST data sets to surface temperatures collected by drifting buoys. The buoys also use the Argos system to obtain location information, so I thought this would be a relatively good comparison, although the buoy locations are arguably more accurate that those obtained from marine animals that may only come to the surface for a few seconds to a few minutes at a time. I also worked under the assumption that the temperatures reported by the buoys were the "truth". There is probably some error in the buoy temperatures, but we have to start somewhere. The results were quite interesting and not necessarily what you might expect.

200806061421.jpg

The red bars are the root mean squared error (RMSE) of each satellite SST source compared to buoy SST and the blue bars indicate the percentage of locations for which no SST was obtained from each of the SST sources. The data sources are listed along the bottom as datasource_temporalscale_interpolationmethod. The data source should be obvious. I added SST from MODIS Aqua, which I had not tried before. The AVHRR data were at 0.05 degrees resolution, GOES at 6km and MODIS at 4km. Interpolation methods are xyz, which basically means no interpolation (just filled in grid cells for which data were available), nearest neighbor and surface. The sources are organized as those with least error to the left. So for example the far left source, avhrr_daily_xyz is average daily AVHRR data with no interpolation. Although this had the highest accuracy when compared to buoy temperatures, you can see the problem of spatial coverage in daily AVHRR data (referred to earlier) in the 90% missing values. "roll" refers to the daily rolling average I mentioned earlier, and weekly is a true weekly average (eg one grid per week). "5day" and "7day" refer to a compositing method where SST grids from before and after a given day are used to fill in gaps in each daily grid. Some of the sources do not include an interpolation method because they were obtained in a gridded format.

Step up from avhrr_daily_xyz and all the rest have significantly greater SST error. Also worth noting is that all of the sources that had less than 1 degree C error also had high levels of missing values (eg lots of gaps due to clouds of lack of satellite coverage).

The take home message I got from this was that one of the middle sources would be best, between avhrr_weekly_surf and avhrr_weekly_nn. Of these the two labeled as weekly would take up the least disk space, and avhrr_weekly_nn would take up less processing time than avhrr_weekly_surf.

Second question for the comments: What is the take home message you get from these results?

Third question for the comments: What spatial and temporal resolutions would you like to see and why?

The above results mostly focus on comparing temporal resolution and interpolation methods. I had intended to follow up with a more thorough comparison of spatial resolutions. I also intended to published the results as I am sure they would be of broad interest to a large number of researchers. Unfortunately life got a bit complicated at about that time and I lost track of it all.

So Why Am I Bringing It Up Now?

About two weeks ago I was moving drives around in my office and accidentally unplugged one. This corrupted the drive and I was no longer able to mount it. As it turns out this was the local drive on which I store ALL of my oceanographic data. It's worth pointing out that this did not effect any data on the seaturtle.org server (it's all still there), but it is the source of any new data that get transferred to the server. So some of you may have started noticing that data available in STAT and Maptool are getting out of date. The data on the now defunct drive take up so much space that it is one of the fews things that I do not back up. Besides, all of the scripts that create the data are backed up, so I knew I could download the source data and regenerate all of the files again if needed.

As you might guess, downloading hundreds of gigabytes of data takes time, so I decided to first try recovering the disk or recovering data off the disk. During the first week after the meltdown I purchased and used a number of data recovery tools, all to no avail until I tried a program called Data Rescue. This appeared to recover the data from the drive and happily chugged away for about 12 hours, only to find that most of the recovered files were corrupted when it was all said and done.

Week two post-meltdown, I rolled up my sleeves, dusted off my keyboard, and proceeded to begin re-downloading all of the oceanographic source data. While I was waiting I dug in to my scripts that process the data in to grids that I can use for mapping and sampling, as this provided a nice opportunity to improve and make them more efficient. I've learned quite a bit since I first set up many of the scripts.

At this point all of my remote sensing data have been rebuilt (chlorophyll, sea surface height, ocean currents, ocean winds, NDVI, productivity). There's more SST source data than any of the others and it was taking the longest to download, so I left SST to last. Finally, a couple of days ago, enough SST data had been downloaded that I could begin processing some of it. Then it hit me...

What Spatial And Temporal Resolution Should I Use?

All of the old questions came drifting back. A complicating factor is that some of the SST products have been improved and new ones introduced since I carried out my initial analysis. After putting off the inevitable for a couple of days, I decided to revisit the questions.

So I have started working on this and the the results are rolling in. I will post them here, but before I do I want your responses to my comment questions above.

Comments

Interesting thoughts. I seem to remember a talk a couple of years back by researchers from the Australian Institute of Marine Science (AIMS) on this topic as well – I will have to try and dig up a reference.

In reference to Q1 my thoughts are that it will really depend on the aims of the project and the scale(s) being addressed (i.e. small scale questions such as home range versus broad scale such as migration) and what is being asked of the remotely sensed data such as SST. Tight association between remotely sensed data on behaviour/ecology requires tight matching of scale & resolution whereby broad associations can probably get away with a weaker match. With substantial improvements in computing, data storage and increased use of remote sensed data in ecological studies, research into “dealing with uncertainty” are becoming more common. Anyway just some initial thoughts & references.

Baccini, A. (2007) Scaling field data to calibrate and validate moderate spatial resolution remote sensing models. Photogrammetric Engineering and Remote Sensing 73, 945-954.

Root, T. L. (1993) Can Large-Scale Climatic Models Be Linked with Multiscale ecological studies. Conservation Biology 7, 256-270.

Wang, W. (2008) Validating MODIS land surface temperature products using long-term nighttime ground measurements. Remote Sensing of Environment 112, 623-635.

Thanks Mark. Good stuff and references. You are absolutely correct with regarding to the need to match up the scale of the environmental data to the particular question being asked. In my case I want to find a scale that is as broadly applicable as possible given the types of tools and services that are and will be offered through STAT and Maptool. Obviously there is the mapping aspect (you want a scale that looks good when mapped). There is the environmental sampling, sampling the environment under or around each location. I think given the error of Argos locations an argument can be made that very high resolution imagery does not add much value. This obviously gets complicated with the increasing use of higher resolution GPS-enabled tags. I will take a look at these refs and see what I can pull out.

Post a comment

You need to create an account on SEATURTLE.ORG before you will be able to post a comment.

Sign in to post a comment.