## 2017/06/19

### Summary

Coming from an atmosphere and ocean sciences background its of particular interest to me how weather (climate), and environmental data are collected, organised, and shared. With the rise of citizen-science and the “open-data” movement, federal and state agencies are opening up their data reservoirs to public scrutiny. In this regard there has been an explosion of access to the core data underpinning their products.

The Deutsche Wetter Dienst (DWD) make their data freely available under the so-called Climate Data Center (CDC).

Below I sketch the approach I used to pull-down 175 years of historical monthly precipitation data for all stations across the country.

### Locating the Appropriate Data Sources

I decided to start by browsing the FTP catalogue. To my pleasant surprise all the directories are well-named, and logically divided into gridded data, European data, and other products.

Navigating a few sub-directories down 1 I found the landing page for the individual stations, and the accompanying meta-data description of the data (“RR_Monatswerte_Beschreibung_Stationen.txt”).

If one is interested in getting a few specific stations then browsing the catalogue and manually downloading the zips would not be too much of an issue.

I wanted all the data (~3000 files) and so the only reasonable option was a script-based download.

With some helpful advice from the users of the Unix & Linux community on Stack Exchange the following procedure was derived.

### Extract the Filenames from HTML Source

Starting on the afore-mentioned page:

Right-click –> View Page Source –> Scroll to the Section Listing Filenames –> Copy Text into a File (in this example named source.txt)

The following steps can be run from the cmdline on UNIX, or if you using a recent version of RStudio then simply insert bash code chunks into a Rmarkdown file (.Rmd), and click Run.

# trim source text to keep only the filenames
awk -F\" '{print $2}' source.txt > names.txt Then we just need to append on the desired prefix to get the full path of the required downloads. awk '{print "ftp://ftp-cdc.dwd.de/pub/CDC/observations_germany/climate/monthly/more_precip/historical/"$0;}' names.txt > filelist.txt

We pass filelist.txt to wget 2 with the argument -i to iterate over the list.

wget -i filelist.txt 

### Extract Files and Clean-Up

Now we can use a simple loop to extract the zips, and move the data into separate directories based on type (i.e. meta-data, raw-data, original zips etc.).

for i in *.zip;
do unzip "\$i";
done
mkdir zips data metadata
cp [mo]* zips
cp [pro]* data
cp [Meta]* metadata
rm [mo]*
rm [pro]*
rm [Meta]*

### Outlook

After a few patient hours all the zips were safely downloaded and the real analysis could begin!

1. “GNU Wget is a free software package for retrieving files using HTTP, HTTPS, FTP and FTPS the most widely-used Internet protocols.”