Archive | July 2012

OSM in CouchDB on Raspberry Pi

The Raspberry Pi is a cheap credit-card-sized ARM-computer with 700MHZ and 256MB of RAM that consumes only about 3.5W energy. It costs about 30€ and runs an optimised debian-linux named “Raspbian“. Since I have heared about it I wanted to see how it performs with a CouchDB installation. CouchDB is a document-based database that should perform well under low-ram circumstances. This is perfect for the Raspberry Pi. There exists a spatial extension named Geocouch which allows the use of a spatial index.

Benefits

What could be the benefits of this kind of system? Clearly the benefits only lie within a system-architecture that only seldom updates the data but regularily queries it. The swapping of an OSM-database to an autonomous hardware makes the system independet from the main computer. Also, depending on further research on possible queries, such a system could prove to be a versatile information-gateway for OSM-data and lift some weight on heavily used default APIs offered by the OSM-project. With a setup like this, it could be possible for everyone to set up a information-delivering API that is cost-effective to set-up and to contain.

Set Up

So, it was time to test this configuration against an OpenStreetMap dataset! The setup of CouchDB was not 100% straightforeward, after installing the package I had to modify the start-up script because there was a problem with the ownership of an automatically created directory. For the installation of Geocouch I had to compile it by myself and again modify the start-up script since the method proposed in the readme of the Geocouch-project did not work for me. (I will not go into more detail about setting this environment up, but maybe I will write a post about this later on)

CouchDB running on Raspberry Pi

On the hardware-side, additionally to the 8GB-SD card that held the system, I attached a USB-HDD with 400GB space on it. I had to reconfigure the CouchDB configs to relocate its storage of the database as well as its views to a directory on the USB device.

Used Data

I had two datasets at hand: First all points in the area of vienna, which sum up to 445220 entries and second the complete OSM-dataset for Austria.

Preparing Data

To convert OSM data to JSON format and batch-upload it to the CouchDB I used the method and tools described on the OSMCouch page of the OpenStreetMap Wiki. It uses the great OSMIUM framework in combination with a custom description-file to generate GeoJSON compatible output. Preparing the dataset for the extent of Vienna was not that time-consuming, whereas the one for Austria took quite some time to process and resulted in a 6.1GB JSON file. The pre-processing of these datasets was not done on the Raspberry Pi but on a much more powerful computer.

Upload to CouchDB

Uploading was done by a little script also mentioned on the OSMCouch page named “chunkybulks.py”. What this script does is just take a huge JSON-file and upload it to a specified CouchDB-database in chunks of 10.000 entries (the size can be specified manually). If one would try to upload a big file at once, the server most probably could not cope with that. On the same page there is a note saying that the software ImpOSM soon will come with integrated CouchDB support but since it was not yet available I sticked to the old but still reliable method.

Bulk Uploading with the “chunkybulks.py” script from user “emka” (the display “of 0” is because the total numbers of entries could not be determined in a fast way since the file is processed sequentially)

This method worked fine when uploading the Viennese points, but with the complete dataset of Austria, it crashed after 450.000 inserted entities. I guess this was because the CouchDB-server could not response fast enough because the bulk size was too large (10.000). After changing it to 1.000 it uploaded entries but it was terribly slow.

Index Generation, Queries

Now came the critical part: Simply put (maybe too simply, but to get the general idea it is good enough), in CouchDB queries are pre-defined by java-script functions. Such a pre-defined query is called a “View”. For faster access, CouchDB generates an index per view. The generation of the index is very time-consuming but once the index is available any future query will be very fast. I’m especially interested in the performance of these kind of queries which rely on an already defined index.

So, I had to create a view – preferably a spatial-view (for a more detailed explanation on this topic, please take a look at the Geocouch-Dokumentation) and execute it once to trigger the generation of an index. As expected, the generation of the index for all 445220 entries in the Vienna-dataset took hours. The index generation for the dataset of Austria took days.

When querying the dataset there is a slow delay noticeable but it happens quite fast considering the power of the RaspberryPi and the size of the database.

Going Further

One interesting question is: What will happen if I use the complete world-file? Given enough hard-disk space and time, the upload and generation of indices should be possible – but how fast will the query be?

Another thing wich will be interesting is how the import method of the future version of imposm will perform, especially since it has support for DIFFs!

July 20, 2012

in GeoData, OpenStreetMap
Leave a comment

Preparing the OpenGovernment TreeCadastre of Vienna for OSM-import (2)

As Friedrich Volkmann from the Austrian OSM-Mailinglist proposed, the single entries from the OpenGovernmentDataset “Baumkataster” do not only include trees, but also shrubs. So, this dataset would better be named “Wood Cadastre” than “Tree Cadastre”. The problem hereby is, that the definition for the OSM tag “natural=tree” only includes trees. So, there has to be applied an additional filtering mechanism. Friedrich proposed to make the decision based on the height of the tree in relation to his age.

I implemented his proposal by checking if the tree is smaller than 2 meters while older than 3 years:

height <= 2 and (int(datetime.datetime.now().year) - year) >= 3

Additionally to this, it would be best to define unique rules for each of the over 90 types of trees. But this is a huge amount of work and in my opinion it is questionable whether this leads to better results or not. After doing some experiments it all comes down to only a handful of trees which would be excluded. With the current general implementation this comes to 678 trees.

It is important to mention that any tree excluded by this method is not ignored but still imported. The difference is, that it is assigned another special tag: “fixme=Baum oder Strauch” (tree or shrub). Only by checking manually the real habit of the plant can be determined.

Andreas Trawoeger is currently working on a yet to be released live-preview-map which is overlaying the trees of the OGD-dataset with the ones already inside OSM. This will give a better overview of the extent of the dataset in question.

July 13, 2012

in GeoData, OpenStreetMap
Leave a comment

Preparing the OpenGovernment TreeCadastre of Vienna for OSM-import (1)

The city of vienna has opended access to some of its geodata to the public. The license under which it is published is compatible with OpenStreetMap, therefore there should be no legal reason not to include any of it into the OSM-database. One of this datasets is the cadastre of trees. ( for the geometrical analysis with maps, scroll down! )

Choosing the Format

The cadastre of trees may be downloaded in various different formates among which are GML, JSON, Shapefile, KML, GeoRSS and CSV. First I went on with the Shapefile format since it is well-proven and there are different ways to access it from different programming languages. But for reasons explained later, CSV is the format to go.

Attribute Data

I used QuantumGIS to inspect the downloaded data.

Looking at the Data-Structures

When looking at the data-structue of the file, one can see the following columns:

tree-number (“BAUMNUMMER”): a unique number by which the tree can be identified unmistakably
area (“GEBIET”): the kind of surrounding of the tree
street (“STRASSE”): name of the street where the tree is located
type (“ART”): a string consisting of the latin name, the cultivar and the german name
year of plantation (“PFLANZJAHR”)
circumference of the stem (“STAMMUMFANG”): in meters
diameter of the crown (“KRONENDURCHMESSER”):
height (“BAUMHÖHE”): the height of the tree in meters
geometry: the actual position of the tree in geographical lat-long

A quick glimpse at the page for the tag “natural=tree” at the OSM-wiki gives an overview over the proposed tags for trees:

type: This distinguishes just between “broad_leaved”, “conifer” or “palm” trees. This information has to be calculated out of the “ART” field from the OGD-dataset.
genus: The genus is just the first part of the latin name and has to be extracted from the “ART” field
species: Here the complete latin name is stated
taxon: The taxon is for describing the taxonometry of the tree in greater detail. More information about this can be found on this OSM-wiki page.
sex: The sex of the tree
circumference: The circumference of the stem in meters.
height: The height in meters.
name: The name tag should only be used when it describes a very special tree.

Converting the Data-Structures

The tree-number may be left out. It would be possible to identify the tree later on when maybe applying any updates to the imported dataset, but since there is no tag recommended for data like this, this would add inconsistency to the OSM-database. Also, any updates done later on can identify the tree by its location. There is no information about the sex or the name of the tree, so this information is left out. The circumference in the OSM-database is measured in meters and refers to the stem. So, this value is taken from the “STAMMUMFANG” field which is apparently in centimeters and needs to be converted. Height is the same in both datasets. The diameter of the crown has no appropriate tag in the OSM-naming scheme. This is quite disturbing since I see many cartographic possibilities to use this value. I decided to still include this value with the import by using a tag called “diameter_crown“, like it is proposed on the tree-3D-visualisation pagein the OSM-wiki.

Extraction of Genus and Species

The genus and specieshave to be extracted from the “ART” field. This is done with a python script. The “ART” field is just a string which contains the complete latin name, sometimes the cultivar in single quotes followed by the german name in parenthesis. An example: In the string

Tilia cordata 'Greenspire' (Stadtlinde)

“Tilia” corresponds to the genus, the species is “Tilia cordata”. “Greenspire” stands for the cultivar and “Stadtline” is the german name.

The Cultivar / Taxon

It is a bit more challanging with the taxon. According to the OSM-wiki-page for taxon, it may contain any latin specification of the botanical name, even the cultivar. Also, the botanical name can be split into its parts by using sub-tags like taxon:cultivar=* . It is a bit unclear to me whether to use the genus/species tag or go on with only “taxon:species” and “taxon:genus”. I consider it best practice to stick with simple “genus” and “species” and include the cultivar with “taxon:cultivar”. The taxon itself is also extraced with the help of a python script. There are some entries that contain two cultivars separated by a comma. This disturbes the dissection process of the “ART”-field. Also, it does not make sense to include two cultivars in the OSM-database. Therefore, the values posing problems are identified manually and removed from the input-CSV before processing it with the python-script. This values and their chosen value are:

"Sumach, Essigbaum" -> Essigbaum
"Kiefer, Föhre" -> Kiefer
"Schwarzkiefer, Schwarzföhre" -> Schwarzkiefer

[edit]

There are two more entries that need to be changed:

“Malus spec. ,Apfel” -> Malus spec. (Apfel)

“Juglans nigra, Schwarznuss” -> Juglans nigra (Schwarznuss)

Determining the Type

The type is not hardwritten in the OGD-dataset but can be determined by looking a the genus of the tree. For this purpose a list of comparisons is used inside the python script:

 if genus == "": ttype = ""
 if genus == "abies": ttype = "conifer"
 if genus == "acer": ttype = "broad_leaved"
 if genus == "aesculus": ttype = "broad_leaved"
 if genus == "ailanthus": ttype = "broad_leaved"
 if genus == "albizia": ttype = "broad_leaved"
 if genus == "alnus": ttype = "broad_leaved"
 if genus == "amelanchier": ttype = "broad_leaved"
 if genus == "araucaria": ttype = "conifer"
 if genus == "baumgruppe": ttype = ""
 if genus == "betula": ttype = "broad_leaved"
 if genus == "broussonetia": ttype = "broad_leaved"
 if genus == "buxus": ttype = "broad_leaved"
 if genus == "calocedrus": ttype = "conifer"
 if genus == "caragana": ttype = "broad_leaved"
 if genus == "carpinus": ttype = "broad_leaved"
 if genus == "castanea": ttype = "broad_leaved"
 if genus == "catalpa": ttype = "broad_leaved"
 if genus == "cedrus": ttype = "conifer"
 if genus == "celtis": ttype = "broad_leaved"
 if genus == "cercidiphyllum": ttype = "broad_leaved"
 if genus == "cercis": ttype = "broad_leaved"
 if genus == "chamaecyparis": ttype = "conifer"
 if genus == "cladrastis": ttype = "broad_leaved"
 if genus == "cornus": ttype = "broad_leaved"
 if genus == "corylus": ttype = "broad_leaved"
 if genus == "cotinus": ttype = "broad_leaved"
 if genus == "cotoneaster": ttype = "broad_leaved"
 if genus == "crataegus": ttype = "broad_leaved"
 if genus == "cryptomeria": ttype = "conifer"
 if genus == "cupressocyparis": ttype = "conifer"
 if genus == "cupressus": ttype = "conifer"
 if genus == "cydonia": ttype = "broad_leaved"
 if genus == "davidia": ttype = "broad_leaved"
 if genus == "elaeagnus": ttype = "broad_leaved"
 if genus == "eucommina": ttype = "broad_leaved"
 if genus == "exochorda": ttype = "broad_leaved"
 if genus == "fagus": ttype = "broad_leaved"
 if genus == "fontanesia": ttype = "broad_leaved"
 if genus == "frangula": ttype = "broad_leaved"
 if genus == "fraxinus": ttype = "broad_leaved"
 if genus == "ginkgo": ttype = "ginkgo"
 if genus == "gleditsia": ttype = "broad_leaved"
 if genus == "gymnocladus": ttype = "broad_leaved"
 if genus == "hibiscus": ttype = "broad_leaved"
 if genus == "ilex": ttype = "palm"
 if genus == "juglans": ttype = "broad_leaved"
 if genus == "juniperus": ttype = "conifer"
 if genus == "koelreuteria": ttype = "broad_leaved"
 if genus == "laburnum": ttype = "broad_leaved"
 if genus == "larix": ttype = "broad_leaved"
 if genus == "liquidambar": ttype = "broad_leaved"
 if genus == "liriodendron": ttype = "broad_leaved"
 if genus == "maclura": ttype = "broad_leaved"
 if genus == "magnolia": ttype = "broad_leaved"
 if genus == "malus": ttype = "broad_leaved"
 if genus == "metasequoia": ttype = "conifer"
 if genus == "morus": ttype = "broad_leaved"
 if genus == "nadelbaum": ttype = "conifer"
 if genus == "ostrya": ttype = "broad_leaved"
 if genus == "parrotia": ttype = "broad_leaved"
 if genus == "paulownia": ttype = "broad_leaved"
 if genus == "phellodendron": ttype = "broad_leaved"
 if genus == "photinia": ttype = "broad_leaved"
 if genus == "picea": ttype = "conifer"
 if genus == "pinus": ttype = "conifer"
 if genus == "platanus": ttype = "broad_leaved"
 if genus == "platycladus": ttype = "conifer"
 if genus == "populus": ttype = "broad_leaved"
 if genus == "prunus": ttype = "broad_leaved"
 if genus == "pseudotsuga": ttype = "conifer"
 if genus == "pterocarya": ttype = "broad_leaved"
 if genus == "pyrus": ttype = "broad_leaved"
 if genus == "quercus": ttype = "broad_leaved"
 if genus == "rhamnus": ttype = "broad_leaved"
 if genus == "rhus": ttype = "broad_leaved"
 if genus == "robinia": ttype = "broad_leaved"
 if genus == "salix": ttype = "broad_leaved"
 if genus == "sambucus": ttype = "broad_leaved"
 if genus == "sequoiadendron": ttype = "conifer"
 if genus == "sophora": ttype = "broad_leaved"
 if genus == "sorbus": ttype = "broad_leaved"
 if genus == "tamarix": ttype = "broad_leaved"
 if genus == "taxus": ttype = "conifer"
 if genus == "tetradium": ttype = "broad_leaved"
 if genus == "thuja": ttype = "conifer"
 if genus == "thujopsis": ttype = "conifer"
 if genus == "tilia": ttype = "broad_leaved"
 if genus == "toona": ttype = "broad_leaved"
 if genus == "tsuga": ttype = "conifer"
 if genus == "ulmus": ttype = "broad_leaved"
 if genus == "zelkova": ttype = "broad_leaved"

The list of geni is complete since I used the “List Individual Values” function of QuantumGIS to get all possible values for genus.

Converting Data to OSM-compatible Format

I tried to make the python script work with the SHP-file, but the python module “pyshp” apparently has problems with the encoding and the OGR-module quits the process with a segmentation fault. Currently the script takes the CSV-file as an input and outputs a newly created SHP-file. This file can be imported by JOSM using the “opendata” plugin and then be uploaded to the OSM-database. But there is a problem: A shapefile can only hold up to 8.3 characters in the attribute discription which truncated some values like “diameter_crown” to “diameter_cr”. So, the way to go is to again create a CSV-file by the script. This proved to be easy to implement. Sadly, JOSM does not import the CSV-file but gets stuck during the process (this is also true for ODS files – in fact every other format than KML had some disadvantages, e.g. unsupportet encoding, inclusion of the lat-lon as tags, …) . So, one can use QuantumGis to convert the CSV to KML which can be read by JSOM without any problems. The python-script produces an output-CSV which has to be converted to UTF8-encoding. Otherwise, QuantumGis will not display any special-characters like “ä”, “ü” or “ö” and will remove them from the dataset. This can be done by one of the many text-editors available (e.g. with linux: “Gedit” or “Geany”).

Geometry Information

It is important not to replace any existing trees in the OSM-database or create duplicate entries. Therefore, the data is analysed by using QuantumGis.

Coverage

Currently there are 2.996 trees mapped in Vienna, most of them in the 8th and 7th district.

Many of these are located inside courts, so they don’t collide with the OGD-dataset which only contains public trees (which in turn are rather located on open streets than on private areas) as can be seen by the example in the following graphic:

The OpenGovernmentDataset contains 120.951 treeslocated in all areas of Vienna:

It is easy to see that the total distributed coverage is much better with the OGD-dataset. Additionally, the OSM-dataset contains no information about tree-types, height or else.

Positional Accuracy

As can be seen by the following examplary graphic, many of the trees that are already mapped are located at nearly the same spot as their OGD-counterparts.

This high positional accuracy makes it easy to identify and leave out any already existing trees. These trees will be aggregated in an own file for later (manual?) processing. I made a positional check with buffer-fields around the OSM-trees. These buffers go every meter from 1 meter to 9 meter. The results are presented in the following table. the numbers are the points that overlap with the buffer. The “+ More” field shows how much more trees were selected in comparison to the buffer one meter smaller.

Buffer Size	# of Trees Contained	+ More
1 meter	1034
2 meter	1124	+ 90
3 meter	1136	+ 12
4 meter	1145	+ 9
5 meter	1158	+ 13
6 meter	1203	+ 45
7 meter	1227	+ 24
8 meter	1239	+ 12
9 meter	1250	+ 11

Increase of tree number when expanding search radius

Increase of Number of Trees when Expanding Search Radius

As can be seen, there are more and more trees selected when expanding the search radius. Until a buffer-size of 5 meters, the amount of additional trees selected is mostly decreasing. From 5 meters it is increasing again which may be because trees may be counted twice because of overlapping buffer-zones. This value defines the upper limit of trees not suitable for import. All trees within a search radius of 5 meters (better choose the higher value to be sure) will not be imported. This will result in a total amount of 1.158 trees that are not imported and processed for later manual checking.

Preparation of Geometry Information

To exclude the unwanted trees and save them in a separate file for later processing, again QuantumGis can be used.

Manual Refinement

After selecting the trees that can be imported, suspicious values like a redicilously high “diameter_crown” have to be removed.

In JOSM the data can be refined even more. This step could have been included in the python script, but it is quite easy to do manually. There is the tag “species=baumgruppe”. This does not make sense. These “baumgruppe”n will be included in the final upload, but only as “natural=tree” without any additional information. With JOSM we can search for “baumgruppe” and remove the undesired values at once for all found trees. There are also some empty attributes. They can easily be found and removed with the “validator” plugin. Just select all elements, and perform a validation. Then select all occuring problems and click on “Fix”. The empty attributes should be deleted automatically. To speed up this process I deactivated all but the needed checks in the options.

Upload

By now the OGD-tree data should be refined and ready to upload !

GISForge