In a recent project at work I did an analysis on the spread of an alien species in Norway using ESRI ArcGIS 10.1 SP1. In this particular analysis we assumed that the species could swim a certain number of meters in open sea. How would it spread and to what extent would current protected areas be invaded by this overseas stranger to our environment? The density of islands Norwegian archipelago is massive, so the possibility for the alien species to spread is rather overwhelming.
As part of the analysis I ended up doing buffers around islands in the Norwegian archipelago. After which it would be necessary to merge and dissolve the objects. This turned out to be problematic. But for some of the shapefiles I was working with ArcGIS (arcpy and python) simply failed to complete the dissolve operation.
After contacting our local ESRI representative, Geodata AS in Norway, they concluded that this was related to the following error in ArcGIS 10.1: NIM079373: Running a large number of features through the Dissolve or Buffer with dissolve option, hangs during process. I have not found any publicly information with this reference.
One could say that 7283 polygons is a tall order. One could perhaps also say that working with polygons in a task like this rather than with raster is asking for problems. Given enough time I will look into it – later – in that quiet week when nothing else is going on at work, sometime.
This blog post is about but how I came to understand more about the limitations and possibilities with the ESRI arcpy Dissolve_management tool. It is also explains how I found a rather surprising way to make it faster.
Buffering around n thousand islands in different regions and then dissolving them to one object works fine most of the time. But for two of the regions the Dissolve_management just stopped processing. The same thing happened if I tried doing the same thing from ArcGIS desktop.
Since I really had to make this work, I tried different ways to fix it. Googling gave me some answers, but my dissolve still hung. So I divided the input file objects file in two halves after which I again tried doing the dissolve operation. Both dissolve operations succeeded. I then continued to merge the resulting files before doing a dissolve operation again. It worked. So somewhere in that fuzzy code within ArcGIS (arcpy and desktop) there is a tripwire stopping the operation.
To handle this in general I wrote a function which splits the input file into smaller files grouped by given number. the objects in these files are dissolved. The resulting files where then merged to one file and dissolved. And just to remind the reader – I was still using the arcpy.Dissolve_management function. The figure below explains the procedure:
Running the “new” dissolve function I noted that the time the whole process took varied. I initially expected the whole process to take longer than it would using the ordinary functionality.
It turned out varying the group size had a rather big impact on the time the process took – in a positive way. I got curious and added timers around all dissolve operations in the script. My expectations for a lower performance were not met. The new procedure was faster. I also prepared at batch script doing the operation on the same input file with a group size varying from 10 to 600.
The result based on nordland_buffer.shp was interesting and to make sure this was not only about that one file I did an additional analysis on a similar file (agder_buffer.shp). The below figure gives you the general idea of what happened:
So what’s in the bottom of the curve? It looked like there is a minimum time for the job when the objects are grouped at a certain size. I was right…
That’s it… the minimum time for these job is when the objects are in groups of around 80. The dissolve operations only takes from 25 to 28 seconds for the tested files. The required padding (merge and delete operations) around the essential dissolve processes adds marginally to the time used.
How does this compare to using the dissolve function straightforward without using above mentioned function? The Nordland dataset hangs, but fortunately the Agder dataset runs through. The total time used for dissolving the dataset is 149,5 seconds!
Could I be missing something here? Or is it the dissolve function from ESRI a rather sub-optimal one? And that it can be made more efficient and stable by simply grouping the input file in sizes of around 80 objects?
There are a host of reasons which could confuse the above picture. Here are some of them:
- The number of overlapping polygons in the input files can make a big impact. So the issue with a sub-optimal dissolve function might not be relevant for an input file with fewer overlapping polygons. The example is extreme with +50 overlapping polygons.
- The remainder of the total polygons divided by group size will vary. This might have implications on the total time to perform the dissolve operation.
- The files I have used might be very unorthodox.
- The computer was in use while the calculations were made. Other activity might have influenced on the time used.
Since working through this problem in December last year I have happily concluded my alien species project, and as such this issue is not of my concern anymore. If someone finds the above of interest I would be curious to have some feedback. I am always eager to have arcpy script go faster. I am of course also interested in other approaches using for example open source libraries.
If my assumptions hold water I suggest that the ESRI guys and girls sit down and remake their dissolve function. It is basically sound, but something is amiss. And when the original function hangs they should give the user some feedback about this. How difficult could it be to implement a failsafe?
To allow for further testing and experimenting I have included the files used in this little experiment:
[wpfilebase tag=list id=9 tpl=table pagenav=1 /]
I would also like to point you to the following discussion on http://gis.stackexchange.com/:
At last I would like to thank my colleagues Johan Danielsen and Martin Bartnes for their contributions and help in the process of understanding the shortcomings of the dissolve functionality in ArcGIS.
Did an update on the article. Please consider if your comment is still valid.
Nice one Ragnvald! Dissolving 7283 is a big job and it good to hear your work-around solution. But why does dissolving polygons in groups of 10 take longer than dissolving them in groups of 110? I would have guessed small groups take very little time and large groups take longest. Weird??
Only ESRI would consider a dissolve of 7000 buffer polygons “large”!
You might like to have a look at JEQL, which is built on the JTS Topology Suite library for geometry processing. It provides a flexible and performant way of running spatial operations on datasets.
I ran the agder_buffer.shp file through a JEQL script to union all the polygons, and it completed correctly in 1.5 sec. The JEQL script to do this is:
ShapefileReader t file: “agder/agder_buffer.shp”;
t = select geomUnionMem(GEOMETRY) g from t;
That’s fast Martin! Is the JEQL script a library available under Python?
JEQL is really its own standalone language and engine at the moment. It runs on the JVM, but really that’s invisible to the user, since they never interact with Java directly.
For integration with Python I’d recommend checking out Sean Gillies’ work on Shapely. It uses the GEOS C library as it’s geometry engine. GEOS is a C port of JTS, so it provides similar functionality (and close, but not quite as good, performance).
Wow … 1.5 sec is incredibly fast!
I also just tried this in QGIS (dissolving all the buffers at once) and it did it in ~ 41 seconds (OSX 10.7, QGIS 1.8, 8GB RAM and SSD). It’d be interesting to adapt the script to run in QGIS and compare your results!
YES! The dissolve process is sub-optimal. I’ve encountered issues with the dissolve operation since ArcGIS v 9.0. I haven’t had the opportunity to do any testing since 10.1 – which given some extra time might be worth doing as it’s been a thorn in my side for sometime. The issues I’ve encountered haven’t so much been on the process hanging but the introduction of splits in the dissolved geometry. Dissolving geometry over a large extent with a large number of features actually introduces split features. For example, a polygon that didn’t have another feature to dissolve with might end up becoming 2 features. When these split features are looked at, it is obvious there is some tiling schema that introduces these splits resulting in geometries not being dissolved but actually split. Manually splitting up large data sets is one way around it, but is this really acceptable for some basic functionality? Even a feature class with 1000s of records and many overlapping polygons should be processed as expected, but as with ANY ArcGIS tools, careful examination of results is always warranted. Curious about the alternatives to ArcGIS suggested, maybe someday I can devote the time to explore. Thanks for the tips!
I’m not sure how you would go about “adapting the script”, but I believe QGIS uses GEOS for geometry processing, and GEOS has the same cascaded union functionality as JTS (which is what JEQL uses). So you could probably get this to work in the same way. Perhaps GEOS is exposed to QGIS Python?
Your unary union is speedy. On my old laptop:
>>> from fiona import collection
>>> from shapely.geometry import shape
>>> from shapely.ops import cascaded_union
>>> def dissolve(c):
... ta = time.time()
... u = cascaded_union([shape(f['geometry']) for f in c])
... print time.time() - ta
... return u
>>> c = collection("/Users/seang/Downloads/agder/agder/agder_buffer.shp", "r")
>>> u = dissolve(c)
Well, it is a pretty fast machine – 3.4 GHz 8-core (although the process is single-thread only).
On my creaky old 2 GHZ machine I get about 4.5 sec.
The difference may be the performance penalty for C memory allocation in GEOS, as opposed to Java’s highly optimized memory managment.
I meant that I think 8 seconds is fast considering that my laptop is 4 years old. And on top of the slow memory allocation Fiona reads and formats all attributes to JSON by default (I didn’t read just the geoms).
Right, the main thing is that they both blow the doors off Arc!
Still memory-bound, though – but I have some tricks up my sleeve for dealing with that problem…
I encountered problems with Dissolve_management recently while trying to convert a python script that worked in 9.3 into one that would work in 10.1. The dissolve worked fine in 9.3 but crashed python with the 10.1 version. I then tried the dissolve using ArcMap and the Arc Toolbox and it locked up giving me a dialog to report the bug to ESRI.
The thing is that I was working on a Virtual XP VM environment with limited RAM (4GB). The dissolve worked for others testing for me on better machines with the same layer (only 812 feature in my case) so I’m still looking into it. Thanks for the insight. This may be useful for me.
I have uploaded an example using multiple (4) processes at once using FME Desktop 2013. For the record this takes 4.1 seconds on my computer on the Nordland dataset and took 4 minutes to create….
Files and workspace can be downloaded here. http://depositfiles.com/files/tcxuedugu
Looks like there are more people having this issue.
Thanks for your time and effort looking at the Geoprocessing Dissolve tool in 10.1 SP1. We’ve been working hard on this and other tools to improve their performance.
I’m happy to inform you that for 10.2 we continued to work on the performance of the Dissolve tool to maintain the quality of our output while at the same time dramatically improving the performance of cases such as the ones you used in this post. When running a Dissolve all operation in ArcGIS 10.2 against the data you provided we are seeing the following performance of the Dissolve tool:
(Specs – Win7 64; 2xDual Core 3.14Mhz CPU; 8GB RAM)
Dissolve ALL Time
agder_buffer.shp Under 7 seconds
nordland_buffer.shp Under 6 seconds
If anyone finds a Geoprocessing tool that they feel is not performing as expected against their data, please contact us directly and if possible share it with us so we can do all in our power to make things better.
Thank you very much for your feedback! As always I will be looking forward to the next version of ArgGIS. ESRI products are central to the work I am doing and much of it would not be possible without 🙂
I would of course also urge you to use and include relevant open source libraries in your software – both for the benefit of its users, ESRI and the other communities supporting open source software!
ESRI’s implementation is definitely very slow. I’m not sure why. I was trying to do the buffer and dissolve all using ESRI and had to implement Shapely’s libraries. I was taking hundreds of thousands of points and buffering them and then dissolving them (so 7000 is nothing). I did a quick test recently and about 11k points took about 3 minutes, 59 seconds using ESRI’s buffer function and through my shapely implementation, it was 13 seconds.
So, I tried using the arcpy Geometry class to see if I could speed it up and using the Geometries individually and using the union function actually took way longer than the built-in buffer. The buffer seemed as quick but the union was a lot slower. I don’t think there’s any way to speed up ESRI’s implementation.