Selecting Geospatial Software for a Project
IGIS Tech Notes describe workflows and techniques or using geospatial science and technologies in research and extension. They are works in progress, and we welcome feedback and comments below.
The Challenge of Selecting Software
One of the most common questions we get from customers looking for a geospatial solution for a research or extension project is what geospatial tools they / we should use. This conversation often pivots to a discussion about open-source vs. commercial tools. Some clients have strong preferences for one or the other, based on their experiences or perceptions. Inevitably, as we start to discuss the details of the project and the functionality requirements, the fuzziness and overlap between these two paradigms starts to emerge.
In this Tech Note, I discuss the similarities, differences, overlap, and misconceptions about open source and commercial geospatial tools. Part 1 begins with a review of some of the key characteristics of open source, debunking some of the common misconceptions. Part II looks at the question from a functionality perspective, which we have found to be the best way to think about the optimal set of software components for a specific project.
Part I: The Many Sides of “Open Source”
Open Source ⇒ Open Code
Even for geospatial developers like us who like to get under the hood of the tools we use, we only dig into the source code on rare occasions. We might do this if we need to debug something, use a poorly documented function, figure out a workaround, or build a companion function that calls or uses the input from another function. These are not common, but that being said, when you need to view source code to get over a hump, it’s extremely helpful to us (and by extension our customers).
As an example, when I was developing the Projected Chill Hours online calculator (designed for tree crop growers) using the open-source R ecosystem, one of the dependent packages (chillR) wouldn’t install on the server due to a problem with an underlying dependency. It wasn’t something I could fix, but by digging into the source code I was able to extract the functions required and bypass the troublesome dependency. This was possible not only because I could access the source code, but also because the license it was under allows for copying with attribution (see below). If I had hit that glitch with an ESRI product (commercial GIS software), we would have to get on the phone with their Tech Support and just hope that they would know a workaround or release a patch.
A huge benefit of being able to view source code is the ability to see how data are being manipulated and reproduce the results. This is particularly of importance to researchers, as reproducibility is the gold standard in science, and is being increasingly expected by reviewers as more and more science relies on complex computational models.
A good example of the value of reproducibility can been seen in the evolving world of drone photogrammetry. Like many groups, our go-to software for stitching drone images into a 3-dimensional model of the landscape is Pix4Dmapper (a commercial and proprietary product). Pix4D was an early leader in drone photogrammetry software, and can generate outputs that look amazing. The software also generates a quality report that provides quantifiable metrics of the stitching process. However key elements of the underlying algorithms are proprietary, and opaque, including the identification of key points that are the foundation of the underlying geometric 3D model, and the blending algorithm used to erase the overlap between images. Because the algorithms have not been published, researchers who want to understand the effects of the various stitching parameters have to resort to trial and error (and indeed many papers have focused exactly on this).
For some research applications, the details of the stitching process are not critical to understand or exactly reproduce. For example when the research output is a 3D model of the landscape or vegetation, and the question being investigated is not terribly sensitive to uncertainty at the scale of the data, it's good enough to report “this set of parameters for this version of the software did the best job in modelling the geometry of our field data”. If someone wants to reanalyze the data, they at least have metrics of spatial accuracy as a reference point.
However in other applications, the opacity of the software can be a deal breaker. UCCE Specialist Alireza Pourreza for example has studied the noise introduced by Pix4D’s black box blending algorithm used to combine overlapping multispectral images, and found it significant enough to seriously affect research conclusions. His team couldn’t diagnose the underlying issue, or recommend an alternative, because the algorithm is hidden. So instead they recommend not stitching images at all, and instead apply corrections and analysis to individual images. For reasons like this, more and more researchers are turning to open source photogrammetry platforms like Open Drone Map, which doesn’t always perform as well but at least you can see what’s going on and share a reproducible workflow.
A key element of what you can and cannot do with open source software is how it's licensed. Just because the code is accessible does not mean there are no rights or rules attached to it.
Most open source geospatial tools have permissive licenses that allow for both reproducibility and adaptation. These licenses generally require copies or adaptation of the code to attribute the source, and may require adaptations to have the same level of permissibility (known as “copyleft”).
It’s worth noting that open source licenses don’t preclude using those tools for commercial purposes. Indeed ESRI software (and many other commercial GIS products) use open source libraries for core functionality like importing and exporting data, geoprocessing, data transfers, etc. However commercial isn’t the same as proprietary. ESRI can use open source libraries like gdal and geos in their commercial products, and they can even make improvements on them. But they can’t place further restrictions on people who use those components of their software.
Licenses are not a detail most users have to worry about. However they’re an important consideration for researchers who are developing code they want others to use. The majority of research tools start out as open source so that other researchers can review, test, and improve the code base.
A big consideration for many end users in selecting software is whether a cost is involved. Open source software and open source licenses are generally free, which is a big part of their appeal. Indeed many people equate ‘open source’ with ‘free’ without realizing that free is a nice byproduct of open source, but not the soul or even the primary intent.
When selecting software to use, one should not underestimate the hidden costs beyond license fees, the biggest of which is probably the value of your time. Open source software generally has a steeper learning curve than commercial products, although this varies a lot. If your time is limited, or you’re low on the learning curve, you may wind up having to hire someone. Testing and debugging can be extremely time consuming also for both open source and proprietary software.
Open source software may or may not be more buggy than commercial counterparts, and may or may not have better documentation. Well established open source projects like QGIS, R, and Python have phenomenal user communities with lots of tutorials, installers, and development tools. Not surprisingly many of the more mature open source projects have corporate sponsors who contribute staff time to keep the projects coordinated and running. Well established commercial products like ArcGIS are in the same boat. But applications that just came out a year ago from a startup - maybe not. Hence a better way to estimate the hidden cost of software is to look at the strength and maturity of the user community and not simply whether it's proprietary or open source.
The ultimate hidden cost of geospatial software is when you have staff turnover and need to find someone to continue maintaining and developing the application. We’ve all come across dead websites, mapping applications that no longer work, software that no longer runs on modern operating systems or looks like it was designed in the 1980s, etc. A big reason our unit sticks with off-the-shelf products from ESRI and Google, and well-established open source projects like R and Python, is because we know these tools are widely taught and we can survive staff turnover. When we occasionally do foray into a specialized tool for niche application, it's generally for research and not something we expect to have to maintain for years and years.
Part II: A Functionality Approach
In Part I of this article on geospatial software, we break down the characteristics of “open source” vs. “commercial”, which often are the poles of discussion about software choices. This is good background info, but we find the most useful way to think about the optimal components for a geospatial “stack” is in terms of the functionality needed for the success of the project. This broadens the conversation beyond the technical characteristics of different options to the specific needs of a project and organization.
One-Stop-Shop vs. a Stack
One of the first questions we ask ourselves when talking to customers is whether there’s an off-the-shelf tool that has everything they need. Examples of “do everything” platforms in the GIS world include the ArcGIS ecosystem and Google Earth Engine. If one of these platforms fits the bill, or the scope of work can be tweaked to work within one of these platforms, that will almost always be our Plan A. Even if we don’t know every nook and cranny of the ecosystem, we know with a fair degree of confidence that we can figure those out.
But sometimes, even oftentimes, the one-stop-shop solution doesn’t cut it. A good example are GIS web apps. Web apps are essentially websites that include dials and levels for user interaction. GIS web apps almost always include a map (or two), and include everything from simple Story Maps, to dashboards, to fancy decision support tools. ArcGIS has templates for a wide range of web apps that are easy to build and fulfil a range of use cases. They work great, until they don’t. If you need to do something that isn’t natively supported, we’ve learned through trial-and-error that you may be better off building your solution from lower level components from the get-go. Yes, it’ll take more work (often a lot), and will be harder to maintain and repair, but if you really need customized functionality, user interface, or interoperability, using a stack of individual components will get you closest to the finish line.
Data access? Analysis? Visualization? All three?
Our bread-and-butter projects typically aim to translate research results into actionable info. The functionality for these projects can typically be clumped into three categories:
- Data. Do we need to provide access to the source data, and if so in what format? This is often contingent on the technical sophistication of the audience, and the degree to which customization is needed.
An example of a project that is centered heavily on facilitating access to research data is caladaptR, which simplifies the process of importing climate data from Cal-Adapt into R. The add-on gives the user several options in terms of formats and level of aggregation, but stops there. Once in R, the user is on their own. In other projects, a data download tool is bundled with visualization or analysis.
- Analysis. Transforming research results into actionable info almost always requires some analysis. In some cases, the researchers have made all the crucial decisions and the user just needs to be able to find and apply them. In other cases, there isn't a single analytical output that meets all needs, and users need access to the dials and levers. A good example of the later is the Land Conservation Ranking Tool, which provides sliders for the user to tweak the weights of approximately 15 measures of conservation value.
- Visualization. Some people can make decisions from statistics or tables of numbers. The rest of us need things like graphs and maps to make sense of the world. In some projects, the visualization requirement is key and drives the choice of software. A good example of this is the CalLands web app which makes visible patterns in ag land ownership. The researchers had designed some custom graphs which could only be built with custom programming, which we were able to deliver use a low-level plotting library. In other cases, the off-the-shelf mapping and graphing widgets that exist in almost every platform are sufficient.
Any project that’s public facing will probably need an online front end, which probably means building a web app. This imposes a short-list of compatible technologies right from the get go. Uploading zipped shapefiles to a FTP server doesn’t cut it any more.
Fortunately there are a number of well developed platforms for creating online front ends that can display ready-made or customized maps on-the-fly. These tend to fall into two camps - client side and server side. Client side maps essentially provide the instructions for your web browser to perform like a lightweight GIS program. These work well for small to medium sized maps, but bog down if there’s a lot of intricate data to display (which browsers aren’t built for). Server side web mapping platforms do all the data crunching on a server, and send just the final image to the browser to display. These work well provided you have a fast enough internet connection.
If the data need to be hosted in the cloud, for either online or local consumption, we have to turn to data storage options that work well on the cloud. Fortunately there are several choices for serving GIS data from the cloud, including commercial, open-source, and everything in between. But if the project also requires ongoing ingestion of new data, or data cleaning, we have far fewer options to choose from. We have more than one project that involves an automated program running every day at midnight, downloading some new data to a server, running some custom geoprocessing tools, converting it to a different format, and then uploading it to another server. These work fine once they’re set up, but such custom pipelines can be tricky to build and require constant vigilance in case any link in the chain fails because something got updated.
The majority of our projects have an online component, which means someone needs to host the data and code for analysis and visualization. Some of our favorite platforms are quite good at this but are restricted to a specific company’s infrastructure (e.g., Google Earth Engine). Others can be run from a server anywhere but require commercial software licenses (e.g, ESRI). Others stacks can be truly run from any Linux machine anywhere, whether physical and virtual, using only open source components like Geoserver, PostgreSQL, gdal, etc.
We have a few options to pick from, but if the customer needs to self-host the application we have to work within their IT infrastructure. A good example of this are the GIS projects we support for the Karuk Department of Natural Resources. Due to a number of factors including the sensitivity of the data and network connectivity, most KNDR projects need to be entirely offline or hosted from their IT infrastructure. This still leaves a lot of options open, but not every component we might consider can be hosted on their servers.
The stability of the code base and data infrastructure is essential for a platform to be useful for the long-term. This in turn is shaped by the stability of the company or the development community that maintains the underlying code base. Neither open source or commercial products are intrinsically more stable, and there are solid examples in both camps of tools that are super stable as well as some that are short lived.
As a general rule, off-the-shelf products that have been around for a while tend to be fairly stable. Behemoths like Google and ESRI are big, diversified companies, and while they certainly have beta versions of products that may or may not survive, their flagship products are going to be around for a long long time. Similarly open source projects supported by groups like the Open Source Geospatial Foundation are well-developed and well-maintained, and indeed commonly used in commercial products (who in turn support the foundations). Python, R, and QGIS also have strong support communities in industry (Python), academia (R & QGIS). These are our go-to tools for GIS work. In cutting edge domains, like drone photogrammetry, the jury is still out as to which platforms will still be around in 10 years.
Security and Privacy
Security and privacy requirements are two of the ‘no compromise’ criteria in selecting a platform. In projects where the goal is to make research results available to as wide an audience as possible through a public resource, this isn’t a huge technical requirement. We just have to make sure no sensitive data is exposed. We like those types of projects, because we can focus on the cool stuff.
However many of our most impactful projects require attention to data privacy. At the “easy” end of the spectrum, access to the data is simply “yes” or “no”. For these projects, we can use a platform that has integrated user accounts and authentication. Commercial products from ESRI, Google, MapBox and others can usually handle this pretty easily. For open source projects, you usually need to add another item to the stack, which someone will have to set up and maintain. Projects that aren’t really that sensitive but the client doesn’t want to share it broadly can sometimes get away with simply making the URL unlisted, but that’s not really a form of security.
Trickier still are projects that require multiple levels of permissions. This can include tailored permissions for different user roles (e.g., viewer, content editor, admin), and/or different subsets of the data (e.g., restricted by field and/or record). The tools that limit access to data have to be robust and apply across every component of the project, from data collection all the way to archiving and backups. These are not trivial to set up and require long-term maintenance that in the worst-case scenario (for administrators) may necessitate a system for providing individual user support.
A good example of where permissions can get tricky is a mobile data collection app that involves a mix of sensitive and nonsensitive data. Something as innocent as a pasture monitoring app might involve information a landowner wishes to record for her own purposes, and may even be willing to share with researchers, but doesn’t want to share with nosy neighbors or the general public. Sometimes a workaround might suffice, such as tweaking the visibility of locations in web maps, however if data are truly sensitive, or perceived to be so, you need to use tools that lock up the underlying data.
Well-developed citizen science platforms like iNaturalist give users and administrators the ability to conceal the precise location of recordings of rare or listed species. This solves many of the privacy concerns, but at the cost of the platform now having to maintain two versions of the location - the real one and the obscured one. To complicate matters further, each location needs to get different permission levels for different sets of users. Few platforms provide that level of granularity out-of-the-box, so you have to be prepared to develop a custom solution or use a higher-end product.
Reproducible workflows are not always a high priority feature, but they’re often important in at least two types of applications - research and government applications.
Geospatial pipelines that feed into publishable research should ideally be reproducible. The old “we tweaked the model until the output looked right” approach no longer sits well with many reviewers and editors. Reproducibility is also key to extend research into new questions and datasets.
Government agencies also tend to require data processing pipelines and analysis tools that are as reproducible as possible. Anything that shapes decisions about public resources or informs public policy has to meet higher standards of transparency and accountability as defined by legislation and in some cases court precedents. All stakeholders have the right to dig into the weeds of the data analysis, and interested parties most certainly will. State and federal agencies also tend to value scalability quite strongly, which requires reproducibility to be built in from the beginning.
The most common approach to reproducibility is to work through scripts, which become part of the final product. This generally works but can take a lot more effort and may require skills not readily available. A lot of GIS work involves visual interpretation and judgements using the most powerful computing platform of all - the human brain. Desktop GIS software like ArcGIS Pro also have strong logging capabilities, which can record both manual and automated processes. Logs may not meet the standard of full reproducibility but they certainly make the process more transparent.
A strong predictor of technology adoption is whether the target audience has the background to use your tool. This often is the primary factor in select which platform to use.
If for example, we’re developing a data repository for county level planners or agencies, the first thing we need to find out is what tools they’re already using. If it’s mostly spreadsheets, which isn’t equipped to manage spatial data very well, we know we need to handle the mapping visualizations through a web site, and make data available as non-spatial CSV files. If they’re GIS users, they’ll probably want access to the underlying data, maybe as downloads or an API rest endpoint. But we also need to find out which GIS software they use. Universities and government typically have access to ESRI licenses, but if we’re talking landowners, nonprofits and some industries, we need to accommodate open-source users like QGIS. File formats, documentation, tutorials, even the labels on the GUIs are all shaped by the experience of the users. As much as we or our client might get excited about a novel new product or code library, we don’t want to build something that won’t survive staff turnover.
Planning for user familiarity also requires us to look at the pipeline of future users. As academics, we naturally keep tabs on what software is being taught in higher ed through conferences, webinars, publications, etc. We also find out what’s in demand through Office Hours, our own workshop evaluation, and periodic surveys we conduct every couple of years or so. When a project calls for building not just a final product, but the underlying data engine, we pay attention to trends in data science and CS. This is one reason why we generally stick to off-the-shelf tools from well-established product lines, whether commercial or open-source. A space we’re currently watching closely is Artificial Intelligence, where there are a number of competing platforms trying to build market share.
License availability is an issue to consider both for us as developers as well as our end-users. Open source projects tend to have permissive and free licenses, but commercial products and data of all kinds may require licenses which are often linked to an annual subscription fee.
As an R&D unit, we have pretty good access to licenses for our own use, and we feel fairly comfortable committing to hosting and maintaining projects for the lifespan of a typical project (say 5 years). If we need to use a tool or library that has a high annual license fee, we either look for an alternative or build it into the budget. But if a product requires end-users to have their own licenses, say for a data portal we’re building, we need to take that into consideration in the design stage. When in doubt, we share data in open source formats such as public REST endpoints and standard file formats like geojson.
Licenses for data is another issue. Being housed in a large university system, we have access to a lot of data from both government and private sources. But this doesn’t mean we can use any dataset for any project. Every case has to be evaluated individually. Many of our projects require real-time imagery, including high resolution satellite data, some of which is free, some of which is not. Datasets funded by government agencies have traditionally been made available to the public for little or no cost, but even that is changing. In recent years the National Aerial Imaging Program (NAIP) has been looking at adopting a subscription model, and other expensive datasets like LiDAR are behind paywalls until someone needs them.
Occasionally we get a project where the use of the tool itself is a subject of interest. This is more likely for decision support web apps, where the explicit goal is to help people make better decisions. In general, the bar for analytics for geospatial tools that go beyond basic page views is surprisingly low. But occasionally a researcher or funder wants evidence that the tool is actually making a difference, and how it can be made better.
The technologies for recording and analyzing usage of a tool are of course extremely well developed. An enormous amount of information can be gleaned from web apps just from web logs and Google Analytics, including characteristics of the users, how they navigate through a site, time on each page, etc. Other libraries have the ability to track “button clicks”, so you can see how your users are using the site. Pop-up surveys are also easy to implement using 3rd party services. On the desktop, ArcGIS Pro and R both have built-in logging capabilities that are easy to turn on and can be extremely detailed.
Web logs are another source of data, which you can tap if you want to know how many times a dataset you’re hosting was downloaded, or how many times a rest endpoint was hit. There are standard tools to harvest these kinds of metrics from logs, but someone has to be tasked to do it. If analytics is an important goal, a general rule of thumb is to add 10% to the budget and timeline both for the development time as well as the ongoing monitoring and analysis.
We’re fortunate to live at a time when there are so many choices for geospatial software, thanks to so many shoulders to stand on. This can make selecting the tools for your GIS project a bit more involved, but it's a good problem to have.
Proprietary and open-source represent two ends of the spectrum, but there’s a lot of mixing in between. We like to take a functional perspective, which we have found can quickly winnow down the options and reduces the odds that we’ll need to make big changes down the road. Thinking about functionality also helps improve a project design by highlighting what features are mission critical, and what we have some flexibility on.
This work was supported by the USDA - National Institute of Food and Agriculture (Hatch Project 1015742; Powers).