USGS Multimedia Gallery
This text will be replaced
To embed this video, click "menu" on the video player toolbar.
If no transcript and/or closed-caption is available, please notify us.
Multiple Representations of Geospatial Data: A Cartographic Search for the Holy Grail?
Mark DeMulder: It's my pleasure to introduce our final speaker for this morning's plenary. Dr. Barbara P. Buttenfield, known to all of us as Babs, is a professor of Geography at the University of Colorado in Boulder.
She teaches courses in Geographic Information Science, Computer Cartography, Geographic Information Design, and her research interests focus on data delivery on the internet, visualization tools for environmental modeling, map generalization and interface usability testing. She has worked extensively with librarians and information scientists to develop internet-based tools to browse and retrieve information for very large spatial data archives.
She spent several months in residence at the Library of Congress Geography Map Division while on sabbatical at the U.S. Geological Survey in Reston, Virginia. Actually, that's where I met Babs; from '93 to '94 she was at the National Mapping Division of the U.S. Geological Survey.
She was also an original co-principle investigator for the NSF-funded National Center for Geographic Information and Analysis, NCGIA, as many of us are familiar with, and she led research initiatives on multiple representations, formalizing cartographic knowledge and visualizing spatial data quality.
Babs is past President of the American Cartographic Association and a Fellow of the American Congress on Surveying and Mapping.
Please join me in welcoming Babs Buttenfield.
Barbara Buttenfield: I am a professor. I go in 50-minute time chunks.
Barbara Buttenfield: So I've got a script here that I'm going to follow really closely so that we can get out of here in 20 minutes or so.
Thank you so much for inviting me to speak. There's a lot of familiar faces here and I feel like we grew up together, so I'm talking to the family.
I'm going to draw today on my collaboration with USGS staff, Larry Stanislawski, Ellen Finelli, and by another research faculty affiliate with CEGIS, the Center for Excellence in Geographic Information Science, part of the USGS. That collaboration with research faculty is Professor Cindy Brewer at Penn State.
Our work is tightly coupled through the past few years. We're developing software tools which generalize data from mapping across multiple scales. Our intention is to deliver not only the intermediate scale datasets of the multiple representations but also the software tools so that you all can make your own generalization. You know, the 'Teach a man to fish' and all of that.
Today's talk, I'm going to emphasize NHD hydrography, because we've made the most progress with that data layer. However, I want to stress that at CEGIS, there are a lot of efforts ongoing with other data layers as well: terrain, transportation, imaging.
So, what's this Holy Grail stuff? I want to recap a brainstorming session that Cindy Brewer and I had eight years ago with Charlie Frye and Aileen Buckley from ESRI. And this animation I'm about to show you came out of that brainstorming session.
We were brainstorming about multi-scale, multi-purpose mapping databases. What would it take to make one? Well, you have to start with some capture data, so let's capture data of a very fine resolution. Eight years ago, we were talking about five to 10 meters. You do some processing, that's the little boxes with the diagrams and the diamonds in them, and you come up with a cartographic database.
So the DLM, that's your raw data that says 'captured'. The DCM is your cartographic database. It has your symbology, it has your lettering, it has some processing, so instead of DEM you might have contours, so on and so forth. And from these DCMs, these cartographic databases, you can create general-purpose topographic map products, you can create large scale products, you can create special purpose maps.
You can collect data and other capture resolutions as well. For example, I've got a 90-meter capture and one-kilometer resolution captured with it. You can capture it at any resolution you want. I'm trying to tailor this a little bit to the USGS. And again, in these cases, you can do some processing and create a cartographic database and products from them in both cases.
Now if you look at this, you'll see that there are big gaps. There's the 24K products and then there's the 100K products, but what about in between? And likewise, there's the 100K products, 150K, but then you jump out to 1 million. So would it be possible, and what would it take to fill in those scale gaps, for example at 50K, to make map products and data products, or at 200K? What would it take to create an integrated data suite spanning a continuous range of mapping scales from 24K down past 1 and 2 million?
That's what we're brainstorming about. Fully operational, multiple representation database with fully automated data production and maintenance. That's the Holy Grail.
So, let's start up from the simplifying process. I'm a good academic; let's make it as simple as possible. Who needs multiple representations in the first place? Well, for some data players, a single version will suffice. Placenames, some attributes, revision information, important boundary, geodetic control, these things probably a single version will suffice.
But for other data layers, there are scale sensitivities, and these are going to challenge the ability to have one version captured at five meters go all the way down to all of the scales. And these scale sensitivities, they vary among data layers. So that's part of the challenge, and that's why we need multiple versions for these layers, especially terrain image base and hydrography.
So, what about this intermediate scale, these LoDs? Now that acronym, the LoD, that stands for Level of Detail database, is simply an intermediate scale version that you can derive from another cartographic database. So processing these smaller scale versions that fill in the scale maps is impossible.
For analysis purposes, now, Charlie and Aileen and Cindy and I, we were talking about cartographic databases, but geez, if you're going to go through all the trouble to build a cartographic database, you may have to make sure that it's going to fulfill analytic needs as well for the hydrologists].
So for analysis, digital features must reflect spatial processes which are evident at a finite range of scales. Let me give you some examples.
If I'm looking for evidence of erosion and deposition, I'm going to go to that five-meter capture. I'm not going to go to the 100-meter capture or the 90-meter capture. And if I'm looking for evidence of isostatic rebound following recession of a glacier, I'm sure I'm not going to start in the five-meter database. I'm going to go to the million database. And that's because these spatial processes are evident within a finite range of scales.
If I were looking for isostatic rebound in 500-meter resolution, it would be like looking for a needle in a haystack. So this analytic issue, these digital features have to reflect spatial processes.
The second issue is the place matters. I'm a geographer; place matters. But place matters to anybody who's using data. Evidence of spatial process is going to vary from one location to another. And so the generalization processing, what you're using to create these intermediate scale databases, has to be tailored to local landscape characteristics.
In one place, you find glacial evidence. In another, you find erosion and deposition following a volcanic eruption. In a third, you're going to find evidence of urbanization. All of these are going to impact the shape of the landscape characteristics. You know this. It's at the heart of what we all do for a living. It's a core of the data that we pull from the National Map.
So these two premises are driving our efforts to create multiple representation, multi-scale versions of NHD data usable from high-res 24K to National-Atlas-res 1 to 2 million or smaller.
Our constraints as we've approached this problem. Now this is Larry and Cindy and Babs and a lot of other people around CEGIS and USGS. The generalization has to preserve at least this much. The topology of the network, full stream reaches and attribution, complete feature codes, and then the local variations that typify these landscape types.
And when I'm generalizing, I'm not thinking about physiographic regimes. I'm thinking about the geometry. So I'm working at the level preserving local stream density or channel shape and sinuosity. But my goal there, my objective is to preserve that evidence of process and landscape type.
So this is our framework that we've been working with. And we've been working in all of the areas that you see here. I can't drag you through the whole thing; I mean, I could, but again, this 50-to-75-minute chunk. So I'm going to talk about just three areas and I'm going to organize my comments on the basis of enrichment.
Now, enrichment is literally adding attributes to the NHD schema, and that really draws the process. And when I get all done and come back to the solution of enrichment, I will show you some examples of our generalization, our pruning, which is deleting entire features, simplification, which is deleting selected vertices from the remaining features, and I'll show you some of the work that Cindy's been doing with symbolization.
Some of our solutions are pretty much together and some of them still in progress, but I want to give special thanks for all of this to Larry Stanislawski who's sitting back there trying to not be visible, because he's the fellow who's been doing a lot, most of the heavy lifting in building code and software code. So none of this would be possible without Larry's help.
Let's start with the density partitions. I'm going to give you a cheat sheet. If you look in the lower left-hand corner of each slide, you can see density partitions. So as I go through the three parts of this initial process, you'll see that little icon chain.
Density partitions can be established automatically from catchment areas and channel lengths. You compute a density value for each channel. So this isn't density in kilometers per square, kilometers for the base of this whole. We have it for each channel.
In some cases, the density differences will preserve evidence of a valid spatial process. Such as here, you see four subbasins in Iowa or a lake bed of a glacial till. The density variations document the remnants of glaciation. And when I simplify this, when we prune this, I want to preserve that evidence so somebody can use this data to do some analysis.
Sometimes, this density differentiation can be a lot more complicated than two density classes in four partitions. This is a case study that I just finished working on and I was so pleased to have the chance to work with Ellen Finelli and Ariel Dumbaya and Larry, of course, on this. But here we have four density partitions in 12 subbasins, and we needed to unify our generalization across the entire state of New Jersey.
We were also working from 2,400 local res down to 24,000. That's a 10x scale jump, for those of you who haven't had enough coffee this morning. And that's often done when you're generalizing in smaller scales, below 1 million. But it's rarely attempted in a single processing step for very large scale data.
Usually, we will go from 24k down to 10,000, and then 10,000 to 24,000. And I suggested this to Ellen in the beginning of the process, and she was quiet for a minute. Then she said, "Babs, we're talking weeks, not months." So we did it in one path, and we did it because we had accessed the information from these density partitions.
So the density stratification is guiding both the pruning and the simplification. Now you can see, this density process, we have very high densities down in the southern part of the state. And you can see the variation in density in other parts of the state. Let me show you how this density stratification helps us with our pruning and simplification.
And these slides are from Ellen. Thank you very much, wherever you are in the room. So what you see here in the upper panel is without density stratification, you have a very dense area of hydrography and a very sparse area.
So without the density partitions, I'd be simplifying with a single uniform simplification tolerance that is just going to marginalize the densities in the sparse and the dense area. And no matter what value I pick, I'm going to lose something. It's just a gamble.
So with the density partitions, what I can do is assign four classes. You can see them: red, orange, gold and yellow. And those density partitions can actually be attributed to the individual channels. Then I simplify the channels in each partition to a unique tolerance automatically. Manual intervention. The relative densities are protected even for these very large scale ones]. Pretty neat. I'm so excited.
Density partition can also help to identify regions of inconsistent compilation, which happens on occasion across quadrangle boundaries. Ellen Finelli served an example on this earlier this week for a map region across the state boundary.
Visually, you can tell that the density difference is an artifact of compilation because these layer partitions happen to coincide exactly with the quadrangle boundaries. So the density partitioning can remove these automatically. You're just pruning incrementally until you're achieving uniform density for the entire region of the map. We've resulted inconsistencies for subbasins of Nevada, Maine, North Carolina, as well as the examples shown here from Alabama.
And I'd like to acknowledge the students at Colorado, my doctorate student, Chris Anderson-Tarver, who did the processing on this, and the ArcMap display was created by MD Stalton who's an undergraduate at Penn State. So we're trying to involve students in this collaboration, and I think it's working. We're making headway.
All right, on to simplification and classifying land types. This is a type of enrichment that's classifying land types which supports differential simplification.
I said earlier that water channels collect differences in landscape types, which means that the generalization processing has to be tailored to specific regimes. To accomplish that, you have to define the specific regime.
So we started with the Fenneman and Johnson 1946 Physical Divisions of Conterminous United States, and we were unsuccessful in trying to create differential simplification, in large part because the Fenneman and Johnson classification is a lot based in density, and I've already told you that that density stuff is a pretty big deal.
So Larry took another pass at it. He classified a conterminous United States on five variables: average elevation, average slope, standard deviation of elevation, drainage density estimates from attachments that we have in our part of Richmond, and runoff in millimeters per year. We ran a maximum likelihood classification on these variables from five-kilometer grid overlaid on a 3 arcsec in DEM.
And you can see the classes here. You can see the New Jersey subbasins we worked on as well. The six black subbasins form a represented sample of the five areas. The five-prong solution, however, also puts B, the Okefenokee Swamp or a portion of it, in the same class as the Missouri Highlands, that's C, and also that glaciated area I showed you a minute ago in Iowa. So we need some more specificity.
We added two variables to the original five, bedrock density and area of inland surface water. Maximum likelihood of classification gave us seven classes. So we have a lot better specificity in the east, but you can see there's a lot of landscape type missing in the west, and that's complicating our ability to work with some of the area subbasins. We're going to continue to work with this as we test for generalization sequences.
Here are our six subbasins. The Magic 6. You probably heard about these yesterday talk. He's from Penn State.
You can see in the regime column of the table that we are cross-tabulating humid and dry landscapes with mountainous, hilly and flat. And so we're trying to tailor the generalization sequence, the processing sequence, order of operations, tolerance values, and so on and so forth for each one of these, and then try to extend these and see how far, how much of the combination we can use.
I think in the end, we're going to narrow it down to a number of sequences that's closer to 12 to 15. We certainly don't want to end up with a number of sequences that's in the dozens or hundreds. We're trying to minimize this.
There's a white letter K in this map. That's a subbasin in Oklahoma and I'll return to that presently.
So I'll show you very quickly some of the examples here, and then I want to get into the validation. For each one of the subbasins, we address a specific problem.
In West Virginia. And you're going to see this template three times for three different subbasins. On the left you have the 24K high-res NHD, and on the right you have 100K medium-res NHD, and in the middle you have our 50K intermediate scale, our LoD.
So you can see in the 24K panel, there's a difference in density and we need to preserve that, and you can see that that is preserved in our solution. If you look, though, in the 100K, in the medium-res panel, you'll see that all of that density has been homogenized in the compilation. We wanted to preserve it, so we kept it in the LoD.
We had a somewhat different problem in this subbasin. I'm showing you Missouri here just to show you an example of our simplification of water polygons. And again, you can see the same thing, the 24K, the 100K, and our solution.
Texas, third place. The generalization challenge here was to delineate a primary flowline, what a cartographer calls the 'centerline', the primary channel through the braid, and to distinguish that from the stream braid as well as delineating that centerline delineation. That turned out to be a hard problem.
For cartographers, we're trying to find a primary channel through a stream network. I was talking to Jeff Simley earlier this week and he says that hydrologists talk about divergences and it's a real thorn in the paw for hydrologists, and that's why in the centerline through a stream braid. And sure enough, we found the same problem here. We haven't fully solved the problem, but I'll show you how far we've come.
Selecting a primary channel simply using the NHD artificial path attribute gives a disconnected solution. In the upper left-hand corner of the slide, you see a pair of panels. In the upper panel left to the centerline you get Missouri if you select on artificial path. This is just the flowlines attributed as artificial path, which run through the stream polygons.
We use topological rules. You can see this in the lower panel. Instead of using just the artificial path to do an intersection or a containment, a topologic relation of the flowlines with all of the hydro areas and water bodies, and we do a little better. You can see there's a little bit more continuity there.
Then we traced through the stream node table to see if we can fill in the centerline. To get this solution to work through a braid, we're going to have to work with upstream drainage areas as well. And we think it's going to work, but I'm not sure that I can stand up here and say, "We got it." But stay tuned.
All right. So this is the processing summary table for the six subbasins. And you can see the kinds of parameters we're working with, how many density classes, one for Florida and Georgia and three for Colorado, what was the centerline delineation. You can see that for humid subbasins, we used contained or a mixture of contained and intersect, and so on and so forth.
The continuity check differs quite a bit for the arid or the dry landscapes and humid landscapes. And then down at the bottom you can see our simplification tolerance is 140 meters for polygons in Georgia and Florida. We had to do some smoothing on those, but we didn't have to do that for any other subbasins. You can also see the tolerance values we used for centerlines and flowlines.
So place matters. I can see the results, and you cannot use a single procedure. It's just not going to work. Our country is just too big.
Let's move on to assessment. My time is getting very short here. We worked with three types of assessment in terms of cartographic validation. Cindy Brewer has played a master role in this effort. She takes our generalized data and she drops it into simple schemes that are designed for specific scales of display.
So on the left and right panel, you are looking at exactly the same symbolization. On the left, Cindy dropped in the high-res NHD, and on the right she dropped in our 50K LoD. So you can see the differences there. She also created displays as smaller scales. Here's the same dataset for 50K LoD dropped into a 100K symbolization. And here it is dropped into 200K.
So this is a cartographic validation. We simply look at the map and we show it to people and we say, "How does this look to you?" And it's a very important kind of validation.
A second kind is metric validation. This is a measure of conflation, essentially. Which features, which channels, polygons match? So coefficient of line correspondence is simply a proportion of the length of features that match in our simplified data, on our 50K LoD or whatever scale we're building in LoD and some benchmark. In the case of 50K, our benchmark was the 100K medium-res.
So we have the matches in the numerator. And then the denominator, we have omissions, commissions, and the matches; the length of features that are present in the benchmark but not generalized, the length of features have been generalized but not benchmarked, and the matches.
So it's simply a proportion of how many features are still there and still there in a very similar fashion to the way they are benchmarked. So you can compute this for the basin as a whole. And you're looking at the Missouri Basin. You can see down at the bottom, the CLC is 79% of our features matched on that generalized version to 100K. That's pretty good.
The coefficient of area correspondence, a very similar kind of conflation measure, is a little bit lower. The CLC is the lowest, as it turns out, in the very sparse channel areas. Remember I showed you an example earlier with the 100K data, and I said that the 100K data has marginalized those density differences. We didn't, and it shows up in the CLC.
A third kind, cross-validation, coming to Oklahoma, is that we take one of these subbasins, we took the Missouri subbasin processing sequence and we just poured the Oklahoma data into it.
You can see the results here. You can see the assessment, the CLC and the CAC, so we can start to compare among our solutions. The CLC at 83% matches even better than we did in Missouri, but the CAC is much worse. So we need to do some tailoring.
So summarizing now and bringing this to a close. This is our NHD Generalization Toolbox. I think a lot of people think when you say, "Oh, I do generalization," they think, "Oh, you just run the Douglas point routine and pick a number and you push the button and you're done." We run through all of these tools to do any of our generalization. And I want to tell you a little bit about three insights that are important as we built this toolbox.
The enrichment tools, they support and inform all of our generalization processing. It's a lot of extra work to do this. But the subsequent processing and the simplification and pruning becomes much more powerful. So it's working.
Secondly, as I said before, these generalization routines tailored to landscape types are really improving our processing results locally and they're preserved in geographic logic.
And then the third, the validation, the assessment for cartographic and statistical validity focuses its attention on what needs to be improved. It should reduce editing workloads and manual intervention, and I think that's really the point here. It's not just an intellectual exercise. That's most exciting for me. I'm doing something real.
So what about the Holy Grail? How much of this multiple representation data processing and generalization can be automated? I did my dissertation, well, more than 25 years ago on generalization. And I remember in my dissertation defense, I got into a big argument with my professors because they said, "How much of this do you think you can automate?" and I took a big breath and I said, "75%." I hope it raises.
Barbara Buttenfield: Anyway. Right now, we're at 90% to 95%. I never would've thought this back in, well, whatever date it was.
Barbara Buttenfield: So what about that last 4% or 5%? I have very high hopes.
Barbara Buttenfield: Thank you very much.
Title: The National Map Users Conference: Multiple Representations of Geospatial Data: A Cartographic Search for the Holy Grail?
The U.S. Geological Survey (USGS) sponsored the inaugural The National Map Users Conference (TNM UC) in conjunction with the eighth biennial Geographic Information Science (GIS) Workshop on May 10-13, 2011, in Golden, Colorado. The National Map Users Conference was held directly after the GIS Workshop at the Denver Marriott West on May 12-13. The focus of the Users Conference was on the role of The National Map in supporting science initiatives, emergency response, land and wildlife management, and other activities.
The National Map Users Conference Experience: Short interviews of Conference attendees. (4:29)
Opening remarks and plenary speakers, Thursday, May 12, 2011 (UC Day 1)
Award Ceremony, Tommy Dewald of the Environmental Protection Agency (EPA) and Keven Roth “semi-retired” USGS are the co-recipients of this year’s Henry Gannett Award, presented by Marcia McNutt, Director of the USGS and Alison Gannet, great-niece of Henry Gannett. Roth and Dewald were cited for their development of the National Hydrography Dataset (NHD). (30:03)
Continuing remarks and plenary speakers, Friday, May 13, 2011 (UC Day 2)
Closing Session: “What You Said: Shaping the Direction of The National Map”. (33:50)
Selected Sessions, Thursday, May 12, 2011 & Friday, May 13, 2011
Location: Golden, CO, USA
Date Taken: 5/13/2011
Video Producer: Michael Moore , U.S. Geological Survey
Note: This video has been released into the public domain by the U.S. Geological Survey for use in its entirety. Some videos may contain pieces of copyrighted material. If you wish to use a portion of the video for any purpose, other than for resharing/reposting the video in its entirety, please contact the Video Producer/Videographer listed with this video. Please refer to the USGS Copyright section for how to credit this video.
Suggest an update to the information/tags?
* DOI and USGS link and privacy policies apply.