Under the Hood

Features are the product of several internal tools/services:

  • Data Pipelines: Ingest and standardize data.
  • Spatial Compute Engine: Transform data from the pipelines into features.
  • Aggy: Aggregate data over space, time, and data structures (bands, arrays, etc.). This can include aggregations within polygons (boundaries) or aggregations to points.
  • Proximity: Find the nearest {} to any chosen point, or the driving (or walking) distance between any point of origin and destination(s).
  • Quality Control: A process and tool to evaluate and improve data.

Data Pipelines

Pipelines allow us to source and standardize trusted geospatial data from all over the internet. We ingest a bunch of data including:

  • Businesses from restaurants, bars and coffee shops to cultural attractions/museums, health care;
  • Transportation like buses and subway stops, roads;
  • Parks, water, coastlines;
  • Energy infrastructure like power, substations and Superfund sites; And many more!

Our data pipelines ingest both vector and raster data.

Spatial Compute Engine

Our Spatial Compute Engine represents a combination of infrastructure, tooling, and data models. It’s used to create summaries and normalizations of geospatial data. This service allows us to take the data from our pipelines and turn them into Features.

Aggy

Aggy computes statistics over boundaries, or at points. Boundaries can be mutually exclusive (non-overlapping) divisions of space (census tracts, or counties, for instance). They can also be potentially overlapping, like the area reachable in a 10-minute walk from two adjacent city blocks.

Features within boundaries

Once we’ve ingested and standardized our data using Pipelines, we can do various things with it. For instance, we can summarize this data within familiar boundaries, such as zipcodes, census (tract and block group), and administrative (county and metro) areas. We can also aggregate our data sources to more novel, hyperlocal boundaries, such as the area reachable within a 10-minute walk or 20-minute drive from any given point.

We summarize POIs with counts (number of coffeeshops), and in some cases the size or attributes of specific assets (think stadium seats, hospital beds, etc.). We also normalize all count and size-based measures using known boundary area and population estimates to create per capita and per area metrics.

POIs that have a real-world areal extent (like parks and water) are summarized using both counts and spatial intersections. For instance, state_park_intersecting_area_in_sqkm would tell you how much land area is covered by state parks within a given boundary (e.g. within a specific county). As with our POIs data, we also normalize all parks and water sources – the percentage of the total area covered, for instance. In this example, the field would be called state_park_pct_area_intersecting_boundary.

Data that are represented as lines are treated differently. For example, we compute the total length of coastline within the boundary, as coast_intersecting_length_in_km.

Some of our more richly-attributed datasets allow us to aggregate other important measures – total tonnage imported or exported at ports, for instance, or the net capacity (in MW) of power plants.

In a few cases, we apply Aggy to individual segments of a broader category (wholesale, value, and upscale grocers, for example, or the energy source for generators at power plants: coal, gas, hydro, solar, or wind generators).

In some cases, the features produced by Aggy feed higher-order models metropolitan activity centers, gentrification, etc. In turn, these derived measures can inform some of Iggy’s other travel time and distance based roll-ups.

Isochrones

One way to create boundaries to summarize within is through something called an isochrone, which is the boundary defined by the area reachable within a certain time from a given starting point.

For example, imagine you centered a map on your home’s location, and shaded in all the surrounding area you could reach within a 10 minute walk or less. The boundary of this shape is an isochrone boundary - it’s the maximum distance from your starting point, in any direction, that you can reach by walking for 10 minutes.

We think that what’s walkable in 10 minutes or driveable in 20 is a pretty common-sense explanation for understanding the area around an address. Unlike a zip code which is a relatively static boundary for any address, isochrones are dynamic boundaries: you can make an isochrone that represents a 10-minute walk, a 12 minute walk, an x minute drive.

At Iggy, we use quadkey centroids (see "What is a quadkey?") as the starting points for our isochrones. Each quadkey (at zoom level 19) represents a tile on the Earth's surface with a side length of roughly 75 meters. This means that we’re able to map any address in our coverage region to an isochrone we’ve computed, that at most is ~40 meters away - far less than the distance the average person can walk in 10 minutes.

Features at Points

The world directly underfoot, or overhead. Our aggregations to points return the value associated with any point. An example here is road speeds. It is one thing to know a road is outside a home, it’s another to know how fast cars can travel on that road.

Other examples include flood or fire risk, climate / weather summaries.

Proximity

Nearest Features

Nearest is a service that returns the nearest {} from any point. It calculates this using "as the crow flies" or straight-line distance. This can be useful when simple proximity to something is important, such as a homebuyer who wants to avoid being near a gas station or funeral home. They can use Nearest to compare the nearest of these places while evaluating different homes.

Note that there are no boundaries involved in the nearest computation. Nearest has a straightforward mandate: to find and measure the straight-line distance to the closest point of interest (such as a hospital or grocery store) that matches certain criteria. It doesn't determine which is the best, as the nearest may not always be the fastest. However, it provides a simple and fast approximation of what is nearest-by.

Routing / Network-based Features

This service returns the routed or road-based drive distance and drive time between two points. This can be useful for understanding commuting or errands.

With real-world commute information, we can do even more. For example, we can compute variable radius (drive time/distance) roll-ups. Proximity features can use base POI features or any data derived from boundary aggregates (like activity centers).

Check out the complete list of Iggy Features in our Feature Catalog.

Quality Control

One of the most common complaints about POI data is that it’s messy, inaccurate, and/or outdated.

At Iggy, we have a robust internal QC process and in-house tooling to vet every one of our features and ensure that they provide a meaningful description of the world. During this QC process, we use a combination of automated methods and expert human review to evaluate each feature along the following dimensions:

  • Value range: do the feature values returned by our system look reasonable?
  • Spatial distribution: do areas expected to be high value match expectations? Are there gaps where there would reasonably be data?
  • Spatial consistency: Do large scale patterns make sense when comparing different geographic regions?
  • Source data accuracy: are many points missing in the source data? Are many points incorrectly labeled?

Some of the above can be evaluated programmatically, but we’ve found that bringing in human experts to evaluate Iggy features in an area they are intimately familiar with can identify issues that even very sophisticated automated checks would miss.