This page gives an overview of the model-ready data and features that Iggy provides. This is meant to accompany the Iggy Feature Catalog.
How Iggy thinks about location features
At Iggy, we think about location-related features in terms of boundaries, data sources, and aggregations. These three components form the core of our data model. Put most simply, each Iggy feature is the result of an aggregation applied to an underlying data source within a boundary.
Boundaries
Many data sets have location fields that link a row of data to a real place on Earth. Depending on the particular location field, that may be a relatively general place (e.g. a metro area or county) or a very specific place (e.g. a quadkey or address). Traditionally, some of the challenge in dealing with location data involves conversion from specific to general places. For example, a dataset may have a field for address. But the available economic data only comes at the county level. How to link from the address to the relevant county, in order to add features from the economic dataset?
We use the term boundary to describe the geographic area over which some data is aggregated. Iggy pre-aggregates features to boundary levels ranging from general (metro area) to specific (quadkey) so that users can pull data at exactly the level they need. For example, if your data set includes a zip code field, Iggy provides features that have been pre-aggregated at the zip code level like count of restaurants per capita within each zip.
Currently Iggy provides features pertaining to the following boundaries, from general to specific:
metro
– Census Core Based Statistical Area, identified by CBSA FIPScounty
– County, identified by 5-digit FIPSlocality
– City, identified by ID from the Who's on First gazetteerzipcode
– Zip Code, identified by 5-digit zip codecensus_tract
– Census Tract, identified by 11-digit census tract GEOIDcbg
– Census Block Group, identified by 12-digit census block group GEOIDqk_isochrone_walk_10m
– 10-min Walk Isochrone, identified by zoom-19 quadkey identifier
The most fine-grained boundary type we currently offer is the 10-min walk isochrone, which is the boundary that encompasses the walkable area within 10 min of a zoom 19 quadkey (a map file with side length ~75m). By providing features aggregated at this fine-grained level, users with addresses or geographic coordinates can add hyper-local features to their models.
Data Sources
A data source describes the underlying geographic data that is aggregated within a boundary. Each data source has rows that represent points, lines, or polygons with geographic coordinates.
Many different types of data can be construed as geographic, such as local businesses, demographics, and topography. Our demo dataset incorporates features computed from the following data sources:
Points of Interest (poi
)
- Points of interest are businesses and services with a physical presence including restaurants, manufacturing sites, and community centers.
- Our
poi
features are aggregated from an underlying dataset of points, each representing a distinct point of interest and categorized based on the Iggy Feature Catalog.
American Community Survey (acs
)
- The U.S. Census ACS data includes information about demographics, household composition, employment, commute patterns, and housing. Iggy currently relies on ACS data collected over the 5-year period 2014-2019. The primary advantage of using multi-year estimates is the increased statistical reliability for less populated areas and small population subgroups.
- Only census-designated boundaries (
county
,census_tract
, andcbg
) incorporate features fromacs
, as these are the levels at which ACS data is reported and provided.
Water (water
)
- Iggy produces features that summarize the coastline, rivers, and lakes within a boundary.
- Our
water
features are aggregated from an underlying dataset that represents coastline as lines, and rivers and lakes as polygons.
Parks (park
)
- We also provide features calculated based on national, state, and local parks within a boundary.
- Our underlying
park
data represents each park as a polygon.
Data Attributes
Each data source also has one or more attributes describing each row that can be used to filter aggregations and derive more interesting features:
poi
poi
data attributes indicate the POI category, and whether it is a brand/chain:
- Ontology Top-level Category Attributes (see Iggy Feature Catalog)
is_{top_level_category}
- Ontology Sub-level Category Attributes (see Iggy Feature Catalog)
is_{sub_level_category}
- Chain
is_brandname
indicates whether POI is a brand or chain (e.g. McDonald’s, Dollar Store, Pep Boys)
acs
acs
data attributes indicate a particular Census summary statistic about the relevant boundary (county
, census_tract
, or cbg
). They cover a variety of types of information:
Demographics
Includes attributes related to age (e.g. median_age
), gender (e.g. pop_sex_male
, pop_sex_female_age_5_to_9
), race/ethnicity (e.g. pop_race_asian
), and birthplace/citizenship (e.g. pop_citizenship_us_naturalized
).
Social
Includes attributes surrounding household composition (e.g. households_female_head_with_children
, households_cohabiting_couple
), education (e.g. pop_adult_education_less_than_high_school
), and veteran status (e.g. pop_veterans
)
Economic
Includes attributes indicating income (e.g. households_with_annual_income_200000_or_more
, pop_below_100_pct_poverty_level
), employment status (e.g. pct_in_labor_force_status_civilian_employed
), and employment industry (e.g. pop_works_industry_manufacturing
)
Commute
Includes attributes indicating (pre-2020) commute habits, including method (e.g. pop_commutes_by_public_transport_rail
), time (pop_commute_departure_0630_to_0659
), and duration (pop_commute_travel_time_20_to_24_min
)
Housing
Includes attributes dealing with housing units type (e.g. housing_units_boat_rv_van
), age (housing_units_built_1939_or_earlier
), ownership status (housing_units_renter_occupied
), size (housing_units_10_to_19_in_structure
), and value (housing_units_value_150000_to_199999
)
water
water
data attributes indicate the type of water body.
- Type of water body
is_coastline
is_river
is_lake
parks
Our parks
data includes terrestrial and marine protected areas inventoried by the the United States Geological Survey. Parks in "PAD-US" are dedicated to preserving biological diversity, and to other natural, recreation, and cultural uses.
Our breakdown of the PAD-US includes aspects of ownership / management (federal, state, local, and private land, for instance), the intended use (recreation vs. agriculture or ranching), and other access-related attributes.
- Park-related attributes:
is_federal_land
is_state_land
is_local_land
is_native_american_land
is_private_land
is_special_district_land
is_easement
is_historic_or_cultural_area
is_agricultural_or_ranching_area
is_conservation_area
is_open_or_limited_access_area
is_open_access_area
is_parks_and_recreation
is_protected_area
Note that a park may have a value of 1 for more than one attribute. For example, a state park might have is_conservation_area=1
, is_open_access=1
, and is_parks_and_recreation=1
.
The full set of underlying data sources and attributes is detailed in the Feature Catalog.
Iggy Feature Catalog
The Iggy Feature Catalog is used to organize places in our poi data source. You can find more information, including definitions and examples, in the Iggy Feature Catalog reference page.
Aggregations and Normalizations
Given a boundary (like a zip code) and a data source (like POIs), Iggy produces features by running an aggregation of the data intersecting the boundary. Aggregations range from simple (i.e. counts of items intersecting a boundary) to more complex spatial functions (i.e. square km in the intersection between a boundary and a data source like lakes).
In addition to aggregations, Iggy also provides features that have additional normalization calculated on top of the aggregation, like dividing by the boundary population or area.
The following is a list of the various aggregations and normalizations that are used to produce Iggy features.
Aggregations
[none]
Features with no aggregation are generated by taking the raw value from the boundary itself, or from a boundary-linked data source like acscount
Count of distinct rows from the underlying data source that intersect a boundary. If the count feature is associated with a data attribute, then the count indicates the number of distinct rows having that particular attribute. For example, the feature poi_is_education_count indicates the number of distinct rows from the poi dataset having the attribute is_education=Trueintersects
A boolean feature indicating whether the boundary intersects any row within the underlying data sourceintersecting_area_in_sqkm
A float feature indicating the total area (in sq km) of the intersection between a boundary and any row in the underlying polygon data source. This can only be computed for data sources whose rows are polygons, like park and water.intersecting_length_in_sqkm
A float feature indicating the total length (in km) of the intersection between a boundary and any row in the underlying line data source. This can only be computed for data sources whose rows are lines, like water where is_coastline=True.
Normalizations
per_sqkm
Divides the aggregated feature value by the boundary area, in sq kmper_capita
Divides the aggregated feature value by the boundary population
Interpreting Features
The Feature Catalog provides a complete listing of the available Iggy features at each boundary level.
In general, features are named using the following convention:
{data_source}[_{data attribute}]_{aggregation}[_{normalization}]
For example, the feature poi_is_museum_count_per_capita
is calculated for a particular boundary by taking the data source poi, filtering for rows where is_museum=True
, applying the aggregation count within the boundary, and finally applying the per_capita
normalization to divide the count by the boundary population.
Some feature names deviate slightly from this convention in order to make them more interpretable. For example, the feature lake_pct_area_intersecting_boundary
is an easier way of expressing the feature generated from lake data source where attribute is_lake=True
, applying the intersecting_area_in_sqkm
aggregation, and the per_sqkm
normalization. The Feature Catalog is searchable by data source, attribute, aggregation, and normalization as well as feature name.