Reliability of streetscape audits comparing on‐street and online observations: MAPS-Global in 5 countries

Background Microscale environmental features are usually evaluated using direct on-street observations. This study assessed inter-rater reliability of the Microscale Audit of Pedestrian Streetscapes, Global version (MAPS-Global), in an international context, comparing on-street with more efficient online observation methods in five countries with varying levels of walkability. Methods Data were collected along likely walking routes of study participants, from residential starting points toward commercial clusters in Melbourne (Australia), Ghent (Belgium), Curitiba (Brazil), Hong Kong (China), and Valencia (Spain). In-person on the street and online using Google Street View audits were carried out by two independent trained raters in each city. The final sample included 349 routes, 1228 street segments, 799 crossings, and 16 cul-de-sacs. Inter-rater reliability analyses were performed using Kappa statistics or Intraclass Correlation Coefficients (ICC). Results Overall mean assessment times were the same for on-street and online evaluations (22 ± 12 min). Only a few subscales had Kappa or ICC values < 0.70, with aesthetic and social environment variables having the lowest overall reliability values, though still in the “good to excellent” category. Overall scores for each section (route, segment, crossing) showed good to excellent reliability (ICCs: 0.813, 0.929 and 0.885, respectively), and the MAPS-Global grand score had excellent reliability (ICC: 0.861) between the two methods. Conclusions MAPS-Global is a feasible and reliable instrument that can be used both on-street and online to analyze microscale environmental characteristics in diverse international urban settings.

Observation, or audit, measures have been developed to evaluate different types of built environments (e.g., urban centers, residential neighborhoods, public open spaces) [7][8][9][10]. Studies initially established the interrater reliability of these instruments using in-person, onstreet measurements [7,11,12]. One such instrument is the Microscale Audit of Pedestrian Streetscapes (MAPS) whose items and subscales mainly had moderate to excellent inter-observer reliability [12] and demonstrated validity through associations with several PA measures in multiple age groups [2].
On-street observations usually consume more time and expense than measurements conducted remotely using online imagery, for example Google Street View. Remote online observations reduce travel costs and are particularly useful when evaluating geographically dispersed or international locations [7,10,13]. Several studies documented generally strong agreement between on-street and online observations in the USA, Australia and New Zealand [7][8][9]14]. For example, Wilson et al. [7] reported significant associations between on-street and Google Street View measures for most items in an instrument applied in two US cities. A shorter version of MAPS (i.e. MAPS Abbreviated Online) was shown to be a reliable online audit tool when compared to on-street assessments [15,16].
The MAPS-Global observation instrument was based in part on the original MAPS [2] and designed to be appropriate for international use, providing measures of microscale features that are comparable across countries by drawing on items developed across several continents [17]. Because MAPS-Global is the first audit instrument designed for international use, it is important to evaluate its performance across countries with a range of built environment and cultural characteristics. The present study aimed to assess cross-method reliability of MAPS-Global on an international basis by comparing on-street and online observations in five diverse countries.

Microscale Audit of Pedestrian Streetscapes-Global Version (MAPS-Global)
As described elsewhere [17], MAPS-Global was based on the original MAPS tool developed and validated in the US [2,12]. MAPS-Global was modified substantially by drawing on items from built environment instruments developed on multiple continents: MAPS (US) [12], Bikeability Toolkit (Australia) [18], SPACES (Australia) [19], ALPHA (Europe) [20], REAT (UK) [21], FASTVIEW (UK) [22], school audit tool used in SPEEDY/ ISCOLE study (UK/International) [23], EAST_HK (Hong Kong) [11], NEWS-Africa [24], and NEWS-India [25]. Wording and scoring were altered for greater international applicability and consistency within MAPS-Global. Numerous international investigators provided input and pre-tested drafts [17]. A key purpose was to represent PA-relevant streetscape characteristics that are relevant across diverse geographic settings. If important attributes only seemed relevant in a few locations, they were retained. Thus, MAPS-Global was designed to be tailored to most settings with specialized items, but it also was comparable across countries due to the common instrument. Examples of items common in a subset of settings would include pedestrian streets that are common in Europe but rare in the US, unpaved roads that were common in Africa and India, and cul-de-sacs that are common in the US but not elsewhere [17].
MAPS-Global has 123 items in four sections: overall route, street segments, street crossings, and cul-de-sacs. The route section has three subsections: destinations and land use, streetscape, and aesthetics and social environment. Route items assess characteristics along a short route from a residential starting-point address towards a pre-selected cluster of non-residential land-use destinations (e.g., shopping areas, restaurants). Route items evaluate, for example, presence of non-residential destinations within the short route, aesthetics characteristics, and transit stops. Street segment (defined as the area between street crossings) items measure aspects of sidewalks, bicycle facilities, and pedestrian shortcuts. Crossings items analyze pedestrian protection features and width of crossings. The cul-de-sac section includes size and presence of amenities. The MAPS-Global audit instrument, manual, and training webinars can be found at https ://drjim salli s.org/measu re_maps.html. MAPS-Global was found to have good inter-rater reliability for on-street observations in 5 countries [17].

Study design and cities
The present study was conducted in five cities: Melbourne (Australia), Ghent (Belgium), Curitiba (Brazil), Hong Kong (China), and Valencia (Spain). Table 1 indicates study locations and summarizes sample sizes for the MAPS-Global evaluation in each country. This study was developed within the framework of the IPEN (International Physical Activity and the Environment Network) Adolescent project (www.ipenp rojec t.org), which had the goal to represent all inhabited continents with the maximum variability in built environments. Cities included in the present reliability study covered diverse contexts from different continents. For instance, Melbourne represented a low population density city, Curitiba a middleincome site, and Hong Kong a high population density and high-income place [26].
Target locations were selected in each city using a geographically stratified sampling design to ensure representation of neighborhoods varying in walkability and socio-economic status (SES). To select high-versus low-walkable neighborhoods, all cities used a GISderived macro-level walkability index based on net residential density, intersection density, and mixed land use [27,28]. High and low SES categories were established using census data about household income or education. Deciles were calculated. The lowest five deciles constituted the "low" category and the highest five deciles corresponded with the "high" category in most cities, while more stringent criteria were applied in Curitiba which excluded the highest, lowest and middle deciles of SES scores [26]. As in previous research [27,28], a 2 × 2 matrix was defined by high/low walkability and high/ low SES. Participants were recruited from areas that met walkability and SES criteria. For the present study, participant addresses were randomly selected and stratified by quadrant, except for Melbourne where general residential addresses were randomly selected from areas within the 2 × 2 matrix. These addresses served as route starting points. Apart from these residence-based routes, to ensure wide variation of contexts, audits were also conducted on segments near some routes which mainly contained retail destinations. These are referred to as commercial routes. IPEN Adolescent was approved for research with human subjects by the Institutional Review Boards at the authors' universities. Present analyses did not use IPEN Adolescent participant data.

Data collection
MAPS-Global data were collected on-street and online in 2015 by two independent raters in each country to evaluate cross-method reliability. One rater carried out the observations by walking on-street. The other rater, who was also in-country, carried out online audits, using Google Earth and Google Street View imagery. Following previous research [2,12], MAPS-Global observations were conducted along a 400-725 m route from a starting point toward a pre-determined commercial cluster along the street network, to represent a likely walking route. The final sample included 349 routes, 1228 street segments, 799 crossings, and 16 cul-de-sacs (see Table 1). Commercial routes represented approximately 20 % of the final sample.
As mentioned elsewhere [17], a research staff manager from the IPEN Coordinating Center was responsible for training and quality control. Raters were trained in two stages. First, remote training was given to each country's investigative team by the IPEN coordinating center via a webinar and were provided training materials including a manual with item definitions and photos. Country teams then conducted their own on-street training sessions, sending photos to the coordinating center for clarification. Second, raters were certified by completing at least 5 routes, including at least 2 commercial routes, 5 segments, 5 crossings, and 2 cul-de-sacs/dead ends. When 95 % inter-rater agreement was reached with the trainer at the coordinating center, raters were certified to rate independently. Most raters reached certification during the first round of 5 routes, whilst some required two rounds to reach certification. Investigators were encouraged to hold weekly rater meetings to review questions and concerns, and to minimize rater drift over time.
Scoring and data analysis MAPS-Global scoring was similar to that of the original MAPS [12]. Items used a variety of response formats; therefore, all items (except for land uses) were dichotomized or trichotomized to provide relatively equal weighting when creating scales. Land use items were scored as 0, 1, 2, 3, 4 or 5+. Subscales were computed by summing related items after they were rescored. The culde-sac section was not analyzed due to the small sample size and unclear expected association with PA. Positive and negative valence scores were created by summing subscales based on expected associations with PA. To create "overall" section scores, negative valence scores were subtracted from positive valence scores. Finally, a grand score was calculated by subtracting the overall negative valence score from the overall positive valence score. Three new conceptual subscales were developed for MAPS-Global, drawing from multiple sections: pedestrian infrastructure, pedestrian design, and bicycle facilities [17]. Detailed information about item recodes and subscale creation can be downloaded (https ://drjim salli s.org/Docum ents/Measu res_docum ents/MAPS%20 DAT A%20DIC TIONA RY_GLOBA L_09061 7.pdf ).

Analyses
Inter-rater reliability analyses were performed using the Kappa statistic for dichotomous variables and Intraclass Correlation Coefficients (ICCs) for continuous or ordinal variables using the one-way random model for average measures, considering values ≥ 0.60 as "good to excellent" reliability, values between 0.41 and 0.60 as "moderate" reliability and values ≤ 0.40 as "fair to poor" reliability [29]. Items rarely observed and with low variability in scores (i.e., almost all zeros or 'never') but percentage agreement between raters ≥ 75 % were considered to have good reliability irrespective of low ICCs [19].
Analyses were performed using SPSS version 22 (SPSS Inc., Chicago, IL). For each item (both original and recoded), range, frequency and inter-rater reliability were calculated as well as mean and standard deviations for both on-street and online raters. Results Figure 1 shows images of a sample residential segment and commercial segment for each of the cities. The number of routes, segments and crossings and average assessment times varied across countries (Table 1).
With the exception of Belgium, online mean assessment times, not including travel, were a little higher than onstreet times. However, overall, mean (± SD) assessment time was 22 ± 12 min for both on-street and online route evaluation. Table 2 provides route subscale reliability and descriptive analyses. For the destinations and land use subsection, all subscales showed good to excellent reliability between on-street and online raters, with ICCs ranging from 0.680 to 0.859, including the overall score with an ICC value of 0.856. Items that were thought to positively influence walking in the streetscape subsection (such as street amenities and traffic calming signage) were aggregated into a positive valence score, which showed good to excellent reliability (ICC: 0.742). Aesthetics and social subsection subscales also showed good to excellent reliability, including the overall score (ICC: 0.736).
Segment and crossing subscale reliability and descriptive analyses are shown in Tables 3 and 4, respectively. The majority of subscales had ICCs higher than 0.80 (i.e., excellent reliability), and almost all subscales showed good reliability with ICCs higher than 0.60. Only two single item indicators (informal path or shortcut positive, and hawkers/shops positive) had low Kappa and ICC values due to insufficient variability, but those items had inter-rater agreements from 93.3-95.7%. The overall segment score had an ICC value of 0.929, and the overall crossing score had an ICC of 0.885.
Finally, Table 5 shows MAPS-Global grand scores and conceptual scale reliability results. Pedestrian infrastructure, pedestrian design, and bike facilities scores showed good to excellent reliability, with ICC values higher than 0.87. The MAPS-Global overall grand score had similar mean values for the on-street and online raters and demonstrated good to excellent reliability (ICC: 0.861).

Discussion
The present study in five diverse countries examined the reliability between on-street and online observations conducted by different raters using the MAPS-Global tool that was designed for international use. Results showed good to excellent agreement between on-street and online audits for most of the summary scores analyzed. Only a few subscales had Kappa or ICC values < 0.70 (23.3 %), with aesthetic and social environment variables having the lowest overall reliability values, though still in the "good to excellent" category. Present findings of high reliability of different observers across different data collection methods were very similar to a previous report of reliability of MAPS-Global across two independent observers using the on-street method [17]. Present results indicate that MAPS-Global can be used internationally with either the on-street or online method, if online imagery data are available and sufficiently recent. Therefore, the present study adds international data supporting acceptable to high reliability across on-street and online observations.
There is no consensus on the time efficiency when comparing on-street and online environment audits, not including travel time. Studies have reported online time savings [8,10,30], no differences [9] or even longer time to complete online audits [7]. This lack of consensus is also present across countries within our study (see Table 1). These differences could depend on such issues as the complexity of the environment, characteristics of the assessment tool, or even differences in computer speed. However, online assessments eliminate travel time and costs [9,10]. Remote audits also address safety problems associated with dangerous neighborhoods [9] and allow researchers to conduct assessments across multiple sites or vast areas [10]. In general, authors appear to agree that Google Earth and Google Street View can be efficient tools for collecting data on micro-scale neighborhood characteristics [9].
However, online methods present limitations that should be considered. Although coverage is increasing rapidly, imagery is not available in many countries, on some streets, or in rural areas [7,14]. Many of these gaps should be addressed over time, but gaps are likely to remain in the lowest income countries and in some countries that prohibit or greatly restrict image-gathering programs such as Google Street View. Limitations of the online method include the time difference between collection of the imagery and its online observation. A related limitation of the present study was lack of documentation of the date of the imagery and interval between imagery collection and observation. Some characteristics can be difficult to view due to the camera's perspective, resolution, or parked or moving vehicles that could block the view of the sidewalks and buildings [7,9,10,14]. Camera views of tall buildings also are restricted. These limitations might explain lower reliability results for aesthetic and social environment variables in the present study and in others [16]. However, these lower reliability results might also be explained by the transitory and subjective nature of these characteristics [31]. Temporal variability of Google Earth and Google Street View images and acquisition dates across locations should be taken into account when auditing multiple sites [7][8][9]13].
Considering good inter-rater reliability and advantages of online audits, we conclude the MAPS-Global instrument can be used both on-street and online to analyze the micro-scale environment characteristics across diverse countries. The present findings also provide initial evidence to justify combining observations from both data collection methods in the same study due to good overall comparability. Next steps are to evaluate    MAPS-Global in more countries, especially low-income countries, identify characteristics of the built environment that may moderate the reliability and validity of online audits (e.g., density), and assess construct validity in relation to physical activity and other outcomes. Further studies with larger samples are needed to examine whether there are differences across countries in reliability across observation methods. It would also be useful to evaluate whether it makes a difference if the rater is familiar with the country and language being observed, as online assessments from a central location could provide more efficient and standardized data collection for international studies. MAPS-Global has been shown to have strong interobserver agreement with in-person auditing [17], and present results showed acceptable agreement between in-person and online auditing in diverse countries. These results provide reassurance about the international applicability of MAPS-Global and its psychometric qualities. MAPS-Global data have been collected for a subsample of routes beginning at residences of a subset of participants in IPEN-Adolescent in eight countries [26]. These data can now be analyzed to address important questions related to health geography. Streetscape scores can be compared across diverse countries to understand the range and distribution of pedestrian-and bicycle-supportive environments. Differences in streetscape quality across lower-and higher-SES area can be examined. Central to the aims of IPEN-Adolescent, the relation of streetscape quality to adolescents' physical activity patterns and weight status can be studied, and differences in associations across countries can be explored. We encourage other investigators to use MAPS-Global to answer a variety of important questions related to health and geography. MAPS-Global data can be used to develop evidence-based built environment recommendations for policies and practices that are either tailored to particular locales or applicable internationally.

Research highlights
• The MAPS-Global streetscape audit tool was evaluated for reliability in 5 countries. • The tool showed good-to-excellent reliability between on-street and online audits. • MAPS-Global could be used both on-street and online internationally.