Tuesday, August 14, 2007

Area Weighted Join vs. Standard Join

Students from an advanced suitability analysis course this summer needed to create a report that specified the percent of the join layer that intersected the target layer. For example, they needed to calculate the percent of each use from a landuse shapefile that intersect each zip code in Texas. I wrote a quick script for the class that generated the report they needed, but the implications are astounding to me.

Here is what I mean by astounding. As a test, I calculated the median household income and the total population within a 1-mile radius around each dance club in Arlington, TX comparing the following methods: (1) Total population using ArcMap's standard spatial join tool, (2) Total population using an area-weighted summation, (3) Average median household income using standard spatial join, (4) Average median household income using an area-weighted average.

The following table displays the results.



Field A displays the dance club's name. Fields B, C, and D above display the difference between using the standard ArcMap spatial join tool and a weighted-average spatial join tool when calculating average household income. Fields E, F, and G display the differences when calculating the total population.

The differences in both cases are quite high. I am thinking of myself and all of the students who I have seen naively rely on the standard spatial join tool for these types of calculations. Wow...

Why is there such a difference?

An area-weighted spatial join between two polygons comes in two flavors, depending on whether ti is calculating an average or a sum.

If it is calculating an average, the formula is [area-percent] * [value] + [area-percent] * [value]... The most important consideration is the percentages of the join features that are within each target feature. For example, in a particular zip code, there might be 3 block groups. Let's further suppose that block group 1 comprises 50%, block group 2 comprises 35%, and block group 3 comprises 15%.

If it is calculating a sum, the most important consideration is the percent of the join feature that actually intersects the target feature. The formula is ( [% area intersects target feature] * [value] + [% area intersects target feature] * [value] ) / number of intersecting join features. This is why you will see a much larger error when using the standard spatial join tool for summations than for averages. If 2% of a block group intersects a zip code, the standard tool will include the entire population of the block group instead of only 2%.

Is This a Perfect Solution?

No. This assumes a perfectly even distribution within each join feature. It is, however, a huge improvement.

Where Can I Get the Script?

Download it here. Extract the compressed archive and you will see three Python scripts and an ArcGIS toolbox. Open ArcMap or ArcCatalog, ensure ArcToolbox is visible, and add the Spatial Join Tools.tbx (single-click).

Caveat: These scripts are first drafts and have not been tested on any systems other than the ArcINFO Desktop 9.1 & 9.2 systems here at UT Arlington. There is no documentation. Also, the scripts run on the slow side. Eventually these will be optimized, but at the current time they are presented as is.

Description of the three tools:
  1. Average Area Weighted: Use this tool to calculate an area-weighted average spatial join between two polygons.
  2. Sum Area Weighted Join: Use this tool to calculate an area-weighted summation spatial join between two polygons.
  3. Percent Area Report: Use this tool to generate a report that specifies the percent of the join layer that intersected the target layer.

Saturday, August 11, 2007

GIS to Select Foreclosure Residential Properties

This summer I taught a graduate real estate course exploring the ways GIS can be used select foreclosure listings for possible investment. As far as student satisfaction is concerned, this was the most successful real estate course I taught. The course was very tight and practical, and hopefully everyone will go out and make some big bucks with the skills they learned...well, when the market eventually turns around, eh?

The 9-week course was divided into two sections. First, we went over the fundamentals of performing a comparative market analysis (CMA) using foreclosure listings and multiple listing service (MLS) listings in Tarrant County, TX. Second, we learned how to automate the process using ArcMap's ModelBuilder.

What data did we use?

We used exclusively the HUD foreclosure listings in Texas available for free from the Southwest Alliance of Asset Managers. This is a fantastic resource as the listings can be batch downloaded in Excel format. Each week, we were able to download a fresh batch of new foreclosure listings for Tarrant County. As an aside, HUD foreclosure listings are freely available from most states. To locate the agency that manages these listings for any state, see the HUD Homes website.

For MLS, we used data from the North Texas Real Estate Information Systems, Inc. (NTREIS). Unfortunately, acquiring batch MLS data is expensive and is only available to realtors. In May I downloaded app. 17,000 MLS listings of all statuses, including both sold and active properties. We used this data for the entire course.

How Can a CMA Help to Locate Foreclosure Properties for Investment?

In its most straight-forward sense, a CMA compares the selling price of properties (from MLS) in the same neighborhood of an active foreclosure listing. The lower the price of the foreclosure as compared to the average selling price the better the deal...on the surface, anyway. There are so many factors to take into consideration, but first let's consider this straight-forward CMA.

To accomplish this bare bones CMA, you first need to geocode both the foreclosure and the MLS listings. You then need to generate a buffer that will designate the neighborhood surrounding the foreclosure property. Then, spatially join the sold MLS points (containing the selling price) to the buffer, making sure to average the selling price field. Then subtract the foreclosure list price from the average selling price, and you are all set.

This is similar to the CMA reports I have seen from many commercial vendors.

Hey, This Oversimplified CMA Is Not Very Useful!

Yes, you are correct. Now, let's take a look at the many, many wrinkles that make this analysis so much fun.
  • In our CMA, we want to compare the price per square foot. The cost per square foot often decreases as the size of the property increases.
    • This causes a large problem as many foreclosure listings, including the HUD data we used, do not report the square foot of the property. Yes, this data is available in the MLS, but the MLS only includes those properties that have been recently active. The answer is to acquire this data from the local appraisal district. In our case, the Tarrant Appriasal District (TAD). We acquired the complete primary real estate account data from TAD, which includes the square foot (living area), but unfortunately there is no reliable field in common with the foreclosure data so we could not make a tabular relationship. So, we acquired the complete parcel shapefile, tabular joined that to the primary real estate account, and then geocoded the foreclosure listings directly to the parcel boundary. We were receiving accuracy results in the upper 90 percentile. Then, we spatially join the parcel shapefile to the geocoded foreclosure shapefile. Whew...we finally obtained the square foot of each foreclosure and calculated the price per square foot.
    • The MLS data already included the price per square foot.
  • A standard spatial join between the foreclosure buffer and the MLS sales is not appropriate as not all properties are comparable. Differences between the year built, number of bathrooms, etc. can decrease the accuracy of a CMA. For example, consider a foreclosure buffer with 6 recently sold properties, and 2 of the properties were built in 2004 while the other 4 were built in the 1950s. If the foreclosure property was built in the 1950s, it might not be wise to include the two newer properties in the CMA as the prices of those two might be substantially higher than the other four.
    • This caused a major wrinkle, as I know of no way to exclude such properties from a spatial join using ArcGIS built-in tools. In other words, I know of no way to perform a spatial join filtered by a query based on the values of each feature of the target layer. When planning this class I knew this would be the major stumbling point, so I created a Python script that did just this. After I clean it up a bit this Fall semester, I will post it here and to ArcScripts.
  • Demographics, especially crime rates and potentially employment outlook, can play a major role in an investor's decision to invest in a property. After the CMA, it is then necessary to filter, rank, or weight the results by these demographic attributes.
    • We used block group data provided by Applied Geographic Solutions (AGS), which the library has been purchasing the last few years. We are shifting to accessing the data via SimplyMap, but this will be a whole other post.
  • Defining a neighborhood by a circular buffer is not the most reliable way to define a neighborhood. A better method is to include only those properties in the same subdivision.
    • While spotty subdivision data is included in MLS, it is not included at all in most foreclosure listings. The solution again was to turn to TAD. If you geocode both the foreclosure listings and MLS data using a parcel shapefile as the reference, you can spatally join the parcel shapefile back to the two geocoded point layers to obtain the subdivision. Then, you need to perform another query-based spatial join (see above), which to the best of my knowledge is not included within the standard ArcGIS tools. The script that I wrote allows for this as well.
  • Yeah, there are other issues, but this is enough for now.
How is This Process Automated Without Any Coding?

Using the ArcGIS ModelBuilder, this be automated up to a point, but not as smoothly as I hoped. The Python script I wrote can be incorporated as a geoprocessing object into a Model, so that went fairly smoothly. Two major difficulties arose. First, there are limitations that make it difficult to set parameters for temporary layers that are contingent on other parameters. Second, I just find the ModelBuilder flunky and inconsistent in general. While everyone did a fantastic job on the first section, there were varying degrees of success automating the process. I recommended to everyone interested to take the Python Scripting for ArcGIS course I am teaching Spring 2008 semester.

Lot of Work For One Class..What Now?

Yes, initial development on these special courses can be quite exhaustive. I expect to be teaching this course once per year for the foreseeable future and I plan to hold a library workshop this Fall semester showing everyone how GIS can help investors flip houses. If this topic can not get folks into thye library for a workshop, I do not know what will. (Actually, the four workshops held in Spring 2007 semester averaged over 20 attendees each.) This workshop will be part of our GIS Day activities, but this deserves its own post as well.