Abstract. A significant limitation of conventional appraisal methods is their reliance on outdated
manual techniques, such as matched pair analyses, which bypass property feature value
contributions as if they didn’t exist. In fact, value contributions provide the necessary
basis for mathematical constraints that prevent over and undervaluation by traditional
appraisers.
The Residual Constraint Approach (RCA) advances the traditional Sales Comparison Approach
(SCA) by integrating a multi-phase valuation process that employs Multivariate Adaptive
Regression Splines (MARS) alongside rigorous mathematical constraints on feature value
contributions, in particular unmeasured features such as condition, quality, aesthetics and
design whose values are typically subject to subjective judgment and bias by the traditional
appraiser.
The mathematical constraints in RCA rely on the expert application of MARS regression to estimate
the value contributions of measurable property features to the sale price. However, this estimate
typically accounts for only about 80% of the actual sale price, excluding subjectively assessed features
like condition, quality, design, and functional utility. In the San Francisco Bay Area, the
residual—representing these subjective components—averages around 20% of the total property value.
The
figure reflects the author’s experience in the San Francisco Bay Area under typical data conditions
and should not be read as a market-invariant constant. In markets where the achievable
is
lower, the framework’s response is to identify and measure unmeasured value drivers — distance to
the ocean in coastal communities, elevation in hillside neighborhoods, and similar variables
collected from external sources when MLS data does not supply them — rather than to
accept the limitation and report wider error bands. While MARS residuals are often treated
as estimation errors in other contexts, in real estate valuation they serve as an indirect
measure of subjective feature value. Though this might initially seem impractical, proof
demonstrates that the residual can be meaningfully decomposed into descriptive components,
provided their contributions sum to the residual—without affecting the final valuation
outcome.
This methodology emerges from two decades of empirical application by the author, developed
through iterative MARS implementations in R and Python environments.
This paper presents the logical and mathematical foundation for a modern approach to
accurately estimating the market value of residential and other real estate properties. The
method, developed over 20 years of experience applying Multivariate Adaptive Regression
Splines (MARS) software to Northern California homesparticularly in San Francisco Bay
Area communitiesoffers a significant advancement in valuation precision. MARS enhances
valuation quality by adeptly capturing non-linear relationships between property features
and value, while efficiently identifying and prioritizing the most impactful variables. This
capability allows appraisers to uncover value dynamics in new market areas far more rapidly
than traditional methods permit. MARS forms the backbone of the proposed Residual
Constraint Approach (RCA). The RCA method earns its name through its rigorous framework:
It imposes strict constraints on value contributions from both measured and unmeasured
property features:
1.
All MARS-derived value contributions, including the residual, must sum to the net
sale price of each comparable property.
2.
The residual may be decomposed into sub-components with assigned values, provided
their total equals the comparables residual.
RCA minimizes bias in estimating subjective featuressuch as condition, quality, and aesthetic
appealby tying their total value to the residual, which is accurately quantified when a competent
valuation engineer constructs the MARS model.
While primarily designed for residential real estate, where subjective features significantly influence
market value, RCA holds potential for broader asset valuation applications. Accurate appraisals are
critical given the U.S. residential real estate markets estimated $45 trillion value. Fannie Mae notes
that typical appraisals deviate by about 5% from actual sale prices, though this figure masks bias, as
appraisers often face pressure to align with sale prices. Moreover, since 60%70% of appraisals support
refinancinglacking comparable sale pricesthe need for precision, as offered by RCA, becomes evident. The
Federal Reserves 2012 analysis estimated that 15%20% of pre-2008 crisis appraisals deviated by over
10$%
from sale prices, a range likely still relevant for refinancing appraisals today, varying by property type
and market. Why does accuracy matter? For low loan-to-value (LTV) transactionse.g., a buyer with
strong credit putting $500,000 down on a $1 million homelenders may waive appraisals, assuming
minimal foreclosure risk. However, buyers who skip appraisals risk overpaying, discovering defects only
after purchase. This underscores the value of robust valuation methods like RCA. Quantifying
appraisal accuracy remains challenging. Researchers lack direct metrics for market value
deviations, relying instead on indirect proxies. Traditional appraisal methods, rooted in
subjective judgment and variable appraiser competency, complicate analysis. Strategies
include:
1.
Tzioumis (2016) studied 4,782 appraisers across 259,436 appraisals in 11 states, finding
one-third within 1% of contract price, half between 1%3% above, and a minority averaging
5% abovesuggesting upward bias. The study identified two appraiser types: at market
value (close to contract price) and above market value (consistently higher).
2.
Krivorotov and LaCour-Little (2020) found 90% of purchase appraisals met or
exceeded contract prices, highlighting anchor biasthe tendency to align with a desired
valueespecially in refinancing contexts.
These studies reveal bias in appraisal practices, amplified by the limitations of traditional tools. If
appraisers were uniformly competent and unbiased, their valuations would converge; persistent
overvaluation by some groups signals systemic issues.
The bias studies cited above, and the systemic pattern they point to, reflect a deeper feature of
traditional sales-comparison practice that RCA is designed to address. The traditional methodology
operates on two channels with very different epistemic status. On the measurable-feature channel —
GLA, room counts, lot size, age, location — the appraiser is expected to derive adjustment values
objectively, through matched-pair analysis or local linear regression on comparable sales. Competent
appraisers do this work, and the resulting adjustments are at least roughly defensible against the data.
On the subjective-feature channel — condition, quality of construction, design, functional utility,
aesthetics — the available support procedures range from thin to threadbare. Subjective-feature
adjustments can be supported by matched-pair analysis of properties outside the immediate
comparable set, by appeal to recognized peer practice, or by reference to the appraiser’s
documented experience in the market. All three are formally compliant with professional
standards. None of them rises to the statistical rigor that the measurable channel achieves
routinely.
The result is a procedural asymmetry between what the standards require and what the
methodology can deliver. The standards require support for every adjustment, measurable
or subjective. The methodology delivers robust support on one channel and a spectrum
of weaker procedures on the other, ranging from an anecdotal matched pair of sample
size one to a citation of what other appraisers customarily use. The honest competent
appraiser doing careful matched-pair work on a view adjustment produces meaningfully better
support than one citing peer practice without basis, but a reviewer comparing the two
appraisals on the standard form cannot easily distinguish them, because the form records the
adjustment value, not the rigor of its derivation. The methodology tolerates the asymmetry
because no stronger procedure for subjective features has been generally available to working
appraisers.
RCA improves both channels, but in categorically different ways. On the measurable channel, the
MARS regression captures non-linear relationships, variable interactions, and basis-function structure
that matched-pair analysis and ordinary linear regression cannot represent. Fit against the full set of
qualified sales in the market area — typically several hundred — rather than against the three to nine
comparables of the traditional grid, the MARS-derived value contributions are both objective and
substantially more accurate than the methodology they replace. On the subjective channel, RCA does
something different in kind from any of the traditional support procedures. It does not attempt to
estimate adjustment values for individual subjective features at all. The contributions of all subjective
features collectively appear as the residual: the difference between each comparable’s actual net sale
price and its MARS-predicted price, calculable from data without appraiser input. The
decomposition of that residual into named components — Condition, Quality, Design,
Functional Utility, View, and so on — is constrained to sum to the residual total, and by
Proof II (Section 9) any decomposition satisfying that constraint yields the same value
conclusion. The judgment that remains in the workflow is the placement of the subject in
the ranked residual distribution of the comparables, made against an inspectable ranking
and explicitly documented in the report. A weak RCA appraisal is still possible — an
inexperienced VE can produce an overfit MARS model, place the subject implausibly, or miss
anomalies in the data — but it is weak in ways a reviewer can identify by inspecting the
model and the ranking, rather than weak in ways that a standard-form review cannot
surface.
Beyond refinancing, appraisers owe owners accurate valuations, especially for high-value purchases
where errors can cost tens or hundreds of thousands. Yet, many balk at spending $600$1,000 on
appraisals, doubting their added value. Advanced methods like RCA, while initially resource-intensive,
promise greater precision. With refinement and experience, their efficiencyand adoptionshould
increase.
2 Background
The authors educational background is in mathematics, computer science, and statistics. By
profession, he is a software engineer with secondary experience in appraisal and accounting. He has
been doing appraisals, off and on, since 2002 and has applied MARS since about 2004 for determining
property adjustments and price trends in residential real estate.
Appraising with MARS is deep experience in appraisal. MARS generates models that reflect in greater
detail than any other statistical method both the variables that contribute to market value and how
changes in those values impact sale price. Thus, in appraising with MARS, the appraiser is constantly
investigating causal relationships between variable values and sale prices as generated by MARS
models.
In the interest of transparency, the RCA methodology is best applied to residential areas with these
characteristics:
1.
Complex neighborhoods with properties that are quite different from one another
Older homes
Varied qualities of updating
Many different architectures and designs
Variation in property attributes like GLA, lot size, elevation, and view.
2.
Complex market conditions
Quickly changing market conditions
Many different types of buyers
3.
Good data sources: The RCA protocol is based on Data Mining and needs plentiful quality data.
While it can be used for areas with sparse data, it might not be worth the time and cost
required.
4.
Accurate estimates of market value are desired. RCA should, in most cases, be able to provide
1-2% accuracy. The valuation engineer report should comment on the conditions for accuracy in
the final value conclusion. If data or information is lacking, or if consistent patterns in the market
have not been uncovered through data mining, then the accuracy will suffer, and the client should
be informed of the situation. Specifically, A high R2 and CVR2, a meaningful MARS model, and
a strong pattern in property CQA (condition-quality-appeal) features in the ranked
residuals are the principal means for assessing accuracy and reliability in the value
conclusion.
The development of these RCA protocols over the past 20+ years is based on the availability of MLS
Listings Inc. (Sunnyvale, CA) and other MLS, which are currently accessible through MLS Listings,
which comprises over 95% of the authors data analysis experience. The authors expertise stems from
extensive appraisal work across 15 primary Northern California counties, with additional experience in
10 secondary counties.
The RCA methodology is most effective in markets that meet two key criteria:
1.
Rich availability of high-quality appraisal data
2.
Complex, heterogeneous residential areas characterized by:
Diverse architectural styles
Varying degrees of property updates
Multi-decade development patterns
California County Experience
Statements made in this paper are based on experience valuing properties in the following North
California counties:
SanMateo
Santa Clara
Marin
Napa
Monterey
Santa Rosa
Alameda
San Francisco
Contra Costa
Sacramento
El Dorado
San Joaquin
Stanislaus
Placer
Solano
Mendocino
Lake
Colusa
Merced
Nevada
San Benito
Santa Cruz
Sutter
Yuba
The above areas essentially take in the Silicon Valley and its surrounding counties, extending
outward into more rural areas, including the Central Valley and Sierra foothills. It should be
mentioned that the author has always had access to plentiful and relatively accurate data
compared to what many appraisers in other less developed areas in the US have at their
disposal.
The MLS data used is RESO Certified, complying with the RCF format. Despite RCF certification,
data of usually secondary importance is sometimes missing or incorrect for some properties, either
through original tax assessor or subsequent sales agent MLS input errors. Often these errors follow
a pattern, such as simply not making an input if the feature doesnt exist. For example,
instead of entering 0 for carport spaces, when there is no carport, nothing is entered. In
areas where carports are rare, the analyst might substitute 0 for no data, in order to get
a useful adjustment for carports. While R/earth and other MARS implementations can
handle missing data, it is up to the Valuation Engineer (VE) to decide on how to handle
it.
MLSListings data is also available from the MLSs of some other states, from Florida to Hawaii. This
can be useful for testing how well protocols and code work in different states.
3 Valuation Engineer
Definition 1: Valuation Engineer
To prevent ambiguity, The term Valuation Engineer (or "E"for short) is used to denote
an individual possessing the expertise of a licensed appraiser, coupled with advanced
knowledge, skills, and experience in utilizing data mining, statistics, programming,
detailed protocols, and artificial intelligence to ascertain the market value of complex
properties. It is also important to recognize that valuation is a task that necessitates
some form of licensing due to the established potential for fraud.
Lenders typically offer several valuation options:
Hybrid appraisals - utilizing unlicensed 6inspectors with licensed appraiser oversight
3.
Standard appraisals - performed by licensed appraisers
Absent from typical consideration for appraisal work are Engineer-level Appraisals. There simply
aren’t many appraisers around with the skill set to do these. There should be a need for them. They
can provide significantly higher accuracy and nearly guarantee fairly good objectivity with the
employment of RCA methodology (MARS and constraints).
For conforming properties in newer subdivisions, standard appraisals generally suffice. However,
engineer valuations are recommended for older homes in areas with significant renovation history
spanning 30+ years. The choice between these methods should follow a structured decision process
based on property characteristics and market conditions.
4 Assumptions
Assumption 1: Geographical Area &Property Type
The scope of this paper primarily focuses on single-family homes and condominiums in
California’s metropolitan and urban areas. While the Sales Comparison Approach applies
to other property types and geographical regions, income and cost approaches typically
play a more dominant role when valuing agricultural land, multi-family properties, and
commercial real estate. In such cases, the RCA approach may need modification or have
limited applicability.
Assumption 2: Data Storage
CSV and Excel are heavily used by MLS systems and assumed to be the principal way to
feed data to RCA programs. Although broker feeds can be used to feed data continually
into SQL databases. Backup and intermediate storage often uses SQL databases and two
that play big roles are the free and open source SQLite and PostgreSQL. Text or RTF
files are also used for report snippets and PNG, JPG, TIFF files are used for graphics.
Note that Microsoft’s SQL Server is also used by many, but the free edition "QL Server
Express"is very limited (e.g. only 1 CPU Core can be used) and other versions have
hefty license fees for commercial use. Real Estate databases in California can be very
large and joins can be very time consuming, so a 8+ core computer with 128G of RAM
can be useful for large joins. SQL databases are also important when you need to find
all of the properties in an arbitrarily bounded area. So, if for example, you used Cluster
Analysis to define neighborhoods, you would be inclined to use some SQL database.
However, different databases have different built-in functions for handling geographic
information and thus are not compatible with one another. For this set of articles on
RCA, PostgreSQL will be assumed for complex SQL tasks and SQLite for simple tasks.
Assumption 3: "/earth" Is Used For MARS
processing.
Until 2017, Salford Systems’ MARS software was an affordable solution at $270 per
year. However, after Minitab’s acquisition of Salford Systems, MARS was bundled
into their SPM package at $16,000 annuallya price point prohibitive for most small
businesses. Fortunately, the R programming language’s "arth" package (Milborrow
2024), maintained by Stephen Milborrow, has evolved to match or exceed the capabilities
of Minitab’s SPM MARS. The earth package offers additional advantages: it’s freely
available, integrates seamlessly with R’s comprehensive statistical ecosystem, and can
be accessed through Python using Pandas, providing excellent flexibility for different
workflows. Discussions related to MARS assume the specific use of R/earth.
Assumption 4: Notation
Raw data from external sources like CSV files and Excel spreadsheets is represented
using standard Latin characters. This includes Multiple Listing Service (MLS) property
data, MARS (Multivariate Adaptive Regression Splines) configuration settings using the
R/earth package, and project-specific data. This notation extends to their corresponding
data frame representations in R or Python.
When discussing property features and their various types and subsets, the notation
shifts to Greek letters for enhanced precision and analytical clarity. While this
mathematical notation may appear more rigorous, it serves a crucial purpose:
distinguishing abstract property characteristics from their raw data representations.
As a convention, Greek characters consistently denote property features, while
Latin characters indicate spreadsheet data structures. For instance, the symbol
represents the MLS data matrix, organized with properties as rows and their
corresponding features as columns.
Assumption 5: Programming Languages
In this field R and Python w/Pandas are the dominant languages, with C or C++ being
used when speed is important. R/earth is itself written in R and C.
Assumption 6: Indexing
Indexing can vary across programming languages and mathematics, which can get confusing:
Mathematics and R start indexing at 1.
C, C#, ,
and Python start at 0.
In this paper, to keep things consistent and simple, all indexing starts at 0 (like Python or C).
Here’s how it works:
For lists, matrices, or data frames:
Index 0 is reserved for the "ubject property" (the main property being
studied).
Example: In a table where rows are properties, row 0 is the subject property,
and column 0 holds unique IDs for all properties.
A practical tip: Arrange properties with the subject property first (row 0),
then list others from newest to oldest sale date, numbering them sequentially
(0, 1, 2, etc.).
When theres no subject property:
If a data structure doesnt include properties (e.g., just variables or functions),
index 0 is skipped, and indexing starts at 1.
Special note for R users: Since R uses 1-based indexing, when coding this
papers math in R, add 1 to all indexes to adjust from 0-based to 1-based.
5 RCA Workflow
Understanding the RCA workflow is essential for effective analysis. The process follows a structured
sequence of analytical tasks and decision points
5.1 Types of MARS Analysis?
The workflow for MARS analysis depends on whether only a market analysis is needed or whether the
valuation or appraisal of a given property is required. Note that a specific property appraisal typically
requires an analysis of its market area.
Market Analysis: focuses on broad areas such as counties, cities, or defined regions. This
approach does not involve a specific subject property and thus requires no value conclusion.
The workflow centers on model development, validation, and the generation of supporting
analytics. The process concludes once the optimal model is identified and documented.
Single Property Analysis encompasses all elements of market analysis plus the determination of a
specific propertys value. This differs fundamentally from traditional appraisal methods. In
traditional approaches, value is typically derived by adjusting 3-9 comparable sales and
computing a weighted average. In contrast, RCA determines value primarily through two
components:
1.
The MARS models predicted value
2.
The estimated property-specific residual (with the associated CQA score)
The Sales Grid in RCA serves primarily to explain value differences between the subject and
comparables by illustrating variable contributions. A key advantage of RCA is its stability: while
clients may debate comparable selection, the value conclusion remains consistent regardless of
which comparables are chosen for presentation, as all sales (whether 15 or 600) are adjusted using
the same model. While older sales are automatically adjusted to current market conditions, recent
sales are generally preferable. The models effectiveness in capturing price factors diminishes with
temporal distance, as unmeasured market characteristics become more significant over longer
periods.
5.2 Processing Workflow
The overall processing workflow can be described by these step:
1.
Scope of work.
2.
Setup initial project folders and files.
3.
Download MLS data. Refer to Protocol 1-3 on the number of property records needed and
setting limitations on date ranges.
Run MARS (R/earth) regression to create price model.
6.
Generate property value contributions &residuals (pure residual and residual/SF) from
the MARS model.
7.
Rank properties by residual/SF.
8.
Create CQA scores for each comparable sale.
9.
Assuming a quality ranking of properties has been produced which truly represents the
overall appeal of the properties, find the best position for the subject in the ranking and
assign it a CQA score, by studying the overall pattern trends in the ranking.
10.
Assign the corresponding residual/SF to the subject property as residual/SF
(subject) GLA.
11.
Add the MARS price estimate and assigned residual to arrive at subject value
conclusion.
12.
Break down the subject residual into descriptive components and value contributions.
13.
Calculate comparable model produced value contributions & adjustments, plus all
subtotals and adjusted sale prices.
14.
The adjusted sale prices will be equal to the subject final value conclusion.
15.
Select the 6-12 most similar properties for the Sales Grid.
16.
Break down the total residual amounts for the comparables into residual component value
contributions, setting the residual total value to 0 after the breakdown is completed.
17.
Recalculate all subtotals and sales grid comparable adjusted sale price. Check values.
18.
Upload Sales Grid to Report.
Note that with RCA, the final value conclusion does NOT directly derive from the comparable values
in the Sales Grid. With the RCA method, the Sales Grid becomes a tool only for explaining why
properties that appear similar to the subject, sell for more or less than the final value conclusion or
the subject properties appraised value. This must be viewed a different way: The subject value
conclusion depends on the MARS model price estimate, and the residual estimated by the
valuation engineer. But these two quantities are very dependent on all of the properties that
went into the R/earth regression, which created the ranking of property residuals that
determined the CQA score of the subject property thus determined its residual through
placement in the ranking by the valuation engineer. The adjustments in the Sales Grid are
really determined by the imposed mathematical constraints of RCA, although the residual
breakdown is somewhat at the mercy of the VE, the breakdown value contribution totals must
always add up to the associated residual. We know from Proof 2 that the while the actual
values if the residual components may vary depending on the judgment of the VE, their
totals for a given property are completely constrained and therefore will not change the
adjusted sale price for the associated property, regardless of how the VE alters the residual
breakdown.
5.3 Data Workflow
Note that in this paper, only a high-level workflow is presented for the RCA method. Other articles
will delve into the various stages and substages of RCA analysis.
5.3.1 Data Requirements for MARS
We will need to specify a beginning date for sales transactions and an ending date based on the
effective date of appraisal or valuation. This usually takes into consideration the estimated number of
regression variables and interactions that will be needed for each degree of interaction over
1.
5.3.2 Protocols: Closed Sales Listings
Protocol 1: Determining the Number of MLS Closed Sales Listings
Required
This protocol describes how to calculate the recommended minimum number of closed sales
listings to input into MARS regression. You will need to make a rough estimate, to begin with,
run MARS in first, second, and possibly third-degree runs to generate the initial model,
and then review the number of rows according to the following rules to determine if
more rows are needed. This protocol assumes the independent variables are 100%
independent, which in real estate is rarely the case; for example, GLA and room counts are
usually highly correlated, and GLA is also likely to show correlation with lot size.
This gets into the calculation of Degrees of Freedom (DOF). Each truly independent
variable gets one degree of freedom. But, when there are dependencies between the
so-called independent variables, an accurate calculation of the DOF requires some
analysis that is beyond the scope of this paper. Just keep in mind that due to likely
co-linearity between variables, the estimates below are probably on the high side.
1.
For first-order terms (no interactions), the
rows per variable rule works well
2.
For each allowed two-way interaction, you should add approximately 15-30
properties
3.
For higher-order interactions (3+ variables), each should have 30+ properties.
However, this isn’t a hard rule. The actual number needed can depend on:
The signal-to-noise ratio in your data
The complexity of relationships you’re trying to model
The variability in your response variable
The quality of your data
Example:
So if you have:
10 main variables
5 allowed two-way interactions
2 allowed three-way interactions
You might calculate your minimum number of sales as:
Alternatively, you may do the following: After running R/earth on a set of data, you can
use the number of basis functions in the model, plus one as an estimate for the degrees of
freedom and then take this number times 20-30 to get the number of input records needed.
Protocol 2: Sales Periods Should Encompass Minimal Change in Buyer
Tastes
When collecting historical property data, there is no strict time limitation if changes in market
preferences can be captured through appropriate variables. Consider these key points for
handling architectural evolution in your market: Changes in dominant architectural styles
(such as a shift from California Ranch to Mediterranean) can be modeled using:
Subdivision identifiers
MLS area codes
Construction date ranges
However, if these style changes cannot be adequately captured through specific variables, you
face an important constraint: while Date of Sale is typically used to adjust for market
conditions, using it simultaneously to control for architectural preferences would
confound these two distinct effects. In such cases, the preferred approach is to either:
1.
Limit your data collection to the period after the introduction of the newer
architectural style
2.
Restrict your comparable selection to subdivisions sharing the same architectural
style
This ensures your analysis maintains consistency in property characteristics while still allowing
proper adjustment for market conditions over time. Alternatively, at least in this case,
assuming you can extract the architectural style from the data, you could consider the
interaction between these two variables. However, such extraction is not always possible.
Protocol 3: If In An Area Of Few Sales Then Do The Best You
Can
When working with MARS modeling for sales data, you need approximately 15 transactions for
meaningful analysis. If you have fewer sales (for example, 5 properties), you can duplicate each
record twice to reach the 15-property threshold. Important considerations for this approach:
1.
Use only first-degree regression to reduce over-fitting
2.
Ignore or skip cross-validation since the data set is too limited
3.
While MARS will likely outperform matched pairs or standard linear regression
methods in this scenario, be aware that duplicating data affects the validity of
uncertainty measurements and statistical significance
The best long-term solution is to expand your data-set with additional actual sales. However, if
you must proceed with limited data, data duplication provides a workable temporary solution
while acknowledging its statistical limitations.
5.4 Project Folder Structure
It would be a good idea to discuss project folder structure. It is a good idea to have a separate folder
for Market Analyses and Single Property Appraisals. Both of these folders have a similar subfolder
structure:
RCA_Projects/
__
MarketAnalysis/
__
Burlingame_20240501/
__
Pacifica_20240310/
__
RedwoodCity_20240615/
__
...
__
SingleFamily/
__
Pacific_Rosewood_123_20240510/
__
Code/
__
R/
__
...
__
Python/
__
...
__
Data/
__
MLS_Original.xlsx
__
MLS/
__
VER202402510153248/
__
MLS.xlsx
__
MLS_Stage1.xlsx
__
MLS_Stage2.xlsx
__
MLS_Stage3.xlsx
__
MLS_Stage4.xlsx
__
Documents/
__
LinearModels.pdf
__
Cum_Dist.pdf
__
Model.txt
__
PartContrib.pdf
__
Pairs.pdf
__
Residual1.png
__
Residual2.png
__
ResidualSF.png
__
Rpart.pdf
__
VariablesImp.pdf
__
VariableVsSP.pdf
__
...
Stage 1 processing will create the version folder with its sub-folders. It will copy the
MLS_Original.xls, even though this spreadsheet should never change, but reflect exactly what
was downloaded from the MLS as of a certain date and time; yet there is the possibility
on a lengthier project that a new MLS download will be needed. Next, this stage will
make a copy of the expanded MLS.xlsx workbook with its configuration parameters, into
the version folder. There can be many version folders each created for a complete run of
MARS.
Then, the following stages will generate column additions,sorting and other changes to the workbook
and write an updated version as MLS_Stage(N) to the project data folder when the appraisal is
completed. To reiterate, the root MLS.xlsx folder is undergoing constant change, and so the version
copy is an ongoing mutation of the original MLS workbook, with variable deletions and additions, plus
other changes. The processing R or Python code most likely will run with input from the Data
MLS.xlsx workbook. Thus we always have a copy of all the data behind each version of the RCA
run.
As many versions are created as needed until it is not possible to make improvements, at which point
the best version is selected as the final version. One could change the folder name by adding "Final"to the end of it, so it is apparent when looking through the directory. Some older versions with
interesting or competing models, should usually be retained. Each version folder contains a number of
text, diagram and graph files that can be embedded into a report. The final MLS_Stage4.xlsx
workbook contains as many sheets as necessary for the report. In particular it may have a
spreadsheet for transfer to a Fannie Mae Sales Grid. It will also necessarily contain the
computations or breakdown behind aggregations. This is potentially the most time-consuming
part of the RCA approach - the struggle to find the perfect model in a difficult market
area.
Everything needed to recreate the sales grids needs to be copied to the version folder, including the R
or Python code used - if the code undergoes frequent changes.
5.5 The Core Excel Workbook
Since most MLS services download property data to Excel CVS worksheets, Excel is probably the
most convenient way to store data and various configuration parameters. There is one core Excel
Workbook for each project called MLS.xlsx, which has one sheet called "LSData" to store the closed
property sales listings that were downloaded from the MLS, as well as several configuration and
parameter sheets:
Workbook: MLS.xlsx/
__
Sheet: Project Data
__
Sheet: MLSData
__
Sheet: Regression Variables
__
Sheet: Calculations
__
Sheet: Allowed Interactions
__
Sheet: Aggregations
__
Sheet: (MARS run parameters)
The original data is stored in a workbook named something like MLS_Original.xlsx. A copy
should be made with a different name, let’s say "LS.xlsx,"which will load into the R or
Python RCA program. Both of these workbooks are stored under the Data folder. Now,
the Data/MLS.xlsx workbook will probably go through many iterations of runs, trying
different combinations of variables, MARS parameters and changes to other settings. Each
iteration creates a new input/output folder named with the date and time, with the format of
VER_YYYYMMDDHHMMSS. Each folder will have a copy of the input MLS.xlxs workbook and
contain the generated Excel Workbooks for each of the four stages of processing, as well
as the output text documents, diagrams and graphs. Note that Stage 4, the final stage,
may contain a number of other sheets for the report, depending on the type of client and
SOW.
5.6 Backups
When writing new or modified Excel spreadsheets to the project version folders, backups can also be
made to SQLite or PostgreSQL databases in the project folder. The database files are less likely to
become unintentionally corrupted over in subsequent periods, through review or reference
use.
6 Workbook Notation
6.1 Descriptive Dimensions
There are roughly six descriptive dimensions for a workbook ""
1.
Appraisals ""
2.
Run Versions ""
3.
Workflow Stages ""
4.
Sheets ""
5.
Property Listing ""
6.
Variable Set ""/dd>
So, an individual cell in a particular workbook M can be described as:
(1)
(Note: ""means some valid index)
or:
(2)
where:
= # of rows (depends on sheet)
= # of columns (depends on sheet)
= # of run versions
= # of workflow stages
= # of analysis projects
= # of workbook sheets
Understand that ""references the entire Excel Workbook of five core data sheets that each have
their own table, which on running the associated R or Python program are loaded into separate
internal data frames. Both R and Python have internal functions that can take a name and
find which row or column of the data frame belongs to. For example, if in the subscript
,
("LA" "otSize" "athRms" "edRms" …) and j=0, then
Also note that the MARS run time parameters sheet is not a dimension of W, as W relates to the
version of data being input into MARS, separate from the parameters that determine how
MARS/earth handles that data.
For the sake of clarity, a complex multidimensional notation for the Workbook is presented which
relates more directly to the internal R or Python data frames that the data workbook
sheets are loaded into. It is "eavy"notation. But when we get into proofs and so on, we
throw a lot of the baggage away by making assumptions about what we are working with.
Nonetheless, this notation helps get a better overview of the workflow and data from the
outset.
Let’s start with notation for the set of workbooks we will be processing.
6.1.1 The Project Dimension (""
Once a Scope of Work (SOW) has been approved, a project folder needs to be setup to
store input and output data, including configuration parameters, output graphs, diagrams,
spreadsheets and report snippets. The VE should identify all projects with a simple sequential ID,
such as an integer, so that if a report should go missing, it would be obvious. If projects
are canceled, then there should be a consistent means to indicate the project has been
cancelled within the given project folder, but it should never be deleted. We assume project
IDs start at 0 and incremented by 1, although there are other possibilities. We further
assume all valid project IDs are stored in an ordered set ""created with the processing
program.
The appraisal project data is in the project worksheet of the workbook, but could also be stored in
some central SQL database,
Here are the following suggested project fields:
1.
Local ID: A short integer ID that is sequentially generated for the company. This would
make a great primary key for storage - and could be automatically generated by the database
with each successive project.
2.
Global GUID: If you want, you can create a GUID (Globally Universal ID) using
Second in the workflow, we execute MARS regression on the input. This will likely involve several to
many executions as we alter various parameters to improve the model. Each execution will create a
new version folder for the output, with a name based on the date and time of execution. For example,
r =
("er20241210_083124"
"er20241209_151130"
"er20241208_160944"
Promising versions with competitive models and output, should be kept until a final version
is decided upon, or perhaps longer if they have some utility. You may want to modify
the version folder to indicate it is the final version, by placing "inal"at the end of the
name.
6.1.3 Workflow Stages (""
The RCA Workflow is separated into usually five separate stages, as listed below. Each stage must
successfully complete before starting the next stage. The VE should be able to stop the
processing at the end of a stage to review the results before proceeding. This is needed
because market areas are fairly complex and often quite unique. Unusual problems can be
encountered that take a new approach. The VE must be able to jump into R or Python code to
make necessary changes and even try new approaches. And, in such cases, he may want to
experiment and try different approaches requiring multiple runs of the same stage he is
working on. Although, the VE should also be free to run all stages with one command, - with
some caveats (e.g. 3rd Degree Aggregation must have already been determined through a
previous run Stage 1). This means that when each stage completes, it should store all
of its internal data in files or databases, so that the next stage can be started as if the
previous stage were still in memory, by reloading the data from the previous stage into
memory.
Here are the stages:
w =
("etup"
-Setup
"tage1"
-MARS processing
"tage2"
-Final Sales Grid
"tage3"
-Report snippets
"tage4"
-Review/Cleanup
6.1.4 Excel Workbook Sheets (""
In the RCA Core Excel Workbook, we have the following sheets that are each mapped into a internal
data frame:
s =
("LSData"
"egression Variables"
"alculations"
"nteractions"
"ggregations 1st &3rd Degree",
"ggregations 2nd Degree"
The reason that 2nd degree aggregations are put on a separate sheet is that they can be most easily
defined with a symmetric table that shows all possible pairs of interactions, allowing the specification
of an aggregation variable for each pair’s associated value contributions and adjustments to be
aggregated (added) to. We couldn’t do this with 3rd degree aggregations because that would require a
3 dimensional symmetric table which is not possible in Excel. Further more 20x20x20 cells would 8,000
cells to deal with. It is easier to just list any third degree term names in the table with the 1st degree
aggregations, for example:
Variable
Aggregate To
AboveGradeFinArea
GLA
BelowGradeFinArea
BGLA
UnfinArea
UBA
LotSize
LotSize
...
...
GLA__LotSize
GLA
GLA__DateOfSale
GLA
BathRms__DateOfSale
GLA
Lat__Long
Location
Lat__Long__GLA
Location
...
...
6.1.5 Property Listings ("" - MLSData rows
Property listings form the rows of the MLSData sheet. They are not found in the other sheets, until
Stage 4, where the final report snippets are created with the final selection of comparable properties.
Here is a sample list of property IDs:
p =
("LS823211" "LS823456" "LS823542" "LS823721" ".."
6.1.6 MLS Variables ("" MLSData Columns
More will be said about handling variables later. Suffice it to state at this point that MLSData sheet
typically has 30-40 variables of various types, but usually only 10-20 will be selected as independent
variables for the MARS regression. These variable also appear in other sheets, although typically as a
list of all possible variables that can be used for submission to MARS. An example of a list of
variables is:
v =
("LA""alePrice""edRms""athRms""arageSF"".."
6.1.7 Final Notation Example
Lets pull the above notation into a final example. Using the above lists, let
(3)
Note that we are using 0 indexing here, so that the first elements of lists is assumed to have an index
of 0. Python also uses 0 indexing, by R uses 1 indexing.
So, the above means that we are talking about the cell value for: