The Case of Credit Scoring (2024)

[Note: This article is part of a series on AI Ethics and Regulation. The previous installment can be found here.]

Banks and other lenders first turned to algorithms in the 1960s for the same reason that many organizations seek automation: Fueled by population and business growth, the sheer volume of decisions that had to be made was becoming so large that manual processing was no longer sustainable.1 But they soon found out that algorithms also had a dramatic impact on prediction accuracy. Here’s how Thomas et al. tell the story in their book Credit Scoring and Its Applications:

Thanks for reading Konstantine’s Substack! Subscribe for free to receive new posts and support my work.

The arrival of credit cards in the late 1960s made the banks and other credit card issuers realize the usefulness of credit scoring. The number of people applying for credit cards each day made it impossible in both economic and manpower terms to do anything but automate the lending decision. The growth in computing power made this possible. These organizations found credit scoring to be a much better predictor than any judgmental scheme, and default rates dropped by 50% or more—see Myers and Forgy (1963) for an early report on such success, and see Churchill et al. (1977) for one from a decade later. The only opposition came from those like Capon (1982), who argued that “the brute force empiricism of credit scoring offends against the traditions of our society.” He believed that there should be more dependence on credit history and it should be possible to explain why certain characteristics are needed in a scoring system and others are not. (p. 3, my emphasis)

This is mostly on point, but the depiction of Capon’s opposition is glib. Capon’s 1982 article, Credit Scoring Systems: A Critical Analysis, which was published in a business journal and is virtually unknown in the computer science community, is one of the first publications to raise concerns about algorithmic fairness and makes for fascinating reading that is directly relevant to recent debates. Even though its focus was on credit decisions, it brought up essentially the same set of concerns that have been circulating over the last 10 years or so in the AI literature: the fact that ground-truth data is often not quite randomly sampled but instead suffers from various forms of survivorship bias; the issue of facially neutral characteristics that in reality are proxies for protected attributes; limited sample sizes; multicollinearity issues; and so on.

But Capon’s 1982 salvo suffers from the safe fundamental problem afflicting the more recent incarnations of these criticisms: It brushes aside the key fact that the alternative, human decision making under uncertainty, is incontrovertibly worse. He writes “While judgmental systems are based, however imperfectly, upon a credit evaluator’s explanatory model of credit performance, credit scoring systems are concerned solely with statistical predictability.” (p. 85).2 But “however imperfectly” does not begin to describe it. Capon suggests that humans have access to “explanatory models” of credit performance, without bothering to specify what these models are or how they are used. Are they bodies of rules specifying what constitutes good or bad credit performance? If so, then we have a de facto algorithmic system codified in rules, and the role of the human decision maker would be reduced to that of a computist following those rules. In reality, these “explanatory models” are private bodies of inchoate intuitions that are largely subjective, inconsistently applied, and exceedingly susceptible to an endless spectrum of cognitive and other biases, as discussed in the preceding article. In Capon’s defense, in 1982 the breadth and depth of these problems with human decision making were not as widely appreciated and as well understood as they are today.3

It is also emphatically not the case that credit scoring systems are “concerned solely with statistical predictability.” If credit scoring systems were only concerned with predictability, then the multicollinearity issues that Capon mentions would hardly be issues, as it is well known that multicollinearity does not pose problems for predictive purposes, particularly in the presence of sufficient amounts of data. Interpretative concerns have been central in credit scoring from the beginning, and certainly after the 1974 passage of ECOA (Equal Credit Opportunity Act), which requires that those affected by adverse credit decisions must be given reasons for those decisions.4 This is largely why most statistical approaches to assessing credit risk have been based on “scorecards”5 produced by logistic regression models, rather than, say, SVMs or deep learning systems, even though the latter would deliver considerably better predictive performance—because logistic regression models (and the resulting scorecards) are much easier to understand and interpret.6

Because they impinge on interpretability, multicollinearity issues are worth discussing in slightly more detail. These are issues concerning the degree to which input features correlate with one another, as observed in a correlation matrix, for example. When two or more features correlate strongly,7 it is often advisable to avoid using all of them in a regression model. Such variables exhibit multicollinearity, meaning that one can be more or less linearly predicted from the others (“more or less” depending on the strength of the correlation). This is not a serious problem for predictive purposes, as multicollinearity has little impact on model accuracy. However, multicollinearity can complicate the use of a regression model for explanatory purposes, particularly for studying the effect of individual features on the target variable. In regression analysis, it is customary to analyze the effect of a feature on the target by looking at the regression coefficient of that feature; that coefficient represents the change in the target variable that is brought about by a unit change in the feature value when other features stay constant. This latter assumption is violated when two or more features are strongly correlated. In addition, multicollinearity can compromise estimates of the statistical significance of the regression coefficients. These issues are more pressing when regression is used in the social sciences or in the context of public policy, where the purpose of a model is not just to predict but also to explain. And this was part of Capon’s message, that these credit-scoring models were overly concerned with prediction at the expense of explanation. However, as pointed out, this is simply not true, and has not been true since 1974 at the risk of violating federal law.

Machine learning practitioners are aware of dimensionality-reduction techniques such as PCA (Principal Component Analysis), that can transform a set of correlated variables into a smaller set of uncorrelated variables, called “principal components” in the case of PCA. These components are linear combinations of the original variables, and are extracted in such a way that the first principal component accounts for the largest possible variance in the data, the second principal component accounts for the second largest variance, and so on, subject to the constraint that each component is orthogonal to (uncorrelated with) the others. When carefully implemented, such techniques can mitigate multicollinearity, but interpretability can be compromised because we now have two sets of features, the original and the transformed, and even though we can do prediction with the latter, we still need to understand the model’s behavior in terms of the former, so we need to somehow map principal components to original features, and this is not an easy exercise.

However, there are other related techniques that are more effective and widely available in streamlined implementations. For instance, variable clustering algorithms identify groups (clusters) of correlated variables, while also outputting the amount of information in a given cluster that is explained by each variable in that cluster, as well as the distance of each variable from other clusters. One can then make rational variable choices, by selecting only one or two variables per cluster to include in the model to be developed, based on the output of the clustering algorithm, e.g., by choosing from each cluster the variable that explains the greatest amount of information in that cluster.8 Such algorithms have off-the-shelf implementations these days, as in the VARCLUS procedure in SAS software. Moreover, these models are developed under the auspices of human experts who have a deep understanding of the credit domain at hand and can reliably guide variable selection.

Capon acknowledges in passing (footnote 8, p. 88) that PCA-like techniques can address multicollinearity,9 but demurs that this would complicate reason-giving without delving into the rich literature on the topic or acknowledging the fact that, in practice, credit scoring systems have been quite successful in identifying reasons for adverse decisions, and do so much more rationally and consistently than any human could hope to do.10 It should also be kept in mind that, unlike contemporary regulatory demands,11 ECOA’s Regulation B does not stipulate that applicants affected by adverse decisions be given explanations, a concept that is hopelessly labyrinthine, but only reasons. In fact, a creditor “must disclose the principal reasons for denying an application or taking other adverse action” (my emphasis) but is not required to disclose all such reasons, as “disclosure of more than four reasons is not likely to be helpful to the applicant” (see the official interpretation of paragraph 9(b)(2) of the relevant regulation). In addition, “a creditor need not describe how or why a factor adversely affected an applicant. For example, the notice may say ‘length of residence’ rather than ‘too short a period of residence’.” These are modest requirements. It is not necessary to decipher complex relationships among independent variables or even to have definitive knowledge of the relationship between an independent variable and credit risk. Compliance is not trivial, but it is feasible, potentially even for models that are much more opaque than regression or decision trees.

Capon’s paper does make some very good points. One is a prescient warning about “brute force empiricism” that has largely gone unheeded in the era of big data, as manifested by frequent failures to make proper distinctions between information and knowledge, and as captured by bold proclamations that “the data deluge makes the scientific method obsolete” and spells out “the end of theory.” Such unthinking glorification of big data should bring to mind Popper’s parable from his Conjectures and Refutations:

But in fact the belief that we can start with pure observations alone, without anything in the nature of a theory, is absurd; as may be illustrated by the story of the man who dedicated his life to natural science, wrote down everything he could observe, and bequeathed his priceless collection of observations to the Royal Society to be used as inductive evidence. This story should show us that though beetles may profitably be collected, observations may not. (p. 46)

Chomsky and many others have made similar points about how data is useless without a prior standpoint informed by theory. The radically empiricist notion that data by itself, divorced from theory, is intrinsically valuable for scientific purposes stems from an outdated Baconian view of science and induction and is severely misguided, though it seems to have been popularized in recent times. Capon was making the point about “brute force empiricism” to protest the fact that some early statistical models were using input features that seem unrelated to creditworthiness, such as the age of one’s car or membership in a trade union,12 just because they appeared to have some predictive value, even though their correlation with creditworthiness was almost certainly spurious.

That is admittedly a good point, but there are a number of observations to be made. First, the chief purpose of these statistical models is not to do science. They are not in the business of explaining creditworthiness as a social phenomenon, or conducting academic research in psychology to determine the true underlying causes of what might make one likely to repay or to default. The chief—though not sole—purpose is to accurately predict the applicant’s likelihood to repay, so as to decide whether or not to grant them credit, and if so, on what terms; and to do so as fairly as possible while still making a profit and adhering to the law.

It is well understood, of course, that statistical models can—and often do—latch on to spurious correlations in their input data. This might just be the most fundamental problem facing machine learning. It’s very hard to separate the good (causal) correlations from the bad ones. We are often trying to predict something without being aware of any robust empirical regularities connecting it to the input features, which can indeed produce absurd results. On the bright side, spurious correlations will almost always manifest themselves in the model’s behavior, either via “absurd results” or, more mundanely, by being brittle and failing to generalize properly to cases that don’t share the peculiarities of its training data. Such a model will also be more vulnerable to data drift, whereby external environmental changes affect the target variable, the input features, or their relationship. These issues can be detected and addressed by proper testing and validation methodologies, performed at a regular cadence as a guard against drift. And these are, in fact, best practices recommended by international banking regulation agreements such as the Basel accords, which have been adopted and legally mandated by many countries. In the U.S., for example, the FDIC provides a number of guidelines for rigorous testing of credit models, including ongoing monitoring, regular validation from independent reviewers, periodic updates and recalibrations, benchmarking and back-testing, and so on; FDIC’s examiners regularly conduct reviews to verify that lenders’ modeling practices comply with such requirements.13 In addition, there are several techniques available, particularly for the sort of simpler and more transparent models that tend to be used for credit scoring, that will reliably weed out features that do not carry substantial information about the target variable. In a flippant footnote, Capon writes that “a logical extension of Mr. Fair’s position would allow the inclusion of such characteristics as color of hair (if any), left or right-handedness, …, if it could be shown that they were statistically related to payment performance.” But no serious practitioner would haphazardly include such noise variables, particularly for the sort of models that are typically used for credit scoring, like logistic regression. If they did, and by fluke that happened to give some promising results at first, almost certainly via overfitting, proper testing would soon reveal the hard way performance declines rather than improvements. And that would only be possible because there are hard quantitative constraints on model performance that can be systematically enforced owing to reproducibility—all courtesy of algorithms. None of it is remotely possible with human underwriting.

While we’re on the subject of spurious correlations, to really put things in perspective, let’s look at the information that was routinely used by human credit analysts in the pre-algorithmic era, as detailed in Lauer’s book, Creditworthy:

To judge character, creditors thus looked for clues in an individual’s outward behaviors and relationships: physical appearance and personality; marital stability (or strife); the condition of one’s home; drinking habits; predilections for gambling or philandering; and one’s reputation among neighbors, employers, and business associates. Statements regarding character were integral to credit reports well into the 1960s. They appeared in open-ended remarks sections and gridlike rubrics that required bureau investigators to indicate whether the subject was “well regarded as to character, habits, and morals.”

While information about one’s character circulated in credit reports, the bureau’s subscribers also performed their own character inspections. Whenever a customer applied for credit, whether at a department store or a bank, a credit manager typically met with the individual and coaxed out as much information as possible about the applicant’s finances, domestic arrangements, and personal life. During these probing face-to-face interviews, which many consumers came to dread, credit managers also scrutinized the disposition and manner of applicants and made their own judgments about their honesty. Even as credit management shifted toward systematic record-keeping and quantification, the personal interview remained standard practice. (pp. 20-21)

Credit analysts sought information revolving around the so-called “3 Cs”: character, capital, and capacity (to earn a living), with “character” being the principal trait. By the early 1950s, credit bureaus had amassed so much personal information in their files that in 1953 Life magazine was joking that they would make “the head of the Soviet secret police gnash his teeth in envy” (Creditworthy, p. 180).14 Atlanta-based Retail Credit Company, founded in 1899 and now known as Equifax,15 would routinely compile “investigative reports” concerned with “the subject’s character, reputation and mode of living,” which could “contain information on any aspect of one’s personal life, ranging from housekeeping proficiency and yard care, to associates’ reputation, to drinking and sexual habits” (see this 1979 article in the Columbia Law Review, which does a great job explaining the rationale behind the passage of FCRA). There appeared to be “no limit to the detail sought by these reports,” with tidbits such as:

  • “Lives common law.”

  • “Lives with Mr. (different name) but sources do not know the relationship.”

  • “Subject living with woman without benefit of marriage.”

  • “He is divorced because of his association with other women.”

  • “He lives with another man and sources suspect them of living in an immoral relationship.”

Commonly asked questions included the following (listed verbatim, using Retail Credit Company’s exact phrasing):

  • Current marital status

  • If divorced–when, why, whose fault?

  • If separated–how long, cause, divorce planned?

  • Past and Present moral reputation

  • If promiscuous–extent, class of partners?

  • If particular affinity–how long, criticized, partner beneficiary?

  • If living with partner–how long, children, stable home, criticized, is there living undivorced spouse?

  • If illegitimate child–how old, circ*mstances, favorable reputation regained, living and working conditions?

  • Possible hom*osexuality: How determined–living together, demonstrates affection for partner in public, dress and/or manner, criticized, associated with opposite sex?

These are some of the “input features” that were taken into account by human credit managers. Left or right-handedness would seem just as relevant by comparison, and a lot less invasive and sinister to boot.

Eventually, the government realized that algorithmic decision making offered compelling advantages for credit underwriting. Here is how the matter is put in a report prepared for Congress in 2007 by the Federal Reserve Board, on the subject of “Credit Scoring and Its Effects on the Availability and Affordability of Credit”:

One feature of credit scoring generally not shared by judgmental underwriting is its objectivity and consistency; judgmental systems are by their nature subjective and may not produce consistent decisions between applicants with substantially similar credit histories. Credit scoring applies an algorithm to standardized credit information, so a given set of such information produces a given credit score no matter when it is prepared or for which borrower it is prepared. In judgmental underwriting, on the other hand, multiple analysts evaluate credit history in different ways, often emphasizing different factors; thus, the same inputs do not always lead to the same interpretation. For a given level of accuracy, improved consistency can lower costs by reducing costly management oversight that is necessary to ensure that different loan underwriters are applying a firm’s lending rules in a manner consistent with company policy and applicable legal requirements. In competitive markets, such cost savings would be expected to be passed on to consumers in the form of reduced loan interest rates or fees.

Moreover:

Adoption of a mechanical, consistent system for credit evaluation reduces the opportunities for engaging in illegal discriminatory behavior. In contrast, judgmental, subjective decisionmaking offers opportunities for discriminatory behavior, whether such behavior is intentional or not. For example, in a judgmental system, a credit rater may assign different credit ratings to two borrowers who pose identical credit risks if one is, say, a friend or member of the rater’s social club, or a credit rater may assign different evaluations to prospective borrowers with identical credit histories on the basis of impermissible extraneous data such as the borrower’s ethnicity, religion, national origin, or sex. Such actions are illegal, but in a judgmental underwriting system they are easier to disguise if deliberate, and they slip through more easily if unconscious.

As a 2020 article in the New York Times put it:16 “Intentionally or unintentionally, [human loan officers] discriminate. When the National Community Reinvestment Coalition sent Black and white ‘mystery shoppers’ to apply for Paycheck Protection Program funds at 17 different banks, including community lenders, Black shoppers with better financial profiles frequently received worse treatment.” By contrast, as the above report by the Federal Reserve Board goes on to point out:

A rule-based system, if applied consistently, works to deter discrimination unless the rules themselves are discriminatory. Credit-scoring systems explicitly avoid making use of impermissible data, a fact that can be readily verified. Moreover, as noted previously, the records maintained by credit-reporting agencies on the credit experiences of individuals do not include information on personal characteristics such as race, ethnicity, sex, and marital status.

In fact, it was government regulation in the form of the FCRA and ECOA, particularly the latter, that really ushered in statistical decision making in the credit industry. Until the early 1970s, algorithm aversion and bureaucratic inertia had created very strong head winds against algorithmic credit scoring.

While seeking to democratize access to credit, the ECOA had a profound effect on the future of creditworthiness. Ironically, the new law accelerated the adoption of credit scoring systems. What originated as a labor-saving technology to speed credit evaluation was by the late 1970s a tool of legal compliance. Faced with prohibited categories of personal information and rules requiring the disclosure of credit criteria, lenders turned to scoring systems as a shield against discrimination suits. At minimum, such systems enabled a lender to prove that it applied the same standards to all credit applicants and to demonstrate its formal rules for decision making. The ECOA thus accomplished overnight what consultants and industry insiders had failed to do for a decade: convince skeptical credit managers to embrace statistical scoring. (Creditworthy, p. 240)

The aforementioned 2007 report by the Federal Reserve Board cites numerous research findings, such as a 2002 paper out of Freddie Mac and George Washington University that studied the effects of statistically based automated underwriting (AU), akin to credit scoring, in mortgage lending. The authors wrote:

Using information from Freddie Mac’s Loan Prospector AU service, we provide statistics useful in examining these issues [the impact of AU on underserved populations]. The data strongly support our view that AU provides substantial benefits to consumers, particularly those at the margin of the underwriting decision. We find evidence that AU systems more accurately predict default than manual underwriters do. We also find evidence that this increased accuracy results in higher borrower approval rates, especially for underserved applicants.

This was borne out by a more recent report from 2022 on racial bias in mortgage lending, which confirms “significant progress in fair lending” over the last 30 years (coinciding with the wider adoption of algorithmic underwriting). While significant racial gaps in denial rates persist, the authors show that these can be explained by “observable applicant risk factors” and not by differential treatment or discrimination (which they define as “lenders treating applicants with identical observed risk factors differently on the basis of race or ethnicity–including both taste-based and statistical discrimination”).

Side Effects and Some Caveats on “Brute-Force Empiricism”

Side effects can come not only from technology regulation but also from algorithmic success stories. By a number of important standards, credit scoring was a big success story. It greatly scaled up the availability of credit in the U.S. economy, and its mechanical consistency removed the subjectivity and biases that are part and parcel of human decision making.17 But the wide adoption and success of algorithmic credit scoring created many ripple effects, some of which have been rather unsavory.

In the pre-algorithmic era, the credit decision problem was very simple: An applicant would either be deemed creditworthy or not, and that would determine whether or not they were granted credit, at a fixed interest rate that applied to all applicants. But a quantitative statistical model can do a lot more than inform a binary decision. It produces a continuous score that tracks the applicant’s probability of defaulting, and allows lenders to partition applicants not just into two classes, creditworthy and not, but into a much larger number of finer-grained buckets according to their credit risk. Different loan terms can then be used for the different buckets. Applicants at lower risks of default can be given credit at lower interest rates, while riskier applicants can still get credit but at higher interest rates, lower credit limits, more onerous late-fee policies, and so on. This is a good thing for low-risk applicants, as it makes credit cheaper for them.

It’s a different story for higher-risk applicants, who tend to be financially insecure people without access to mainstream credit, usually because they have limited or poor credit histories.18 People in this category, who are easy to identify algorithmically, have often become the target of predatory lenders who offer them credit at exorbitant interest rates and with punitive stipulations. One might think that having access to credit on unfavorable terms is better than not having access at all, but in many cases the remedy of predatory credit is worse than the disease, as borrowers become trapped in spiraling cycles of debt that often end up in utter financial ruin. Risk-based pricing, of course, is perfectly legitimate in theory. In practice, it is a slippery slope that frequently slides into fraud, with hidden fees and inadequate disclosures in general, loan packing, “balloon” repayment schedules, debt churning, and so on. However, predatory lending is a very old phenomenon19 and there are already mechanisms in place to deal with fraudulent lending practices at the individual level, most notably the 1968 TILA (Truth In Lending Act) and subsequent amendments, such as the 2009 CARD Act (Credit Card Accountability Responsibility and Disclosure Act).

The wide adoption and success of credit scoring has also created a number of feedback loops. FICO scores, for example, have become so entrenched that they are used to inform life-altering decisions that determine whether one can purchase insurance, obtain housing, buy a car, and find a job or possibly even a romantic partner. Outcomes from these processes can in turn end up reflected in your credit score. If your credit score is meager or poor, you will have a harder time renting an apartment, and you might need a larger down payment. Finding a good steady job will be more challenging. Your loans will carry a higher interest rate, which will make it harder to keep up with payments. If you do fall behind on your payments, you’ll have to pay even more for car insurance, which will make it even harder to stay afloat, which will push your credit score even lower, further reducing your chances of finding good work, and so forth.

Let me now turn to some caveats. I wrote earlier that no serious practitioner would include noise variables like hair color or left-handedness as input features, “particularly for the sort of models that are typically used for credit scoring, like logistic regression.” If they did, they would likely end up hurting the model’s performance. While this is true, the move towards big data and much more complex models is injecting various shades of grey into the matter. Left-handedness or hair color will not provide any signal no matter what, but if you take a non-linear model (like a deep neural net) and throw a ton of data at it, above and beyond the customary data used in traditional credit-scoring systems, it’s conceivable that the complex model will discover interactions between some of the data inputs that will appear to have predictive value. Such a model might turn out to be more accurate than a traditional system like FICO.

This is not a new idea. It’s something that a number of companies have been trying to do over the last 15 years or so, ever since the notion of big data started to break into the mainstream. Wonga, a former payday-loan company in the U.K., predicted the probability of default via a complicated AI model that used 8,000 different input features about an applicant; they claimed to be doing a “dramatically” and “unbelievably” better job at predicting defaults than FICO.20 In 2012, Douglas Merrill, the then-CEO of ZestCash, another payday-loan company that was owned by ZestFinance,21 made the following declarations in a New York Times article:

“We feel like all data is credit data, we just don’t know how to use it yet,” he says. “This is the math we all learned at Google. A page was important for what was on it, but also for how good the grammar was, what the type font was, when it was created or edited. Everything. What Gil is doing at Factual is the same. Data matters. More data is always better. [my italics]

More data might or might not be “always better” for purposes of sheer prediction, but deciding what webpages to retrieve in response to a generic Google query is very different from deciding whether to grant a particular person credit or bail, or whether to rent them an apartment or make them a job offer. This point seems to have been overlooked by these entrepreneurs, and also by “social innovators” like Rachel Botsman, who in 2012 was waxing poetic about “reputation capital,” enthusiastically endorsing Wanga’s use of those 8,000 input features and similar approaches used by other startups, like Movenbank, which had been founded by “futurist” Brett King the year before, in 2011, and which is also now extinct. Movenbank used a “concept” named CRED that was intended to determine whether “your behaviour is risky.” To do so,

[CRED] takes into account an individual’s traditional credit score but also aspects such as their level of community involvement, social reputation and trust weighting. Do they have a good eBay rating? Do they send money peer-to-peer? It also measures their social connectivity—how many friends do they have on Facebook? Who are they connected to on LinkedIn? Do they have an influential Klout score? It combines this data, not just to assess their risk, but to measure the potential value of the customer.

While many of these early startups no longer exist,22 the trend towards using big data to assess credit risk is alive and well. This trend should be resisted. It represents the “brute force empiricism” that Capon was right to criticize. Yes, credit-scoring models are in the business of predicting behavior, not explaining it. But that doesn’t mean that any and all data under the sun is fair game as long as it yields some predictive delta. As a society, we have (rightly) decided to preclude certain types of data from being taken into consideration in credit decisions—regardless of whether that data carries any signal about the outcome we are trying to predict. Protected attributes are the obvious example (race, sex, and so on), but not the only one. Private data that is not protected by antidiscrimination law should likewise be off-limits. Health information is an obvious candidate. It is plausible that struggles with physical or mental health might have some correlation with default rates; and information about one’s health history might well be discoverable on the Internet. This does not mean that it should be taken into account by credit-scoring algorithms. New regulation might be needed here. Existing regulation demands that reasons for adverse credit decisions must be provided to those customers who request them, but does not impose any restrictions on what those reasons can be. There is nothing legally preventing a lender from telling an applicant that they were rejected because of the school they attended, or the grades they received, or how well or badly rated they are by Uber drivers, or because of unsavory rumors about them on social media, or because they have changed too many phone numbers, or because of their web browsing patterns, or because they scrolled through the lender’s terms of service too fast,23 or because of any one of a myriad dubious factors that might be indiscriminately fed into a “big data” credit-scoring model with the subtlety of a firehose. The vast majority of these are either patently irrelevant to creditworthiness or beyond the applicant’s direct control (often both).

We made serious progress in moving away from the “character” nonsense and the fatuous tattletale gossip that bureaucrats used to obsess with when making credit decisions in the pre-algorithmic era—the sort of intrusive surveillance reports gathered by credit bureaus that “would make the head of the Soviet secret police gnash his teeth in envy.”24 It would be a perverse irony if we allowed big data to sink us back to those depths. A credit-scoring model has a narrow predictive objective, to determine whether the applicant is likely to repay—nothing more and nothing less. That decision should not be informed by a deluge of prying data that essentially gauges general social conformity and obedience in the style of the Chinese “social credit” system, wildly overstepping the narrow predictive mandate of a credit-scoring system. It is neither here nor there whether a credit applicant likes to help old ladies cross the street, whether they are good tippers, whether they are courageous, forgiving, prudent, generous, humble, patient, loyal, fair, dependable, compassionate, respectful, or whatever other traits one might subjectively believe to be constitutive of the sort of “character” that might correlate with debt repayment.

Executives of payday-loan and other assorted fintech companies defend their practices by appealing to the opportunity to score the credit invisibles and the unscored. This prospect seems to have greatly impressed regulatory agencies like the CFPB, who are understandably eager to expand credit access to underserved population segments and hope that fintech’s use of big data might be helpful to that end. But scoring the invisibles and the unscored in a way that does not penalize them for lacking a credit history or, say, for having faced medical-bill collections, does not need big data or social media chatter. It can be done in the same basic framework established by FICO, simply by altering the weights of extant input features or, at most, by incorporating a very small number of new but carefully circ*mscribed and vetted sources of alternative data. This is what Vantage has been doing, for example. Whereas FICO requires at least 6 months of active credit history in order to produce a score, Vantage only requires one month. Vantage also completely disregards third-party collection debts that have been settled, including, crucially, all medical bill collections, regardless of their balance sizes. Vantage claims that its scoring system is significantly more inclusive, allowing 33 more million U.S. consumers to be scored.25 These are the sort of adjustments that regulatory agencies should be experimenting with in their quest to expand credit access.26 Alternative data that might be able to provide meaningful help on that front and whose provenance and accuracy can be readily verified is quite limited; payment histories for utilities are the prime example.27 There is no need for the unreliable cacophony of social media or browser histories to enter the equation.28

It is worth reiterating that the limited regulation suggested here would aim to circ*mscribe the data sources that can be taken into account when deciding creditworthiness—regardless of who or what is making the decision. This is not overarching regulation about AI en masse or about algorithmic “explainability” or “transparency.” Such regulation would have been just as useful in the pre-algorithmic era, when it could have prevented human analysts from digging into surveillance dossiers detailing people’s drinking habits and sex lives in order to judge their “character.”29

Credit Scoring and the Subprime Mortgage Crisis

Some authors have implicated algorithmic credit scoring in the financial crisis of 2007-2008. For instance, an informative and otherwise insightful article by Barbara Kiviat claims that the “mortgage lenders’ mass adoption of credit scoring … greased the wheels of private-sector mortgage securitization in the early 2000s—and the housing finance crisis that followed.” But if it was any credit assessment methodology that greased the wheels of private mortgage securitization, it was not the algorithmic credit scoring we have been discussing here, which assesses the creditworthiness of individual borrowers, but rather the risk assessment models developed by credit-rating agencies (CRAs) like Moody’s and Fitch. These are a whole other type of animal, designed to do something very different: to evaluate the risk not of a single borrower but that of a financial instrument issued by a private or public entity (a structured financial product issued by an investment bank, a bond issued by a government or municipality, and so on). That is a very different problem from the typical task of “decision making under uncertainty” that we have been examining. It is also a much more challenging problem, and one whose complexity varies dramatically depending on the complexity of the instrument being rated. Because the charge of having contributed to one of the most cataclysmic financial crises in history is a heavy one, it is worth providing some brief background on that crisis that might help to clarify the situation. The rest of this section will aim to do just that. The crisis was obviously a complex event, and the technical details of the relevant financial products are intricate, but the basic ideas are easy to understand.

Government-sponsored enterprises (GSEs) like Fannie Mae and Freddie Mac30 had already been in the “securitization” business of buying mortgages from primary lenders (banks) for decades before the crisis. This was done for the sake of injecting liquidity and stability into the mortgage market. When Fannie Mae or Freddie Mac buy loans from lenders, that gives lenders more money that they can use to continue lending, hopefully at affordable rates and with long-term horizons, e.g., with 30-year-long repayment periods.31 The purchased mortgages are then pooled and securitized—they are turned into bond-like financial instruments that the GSEs sell to external investors. These instruments are known as MBSs (mortgage-backed securities); they represent ownership interests in the cash flows from the pooled mortgages. That is, the investors who buy these MBSs receive a portion of what the homeowners continue to pay in loan interest and principal reduction. The same idea can be—and often has been—applied to any future stream of payments (known as “receivables”), not just loan payments (or debt payments in general, e.g., credit card debt or school loan debt), but even proceeds from royalties32 and trade receivables, i.e., the money that your customers will owe you for goods or services that your company will deliver to them. The general technique is known as asset securitization and can be a very convenient tool for companies that need to raise funds, for a number of technical reasons.33 Such a steady stream of income is appealing to investors, particularly in the case of GSE-issued MBSs, which have the weight of the U.S. government behind them, providing an implicit guarantee on the timeliness of payments and a vested interest in preventing GSEs from defaulting on their obligations. To further ensure that the mortgages are healthy and can be sold in the secondary market, GSEs require lenders to adhere to certain underwriting guidelines regarding the borrower’s creditworthiness and the property’s appraisal value.

In 1995, Freddie Mac endorsed the use of FICO scores as the main tool for assessing the creditworthiness of mortgage applicants, in addition to other long-established factors for appraising the viability of a mortgage, such as LTV (loan-to-value) and DTI (debt-to-income) ratios, property appraisals, and so on. Their guidance settled on 620 and 660 as two key score thresholds.34 Loans given to borrowers with FICO scores at or above 660 were considered “prime” and underwriting for those borrowers could proceed swiftly. Scores between 620 and 660 merited a “comprehensive review” by the underwriters, while scores below 620 indicated a strong probability of default, barring “extenuating circ*mstances”.35 These two numbers were chosen because they were found to be the optimal cutoff points consistent with the agency’s extant standards and rich historical data, which showed, for example, that borrowers with credit scores above 660 were much less likely to default. Fannie Mae followed suit within a month, also adopting FICO scores.

While the adoption of FICO scores by the GSEs is occasionally portrayed in a negative light, it was in fact a positive development. Of course, information about a borrower’s credit record was already taken into account during mortgage underwriting before 1995. But the process for doing that was largely manual and left up to individual real estate brokers, who were financially motivated to improve their client’s profile so as to make a sale and earn a commission. Accordingly, the stringent underwriting standards that the GSEs demanded were often applied in a lax manner, as the brokers were allowed to retrieve credit information about a client from multiple bureaus and then merge them into a single report to the best of their ability and judgment (e.g., by correcting whatever mistakes they claimed to have identified). Not only did this compromise GSE standards, potentially jeopardizing the soundness of their MBS offerings, but it also resulted in credit grades that were subjective and uncalibrated, so that the same grade could mean different things from case to case. Switching to FICO scores standardized the process. In addition to increased accuracy and consistency, there was also regulatory pressure in the early 1990s to increase access to credit, and FICO’s efficiency aligned with that objective as well, as it allowed lenders to speed up their underwriting process significantly and thus serve more potential borrowers.

The real issue was that investor demand for debt instruments like MBSs was high, and Wall Street sensed a lucrative opportunity. GSEs, after all, were not the only entities that could buy mortgages, package them, and sell them off to investors. Private firms could also do that, though this was uncommon until the 1980s, when a confluence of circ*mstances enabled the rise of so-called private-label MBSs, issued primarily by investment banks like Goldman Sachs, Lehman Brothers, Merrill Lynch, and J.P. Morgan, but also by organizations like GE Capital, the financial-services unit of GE. MBSs issued by GSEs became known as agency MBSs, as opposed to private-label MBSs, which also went by the name non-agency MBSs.

Agency and non-agency MBSs were similarly structured, using a technique known as tranching, a Wall Street innovation from the early 1980s that was first introduced in the context of CMOs (collateralized mortgage obligations).36 A single pool of thousands of loans is divided into a number of different tranches (or slices), characterized in terms of “seniority” levels. Each tranche has its own risk profile and yield. Junior tranches are the riskiest; they pay the highest yields but are the first to take losses in the event of a default. More senior tranches have lower risk and thus lower yields, and receive priority in payments coming from the mortgage holders.37 In terms of structure, then, there is little difference between agency and non-agency MBSs. The key difference is that loans bought by the GSEs were conforming to a number of stringent guidelines pertaining not just to the borrower’s credit score but also to the size of the loan (with a fairly low ceiling for the maximum loan amount, which excluded so-called “jumbo” loans), requirements on down payments and LTV ratios, and so on.38 Initially, most of the loans bought by investment banks were also conforming, but over the years more and more non-conforming loans were included in private-label MBSs. These non-conforming loans fell into three categories: jumbo, Alt-A, and subprime.39 From 2004 to 2006, for example, 73% of GSE loans were made to borrowers with prime credit scores (≥ 660) and LTV ratios less than 80% (down payments of at least 20%); the corresponding fraction for private-label loans was 40%. Another key difference was that the large majority of the GSE loans were fixed-rate mortgages, whereas the situation with private-label MBSs was reversed: the large majority were ARMs and thus vulnerable to interest-rate increases. As a result, private-label loans were considerably more likely to default than agency loans.40 Non-agency loans exploded from $377 billion in 2000 to roughly $2 trillion in 2007 (Table 2, p. 3), at some point even overtaking agency loans. Between 2001 and 2006, origination (by primary lenders) of subprime mortgages more than tripled, while the overall subprime MBS market more than quintupled, going from $95 billion in 2000 to $483 billion in 2006 (Table 4, p. 6).

So how were investment banks able to sell these risky securities to investors, many of whom (such as pension fund managers) had strict institutional guardrails on the level of risk they were allowed to assume,41 particularly in the absence of any sort of government guarantee, explicit or implicit? One reason, of course, had to do with profit. Non-agency MBSs generally offered higher yields than GSE MBSs. They could afford to do that because of the higher interest rates they were receiving from their riskier loans. This made them attractive to investors. The share of non-agency MBSs (compared to agency MBSs) rose rapidly in the 2000s. The higher yields by themselves should have been a definitive signal to investors that the underlying securities were risky. It’s a fundamental principle of basic investment economics that higher yields entail higher risk. Nevertheless, investors were seriously misled—or perhaps allowed themselves to be misled—by ratings, which is where the CRAs enter the picture.

CRAs assign “credit ratings” to debt instruments (bonds or more exotic securities such as MBSs) issued by governments or private firms. These ratings assess the issuer’s ability to service the debt (pay the principal and interest). The ratings are letter grades on a scale, e.g., the S&P scale has AAA and D as the highest and lowest ratings (least risky and riskiest), respectively. CRAs are a unique type of organization. While they are privately owned and operated for profit, they are “agencies” rather than firms and have a close relationship to the government, particularly to financial regulators, as they are “Nationally Recognized Statistical Rating Organizations” (NRSROs), a designation created by the SEC in 1975 and bestowed upon the “big three” CRAs: Moody’s, S&P, and Fitch. The three had been around since the early 20th century, but their formal blessing as NRSROs made them the only officially sanctioned sources of credit ratings for regulatory purposes. Thus, for example, if an insurance company wanted to demonstrate that the debt securities in its portfolio adhered to risk regulations, they would need to point to the ratings given to those securities by the big three CRAs. As Lawrence White puts it: Essentially, the creditworthiness judgments of these third-party raters had attained the force of law. (p. 213).

How do CRAs make money? For a long time their clients were investors, who bought CRA ratings and used them to inform their decisions to buy or sell bonds. That business model changed in the early 1970s, for reasons that remain unclear and controversial (p. 214), transitioning from “investor pays” to “issuer pays.” That is, the debt issuers themselves started paying CRAs to rate their securities. This presented a flagrant conflict of interest, but CRAs insisted (and continue to insist) that inflating their ratings to please a paying issuer would be irrational, as it would adversely impact the credibility of their assessments. To preserve their reputation, they have an incentive to keep their ratings objective. Of course, how much of a reputation there is to preserve can be debated. These are the agencies, after all, that kept rating Enron as investment grade until four days before its bankruptcy in December 2001, even though Enron’s stock had been in severe decline for many months before that. Their subsequent performance during the subprime mortgage crisis was abysmal, and in inverse proportion to their profits. Here are some relevant numbers from the 2010 congressional report on CRAs:

[From] 2002 to 2007, the three top credit rating agencies doubled their revenues, from less than $3 billion to over $6 billion per year. Most of this increase came from rating complex financial instruments. According to Standard & Poor’s, between 2000 and 2006, investment banks underwrote nearly $2 trillion in mortgage-backed securities, $435 billion or 36 percent of which were backed by subprime mortgages. All of those securities needed ratings. Moody’s and S&P each rated about 10,000 RMBS [Residential MBS] securities over the course of 2006 and 2007. Credit rating executives got paid Wall Street-sized salaries.

As this NBER article points out:

The lion’s share of these securities was highly rated by rating agencies. More than half of the structured finance securities rated by Moody’s carried a AAA rating—the highest possible credit rating.

And as the founder of Vanguard put it:

And let’s not forget our credit rating agencies, which happily bestowed AAA ratings on securitized loans in return for enormous fees that were paid in return by the issuers themselves. (It’s called “conflict of interest.”) Yes, there’s plenty of blame to pass around.

The chickens started coming home to roost in June of 2007, when the mass rating downgrades started, sending the financial markets into shock, as the downgraded securities could no longer be sold and their value “dropped like a stone” (p. 5). By early 2008, the big three had downgraded dozens of thousands of MBSs.

Besides the crooked incentive structure, there were also modeling issues, starting with the lack of sufficient historical data. MBSs, particularly private-label MBSs comprising loans with poor credit profiles, were a relatively recent development, so there were no sufficiently long sequences of historical data to inform the statistical models developed by the CRAs. Moreover, due to the complexity of the underlying instruments, small errors in modeling assumptions or parameters can be greatly magnified, particularly when the securitization is iterated, as with derivatives like “CDOs squared.” CRA risk models were also affected by a broader issue in financial modeling—the assumption of normal outcome distributions. In practice, the outcomes of interest are more accurately modeled by fat-tail distributions that leave a lot more room for improbable (and potentially catastrophic) events; the reliance on normal distributions is known to underestimate the probability of extreme events. Note, however, that the formal models used by the CRAs were often significantly more accurate (less generous) than the final ratings assigned to the securities, which included various types of “adjustments” made at the discretion of managers; see also this paper.

In summary, there were numerous factors at play behind the subprime mortgage crisis. The popularization of ARMs was one factor. The increasing trend towards securitization that turned debts into tradeable assets was another key ingredient. The securitization of mortgages, in particular, completely disrupted the traditional (and often personal) connection between lender and homeowner, turning it into a detached and perfunctory relationship. Because the mortgage originators did not intend to hold on to the loans, they were not incentivized to ensure their soundness—selling them off to investment banks made them someone else’s problem. The investment banks, in turn, promptly packaged the loans into attractive-looking securities that unsuspecting investors would then buy in the eternal search for higher yield, crucially aided by the wildly inaccurate risk ratings given by profit-driven CRAs with blatant conflicts of interest and limited ability to build robust predictive models. It was a typical case of market failure.

Algorithmic credit scoring had little to do with the crisis. The worst that can be said for it is that it catalyzed a mentality shift about risk—it normalized the idea that risk can be safely managed with higher interest rates and fees. This notion helped to lay the groundwork for what eventually became a tsunami of subprime loans, particularly adjustable-rate mortgages originated with high fees and interest rates. While risk might be manageable at the local level of the individual loans under normal circ*mstances, systemic global risk is another story, particularly under changing macroeconomic conditions, because these loans interact with the greater economy and their performances end up being correlated with one another in a way that can ignite a “default contagion” (whereby the prospect of a default by some borrowers increases the probability that others will also default). This mentality change, along with the push-button streamlining of credit risk assessment (courtesy of using algorithms) and the consequent expansion of credit access to a much broader segment of the population, were indeed contributions to the crisis, if only weak ones. They were mere enabling background conditions and (perhaps) very distal necessary causes of what happened; the crisis had many other much more proximate and direct causes that were orthogonal to credit scoring.

1

In 1974, Rule was reporting that Bank of America needed about a week to decide whether to issue a single credit card. In 1975, then-Senator Joe Biden was the chairman of the Subcommittee on Consumer Affairs. In a committee discussion of statistical credit scoring systems, he admitted “I guess I just don’t like the point scoring system.” Senator Jack Garn, a committee member, replied: “If you don’t allow the point scoring systems, [on] what basis do you expect people to be able to grant or restrict credit? There is no way they can sit with each individual and go through the personal type of credit granting system. If they can’t use some kind of system, what will they do?”(See Lauer’s book, p. 241.)

2

Recall that “judgmental” refers to systems that rely solely on human decision making.

3

Capon does insert a cursory caveat in the very last paragraph of the paper, to the effect that his analysis “should not be construed as advocacy for traditional judgment systems nor as argument against the thrust towards objectivity and consistency in decision making.” However, like more recent papers that unleash a barrage of scathing commentary on AI but include a near-grudging afterthought about the technology’s “opportunities,” Capon’s perfunctory nod, buried under a mountain of criticism, comes across as a token gesture at balance rather than a sincere concession.

5

A scorecard is essentially a a table that maps an attribute (or input feature) X and a certain set of values for X into a number of points to be awarded to the credit applicant. For instance, if X is age then values in the range 20-30 might be given 100 points, values in the range 30-40 might be given 150 points, and so on. See pp. 5-10 of Siddiqi’s Credit Risk Scorecards. The total number of points awarded to the applicant becomes their credit score. This is computed simply by going through each attribute X used by the model, consulting the scorecard to look up the points that correspond to the applicant’s value for X, and summing everything up. These tables are themselves built using statistical techniques such as logistic regression. See Siddiqi’s text for details.

6

The claim that credit-scoring advocates only care about predictability and exhibit “a total unconcern about other issues” is one that Capon makes repeatedly in the paper. He tries to support it by quoting a relevant exchange from the 1979 Senate hearings on credit scoring (Capon himself participated in those hearings). The exchange was between the late Senator Carl Levin and Bill Fair, the late co-founder of FICO (the other co-founder was Earl Isaac; the acronym “FICO” stands for “Fair Isaac Corporation”). It starts with Levin asking Fair: You feel that you should be allowed to consider race?, to which Fair replies That is correct. Levin goes on to ask Would the same thing be true with religion?, to which Fair replies Yes. Followed by Would the same thing be true with sex?, eliciting another Yes. And so on. But if we look at the text preceding the exchange, on p. 220, we see that Capon omitted some important context. Just before the beginning of the quoted exchange, Fair had clarified: “I will speak personally, and not for the many creditors with whom we work. If the object of the exercise is to produce decisions which grant credit to creditworthy and deny it as best we can, however faulty it is—and it is faulty—my answer has to be Yes.” [emphasis added]. The subsequent Yes answers were thus conditional—if the sole objective was naked prediction, it would make sense to consider any and all attributes that might potentially carry information about the outcome of interest. Fair was obviously aware that this objective had legal constraints. When Levin had earlier asked Fair why the law prohibits credit scoring firms from taking protected attributes into account when these might have predictive value, Fair replied “I can’t answer that. That law was passed by the Congress a couple of years ago. And neither I nor, to the best of my knowledge, any of our customers have any trouble in complying with it. If it’s the law of the land, one obeys the law. One might have some reservations about its serving the interests of the citizens at large, but one obeys the law. Incidentally, it should go on the record that no creditors of whom I am aware record or even ask a person’s race. That disappeared probably 7 or 8 years ago.” He later added: “Please don’t misunderstand me personally as being misogynistic, racist, or any of the other pejorative words that may be attached. That would be a very unfair misreading of my view.”

7

For instance, the length of an applicant’s credit history is highly correlated with their age.

8

See Chapter 6 of Siddiqi’s Credit Risk Scorecards. In practice, even simpler techniques such as stepwise regression (or even exhaustive combinatorial analysis, if the set of original input features is small enough) are often sufficient for variable selection; see Chapter 5 of Credit Risk Analytics by Baesens et al.

9

He mentions factor analysis, which is somewhat similar except that it focuses on identifying latent factors that cannot be directly observed. Instead, they are inferred from the correlations among the input variables.

10

Although, again in Capon’s defense, the landscape was different in 1982.

11

In their 2018 report, the UK House of Lords Select Committee on Artificial Intelligence submits that “it is not acceptable to deploy any artificial intelligence system which could have a substantial impact on an individual’s life, unless it can generate a full and satisfactory explanation for the decisions it will take.” (p. 40). They neglect to specify what constitutes “a full and satisfactory explanation.”

12

See p. 239 of Lauer’s Creditworthy.

13

But note that many statistical models are regularly tested and validated as new data becomes available, even when this is not legally mandated. VRAG, for example, is frequently tested by a variety of academic researchers, government agencies, and clinical professionals such as forensic psychologists.

14

That information was routinely shared with government agencies:

The Credit Bureau of Greater New York “set aside a table for F.B.I. agents, Treasury men and the New York Police Department” who came each day to fill gaps in their own dossiers. And in Houston, the city’s leading bureau had a contract not only with the FBI, but also with the Internal Revenue Service (IRS), to which it sold discounted reports for the purpose of hunting down delinquent taxpayers.

Disclosure of credit information to outside agencies was also outlawed by FCRA.

15

The name was changed in 1975.

16

That article stands out for painting algorithmic decision making in a much less negative light than usual.

17

As Alan Greenspan put it in 2002, credit-scoring models “have sharply reduced the cost of credit evaluation and improved the consistency, speed, and accuracy of credit decisions.” Again, this is not to say that the input signals flowing into credit-scoring models are not affected by systemic forms of bias; that, unfortunately, is inevitable for any decision-making system, human or not, that must predict future behavior on the basis of imperfect observable proxies. The claim is only that these models are an improvement over judgmental systems, because they keep at bay a tremendous range of biases, cognitive and emotional, that plague human decision making.

18

People with no credit histories at all are called “credit invisibles” (CFPB’s terminology), while those with limited or spotty credit histories that cannot be assigned meaningful credit scores are called “unscorables” or “unscored.” They make up roughly 11 and 9 percent of the U.S. adult population, respectively (pp. 11-12). Membership in these categories is obviously correlated with young age (and also somewhat with very old age), but it is also strongly correlated with residence in low-income minority neighborhoods, and thus with race and ethnicity. Credit scoring relies heavily on the values of specific input features, and if these values are largely missing then an output score will either be impossible to compute or will provide a noisy signal with little predictive value.

Of course, lack of vital relevant information would present a problem for nonalgorithmic approaches as well. There is strong interest in scoring underserved applicants through the use of alternative data sources that are not commonly used by conventional credit-scoring models (such as payment histories for rent and utilities), as well as expanding educational efforts aimed at helping young people to develop credit histories that will make them scorable by conventional models.

19

In 1911 it was determined that 35% of New York City’s employees owed money to loan sharks (p. 118). In 1908, Harper’s Weekly was publishing articles with titles like “Loan Sharks: The Scourge of the Deep Waters of City Life.” In the Reconstruction era, Mark Twain was satirizing “beautiful credit” (“the foundation of society”) with lines like “I wasn’t worth a cent two years ago, and now I owe two millions of dollars.” And of course, going back further, charging interest on loans made to poor people repeatedly pops up (and is condemned) in the Old Testament.

20

Errol Damelin, a co-founder of Wonga, had stated that “we’ve built an engine that is dramatically more predictive for what we do than FICO, dramatically on a scale that’s unbelievable.” Numerous similar claims have been made by fintech companies over the last 10-15 years, but on the basis of little or no evidence. Such claims are hard to verify for a number of reasons. First, there are no standard public test sets that are universally recognized as benchmarks for the performance of credit-scoring models. Models are typically trained and tested on internally held historical loan performance data that differs from company to company. Second, such claims are often made without any proof. Occasionally some aggregated results comparing the performance of big-data models to traditional scoring methods might be released, but without the detailed data and methodology needed for a thorough independent evaluation. Third, traditional scoring methods like FICO have been extensively tuned and already have high accuracy. Improvements are always possible, but “dramatically” better performance is unlikely even in terms of relative (rather than absolute) improvement. Finally, the way in which a comparison of one model to another is conducted can have a dramatic impact on the results. For instance, performance that is only measured on a segment of the underlying population that is defined by a range of scores produced by the benchmark model (as opposed to the new, “challenger” model) can artificially inflate the improvement realized by the new model.

21

ZestFinance is now known as Zest AI, after facing a number of legal troubles over the years, such as a 2018 lawsuit alleging that Merrill and his company “worked with a North Dakota tribe so they could hide behind its sovereign immunity while issuing illegal short-term loans with exorbitant interest rates.”

22

Robinson and Yu observed in 2014 that “These companies come and go quickly” (p. 14).

23

ZestFinance tracked cookies to estimate how carefully a borrower was reading its terms, which they interpreted as an indication of how seriously they were taking the loan.

24

The contemporary equivalent of those earlier bureaus are “big data brokers” like eBureau and Intelius. And just like the earlier bureaus were cozy with agencies like the FBI and the IRS (see footnote 13), today’s data brokers often sell their data to intelligence agencies.

25

Vantage 4.0, in particular, is said to “expand the scoreable universe to 96% of the U.S. adult population,much higher than the 80.7% that was estimated by the Congressional Research Service in 2019 (see pp. 11-12; the total estimate is 11% for invisibles and 8.3% for unscorables, for a total of 19.3%, so 80.7 = 100 − 19.3).

26

An issue with Vantage is that the scoring model is owned by the three credit bureaus themselves (Experian, Equifax, and TransUnion), unlike FICO, which is a separate business entity that relies on data from the credit bureaus but generates its scores independently. Ownership of the scoring model by the same firms that generate the data that flows into the model raises some uneasy consolidation questions, but these are orthogonal to the conceptual question of how to expand credit access without unlocking big data’s Pandora box. (FICO has also experimented with alternative scoring methodologies in the interest of expanding credit access, such as FICO XD, introduced as early as 2015 and still pursued; the range of “alternative data” used by XD is not clear, though it’s almost certainly nowhere near as wide as that of the proprietary models built by big-data evangelists.)

27

In fact, both FICO and Vantage already include such data in their models and use it when available. The data usually comes from the National Consumer Telecom & Utilities Exchange (NCTUE), a consumer reporting agency managed by Equifax, which maintains a database of account- and payment-related information, mostly for telecom and utility services, to which companies like Verizon have read and write access. But participation by such companies is voluntary, and the information is very sparse, so few credit profiles contain such data. That is something that regulation can change, by mandating that payment histories are standardized and routinely provided by utility or telecom companies to credit bureaus (or to an organization such as NCTUE) in an accurate and secure manner. The Credit Access and Inclusion Act of 2019, S. 1828, attempted to amend the FCRA to allow the reporting of positive consumer-credit information relating to lease or utility payments but was not enacted into law. The bill was reintroduced last year but again stalled. It does not go far enough anyway; the reporting of such information needs to be systematized and mandated, not merely permitted.

28

There have been a few reports indicating that big-data approaches might indeed facilitate the expansion of credit access. But the question is not whether the use of big data can expand credit access; perhaps it can. The question is whether it is necessary. And there is not enough evidence yet to settle the first question. Instead of expanding access, big data might end up restricting it. Features like the school that one attended or the grades received, geolocation data from online tracking, the use of banking services, and particularly aggregate input features (derived, e.g., by averaging certain quantities over entire groups of people), seem likely to have a disparate impact on minorities, which might make such systems more vulnerable to charges of disparate-impact discrimination.

29

As usual, of course, such regulation is more difficult to enforce when the decision makers are humans. You can easily verify the inputs flowing into an algorithm, but it is much harder to determine what information a human used (consciously or not) to arrive at a decision.

30

Fannie Mae (a backronym that pronounces FNME, itself an acronym for the Federal National Mortgage Association) is the older of the two, having been founded in 1938 to avoid a repeat of the housing debacle of the Great Depression, when about a quarter of all mortgaged homes were being foreclosed each year. Initially, Fannie Mae could only buy loans that were insured by the FHA (the Federal Housing Administration, established in 1934 with the aim of insuring mortgages, thereby significantly derisking them and enabling lenders to make more loans), and later on mortgages guaranteed by the Veterans Administration (VA), as authorized by the Servicemen’s Readjustment Act of 1944 (commonly known as the GI Bill). Fannie Mae became able to buy arbitrary loans from the broader mortgage market (not just FHA-insured or VA-backed loans) in 1970, which is also when Freddie Mac was created (another dubious backronym for FHLMC, the Federal Home Loan Mortgage Corporation), with the express purpose of preventing Fannie Mae from becoming a monopoly. Two years before that, in 1968, the older Fannie Mae was split into two, a new GSE incarnation that still went by the name Fannie Mae but became a publicly traded company that would remove its loans from the government’s balance sheet; and a brand new GSE by the name of Ginnie Mae (GNMA, for Government National Mortgage Association). The latter remained a government entity and guaranteed payments (on both principal and interest) for those loans that were FHA-insured or VA-backed.

31

Mortgages before World War II had much shorter repayment periods, usually 3-5 years. Borrowers would typically refinance at the end of the loan term, since at that point they faced “balloon” payments for the principal that few could afford (the bulk of the preceding payments having gone to interest) . Banks were reluctant to commit to a fixed interest rate for a long period, since higher rates in the future would present greater profit opportunities. (Adjustable-rate mortgages—ARMs—addressed that concern later on, but these were not popularized until the historically high inflation of the late 1970s.) Liquidity was another concern, as longer-term loans tie up a bank’s capital for extended periods. GSEs revolutionized this landscape by developing and standardizing a secondary mortgage market that allowed lenders to sell loans and free up capital. Initially, mortgages were predominantly sold to GSEs, but, as we will see, over time direct sales to private firms such as investment banks became very common.

Note, however, that if you go back earlier in time, before the Great Depression, you’ll see that MBSs were widely sold by private firms and bought by ordinary citizens in the wake of World War I, particularly in the 1920s, albeit packaged in a much less sophisticated form than the private MBSs that emerged in the 1980s. Those earlier MBSs were simple mortgage bonds. Many of the iconic skyscrapers in U.S. cities were financed by such bonds, which were bought by ordinary American families. As Sarah Quinn puts it in chapter 6 of her book: “For as little as $100—or $10, if you paid in installments—a person could own a share of the mortgage on the Chrysler Building or the Waldorf Astoria, although the unlucky investors in the latter lost their money when its loan defaulted.” (p. 106).

32

David Bowie is said to be the first artist to have used this technique to obtain funding from his future music royalties, but James Brown, the Isley Brothers, and Rod Stewart have also done it (p. 3).

33

One chief reason is that the asset security can be appraised independently of—and usually more favorably than—the general financial situation of the company (also known as the “originator”); see p. 135 of this paper, and also chapter 2 of this book.

34

If lenders wanted to sell their loans to a GSE, they were contractually obligated to adhere to the underwriting guidelines set by that GSE.

35

Importantly, however, neither number was a “hard” cutoff, e.g., in the presence of compensating factors, even a loan to a borrower with a FICO score below 620 could still be eligible for GSE purchase.

36

CMOs, created in 1983 by Solomon Brothers and First Boston for Freddie Mac, were the first MBS products to use tranching. Several other types of MBS products appeared in subsequent years, such as CDOs (collateral debt obligations), but all shared the same basic ideas and all made extensive use of tranching.

37

Payments are made on a waterfall model: Incoming cash from the mortgage holders is first distributed to the most senior tranches at the top, and what remains flows down to the more junior (“subordinate”) tranches.

38

Thus, all conforming loans were by definition prime, but not all prime loans were conforming.

39

Jumbo loans went to borrowers who had prime credit scores but whose principal balances exceeded the GSE caps. Alt-A borrowers also have prime credit scores but do not satisfy underwriting requirements for documentation on their mortgage applications, particularly documentation of income sources. Typically, this is because Alt-A borrowers are self-employed or have varying incomes. Finally, subprime loans are those made to borrowers with credit scores under 660.

40

In the 2004-2006 period, for example, non-agency loans defaulted three times as often as agency loans.

41

Various regulations prohibit certain institutions from making high-risk investments. Starting in the 1930s, for instance, the government prohibited banks from investing in “speculative investment securities” (bonds that would be rated below investment grade, in contemporary terms). In subsequent decades, many states in the U.S. introduced legislation that curbed the investment risk that banks and insurance companies could take on. In the 1970s, federal pension funds were likewise constrained. Similar measures have been implemented globally, e.g., most countries these days impose numerous quantitative restrictions on the investment portfolios of pension funds.

The Case of Credit Scoring (2024)

References

Top Articles
Latest Posts
Article information

Author: Cheryll Lueilwitz

Last Updated:

Views: 6115

Rating: 4.3 / 5 (74 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Cheryll Lueilwitz

Birthday: 1997-12-23

Address: 4653 O'Kon Hill, Lake Juanstad, AR 65469

Phone: +494124489301

Job: Marketing Representative

Hobby: Reading, Ice skating, Foraging, BASE jumping, Hiking, Skateboarding, Kayaking

Introduction: My name is Cheryll Lueilwitz, I am a sparkling, clean, super, lucky, joyous, outstanding, lucky person who loves writing and wants to share my knowledge and understanding with you.