2016-11-01

Human analysts can’t agree on sustainability ratings or much else, either

In Indonesia, Latin America, and Africa, farmers set forests ablaze every day, releasing CO², displacing endangered species, and contaminating the air. Once the forests are razed, palm oil, the world’s most popular vegetable oil, can be produced by newly-planted farms.

It’s a problem that most of the world’s largest companies have a hand in, and one which environmentalists say is getting worse. So, several authoritative groups, including Greenpeace, the World Wildlife Foundation, and the Union of Concerned Scientists set out to hold these firms to account, rating their transparency and conduct related to palm oil sourcing.

What did they find? Unfortunately, they draw very different conclusions. Their most recent reports from 2015 and 2016 don’t offer a clear picture of which companies consumers and investors should pressure for better behavior.

Due to different methodologies and different human bias, disagreement among the scores illustrates a crucial problem that confronts anyone who hopes to measure environmental, social, and corporate governance (ESG) performance by companies.

The problem is known by a name in the scientific community: Inter-rater reliability. In the case of Palm Oil studies or ESG data for investing, it means that highly qualified analysts can make radically different assessments due to inherent human bias.

The disagreement in Palm Oil conduct ratings for 30 major companies can be seen in the chart above, which compares three sets of scores. If the raters had agreed, scores would be bunched together on each line. As it is, 60% of the companies were scored by two raters within three points of each other on a 10-point scale. Forty percent of the companies were rated four points or more apart.

While the scorers agree on that Kellogg’s and Danone are leaders, they have polar opposite views on Avon, Kraft, and others.

That 60% rate of agreement is a little worse than the standard expectation when humans with similar qualifications evaluate information. Numerous studies show that at best, trained experts agree about 80% of the time.

Inter-rater reliability

When scientists and academics study inter-rater reliability, it’s usually to figure out how much they can trust medical and psychiatric diagnoses made by doctors, nurses, psychologists, and people training for one of those roles. Other times, the phenomenon is studied to determine whether job performance ratings are valid for teachers and other hard-to-measure professions.

As illustrated above in the case of palm oil, the problem matters for environmentalists and consumers. It also extends to investment scoring and rating, including those for ESG and sustainability.

“The ratings of six major social raters—KLD, Asset4, Calvert, FTSE4Good, DJSI, and Innovest—have little overlap in membership and fairly low correlations with each other. Our results imply that SRI raters not only do not agree on a one definition of sustainability … but also that raters may measure the same construct in different ways…. The validity of social ratings is a serious concern not just for academics, but also for investors, activists, and policymakers.”

— “Do ratings of firms converge? Implications for managers, investors and strategy researchers”

That finding comes from a study published in the Strategic Management Journal by professors from the schools of business at Duke, UC Berkeley, and IPAG, a French institution. It should be noted that since the study, KLD and Innovest consolidated to become MSCI ESG Research.

Corporate Social Responsibility, or CSR, is a close cousin to ESG, and the same problem is apparent in CSR ratings. A study comparing Fortune’s ranking of “America’s Most Admired Companies” with the companies’ scores from KLD found hardly any correlation. That’s seen in the scatterplot chart below.

The chart has one axis dedicated to Fortune’s company ratings, and the other to KLD’s. If KLD and Fortune agreed, the dots would form a diagonal pattern, with zero scores at the bottom and left corner, and perfect scores positioned in the top right corner.

KLD and Fortune, and the leading CSR raters in the previous example, aren’t negligent or incompetent. In fact, they likely intend to measure different things. Each rater constructs their own methodology to examine the factors they think matter. That said, their human judgment produces different enough methods that the results at times can’t be reconciled.

Moreover, these all-inclusive scores or ratings are used by investors as a comprehensive, portable metric that is totally representative of a company’s performance. End users rarely dive into methodology behind ratings; they’re too busy making investment decisions to consider on a daily basis how the sausage is made.

ESG data are the new investing fundamentals

So if existing ESG data and ratings aren’t totally unreliable, why should mainstream investors use them?

A body of research that’s now several decades old has repeatedly validated that ESG has a measurable effect on investment risk and returns. A review of 2,200 academic studies on the subject found that a clear majority identified a positive relationship between ESG criteria and corporate financial performance.

For most of the last century, traditional balance sheet fundamentals were sufficient to examine a company’s performance. The types of data that are critical for investors to get a complete picture of a company are expanding, though. Today, investors who don’t pay attention to ESG and other so-called “intangible” qualitative factors are missing out on a wealth of relevant, material information.

The ratio of tangible company assets to intangible ones that give the S&P 500 firms their value has completely flipped in the last four decades, as Bruno Bertucci, UBS’s head of sustainable investors, pointed out in a presentation with the Sustainability Accounting Standards Board.

As late as 1975, 83% of assets were tangible, 17% intangible. By 2015, 84% of assets were intangible, 16% tangible.

UBS builds its modern investment processes to include sustainability because today’s companies are “asset-light,” and book value no longer equals market value, Bertucci said.

The ascendance of intangible value means that non-financial yet material factors are crucial to company book value. “Intangibles” like ESG are an important means to examine the qualitative company assets that underpin valuations today.

ESG factors cover a range of business practices, strategy, and relationships with communities and the environment. It measures types of company performance that are essential to reputation, customer relations, and yes, long-term earnings.

Handling rater reliability and other ESG data issues

ESG data are not standardized, and that’s a problem even earlier in the analysis process than inter-rater reliability. Raw numbers for carbon output and gender diversity are reported in different formats by different companies.

That’s partly because ESG information isn’t published in a standard format. Not only that, the data quickly becomes outdated in many cases when annual reports are the only ones available. Perhaps most important, analysts who compile scores rely on company data, which is vulnerable to reporting bias.

So, how can investors who consider ESG performance assess the validity of ratings? Artificial Intelligence, or AI, can act a hedge against human rater error.

TruValue Labs, a San Francisco-based startup, is harnessing the power of artificial intelligence, or AI, to help solve the inter-rater reliability problem for ESG investors.

Metrics created by artificial intelligence can serve as a benchmark to validate human ratings, or data self-reported by companies. AI may make errors, but even those errors are consistent, not the result of human confirmation bias. A machine holds no prior beliefs about company performance.

The company’s flagship product, Insight360, applies AI to the wealth of semantic data on the internet, including news stories, regulatory filings, and industry coverage. This service offers current, comprehensive data on ESG performance for more than 8,500 publicly listed companies.

Multiple times a day, Insight360 scans and scores data from more than 75,000 sources around the English-language web. Algorithms analyze natural language data from sources ranging from the New York Times to the Hindustan Times, and from the U.S. Department of Labor to China Labor Watch.

Data is categorized, ranked for sentiment, and incorporated into a company’s score. Each company is scored on the Insight360 index—a 100-point scale with positives rated above 50, and negatives below. That data is transparent, because subscribers can drill down to examine news stories that cause score changes.

Insight360 learns as humans do, but considers exponentially more data. The system is designed to catch important events. Ratings are more consistent than ones created by human analysts, because the system assigns the identical values to identical data points.

TruValue Labs sells subscription licenses to financial professionals for our SaaS-based Insight360 application, both directly and in the Thomson Reuters Eikon App Store.

Request a free trial of Eikon today.

Lipperalpha.refinitiv.com

Failing Scores for ESG Raters