Disclosure: I’m long Acacia Research Corporation (ACTG)
This post describes how I used correlation and regression to investigate an aspect of Acacia Research’s performance. The example is live enough that I need the disclosure at the top and the disclaimer at the bottom. Spreadsheets make statistical analysis more accessible, but they don’t eliminate the pitfalls that can catch an uninformed user. I’ve tried to map the pitfalls surrounding the interpretation of results that appear in a real case, while investigating a possible risk to Acacia’s value. I assume little knowledge of spreadsheets or statistics, and with the subtleties of interpretation, the result is a long post. If you just want to number-crunch in Excel, go here: How to Run Regression Analysis in Microsoft Excel http://www.wikihow.com.
The analysis is based on nine quarterly results, a small sample size which could raise objections. While that’s an unusually small sample size, phase 0 clinical trials (a recent development in oncology) base statistical measures on samples of 6 to 15 patients. See page 10 of the pdf ‘Phase 0 Clinical Trials in Cancer Drug Development: From Concept to Practice” James H Doroshow M.D., National Cancer Institute (sensitive readers should avoid page 8).
For convenience, I’ve put small copies of the images in this post at the end, so they can be opened from one place.
My spreadsheet for correlation and regression can be downloaded in two versions, saved from OpenOffice Calc.
WARNING – download sites are risky. I use mediafire, a site which does not use download wrappers, so you’ll get spreadsheets, not executable files. Even so, the ads could be aggressive. Currently, a new window opens with ads. On a PC, I believe the safest way to close the window after downloading is to right-click on the tab in the taskbar and click on ‘x Close’, or any way of closing the window without clicking on it. Or get an ad-blocker. If the official green button doesn’t work, give up.
OpenOffice (ods) File size: 51.06 KB
Excel (xls) File size: 92.5 KB
Spreadsheets and reliability
There are dire warnings that spreadsheets can’t be relied on for stats. According to “Statistics The Risks of Using Spreadsheets for Statistical Analysis” IBM, 2010, 5.2% of spreadsheet cells contain errors, although IBM could be biased and the analysis may have been done on a spreadsheet :). There’s even a European Spreadsheet Risks Interest Group. Their sponsors are fairly diverse and EuSpRiG seem genuine. The horror stories they list include the well publicized slip by economists Reinhart and Rogoff regarding the effect of debt on the economy, and Britain’s spies snooping on 134 phones by mistake (“MI5 makes 1,061 bugging errors” Identifier: FH1215). Most of the errors seem to involve human error, and it might be convenient to blame the spreadsheet.
There’s a more relevant problem when a regression line (explained later) is forced to pass through the origin (the point where two variables both equal zero), which can produce less reported error while being more wrong, sometimes wildly wrong. This excellent guide “Quick and Dirty Regression Tutorial” advises against forcing the regression line through the origin without a good theoretical reason.
From wikipedia’s entry for Microsoft Excel: “The accuracy and convenience of statistical tools in Excel has been criticized, as mishandling missing data, as returning incorrect values due to inept handling of round-off and large numbers, as only selectively updating calculations on a spreadsheet when some cell values are changed, and as having a limited set of statistical tools. Microsoft has announced some of these issues are addressed in Excel 2010.”
I suggest checking that results are reasonable and showing them in a chart when possible.
Short company background
Acacia Research makes money from patents, either acquired outright or obtained through revenue sharing agreements with patent owners, who are often paid an advance. Earnings are from pre-trial settlements, license fees, and awards (usually from jury trials). Earnings are volatile but not seasonal.
A few fairly bullish articles were published on Seeking Alpha after results were released on April 18 2013, and I’ve blogged about Acacia (at length).
Two of their costs are unavoidable: contingent legal fees (the lawyers’ share when a case is won or settled), and inventor’s royalties. I’ve looked at quarterly Revenue Net of Royalties and Contingent Fees, which I call Revenue NRCF. Management report this as ‘total revenues, less inventor royalties expense and contingent legal fees expense’. For Q3 2011 they include verdict insurance proceeds and related costs to get $21,117,000, whereas I don’t and calculate $35,116,000. Revenue NRCF is not one of management’s preferred metrics for assessing performance, which are annual growth of: patent portfolios under management, new licensing programs, revenues and profits, with trailing twelve month measures and non-GAAP preferred.
Revenue NRCF and Patent acquisition costs
Plotting Revenue NRCF against Patent acquisition costs shows a relationship which might look worrying (find ‘chart for Acacia’). It could show that expensive patent acquisition boosts Revenue NRCF in the same quarter, with little sign of a more sustained contribution. This is subjective, and other investors might see little to worry about. To investigate this, I first looked at correlation.
A correlation of 1 between two variables means that if you know one of them, you know the other, and a graph of one against the other will show a straight line. For instance, if a website sells gizmos for $100 dollars each plus a fixed $5 delivery charge, there will be a ‘perfect correlation’ of 1 between the bill for an order and the number of gizmos in the order. Sometimes one number goes down when another number goes up. This is called ‘negative correlation’ and a perfect negative correlation gets a score of -1. A correlation of 0 indicates there’s no correlation. There are charts illustrating different kinds of correlation on wikipedia (top right).
Imagine that one year a generous government gave each family $10,000 plus $100 for each child in the family. There ought to be a ‘perfect’ correlation of 1 between the number of children in a family and the amount received, but the correlation could be lower due to fraud or mistakes. Suppose in the next year, the government gave each family $5,000 plus $200 for each child in the family. If the level of fraud and mistakes don’t change, the correlation between children in a family and the amount received should be the same as in the previous year, except for some random variation as fraud and mistakes don’t repeat exactly. More generally, you can add, subtract, and multiply as much as you like without changing the correlation, so long as it’s consistent for each variable (like number of children, or amount received), and the multiplication is not by a negative number, which would change the sign of the correlation coefficient. That’s because correlation depends on how much of the variation in one quantity can be explained by variation in the other, which is different to sensitivity.
This might be old fashioned, but imagine meters with a pointer and a dial, for measuring electric current. Any such meter should have a correlation very close to one, for the angle of the pointer and the actual current, within its range of operation. A meter for measuring tiny electric currents would be much more sensitive than a meter used in a power station, comparing their degrees of pointer movement per amp. If a spreadsheet had a column for current and another column for pointer angle, multiplying the pointer angles by two would mimic a doubling of sensitivity, and would not affect the correlation.
Whatever number you get for correlation, in statistics there’s a chance that a sample is unrepresentative. That problem is often minimized by trying to avoid any cause of bias, and by using a large sample size. Correlation is symmetric, meaning if rainfall and umbrella sales are correlated with a coefficient = 0.6, then umbrella sales and rainfall are also correlated with a coefficient = 0.6. The proper name for the usual measure of correlation is the ‘Pearson correlation coefficient’ or the less catchy ‘Pearson Product-Moment Correlation’ (PPMC).
Tables and charts
My tables have data in rows, which fits the post, but columns are more usual. The downloadable spreadsheet has a sheet with the data in columns. I show tables with ‘formula view’ first, then results. To toggle formula view on and off in a spreadsheet, press the Control key and ` (on the top left, usually above the tab key and left of 1). Sometimes formulas don’t show because columns aren’t wide enough, and it can be worth copying to a new sheet before making columns wider. I improved most of the charts using GIMP, and making the lines thicker might have reduced the accuracy by a pixel or two.
Spreadsheet and chart for Acacia’s Patent acquisition costs and Revenue NRCF
The correlation function
I used the formula:
‘=’ means, a formula follows
‘CORREL’ is the correlation function. Functions are followed by ‘arguments’ in brackets. The function ‘CORREL’ takes two arguments, each of which is a data range.
‘C9:K9’ is the first data range, which specifies the nine cells where I’ve put Revenue NRCF (The data range C9:K9 means the cells in row nine, from column C to column K).
‘C5:K5’ is the other data range, which specifies the nine cells where I’ve put Patent acquisition costs.
Normal spreadsheet functions have individual cells or constants as their arguments. Functions like CORREL which have data ranges for their arguments, are called ‘array functions’. If you use the Function Wizard to get an array function, you need to tick the ‘Array’ box, which OpenOffice puts in the bottom left.
The coefficient of correlation for Revenue NRCF and Patent acquisition costs is 0.83 (actually, it’s 0.833016, but there’s no need here for that much accuracy). A correlation of 0.8 or above is usually considered high enough to be taken seriously in the social sciences. A low correlation would have meant there was little point in further investigation. The relatively high correlation would usually mean that the two variables have moved in line with each other fairly closely, but there are complications which I describe later.
If your spreadsheet lacks a statistics add-on, you can check the significance of the correlation courtesy of the Department of Obstetrics and Gynaecology at The Chinese University of Hong Kong, here. Unfortunately the site is likely to redirect you to it’s home page, if so you navigate through:
Statistics Tool Box / Statistical Tests / Correlation / Minimum r to be signif
I put the number 9 in the box labeled ‘Sample size (n)’, and clicked the ‘Minimum coefficient’ button. The results were:
Minimum r significant at p=0.05 is 0.6664
Minimum r significant at p=0.01 is 0.7977
Minimum r significant at p=0.001 is 0.8983
Bearing in mind that those numbers are only relevant to a sample size of nine, what they mean is:
line 1 – Minimum r significant at p=0.05 is 0.6664
If the variables are fundamentally uncorrelated, the chance that a sample of nine data points will give a correlation greater than or equal to 0.6664 is five times out of a hundred, or 5% of the time. (The p=0.05 translates to a 5% chance). That is often expressed as meeting a 95% confidence threshold (as 100% – 5% = 95%). Any correlation higher than 0.6664 exceeds the 95% confidence threshold and is even less likely to occur by chance. The 95% confidence threshold is traditionally taken to be a test of ‘statistical significance’, although it’s an arbitrary number and not appropriate to all cases.
line 2 – Minimum r significant at p=0.01 is 0.7977
If the variables are fundamentally uncorrelated, the chance that a sample of nine data points will give a correlation greater than or equal to 0.7977 is one time out of a hundred, or 1% of the time, (The p=0.01 translates to a 1% chance) and the 99% confidence threshold is met. Any correlation higher than 0.7977 exceeds the 99% confidence threshold and is even less likely to occur by chance.
line 3 – Minimum r significant at p=0.001 is 0.8983
If the variables are fundamentally uncorrelated, the chance that a sample of nine data points will give a correlation greater than or equal to 0.8983 is one time out of a thousand, or 0.1% of the time. (The p=0.001 translates to a 0.1% chance). That could be expressed as meeting a 99.9% confidence threshold. Any correlation higher than 0.8983 exceeds the 99.9% confidence threshold and is even less likely to occur by chance.
(To test the confidence of a negative correlation, you leave out the minus sign when you type the number into the box.)
The correlation coefficient found above was 0.83, which does not qualify for 99.9% confidence (as 0.83 is less than the 0.8983 threshold), but it qualifies for 99% confidence (as 0.83 is more than the 0.7977 threshold). Since 99% confidence is five times as strict as the common 95% threshold for statistical significance, and the correlation between Revenue NRCF and Patent acquisition costs is plausible, the correlation would usually be taken seriously.
If the Chinese obstetrics site ever folds, you can use tables instead which you find by searching for ‘Pearson Product-Moment Correlation (PPMC) Coefficient Table of Critical Values’, but you don’t usually look up the sample size, instead you look up the degrees of freedom which you get by subtracting 2 from the sample size (for correlating two variables, as here). Note that df or d.f. stands for degrees of freedom, and n is usually the sample size (so df = n – 2). This is how I explain degrees of freedom. If you can only sample two points from a set of points, you have no way of knowing if the set of points is scattered like pellets from a shotgun (uncorrelated), or from a set of points that are all close to a straight line (correlated). Adding a third point gives at least some indication of how scattered the set of points is. The third point sampled is the first that gives useful information. If there are n points, there is only n-2 points-worth of information useful for calculating the correlation, because discarding n-2 points makes the calculation impossible.
Statisticians won’t tell you in simple language how to calculate the confidence level for a given coefficient and sample size, because they insist you should set the confidence level before making a test. However, the Chinese site gives you the probability needed to calculate the confidence level, here, although you might have to navigate:
Statistics Tool Box / Statistical Tests / Correlation / Stat sig of r
Entering 9 for Sample size (n), and 0.83 for Correlation coefficient (r), got Results
t=3.9837; df=7; p=0.0053
which gives 99.47% confidence (from 100 * (1 – 0.0053)).
Before looking into the consequences of the correlation, there are:
Five points to bear in mind:
1) Correlation is not causation. In ‘The Mismeasure of Man’, Stephen Jay Gould listed “…my age, the population of México, the price of Swiss cheese, my pet turtle’s weight, and the average distance between galaxies” as variables which increased over time, and so had a high positive correlation with each other, although none of the variables causes any of the others. The wikipedia article about Gould’s book describes an old intellectual dispute with much argument about the role of correlation, and plenty of accusation about bias, which shows that statistics is about interpretation as well as number-crunching.
One way to misuse statistics is to claim that X causes Y when the only evidence is that X and Y are correlated. For instance, a significant correlation between eating organically produced vegetables and a long lifespan MIGHT mean that such vegetables are good for you, but one alternative is that poor people can’t afford organic veg and it’s their lack of wealth that affects their health. That’s why serious studies take care to exclude or correct for other factors than those being tested, but that’s not usually an option for company-specific tests.
If you search for ‘examples of spurious correlation’, apart from irrelevant results, the examples found are nearly all cases where a third variable is causing the variation in the two with the observed correlation. I would not usually call such correlation ‘spurious’, although an assumed cause and effect could be spurious. Sometimes the third variable is hard to measure. In the case I’m investigating here, it’s reasonable to assume that the quality and quantity of patents acquired affects both Patent acquisition costs (at the time of acquisition) and various measures of revenue including Revenue NRCF (without assuming which quarters that occurs in). There is no direct objective measure of the quality and quantity of patents acquired, so it’s reasonable to use Patent acquisition costs as a surrogate. If anyone objected to the statement ‘Patent acquisition costs cause Revenue NRCF’ I would regard the objection as technically correct but possibly pedantic.
2) Sample size and data quality. I use data from nine quarters, which is a small sample size. The sample size is taken into account in the confidence test. Even so, I suggest caution. Surveys can draw on thousands of samples, and the ‘big data’ companies can analyze millions of data points. The problem with having only nine quarters is compounded by the uneven spread of Patent acquisition costs, with two quarters having relatively high values. The Pearson correlation coefficient is not a ‘robust estimator’, because it’s less reliable when the sample size is small, and is particularly susceptible to ‘outliers’. An outlier is a point which does not fit the general pattern of other points, for instance if 47 of 50 points were close to a line when graphed, but the other three were a long way from the line. (See ‘Outlier test’ at the end of my chart-making instructions.) The data I’m looking into does not contain outliers (find ‘Spreadsheet and chart for regression’ below), but the correlation relies disproportionately on the two quarters with high Patent acquisition costs.
A scatter-plot of the Patent acquisition costs and Revenue NRCF could be seen as a cluster and two points, with seven points clustered in the bottom left, and the two points outside the cluster in the top right. With that approach, the two points and the center of the cluster lie more or less in a line. It’s tempting to replace the cluster with the point at its center, but there are two reasons for not doing that. 1) A sample size of 3 is far too small. 2) Replacing a cluster by its center loses information about the variation from its center, including variation that would reduce the correlation. (But in case you’re curious, doing that gives a coefficient of correlation of 0.9625 and confidence of 82.5% which is not statistically significant.) In any case, it’s prudent to be wary of the results for the correlation using nine points. Investors need to decide which risk matters to them:
1) The statistical risk, of relying on a suspect statistic to inform a ‘sell’ or ‘don’t buy’ decision.
2) The monetary risk, of losing money by ignoring a statistic which is suspect but could be innocent or only partly guilty (producing an exaggerated confidence level). Even if the ‘true’ confidence level is ‘only’ 90% instead of at least 99%, the odds are still in favor of the observed correlation not existing through sheer chance, unless you have reason to believe the correlation is unlikely.
While a statistician might regard the data as inadequate for correlation and regression, it’s not a hypothetical statistician who might be risking their money. Some statistics experts have failed spectacularly at investment (find ‘Long Term Capital Management’ below). Using more quarters might make the analysis more respectable, but risks drawing conclusions based on old data, when the company or conditions it operates in may have changed. Investors often have to make do with partial information and can’t simply increase the sample size. When a company beats or misses target growth of 10%, investors rarely care if 12% or 8% growth supports a hypothesis with statistical significance or not.
With a small sample size, the correlation could be blown away by one quarter with high Revenue NRCF and low Patent acquisition costs (which would be good news for shareholders).
3) Flukes. It’s possible to find a numeric correlation which only exists by chance, and is therefore unlikely to continue. It’s a similar situation to tossing a coin seven times and getting heads each time. If it’s a fair coin, the result is unusual, with one chance in 128 (and the same for getting seven ‘tails’), but if it’s really a fair coin with no tricks being played, there’s no reason to expect the pattern to continue. If you can find a large enough number of diverse data sets and start looking for correlations, you will eventually find numeric correlations that make no sense at all. A correlation between UFO sightings in California and the French stock index would be described as ‘spurious’ by statisticians as there is no plausible cause and effect between those variables, or a third variable that could cause the correlated variation in the first two (but that’s hard to actually demonstrate). The ‘fluke’ problem is less likely the higher the confidence threshold is set, but never disappears entirely. (See also ‘Negative results’ near the end.) The correlation between Revenue NRCF and Patent acquisition costs is doubtful due to the issues concerning robustness, and the present correlation might appear spurious if it fails in the future, but should not be dismissed as spurious now, because the revenue could plausibly be related to the costs.
“Predicting the Presidential Election with Baseball” By Martin Kelly, September 08, 2008. About.com Guide.
4) It’s only an estimate. The result of calculating the correlation coefficient is at best an estimate of the true correlation. At worst, the calculated coefficient is worthless due to the ‘fluke’ case, which could be more likely than the relatively high coefficient suggests due to the uneven distribution of the data.
5) Correlations can change. If social scientists tracked the correlation between IQ score and earnings, they might find a high correlation which changed little from one year to the next, but over decades a decreasing coefficient might reveal decreasing social mobility. In contrast, the correlation between leverage and earnings growth for comparable companies might quickly turn from positive to negative in a recession.
So what’s the problem?
When Acacia gets a high return on Patent acquisition costs in the same quarter as they occur, that’s good, not bad. The problem is that the immediate return is nowhere near big enough to cover the costs, so shareholders need the immediate return AND sufficient future returns.
The problem with the problem
It’s really the correlation between a quarter’s Patent acquisition costs and future Revenue NRCF that matters. Looking at the chart, I see little evidence of Patent acquisition costs producing much Revenue NRCF in later quarters, but that is subjective. Unfortunately, the data is not suitable for the kind of analysis that would measure correlation between a quarter’s Patent acquisition costs and future Revenue NRCF.
Suppose the top row here shows a company’s annual investment, and the row under shows income, for five years:
2 6 4 8 6
6 4 5 6 9
A correlation could be calculated for annual investment and income, using a spreadsheet as described above.
A correlation could also be calculated between investment and the next year’s income, by changing the data ranges, which can be illustrated by shifting the bottom row one to the left, like this:
2 6 4 8
4 5 6 9
That loses a precious data point, and it’s worse when correlating investment with income from two years later:
2 6 4
5 6 9
In my opinion, the problem of losing recent data points makes it impractical to measure the correlation between Acacia’s recent Patent acquisition costs and Revenue NRCF in later quarters. Some investors might be happy to rely on Acacia’s management’s statement that older investment (with enough time to measure its return) has produced a return on investment of 44%. Near the end, I see if Patent acquisition costs causing some Revenue NRCF in later quarters, is consistent with the data (under the heading ‘Controls’).
Correlation and Regression
The correlation coefficient of 0.83 (calculated above) gives me enough confidence to fit a line to the variables, while still being wary of the problems listed above in ‘Five points to bear in mind’.
Regression analysis is about fitting a curve so that one quantity can be estimated when others are known. I’ll be using ‘regression’ to mean fitting a straight line to two variables using the default method (Ordinary Least Squares), until I say otherwise about curves and the number of variables. Regression is a form of generalization, sacrificing detail in favor of a simple equation or line (or ‘rule’, to make an awful pun).
I mentioned earlier that correlation is symmetric, meaning that X correlates with Y in exactly the same way that Y correlates with X. Regression is different. Regressing Y on X is about fitting a line so you can read-off a Y for any X, and the way it works you get a different line if you regress X on Y (the two lines get more similar when the correlation gets higher). For regression, if the variables are called P and Q, you need to decide what’s more likely, P causes Q or Q causes P. If you decide it’s P causes Q, then you call P the independent variable and Q the dependent variable. That makes P like X and Q like Y on a graph, where the X axis is horizontal and the Y axis is vertical.
I hope these two steps simplify things:
Step 1: Decide what’s more likely, P causes Q or Q causes P.
If it’s P causes Q:
Step 2: Use the regression functions in a spreadsheet to regress Q, the dependent variable, on P, the independent variable. Q is the first data range inside the functions’ brackets.
If it’s Q causes P:
Step 2: Use the regression functions in a spreadsheet to regress P, the dependent variable, on Q, the independent variable. P is the first data range inside the functions’ brackets.
If you don’t think P causes Q or Q causes P, ask ‘Is there an R that causes both P and Q?’. If ‘yes’, regress P on R or Q on R. If ‘no’, there might be no point in doing regression.
Just because you decided that ‘P causes Q’ is more likely than ‘Q causes P’, don’t believe that P must really be causing Q.
It’s more reasonable to suppose that Patent acquisition costs cause Revenue NRCF than the other way round, although I’m using Patent acquisition costs as a surrogate for the unmeasurable quality and quantity of patents acquired. That means Patent acquisition costs should be the independent variable, making Revenue NRCF the dependent variable, and I regressed Revenue NRCF on Patent acquisition costs.
If you don’t use the function wizard, it can be worth checking the spreadsheet help before entering a regression function, for instance OpenOffice help has ‘SLOPE(data_Y; data_X)’ to tell you that the Y data goes on the left in the SLOPE function. Any guide to regression with two variables on a spreadsheet is likely to match this order by showing two columns with Y values on the left and X values on the right.
I use the basic regression functions that should be available in any spreadsheet. If you have the Analysis toolpack for Excel or don’t mind following the instructions to activate it, you can follow this instead:
“How to Run Regression Analysis in Microsoft Excel” wikihow.com (repeat of link near the top)
and then skip to ‘COEFFICIENT OF DETERMINATION’ further down this post. There are add-ons for OpenOffice Calc, but I haven’t found a good alternative to reading on.
The results of the functions SLOPE and INTERCEPT define the line which best fits the data points.
The slope (or gradient) of the line is calculated with:
‘SLOPE’ is the function, which takes two arguments, each of which is a data range.
‘C9:K9’ is the first data range, which specifies the nine cells where I’ve put the Revenue NRCF.
‘C5:K5’ is the other data range, which specifies the nine cells where I’ve put the Patent acquisition costs.
The result is 0.35, which suggests that on average every $1 of Patent acquisition costs results in an extra $0.35 of Revenue NRCF in the same quarter.
The intercept is what y is when x = 0, or where a line on a graph cuts the vertical ‘y’ axis. It’s calculated with:
‘INTERCEPT’ is the function, which takes two arguments, each of which is a data range.
The data ranges are the same as for SLOPE above.
The result is 24697418.59, which suggests that if nothing was spent on Patent acquisition costs, Revenue NRCF would be roughly $25,000,000, on average. It’s worth asking if this makes sense. In the long term, it depends on whether or not Acacia can get revenue sharing agreements that can generate $25m Revenue NRCF without incurring costs under the ‘Patent acquisition costs’ heading, which isn’t worth thinking hard about as management are committed to paying advances to share revenue on quality patent portfolios. In the short term, Acacia can be expected to collect revenue from patents acquired in previous quarters.
The intercept estimates ‘residual Revenue NRCF’ (my term), the Revenue NRCF in a quarter that is not accounted for by the quarter’s Patent acquisition costs, and is therefore the result of previous patent acquisition. Residual Revenue NRCF ought to be higher after quarters with high Patent acquisition costs, so the intercept can only estimate the average residual Revenue NRCF in the nine quarters. The higher the intercept the better, and $25m estimated average residual Revenue NRCF gives some hope that Patent acquisition costs are providing worthwhile returns in later quarters. However this is substantially undercut later when a control-case is constructed.
COEFFICIENT OF DETERMINATION
‘RSQ’ is the function that gives the coefficient of determination. (RSQ probably stands for ‘r squared’, where r is the coefficient of correlation).
The data ranges are the same as for SLOPE and INTERCEPT above.
The result is 0.69 (to two decimal places), and that’s a measure of how much variation in Revenue NRCF is estimated to be the result of variation in Patent acquisition costs. The coefficient of determination always equals the square of the coefficient of correlation, and as 0.83 * 0.83 = 0.6889 which is close enough to 0.69, that’s a useful check that there’s no mistake that would only affect one of the numbers.
Note that the variation of Revenue NRCF not explained by variation of Patent acquisition costs is made up about equally of points above the line and points below the line. That’s a result of fitting a line to the points, and not a feature of Acacia.
Reasons to chart regression
It’s easy to mix up the dependent variable and the independent variable when using spreadsheet functions. This can be discovered by making a regression chart with an equation as part of the chart, which can be checked against the results of the SLOPE and INTERCEPT functions. It’s worth seeing how well the line fits the points, and the chart means you can see if any points are outliers. See ‘How to make a chart for regression’ below.
The only data series I used to make the regression chart were Patent acquisition costs and Revenue NRCF.
Spreadsheet and chart for regression of Acacia’s Revenue NRCF on Patent acquisition costs
(formula view then results)
What does it all mean?
Bearing in mind that the data might not have produced a reliable correlation coefficient, you might want to look at the chart with the regression line and judge for yourself how well the line fits the points. Maybe I ought to say that the regression line should be ignored as the data is inadequate. In my opinion, that would be to ignore the importance of considering the costs associated with the risks of false positives and false negatives.
Outcomes and costs
If you find mushrooms growing outdoors but are only 90% sure that they are safe to eat, you don’t eat them. This demonstrates that the costs associated with outcomes need to be considered, as well as the probabilities. In traditional statistics, this is dealt with by setting a very high confidence limit. A forager might want to be at least 99.999% confident in the edibility of their mushrooms. If there’s a 90% chance of the mushrooms being safe, there is a 10% chance that they are unsafe, and since 10% exceeds the low 0.001% threshold for the hypothesis that the mushrooms are unsafe, turning the question around gives the same practical result of not eating the mushrooms. Now suppose a statistician says “There is reason to doubt the probabilities, so the hypothesis ‘The mushrooms are unsafe’ should be rejected.”. So the forager eats the mushrooms and gets poisoned. This illustrates that when there are doubts about the data, it’s important to default to the appropriate hypothesis. In the case of Acacia, an investor who is mostly concerned with not losing money could reasonably make ‘the regression line reflects reality’ the default to hang on to, whereas an investor more concerned with not missing an opportunity for profit could reasonably make ‘the regression line does not reflect reality’ the default.
Climate activists have advocated the precautionary principle, that it’s better to risk cutting CO2 emissions unnecessarily than to risk global warming. That was arguably a case of choosing the appropriate default. Although the data would put my nine points in the shade, there was much disagreement about appropriate models and other issues.
This article on six-sigma-material.com gives the traditional approach to the risk of false positives and false negatives, known as alpha risk and beta risk. I’m not surprised that the emphasis is on probability and theoretical correctness with no explicit mention of the associated costs.
The alpha and beta risks associated with believing the regression line I’ve plotted for Acacia, cannot be calculated, since the problem is the uncertain effect of the questionable data quality.
Using a traditional statistical test is like trying to convince someone who is skeptical but not dogmatic. The ‘null hypothesis’ (such as no correlation exists) is usually preferred over the hypothesis until substantial evidence supports the hypothesis. When statistics is relevant to an academic subject, that kind of gatekeeping is necessary to avoid a flood of weakly supported studies. Such bias against the hypothesis is often out of place in real situations. If an audience is at risk, proof that the risk is likely is not required before evacuating the theater.
If it’s real
Suppose that the regression line illustrates a real underlying cause-and-effect. That might not necessarily continue, as businesses and markets change, but as the shares aren’t absolutely dirt cheap, in that case I wouldn’t want to bet on the quick collapse of a cause-and-effect that I don’t properly understand. If future Revenue NRCF stays as close to the regression line as it has in the nine quarters considered, Acacia’s prospects are not good. Quarters falling near the top-right of the line will have Patent acquisition costs that are mostly not recovered in the same quarter, while quarters falling nearer the bottom-left of the line show less Revenue NRCF, out of which substantial costs have to be paid. You can see charts with cost breakdowns in my blog about Acacia, find ‘The bottom row’ and ‘raised to fund’. In the second chart I see possible ‘green shoots’ of past investment starting to pay off in the most recent quarter, (even though results disappointed the market and the shares were hammered). Those are subjective ‘green shoots’, not substantial enough to shift the regression line out of worry-land.
Estimated Revenue NRCF
For each of the nine Patent acquisition costs, I need to calculate a quantity I call ‘estimated Revenue NRCF’. This is given by the formula:
estimated Revenue NRCF = Patent acquisition costs * slope + intercept
where ‘*’ means multiply, ‘slope’ is the result of the SLOPE function, and ‘intercept’ is the result of the INTERCEPT function.
The formulas I used to calculate estimated Revenue NRCF are:
That’s like, y = c + m * x, which can be written as y = mx + c, which might be familiar as the formula for a straight line, where m is the gradient and c is a constant (on a graph, c is the y intercept). In my spreadsheet, the cell C19 holds the y intercept, and C18 holds the gradient (from the SLOPE function). The cells C5, D5, E5 etc hold the Patent acquisition costs. (The dollar signs in the formulas were useful when extending the formula by dragging.)
Charting the regression estimates
This repeats the first chart, with the addition of the Revenue NRCF estimated by the regression, and the amount that actual Revenue NRCF exceeded the prediction by.
Chart for Acacia’s Patent acquisition costs, Revenue NRCF, estimated Revenue NRCF and a difference
The difference ‘Revenue NRCF minus estimated Revenue NRCF’ is expected to fluctuate around zero in the quarters shown (as the result of regression fitting a line, as shown in the previous chart). Shareholders would want that difference to grow, or for the dashed orange line to bend up in future. There might be a little hope in the recent upward movement, which could be an early sign that previous acquisition is producing results in later quarters. That has no statistical significance.
One way to test an anti-aging cream is to treat only one half of someone’s face (aside from ethical issues). The untreated half is an experimental control. Many tests use a control-population, including randomized double-blind clinical trials where no-one knows who gets the placebo until all the data is in. Acacia is not an experimental subject and I can’t use a control comparable to those examples. What I can do is to plot an alternative scenario, which attempts to answer the question “If Acacia’s Patent acquisition costs produced additional Revenue NRCF in later quarters, would that show up on the charts shown?” In other words, is a better scenario consistent with the data. Obviously, if additional Revenue NRCF is produced nine quarters after patent acquisition, that won’t show up in data covering nine quarters, which allows better scenarios consistent with the data. The effect of a delay of 5 quarters might be mostly off the charts and into the future, as Patent acquisition costs shown were minimal until Q1 2012. Such happy scenarios are not impossible, but in my opinion would be too much like arbitrary constructions designed to dodge the data, and my control case assumes more even returns.
What I’ve done is:
1) Assume an overall 20% return of Revenue NRCF on Patent acquisition costs. Patent acquisition costs of $100 million in a quarter would provide that return if they produced an additional $120 million of Revenue NRCF four quarters later, or $144 million eight quarters later, although I go on to specify that the returns are more evenly spread. 20% might seem like a high bar to clear, but this is not earnings or cash flow, and substantial costs have to be paid out of Revenue NRCF.
2) Assume that only 20% of Patent acquisition costs contribute to Revenue NRCF in the current quarter. This contradicts the 35% estimated by the regression, but some reduction is necessary to offset the estimated Revenue NRCF that the control is going to add in.
3) Assume an ‘economic life’ of six years (or 24 quarters). Management have indicated the average useful economic life of Acacia’s patents is six to seven years.
4) Assume that Patent acquisition costs in Quarter n produce a return in the subsequent quarters n+1, n+2, n+3,.. n+22, n+23 (the same in each of those quarters).
5) A tedious calculation showed that $100 million Patent acquisition costs in Quarter n had to produce additional Revenue NRCF of $5.74 million in the subsequent quarters (n+1 to n+23) to produce a return of Revenue NRCF very close to 20% pa. (I used an iterative method, and got a net present value of 99.94 when 100 would have implied a return of exactly 20%.)
6) The contribution from previous Patent acquisition costs was calculated for each quarter. For instance taking Q4 2011, the contribution equals the sum of Patent acquisition costs from the previous three quarters (Q1, Q2 and Q3 2011) multiplied by 5.74% (or 0.0574).
7) A ‘control estimate’ was calculated for each quarter by adding the contribution from step 6 (previous quarters), the contribution from step 1 (current quarter), and the Y-intercept obtained by regression previously. That Y-intercept represents the Revenue NRCF estimated for a quarter with zero Patent acquisition costs, by the regression (where only the same quarter’s Patent acquisition costs affect the estimate).
8) Lines representing the contribution calculated in step 6, and the control estimate from step 7, were added to the previous chart.
Chart for Acacia’s Patent acquisition costs and Revenue NRCF with control-case
The chart shows that the ‘control contribution’ from previous Patent acquisition costs is not massive, but is not insignificantly small (it’s the dark blue line which starts near zero then turns positive from 2012 Q2).
The control estimate (of Revenue NRCF, in orange) tracks Revenue NRCF (green) fairly closely, with a correlation of 0.7255. Correlation is not the best measure of how closely the control estimate tracks Revenue NRCF, as the correlation coefficient would not change even if the control estimate was shifted up by adding a massive constant, or if each control estimate was multiplied by two. The tracking errors are calculated soon, but it can be seen that tracking is reasonably close by comparing the orange line for the control estimate to the green line for Revenue NRCF. Referring to the results under the heading ‘Statistical significance’, the correlation of 0.7255 exceeds the 95% confidence threshold of 0.6664. By that measure the control scenario is feasible, at least for explaining variation in Revenue NRCF, but is statistically less likely than the pure ‘regression’ scenario, exceeding 95% confidence compared to 99% confidence. (The regression estimates and the actual Revenue NRCF have a correlation coefficient of 0.8330, the same as between Patent acquisition costs and Revenue NRCF).
While those correlations have the advantage that they can be compared to confidence thresholds, a better measure of how closely the estimates follow Revenue NRCF is to accumulate the errors. The simplest way to do this is to use absolute errors, so +4 and -5 sum as +4 + +5 giving 9 rather than -1 ( because you don’t want errors in different directions to cancel out). After averaging the absolute tracking errors by summing then dividing by nine, regression gives $11,249,965 and the control gives $12,697,213 as the average absolute tracking error, showing that the control tracks the observed Revenue NRCF with 12.86% more absolute tracking error than the regression-based estimates.
Statisticians like to square errors, sum the squares, divide by the sample size and take the square root, calling the result the ‘root mean square’ or ‘RMS’ error. Regression gives $12,252,055 and the control gives $15,538,282 for the RMS tracking error, showing that the control tracks the observed Revenue NRCF with 26.82% more RMS error than the regression-based estimates. On this measure, the control is well behind the regression, producing estimates a lot less consistent with actual Revenue NRCF. This dims the hope that the control case might have offered. The regression line is better at estimating recent Revenue NRCF than the control scenario. The side of the face with the anti-aging cream really does look younger than the other side, even though the other side has few wrinkles. The failure of the control is not encouraging for shareholders. I’ve tried shifting the control estimates down slightly (by $320,879) to balance the ‘too high’ and ‘too low’ tracking errors, but it makes very little difference to either the average absolute error or the RMS error.
I mentioned earlier that the Y intercept of $25 million produced by the regression, could be seen as representing an estimate of average ‘residual Revenue NRCF’ (Revenue NRCF not accounted for by multiplying Patent acquisition costs in the same quarter by the regression’s ‘slope’ of 0.35). Since residual Revenue NRCF is the result of previous patent acquisition, that $25 million gave some hope that Patent acquisition costs might provide a worthwhile return in later quarters. That hope has been been tested by the control case, which failed to explain actual Revenue NRCF better than the regression.
Alternative scenarios to the regression and the control-case could be considered. Anyone determined to find a scenario which fits the data well, can do so with enough fiddling about. Anyone determined to build a highly profitable scenario can assume big returns after nine quarters. That would not provide reasonable reassurance. Anyone who finds an encouraging and reasonable scenario should record the steps taken to find it. As a variation on Occam’s Razor, I would say that unless it explains the facts better, the harder it is to find a theory which fits the facts and has a desirable outcome, the less likely it is to be realistic. It’s possible that two aggregates for the nine quarters limit the scope for finding a happy alternative scenario. Total Patent acquisition costs $347 million, total Revenue NRCF $344 million. This highlights the situation where Revenue NRCF (and income) needs to increase substantially, and the signs of it happening so far are not definite, although the major patent acquisitions only started in Q1 2012, while the latest results are for Q1 2013.
The ‘gloomy view’ is that the control-case failed, and no cheery alternative scenario is reasonable, leaving the regression line as the best description of the relation between Patent acquisition costs and Revenue NRCF, with no compelling reason to believe the situation will change soon. According to that view, Patent acquisition costs are not returning the investment, and the lower they are, the better.
The control case was based on a 20% return of Revenue NRCF from previous Patent acquisition costs, and while that’s better than the regression scenario (which implies that high costs are never fully recovered), further analysis is required to see how well the company could perform with that 20% return, bearing in mind that the return is before costs (except for royalties and contingency fees).
In favor of the control estimate of Revenue NRCF, it was from my first attempt at reasonable though highly simplified assumptions. There are puzzles where you try many solutions and think “Aha!” when you find the right one, but in statistics, usually the more you have to fiddle about with assumptions to get an acceptable result, the less useful the assumptions are, and either the model is useless or it’s trying to tell you something you don’t want to know. One case is backtesting share-picking criteria. If you keep putting new variables in, you can thrash the index, but only in the past.
It’s worth asking if the results could be introduced by the method rather than truly reflect the data. I’ve seen old Westerns on TV where the wagon wheels looked as if they were turning the wrong way, and that was an artifact of the media technology rather than how wagon wheels behave. You could take any company and derive a regression line supposed to show how quarterly revenue varied according to investment in the same quarter, with about as much error above the line as below the line. One way to check that the regression chart for Acacia is not an artifact involves making comparable charts for other companies, but that’s complicated by Revenue NRCF being specific to the business of exploiting intellectual property.
If the regression line was only an artifact, there would be no reason to expect the regression estimates to track Revenue NRCF at all well. The failure of the control scenario to outshine the regression is also evidence against the regression line being a mere artifact.
What shorters could say
The less-inhibited share-shorter could write an article featuring charts like mine, claiming that Acacia is a ponzi scheme, raising cash for patent acquisition that never returned all the cash, while management claimed performance was excellent and paid dividends to keep the share price up for future capital raising. Please note I believe that’s unlikely. If you think the statement is excessive, maybe you haven’t seen share-shorters in full flood.
Any Acacia shareholders thinking of selling after reading this, should check my work or get someone qualified to check it, and check that I’ve entered the numbers correctly. The GIGO principle (Garbage In, Garbage Out) doesn’t respect care taken after data entry.
I mentioned earlier that the correlation between leverage and earnings growth for comparable companies might quickly turn from positive to negative in a recession. The same applies to regression. A regression line for leverage and earnings growth (for comparable companies) might hold steady in the boom years, but shift dramatically in a recession. Predictions can be based on regression lines, and the example shows that the likely stability of the regression line should be considered when thinking about such predictions.
Why do I hold the shares?
I regard Acacia as a risky speculation and it’s only a small part of my portfolio. Many articles list the attractions, including the historic 44% return on investment claimed by management, and good though volatile growth in the past. The various reasons-to-invest suggest the regression line won’t hold for much longer. However the reasons-to-invest depend to some extent on information from management, and if the regression line holds, those reasons fail. If the regression line holds, I might feel like giving myself a hard kick for not selling.
I find 99% confidence hard to ignore, even with some doubts based on the distribution of the data, and if future quarters land around the regression line the company is likely to perform poorly. And yet, I think there’s a lot more than a 1% chance of Acacia’s next quarter being further off the regression line than any of the points in the regression chart (although I can’t quantify ‘a lot’), based on the general investment case. Many investors would prefer the objectivity of the regression line, but they might not be following Acacia.
I’ll repeat that investors more concerned with not losing money could reasonably assume that the regression line will hold, as it’s the safest assumption in a monetary sense, though not in a statistical sense. Investors more concerned with not missing an opportunity for profit can more reasonably reject the regression line on statistical grounds, due to the possibility that the Pearson correlation coefficient is not a sufficiently robust estimator for the nine quarters of unevenly distributed data. Investors who weigh reward against risk need to make their own adjustment to the risk side.
Investors also need to consider all the issues that they normally consider before making an investment decision. Recent developments include a likely investigation into patent trolls, new patent acquisition and the CEO retiring soon. This investigation on behalf of investors is dated Jul. 16, 2013 but I won’t trust the date without confirmation.
year 2000 2001 2002 2003
cats 1001 1003 1005 1007
rats 1111 2222 3333 4444
has its data (or data series) in rows, because items in a row are distinguished by the year. A table with times, dates, or serial numbers as column headings would generally have data series in rows, and if transposed would have data series in columns. When regressing P on Q, if P is a row in the table then Q should be a row in the table, and the data series are in rows. I hope that explains what I mean when I say that in spreadsheets, the data series are usually in columns, but in this post the spreadsheets fitted better with the data series in rows. When you make a chart (select the table and choose ‘Chart…’ from the ‘Insert’ menu), you need to select either ‘Data series in rows’ or ‘Data series in columns’. A data series has its data in a data range, for instance I’ve put Patent acquisition costs (a data series) in cells C5:K5 (a data range, in the fifth spreadsheet row, from column C to column K).
List of steps
This list is not meant to be definitive, and is more applicable when there are concerns about the data. With plenty of well distributed data, no outlier problem, a high coefficient of correlation and a chart showing points close to the regression line, the task is simpler, but it’s still worth remembering that correlation does not prove causation.
To investigate two variables that change over time.
Give up if correlation isn’t plausible.
Decide on data series in rows or columns (usually columns).
Decide on independent and dependent variable.
Collect the data in a spreadsheet table, with the independent variable before the dependent variable
(i.e. independent variable in a row above or a column left of the dependent variable).
Chart the variables over time.
Get correlation coefficient.
Decide on confidence threshold.
Test correlation coefficient against confidence threshold (to test significance).
Give up if test failed.
Think about ‘Five points to bear in mind’ to avoid jumping to conclusions:
Correlation is not causation.
Sample size, data quality, robustness of the correlation coefficient.
It’s only an estimate.
Correlations can change.
Run regression (dependent variable on independent variable).
Check regression r-squared = square of correlation coefficient (if it isn’t, somethings gone wrong)
Calculate the regression estimates.
Draw the regression chart (with the independent variable on the X axis).
Check how well the regression line generalizes the points.
Think about what it means.
Consider outcomes and costs.
Think about consequences if the regression line holds in the future.
Calculate variation from estimates (dependent variable – regression estimate).
Make chart, of both variables, regression estimates, and variation from estimates, with time on X axis.
Think, would a control case be possible and relevant?
Construct control case.
Calculate control estimates.
Add control estimates to previous chart.
Stop if control estimates look miles off the dependent variable.
Get correlation coefficient for control estimates and dependent variable.
Find a confidence threshold for the correlation coefficient.
Compare correlation coefficients for control and regression (each is with the dependent variable).
Compare the highest confidence found for the two correlation coefficients.
Calculate average tracking errors (absolute and root mean squared, for control and regression).
Compare the tracking errors.
Think about relative likelihood of the regression results and the control case.
Think about if the results could be an artifact of the method.
If the regression line seems to be a plausible generalization, think about how stable it might be.
If you have the time, sleep, think, and write it up. Imagine a critical reader who will check everything.
Remember it all depends on the accuracy of the numbers.
Condensed list of steps
the ‘Five points to bear in mind’
which variable is independent?
chart with regression-estimates, in date order
does the control-case fit better than the regression?
How to make a chart for regression
A scatter-plot is made, then the chart is positioned and sized. These are added to the chart: the regression line, the equation of the line, and the r squared value. The chart can be edited and should be checked. If the SLOPE function and the INTERCEPT function have been used, their results can be checked against the equations in the chart.
If you have Excel
The method is broadly the same in Excel as in OpenOffice, but you might find one of these useful:
“Quick and Dirty Regression Tutorial” udel.edu (repeat of link near the top)
The tutorial above includes how to activate the Analysis ToolPak, which is needed for the regression utility.
“How to Run Regression Analysis in Microsoft Excel” wikihow.com (repeat of link near the top)
The guide above has more screenshots, and uses some to explain how to activate the Analysis ToolPak, ‘with the ribbon’ and ‘without the ribbon’.
“Doing a Linear Regression Analysis, Using Excel (version from Office ’97)” Laboratory in Cognitive Psychology
Rows or columns
I’ve split the explanation into two cases, data series in rows, and data series in columns, with a piece common to both cases at the end. Having the series in columns is more usual, and is better when the series are long.
How to make a chart for regression, data series in rows
I’ll be using the Acacia regression as an example, and it will help if you can refer to the relevant images, or have the spreadsheet open.
The X axis
The top row selected will define the quantity along the X axis, and you want this to be the independent variable, which in this case is Patent acquisition costs.
Selecting from the table when data series are in rows
In the Acacia regression I put Patent acquisition costs above other series, just under the column headings (quarters Q1 2011 etc). The quarters aren’t needed for the regression, so Patent acquisition costs are the top row to select, which makes them the quantity measured along the X axis. The area of the table to select includes rows that aren’t wanted, but they are easily removed from the chart. Include the row headings (on the left) in the selection. These are used to label the X axis and, in the key or ‘legends’ area, the points drawn. The area to select is B5:K9. That area includes Patent acquisition costs in row 5, and Revenue NRCF in row 9, with three rows in between to remove later.
In the Insert menu, click on Chart…, and the Chart Wizard starts. You’ll see a chart made with default settings, usually too small and sometimes all you can see is the key with labels taken from headings. There’s no point in clicking on the chart until you’ve exited the wizard.
Step 1. Chart Type
Choose XY (Scatter) and Points Only style.
Step 2. Data Range
Choose Data series in rows. Tick First column as label. This means that ‘Revenue NRCF’ which is in the first column, will be used to label the points, in the key or ‘legend’ area on the right.
Step 3. Data Series
Remove rows you don’t want. Just click on a row (in the list under ‘Data series’) that you don’t want to include, and click Remove. That’s the three that aren’t ‘Revenue NRCF’.
Now skip to ‘How to make a chart for regression, finishing’.
How to make a chart for regression, data series in columns
In this made-up table…
year.. | animal population
——– | cats | dogs | birds
2001 | 1987 | 1230 | 2845
2002 | 2120 | 1287 | 2757
2003 | 2586 | 1321 | 2702
2004 | 2465 | 1303 | 2731
2005 | 2630 | 1296 | 2664
…the table is suitable for charting with a regression line for ‘birds’ on ‘cats’, because the independent variable ‘cats’ is to the left of ‘birds’.
The X axis
The leftmost column selected will define the quantity along the X axis, and you want this to be the independent variable, which in this case is ‘cats’.
Selecting from the table when data series are in columns
The part of the table to select is:
cats | dogs | birds
1987 | 1230 | 2845
2120 | 1287 | 2757
2586 | 1321 | 2702
2465 | 1303 | 2731
2630 | 1296 | 2664
The X axis will represent cats, as the column ‘cats’ is on the left of the selection. The column ‘dogs’ is easily removed later. (If you want to practice using the table, find ‘Text Import’ below.)
In the Insert menu, click on Chart…, and the Chart Wizard starts. You’ll see a chart made with default settings, usually too small and sometimes all you can see is the key with labels taken from headings. There’s no point in clicking on the chart until you’ve exited the wizard.
Step 1. Chart Type
Choose XY (Scatter) and Points Only style.
Step 2. Data Range
Choose Data series in columns. Tick First row as label. This means that ‘birds’ which is in the first row, will be used to label the points, in the key or ‘legend’ area on the right.
Step 3. Data Series
Remove columns you don’t want. Just click on a column (in the list headed ‘Data series’) that you don’t want to include, and click Remove. In this case there’s only ‘dogs’ to remove.
How to make a chart for regression, finishing
This part applies to both cases, data series in rows and data series in columns.
Step 4. Chart Elements
You can give the chart a heading, put names on the axes, and choose the kind of grid you want in the background. I kept the default grid of horizontal, and did not add titles etc.
Move and stretch
Click OK to exit the Chart Wizard. The chart will have a gray border, which shows it’s in edit-mode. Press the Escape key to leave edit-mode so you can move the chart to where you want it. An alternative to the Escape key is to select a cell and click back on the chart. Copy and paste might be quicker than dragging a long way. Then you need to stretch the chart to make it the right size and shape.
Double-click the chart for edit-mode (or right-click over it and click Edit at the bottom). This gets options under the Format menu which are very like the steps in the Chart Wizard.
With the chart in Edit mode, right-click over a data point, and click on ‘Insert Regression Curve’ in the menu. That should add the regression line. Right-click on the regression line and choose ‘Object Properties’, which gets a dialog with a tab for ‘Equation’. Under the ‘Equation’ tab, tick ‘Show equation’ and tick ‘Show correlation coefficient (R squared)’. (BTW clicking ‘Insert Regression Curve Equation’ won’t get the R squared value.)
The equation, the key (or ‘legend’) and sometimes other parts of the chart can be dragged to new positions. Double-clicking the parts brings up some options, and right-clicking then clicking ‘Object Properties’ usually gets the same options. Right-clicking also gets ‘Position and Size’ in the dropdown menu. When part of a chart is selected, the Escape key de-selects it, and pressing Escape again will exit edit-mode.
The Statistics tab
The Object Properties for the data points usually includes a Statistics tab. If you can’t see it, try exiting edit-mode, then double-click to get back to edit-mode, then right-click over a data point (without selecting it first). The Statistics tab includes a choice of regression curves. The initial default (on the left) when you make the chart is ‘no regression line’. After clicking the ‘Insert Regression Curve’ option, you get a straight regression line, but there’s also a choice of logarithm, exponential and power. In principle, an exponential regression line could be useful for fitting a trendline to a growing company’s revenue or profit, but that would be on a graph with time on the X axis, not on a scatter-plot. For each curve, there should be an equivalent regression function for numeric results. It helps if there’s a good theoretical reason for a curve, such as the exponential curve for a company growing at a fairly steady rate. When projecting a trendline into the future, remember that things change. Imagine doing a projection for Kodak not long before they were walloped by digital technology.
There’s no good reason to fit an exponential curve to the points I scatter-plotted for Acacia, so naturally, I had to try it. The fit did not look unreasonable, but the R-squared value dropped from 0.69 to 0.43, a statistically insignificant value (taking the root, r = 0.6557 which is just short of the 95% confidence threshold). Part of the reduction of R-squared will be due to a reduction in the degrees of freedom, because curves need more numbers to define them than a straight line does (an exception is ‘lines’ of latitude or longitude), for instance many circular arcs pass through any two points, but only one straight line.
Always get the R-squared value. I can’t help wondering how many spreadsheet users get a trendline, and maybe click “Insert Regression Curve Equation”, but don’t bother to get R-squared. I plotted the variables in my table with the quarters on the X axis. It was too cluttered to see trends, and when I made a regression line for Revenue NRCF, the healthy uptrend was not obviously wrong. At best, the trend is slightly less than probable, because including R-squared showed it to be a low 0.07, square-rooting for the correlation coefficient got 0.26, and a confidence level of p=0.4993 or almost an even chance. If there is a trend in Revenue NRCF, then nine quarters is not long enough to pick it out of the volatility with much confidence.
Another option on the statistics tab is the different kinds of error margins that can be shown. Selecting ‘Standard deviation‘ (a measure of variation in a population) and the little box with a bit of line above and below, shows that the Acacia regression line is within one standard deviation of each point, although “each point is within one standard deviation of the line” sounds better. That could be used as an objective test that there are no ‘outliers’ (which affect non-robust estimators like the Pearson correlation coefficient).
Check the chart
If it’s not what you expected, the error could be your expectations or anywhere from the data source to making the chart, depending on the observed error. If the SLOPE function and the INTERCEPT function have been used, check their results against the equations in the chart. If the coefficient of correlation has been calculated, square it and check that it equals R-squared in the chart.
year.. | animal population
——– | cats | dogs | birds
2001 | 1987 | 1230 | 2845
2002 | 2120 | 1287 | 2757
2003 | 2586 | 1321 | 2702
2004 | 2465 | 1303 | 2731
2005 | 2630 | 1296 | 2664
To import the table above, first copy it into an editor which allows search and replace (e.g. Microsoft Word). Then replace each ‘ | ‘ with a space, copy the whole table, pick a cell in a spreadsheet, and paste.
If you get a dialog headed ‘Text Import’, then in OpenOffice, under ‘Separator options’ make sure ‘Separated by’ is selected, and tick the ‘space’ box. A small preview in the bottom left should show that the table will cover a range of rows and columns. Click OK.
If instead the text is pasted into a single column, select the cells in the column with the text, and in the Data menu, click on ‘Text to columns…’, and the dialog you see is the same as above except for the heading.
I used the space character as the separator. Other standard separators are tab, comma and semicolon, but you can use any character, tick ‘Other’ and enter it. Use a separator that doesn’t appear in the text.
Transposing means swapping rows and columns of a table, like from this:
1 2 3
4 5 6
Tables usually have headings at the top, with times or dates listed down the left, which makes mine the wrong way round. The second sheet in my downloadable spreadsheet has the table the usual way round, so you shouldn’t need to transpose the table, but it’s handy to know how anyway. This is a short guide with good screen-shots:
“Transpose Excel data from rows to columns, or vice versa” By Jeff Davis, 2009. techrepublic.com
I mentioned near the top that Acacia’s business was not seasonal. When there’s a seasonal effect (as for ice cream sales) and you need to look at quarterly data, you have a problem, and it’s a similar problem if there’s a weekly cycle and you have daily data, etc. The first thing to do is to get a chart so you have an idea of the situation. For correlation or regression, get a correlation on the raw data, and if it’s high enough, you might not need to de-seasonalize. This is the best relatively short guide I found for isolating and removing cyclical effects:
“How to Analyze Seasonal Sales in Excel” by Charley Kyd. exceluser.com
The article is OK at describing features that might not be familiar, like using the COUNT function to count the entries in a column, and ‘IF’ to stop calculating at the end of the data. It tells you when to use the Create Names dialog but doesn’t take you through it step by step. ‘Create Names’ is used for instance so the heading ‘SlsRatio’ can be used as a name in an array formula, instead of specifying the data series using row and column ID as in D12:D64.
This pdf was linked to above as a way to make a scatter plot with a line fitted. It leads up to estimating lost income for insurance cases, which involves trends and seasonality. It’s good if you have the time to read 70 pages, have Excel and can get to Regression via the Data tab on the ribbon and Data Analysis in the Analysis group.
When looking at data over several years with high inflation, you might want to produce figures adjusted for inflation before throwing data into stats functions. Use common sense about it. In this post, apart from the control case, the tests only involved revenue and costs in the same quarter, so inflation was not an issue.
Skip to ‘Math and sense’ if you want to dodge the complicated bit.
I’ve already mentioned exponential growth because it’s possible to fit an exponential trendline to points. The R-squared value and some of the error indicators can show how well the trendline fits. Another way is to calculate the logarithm of each value (nearly always the dependent variable, on the Y axis), and make a chart using the logarithms. A perfect exponential series (like 1, 2, 4, 8, 16 or 1, 10, 100, 1000) will be transformed to a perfectly straight line, and if the result looks nearly straight or like a straight line with noise, fitting an exponential trendline will be more reasonable. The equation for 1, 2, 4, 8, 16 is f(x) = 0.5 * 2.00^x (after choosing ‘number’ and two decimal places in the equation’s Object Properties). The 0.5 is just an initial value for when x = 0. The 2.00 corresponds to the growth rate, and if x is time in years, then the amount doubles each year. To convert that to percentage growth, it’s g = (2.00 – 1) * 100. For 1.06^x in the equation, it’s (1.06 – 1) * 100, which gives 6% growth pa.
This next link has graphs showing how taking the log can put points almost in a straight line, and is generally relevant. (Don’t worry if you can’t follow the math. I think ln y = 0.016 + 2.64x should be ln y = 0.016x + 2.64).
“Exponential Regression” by Charles Zaiontz, 2013
If you find Figure 4, you’ll see how LOGEST produces two columns of numbers. The top left is a slope, and the top right is an intercept. These are different to the slope and intercept of linear regression, which was like
LINEAR: y = slope * x + intercept
With LOGEST it’s:
EXPONENTIAL: y = slope ^ x * intercept
You can also look up LOGEST in the help, or try this exercise based on a downloadable spreadsheet:
It’s well known that a company can’t grow at double digits for too long, or it would outgrow the world economy. However Moore’s Law has seen exponential growth in computing power for decades, which has some relevance to business.
I’ve kept this post simple (relatively), and the Acacia problem only involved two variables. When more variables are involved you need ‘multivariate analysis’. Have the data series in columns and use LINEST which will be described in the spreadsheet Help. Unfortunately I can’t even get a three dimensional scatter-plot so visualization is hard and verification from the chart’s equation and R-squared is impossible, at least with my version of OpenOffice.
Umbrella sales might be high on a wet day, but low on the third day of rain due to fewer pedestrians without umbrellas. If enough data for daily umbrella sales and rainfall is available, a simple model would be more accurate than regression. A model could estimate the number of pedestrians without umbrellas, and update the estimate according to the day’s weather.
That’s just to alert you to the possibility that modeling could be better than regression in some cases. I can’t find a good link for that level of modeling.
Math and sense
For every stats function in a spreadsheet, there’s math behind it that most investors won’t want to know about. What you need most is judgment and common sense. Investors do better with sense and no math than with math and no sense. You might have heard about the Nobel Prize winning geniuses at Long Term Capital Management (wikipedia) who got too big to fail and had to be bailed out.
The ‘Big Data’ trend would be impossible if a human had to think deeply about the results of every statistical test. Thought is needed early in the process, after which automation takes over. In “Big Data: A Revolution That Will Transform How We Live, Work and Think” by Viktor Mayer-Schonberger and Kenneth Cukier, the authors claim that we can’t always obsess about causality. They are quoted in a piece about the limits of Big Data:
“Big Data Hasn’t Changed Everything” By PHILIP DELVES BROUGHTON, July 1, 2013. The Wall Street Journal.
Bayes theorem and probability
Bayes theorem could be seen as a systematic method for updating opinions as new information arises, using equations for conditional probability. Conditional probability is like the probability that you’ll eat ice cream today if the weather is hot, whereas plain vanilla probability is just the chance that you’ll eat ice cream today.
The first half of this might be too simple:
“The Bayes theorem, explained to an above-average squirrel” Michele Reister and Bob Lucas. blogs.sas.com
The second half touches on the debate about Bayesian versus ‘frequentist’ or traditional statistics. In my opinion, the theory of traditional statistics has enshrined the customs surrounding its use. The word ‘hypothesis’ suggests a new idea that needs substantial evidence before including it in a body of knowledge. When you only care if a proposition is more likely to be true than false, it’s logical to apply a 50% confidence threshold. Professionals who publish that they’ve used a 50% confidence threshold risk mockery, but if probability theory gives a probability of 0.51 (or 51%) for an outcome or inference, that’s normal. Big Data could be using low probabilities for small decisions until the data has built up. Sales are not canceled if they are made before an advert or offer has been optimized, so there is no clear division between experiment and application. But it works. The point I’m making is that sometimes the balance of probability matters, but the customs of traditional statistics won’t let most of the merely probable through the door.
I couldn’t find examples of spurious correlations that were obviously worth browsing to. I’m not surprised that various preferences and opinions are correlated, but you can find ten here if you like, and contribute by answering trivial questions.
If you dare you can try this on businessinsider.com:
AD-INFESTED CPU HOG “10 Crazy Correlations Between Unrelated Things” Gus Lubin and Tony Manfred, Jun. 17, 2011.
There might be some good examples on New Scientist if you subscribe.
In contrast, claimed coincidences are easy to find (on understandinguncertainty.org), just scroll down to the list.
The ‘hemline indicator’ theory is that if women wear shorter skirts, shares go up. It seems more likely that a booming economy might affect fashion, but it’s hard to find a serious study that’s publicly available. The hemline indicator won’t go away as it’s an easy way to fill space in publications. Other off-beat indicators include the lipstick indicator and the skyscraper index. In my opinion, none of those should be dismissed as spurious without investigation, just because they seem unconventional or might get too much publicity.
Composition or aggregation can affect plausibility. ‘High income causes high spending’ is plausible for individuals and for an economy. Most readers might agree that ‘High spending causes high income’ is not plausible as a general statement about individuals, but for an economy, one person’s spending is likely to be someone’s or something’s income, making the statement plausible.
If you google this:
“But if you look at university parking lots, you don’t see a lot of Rolls Royces”
you’ll find a story that had a lot of coverage. Research scientist Jay Zagorsky conducted a study and concluded that higher IQ does not predict more wealth, or less chance of financial difficulty. The study may be impeccable, but the remark about professors’ “old, low-value vehicles” seems careless. I’m not the first to point out that professors might feel less need to display material status symbols. That’s just to show how you need to be careful about assuming causation.
Suppose there is a scientific reason to suspect that some item could cause cancer. A study that fails to give evidence of a cancer risk might fade into obscurity. The study is likely to be repeated, as few researchers know the previous negative result. The more times the study is repeated, the greater the chance of a ‘fluke’ giving evidence that the item is a cancer risk. The problem is recognized and there are efforts to minimize ‘publication bias’: “Science Journal Says It’ll Publish Negative Results For Free” By Francie Diep, 05.30.2013. popsci.com. In science, researchers attempt to replicate other researchers’ results, and the consequence of forgotten negative results is a waste of time. In commerce and politics, results can be hidden or publicized according to their propaganda value, although I don’t mean to imply that scientists are always rational and dispassionate.
An infamous outlier
In full: “An undeniable problem in Antarctica” berkeley.edu
My short version from various sources: In the early 1980s, scientists from the British Antarctic Survey discovered the hole in the ozone layer over the Antarctic during winter. They used an old spectrophotometer which was difficult to keep going in the extreme cold, and at first they thought the instrument was malfunctioning, especially since NASA’s satellites should have picked up a real drop in ozone concentration. When their data was good enough for publication, it emerged that NASA’s satellites had collected the relevant data, but automated filtering had rejected the outliers. After realizing their mistake, NASA had another look at their old data and found an enormous hole in the ozone layer.
There’s an add-on for Excel, available free to all from a school. I can’t vouch for it, but it might be worth trying or looking into. Features include: ‘Linear Regression (linear slope and intercept including 95% confidence intervals)’.
The legendary British sports commentator Brian Johnston used to give a radio commentary on the Oxford vs Cambridge boat race. Often, he couldn’t get a good view, but he noticed a flagpole on a hill where the height of a blue flag and a white flag indicated the relative positions of the boats. After relying on these flags for much of the time during his forty-two years of broadcasting from the race, he was at a party where a friend pointed out the man who raised and lowered the flags. Brian Johnston asked him “How did you know where the boats where?”. “That’s easy,” the man replied, “I listened to Brian Johnston on the radio.”.
Infrequently Asked Questions
IS THE WRITING ABOUT ACACIA DISPROVABLE?
No. (Scientists prefer theories that can be disproved if they’re wrong.) The correlation and regression line will soon be tested by Acacia’s Q2 2013 results (released on July 18 2013), but I have not produced strong enough evidence to make a confident prediction myself.
DOES THE ANALYSIS HELP ACACIA SHAREHOLDERS?
I think so. It helps to look at an investment from different viewpoints, especially as many investors are looking at the same information in the same way.
DO YOU EXPECT ME TO CORRELATE AND REGRESS QUARTERLY INVESTMENT AND REVENUE EVERY DAY?
No. I hope I’ve added tools to some readers’ investment toolbox, not for everyday use, but to aid investigation when two variables might be related. Some readers may be better equipped to be critical about a correlation or regression in a presentation.
WAS IT ABOUT ACACIA OR STATS?
Stats. I didn’t see any point in writing an abstract lesson or one with simple made-up examples, as there are plenty of those already.
Added September 03, 2013
If you take this chart:
… and remove the quarters with high Patent acquisition costs (Q1 2012 and Q4 2012):
… it looks as if growth tailed off some time in 2010. Since then there is no particular trend, and it’s possible to pick a starting quarter that shows a downtrend when a trend line is fitted, although with an R-squared coefficient of 0.004 (shown as 0) indicating no meaningful significance.
There will be an average, even though the quarterly NRCFs are so volatile that the variation is not a lot less than the average. Now suppose the large Patent acquisition costs were accompanied by licensing deals which added about a third of the Patent acquisition costs to Revenue NRCF in the same quarter. That could be enough to make the regression chart look as it does.
Now suppose that Revenue NRCF continues to stagnate in the absence of large Patent acquisition costs, and management have a policy of only incurring large Patent acquisition costs when they can recover a substantial proportion in the same quarter. Those conditions could be enough to keep Revenue NRCF in a similar relationship with Patent acquisition costs, as in recent quarters. That is not a prediction, due to the conditional statements and the variables involved. Growth might resume, or management might be able to license an even greater proportion of large Patent acquisition costs in a quarter, or they might not want to miss an opportunity to acquire attractive patents even if they can’t license them in the same quarter. However, the equation beat me in Q2 2013. I mostly expected Acacia’s results to improve sooner rather than later, due to the usual reasons concerning marquee portfolios, expertise, relationships etc. The result was at the bottom of the cluster, and in the charts above is clearly below the average, although too much should not be read into a single quarter’s result. While management defended the Q1 2013 result by saying they achieved the second highest revenue in the company’s history, the attitude now is more that everything is in place, and profit will come, but with no set schedule.
I’ve recently added charts to my ‘walk’ post, which show that Acacia did well from 2009 to 2011 on charts showing Net cash from operations against Net cash into investing, and I believe that 2012 would have looked good at the time.
This addition is tenuous in places, and I don’t have the time to check everything as much as I would like. The 2009 and 2010 data is from the 2010 10-K, on http://www.sec.gov/Archives/edgar/data/934549/000093454911000005/actg2010123110k.htm
Original post on July 16 2013.
Stated Acacia’s Q2 2013 release date (July 18 2013). July 17 2013
Made clear I wanted to regress ‘birds’ on ‘cats’ in a chart example. July 17 2013
Fixed a thumbnail image. July 17 2013
Deleted a redundant thumbnail image. July 17 2013
Changed R-squared of zero to R-squared of 0.004, September 03 2013.
Images You can open all the images from here, or right click on each and choose ‘Open Link in New Tab’ to keep them handy for reference.
Copyright ©2013 sinksmith
DISCLAIMER: Your investment is your responsibility. It is your responsibility to check all material facts before making an investment decision. All investments involve different degrees of risk. You should be aware of your risk tolerance level and financial situations at all times. Furthermore, you should read all transaction confirmations, monthly, and year-end statements. Read any and all prospectuses carefully before making any investment decisions. You are free at all times to accept or reject all investment recommendations made by the author of this blog. All Advice on this blog is subject to market risk and may result in the entire loss of the reader’s investment. Please understand that any losses are attributed to market forces beyond the control or prediction of the author. As you know, a recommendation, which you are free to accept or reject, is not a guarantee for the successful performance of an investment.
ADDITIONAL DISCLAIMER: While I’ve taken care, I could have made mistakes in entering the data, in the analysis, or anywhere else, and cannot be held responsible for them. When I comment on the quality of the data, I refer to statistical features and do not imply any inaccuracy in published figures.