home - Coelho Paulo
The method of least squares in excel graph. Linear pairwise regression analysis. A few words about the correctness of the initial data used for prediction

Method least squares(OLS) refers to the field of regression analysis. It has many applications, as it allows an approximate representation of a given function by other simpler ones. LSM can be extremely useful in processing observations, and it is actively used to estimate some quantities from the results of measurements of others containing random errors. In this article, you will learn how to implement least squares calculations in Excel.

Statement of the problem on a specific example

Suppose there are two indicators X and Y. Moreover, Y depends on X. Since OLS is of interest to us from the point of view of regression analysis (in Excel, its methods are implemented using built-in functions), we should immediately proceed to consider a specific problem.

So, let X be the selling area of ​​a grocery store, measured in square meters, and Y is the annual turnover, defined in millions of rubles.

It is required to make a forecast of what turnover (Y) the store will have if it has one or another retail space. Obviously, the function Y = f (X) is increasing, since the hypermarket sells more goods than the stall.

A few words about the correctness of the initial data used for prediction

Let's say we have a table built with data for n stores.

According to mathematical statistics, the results will be more or less correct if the data on at least 5-6 objects are examined. Also, "anomalous" results cannot be used. In particular, an elite small boutique can have a turnover many times greater than the turnover of large outlets of the “masmarket” class.

The essence of the method

The table data can be displayed on the Cartesian plane as points M 1 (x 1, y 1), ... M n (x n, y n). Now the solution of the problem will be reduced to the selection of an approximating function y = f (x), which has a graph passing as close as possible to the points M 1, M 2, .. M n .

Of course, you can use a high degree polynomial, but this option is not only difficult to implement, but simply incorrect, since it will not reflect the main trend that needs to be detected. The most reasonable solution is to search for a straight line y = ax + b, which best approximates the experimental data, and more precisely, the coefficients - a and b.

Accuracy score

For any approximation, the assessment of its accuracy is of particular importance. Denote by e i the difference (deviation) between the functional and experimental values ​​for the point x i , i.e. e i = y i - f (x i).

Obviously, to assess the accuracy of the approximation, you can use the sum of the deviations, i.e., when choosing a straight line for an approximate representation of the dependence of X on Y, preference should be given to the one that has the smallest value of the sum e i at all points under consideration. However, not everything is so simple, since along with positive deviations, there will practically be negative ones.

You can solve the problem using the deviation modules or their squares. The latter method is the most widely used. It is used in many areas including regression analysis(in Excel, its implementation is carried out using two built-in functions), and has long proved its effectiveness.

Least square method

In Excel, as you know, there is a built-in autosum function that allows you to calculate the values ​​of all values ​​located in the selected range. Thus, nothing will prevent us from calculating the value of the expression (e 1 2 + e 2 2 + e 3 2 + ... e n 2).

IN mathematical notation it looks like:

Since the decision was initially made to approximate using a straight line, we have:

Thus, the task of finding a straight line that best describes a specific relationship between X and Y amounts to calculating the minimum of a function of two variables:

This requires equating to zero partial derivatives with respect to new variables a and b, and solving a primitive system consisting of two equations with 2 unknowns of the form:

After simple transformations, including dividing by 2 and manipulating the sums, we get:

Solving it, for example, by Cramer's method, we obtain a stationary point with certain coefficients a * and b * . This is the minimum, i.e. to predict what turnover the store will have for a certain area, the straight line y = a * x + b * is suitable, which is a regression model for the example in question. Of course she won't let you find exact result, but will help you get an idea of ​​whether buying a store on credit for a particular area will pay off.

How to implement the least squares method in Excel

Excel has a function for calculating the value of the least squares. It has the following form: TREND (known Y values; known X values; new X values; constant). Let's apply the formula for calculating the OLS in Excel to our table.

To do this, in the cell in which the result of the calculation by the least squares method in Excel should be displayed, enter the “=” sign and select the “TREND” function. In the window that opens, fill in the appropriate fields, highlighting:

  • range of known values ​​for Y (in this case data for turnover);
  • range x 1 , …x n , i.e. the size of retail space;
  • and known and unknown values ​​of x, for which you need to find out the size of the turnover (for information about their location on the worksheet, see below).

In addition, there is a logical variable "Const" in the formula. If you enter 1 in the field corresponding to it, then this will mean that calculations should be carried out, assuming that b \u003d 0.

If you need to know the forecast for more than one x value, then after entering the formula, you should not press "Enter", but you need to type the combination "Shift" + "Control" + "Enter" ("Enter") on the keyboard.

Some Features

Regression analysis can be accessible even to dummies. The Excel formula for predicting the value of an array of unknown variables - "TREND" - can be used even by those who have never heard of the least squares method. It is enough just to know some features of its work. In particular:

  • If you arrange the range of known values ​​of the variable y in one row or column, then each row (column) with known values ​​of x will be perceived by the program as a separate variable.
  • If the range with known x is not specified in the TREND window, then in the case of using the function in Excel, the program will consider it as an array consisting of integers, the number of which corresponds to the range with the given values ​​of the variable y.
  • To output an array of "predicted" values, the trend expression must be entered as an array formula.
  • If no new x values ​​are specified, then the TREND function considers them equal to the known ones. If they are not specified, then array 1 is taken as an argument; 2; 3; 4;…, which is commensurate with the range with already given parameters y.
  • The range containing the new x values ​​must consist of the same or more rows or columns, as a range with given y values. In other words, it must be proportionate to the independent variables.
  • An array with known x values ​​can contain multiple variables. However, if we are talking about only one, then it is required that the ranges with the given values ​​of x and y be commensurate. In the case of several variables, it is necessary that the range with the given y values ​​fit in one column or one row.

FORECAST function

Regression analysis in Excel is implemented using several functions. One of them is called "PREDICTION". It is similar to TREND, i.e. it gives the result of calculations using the least squares method. However, only for one X, for which the value of Y is unknown.

Now you know the Excel formulas for dummies that allow you to predict the value of the future value of an indicator according to a linear trend.

The least squares method (LSM) is based on minimizing the sum of squared deviations of the selected function from the data under study. In this article, we approximate the available data using linear function y = a x + b .

Least square method(English) Ordinary Least Squares , OLS) is one of the basic methods of regression analysis in terms of estimating unknown parameters regression models according to sample data.

Consider approximation by functions depending on only one variable:

  • Linear: y=ax+b (this article)
  • : y=a*Ln(x)+b
  • : y=a*x m
  • : y=a*EXP(b*x)+c
  • : y=ax 2 +bx+c

Note: Cases of approximation by a polynomial from the 3rd to the 6th degree are considered in this article. Approximation by a trigonometric polynomial is considered here.

Linear dependency

We are interested in the relationship of 2 variables X And y. There is an assumption that y depends on X according to the linear law y = ax + b. To determine the parameters of this relationship, the researcher made observations: for each value of x i, a measurement of y i was made (see example file). Accordingly, let there be 20 pairs of values ​​(х i ; y i).

Note: If the change step by X is constant, then to build scatterplots can be used, if not, then you need to use the chart type dotted .

It is obvious from the diagram that the relationship between the variables is close to linear. To understand which of the many straight lines most "correctly" describes the relationship between variables, it is necessary to determine the criterion by which the lines will be compared.

As such a criterion, we use the expression:

where ŷ i = a * x i + b ; n – number of pairs of values ​​(in our case n=20)

The above expression is the sum of the squared distances between the observed values ​​of y i and ŷ i and is often denoted as SSE ( sum of squared Errors (Residuals), sum of squared errors (residuals)) .

Least square method is to select such a line ŷ = ax + b, for which the above expression takes the minimum value.

Note: Any line in two-dimensional space is uniquely determined by the values ​​of 2 parameters: a (slope) and b (shift).

It is believed that the smaller the sum of the squared distances, the better the corresponding line approximates the available data and can be further used to predict the values ​​of y from the variable x. It is clear that even if in reality there is no relationship between the variables or the relationship is nonlinear, then the least squares will still select the “best” line. Thus, the LSM does not say anything about the presence of a real relationship of variables, the method simply allows you to choose such function parameters a And b , for which the above expression is minimal.

Having done not very complex mathematical operations (see for more details), you can calculate the parameters a And b :

As can be seen from the formula, the parameter a is the ratio of covariance and , so in MS EXCEL to calculate the parameter but You can use the following formulas (see example file sheet Linear):

= COVAR(B26:B45;C26:C45)/ VAR.G(B26:B45) or

= COVARIATION.B(B26:B45;C26:C45)/VAR.B(B26:B45)

Also to calculate the parameter but you can use the formula = SLOPE(C26:C45;B26:B45). For parameter b use formula = INTERCUT(C26:C45;B26:B45) .

And finally, the LINEST() function allows you to calculate both parameters at once. To enter a formula LINEST(C26:C45;B26:B45) select 2 cells in a row and press CTRL + SHIFT + ENTER(see article about). The left cell will return the value but , on the right b .

Note: To not mess with input array formulas you will need to additionally use the INDEX() function. Formula = INDEX(LINEST(C26:C45,B26:B45),1) or just = LINEST(C26:C45;B26:B45) will return the parameter responsible for the slope of the line, i.e. but . Formula = INDEX(LINEST(C26:C45,B26:B45),2) will return the parameter responsible for the intersection of the line with the Y axis, i.e. b .

After calculating the parameters, scatterplot line can be drawn.

Another way to draw a straight line using the least squares method is the chart tool trend line. To do this, select the diagram, select from the menu Layout tab, in group Analysis click trend line, then Linear approximation .

By checking the box "show equation in the diagram" in the dialog box, you can make sure that the parameters found above match the values ​​in the diagram.

Note: In order for the parameters to match, the chart type must be . The fact is that when constructing a diagram Schedule x-axis values ​​cannot be set by the user (the user can only specify labels that do not affect the location of the points). Instead of X values, the sequence 1 is used; 2; 3; … (for category numbering). Therefore, if building trend line on the type diagram Schedule, then the values ​​of this sequence will be used instead of the actual values ​​of X, which will lead to an incorrect result (unless, of course, the actual values ​​of X do not match the sequence 1; 2; 3; ...).

It has many applications, as it allows an approximate representation of a given function by other simpler ones. LSM can be extremely useful in processing observations, and it is actively used to estimate some quantities from the results of measurements of others containing random errors. In this article, you will learn how to implement least squares calculations in Excel.

Statement of the problem on a specific example

Suppose there are two indicators X and Y. Moreover, Y depends on X. Since OLS is of interest to us from the point of view of regression analysis (in Excel, its methods are implemented using built-in functions), we should immediately proceed to consider a specific problem.

So, let X be the selling area of ​​a grocery store, measured in square meters, and Y be the annual turnover, defined in millions of rubles.

It is required to make a forecast of what turnover (Y) the store will have if it has one or another retail space. Obviously, the function Y = f (X) is increasing, since the hypermarket sells more goods than the stall.

A few words about the correctness of the initial data used for prediction

Let's say we have a table built with data for n stores.

According to mathematical statistics, the results will be more or less correct if the data on at least 5-6 objects are examined. Also, "anomalous" results cannot be used. In particular, an elite small boutique can have a turnover many times greater than the turnover of large outlets of the “masmarket” class.

The essence of the method

The table data can be displayed on the Cartesian plane as points M 1 (x 1, y 1), ... M n (x n, y n). Now the solution of the problem will be reduced to the selection of an approximating function y = f (x), which has a graph passing as close as possible to the points M 1, M 2, .. M n .

Of course, you can use a high degree polynomial, but this option is not only difficult to implement, but simply incorrect, since it will not reflect the main trend that needs to be detected. The most reasonable solution is to search for a straight line y = ax + b, which best approximates the experimental data, and more precisely, the coefficients - a and b.

Accuracy score

For any approximation, the assessment of its accuracy is of particular importance. Denote by e i the difference (deviation) between the functional and experimental values ​​for the point x i , i.e. e i = y i - f (x i).

Obviously, to assess the accuracy of the approximation, you can use the sum of the deviations, i.e., when choosing a straight line for an approximate representation of the dependence of X on Y, preference should be given to the one that has the smallest value of the sum e i at all points under consideration. However, not everything is so simple, since along with positive deviations, there will practically be negative ones.

You can solve the problem using the deviation modules or their squares. The latter method is the most widely used. It is used in many areas, including regression analysis (in Excel, its implementation is carried out using two built-in functions), and has long been proven to be effective.

Least square method

In Excel, as you know, there is a built-in autosum function that allows you to calculate the values ​​of all values ​​located in the selected range. Thus, nothing will prevent us from calculating the value of the expression (e 1 2 + e 2 2 + e 3 2 + ... e n 2).

In mathematical notation, this looks like:

Since the decision was initially made to approximate using a straight line, we have:

Thus, the task of finding a straight line that best describes a specific relationship between X and Y amounts to calculating the minimum of a function of two variables:

This requires equating to zero partial derivatives with respect to new variables a and b, and solving a primitive system consisting of two equations with 2 unknowns of the form:

After simple transformations, including dividing by 2 and manipulating the sums, we get:

Solving it, for example, by Cramer's method, we obtain a stationary point with certain coefficients a * and b * . This is the minimum, i.e. to predict what turnover the store will have for a certain area, the straight line y = a * x + b * is suitable, which is a regression model for the example in question. Of course, it will not allow you to find the exact result, but it will help you get an idea of ​​\u200b\u200bwhether buying a store on credit for a particular area will pay off.

How to implement the least squares method in Excel

Excel has a function for calculating the value of the least squares. It has the following form: TREND (known Y values; known X values; new X values; constant). Let's apply the formula for calculating the OLS in Excel to our table.

To do this, in the cell in which the result of the calculation by the least squares method in Excel should be displayed, enter the “=” sign and select the “TREND” function. In the window that opens, fill in the appropriate fields, highlighting:

  • range of known values ​​for Y (in this case data for turnover);
  • range x 1 , …x n , i.e. the size of retail space;
  • and known and unknown values ​​of x, for which you need to find out the size of the turnover (for information about their location on the worksheet, see below).

In addition, there is a logical variable "Const" in the formula. If you enter 1 in the field corresponding to it, then this will mean that calculations should be carried out, assuming that b \u003d 0.

If you need to know the forecast for more than one x value, then after entering the formula, you should not press "Enter", but you need to type the combination "Shift" + "Control" + "Enter" ("Enter") on the keyboard.

Some Features

Regression analysis can be accessible even to dummies. The Excel formula for predicting the value of an array of unknown variables - "TREND" - can be used even by those who have never heard of the least squares method. It is enough just to know some features of its work. In particular:

  • If you arrange the range of known values ​​of the variable y in one row or column, then each row (column) with known values ​​of x will be perceived by the program as a separate variable.
  • If the range with known x is not specified in the TREND window, then in the case of using the function in Excel, the program will consider it as an array consisting of integers, the number of which corresponds to the range with the given values ​​of the variable y.
  • To output an array of "predicted" values, the trend expression must be entered as an array formula.
  • If no new x values ​​are specified, then the TREND function considers them equal to the known ones. If they are not specified, then array 1 is taken as an argument; 2; 3; 4;…, which is commensurate with the range with already given parameters y.
  • The range containing the new x values ​​must have the same or more rows or columns as the range with the given y values. In other words, it must be proportionate to the independent variables.
  • An array with known x values ​​can contain multiple variables. However, if we are talking about only one, then it is required that the ranges with the given values ​​of x and y be commensurate. In the case of several variables, it is necessary that the range with the given y values ​​fit in one column or one row.

FORECAST function

It is implemented using several functions. One of them is called "PREDICTION". It is similar to TREND, i.e. it gives the result of calculations using the least squares method. However, only for one X, for which the value of Y is unknown.

Now you know the Excel formulas for dummies that allow you to predict the value of the future value of an indicator according to a linear trend.

The least squares method is a mathematical procedure for constructing a linear equation that most closely matches a set of two series of numbers. The purpose of this method is to minimize the total squared error. Excel has tools that can be used to apply this method in calculations. Let's see how it's done.

The method of least squares (LSM) is a mathematical description of the dependence of one variable on another. It can be used for forecasting.

Enable the Solver add-in

In order to use OLS in Excel, you need to enable the add-in "Search for a Solution", which is disabled by default.


Now the function Finding a solution in Excel is activated, and its tools appear on the ribbon.

Conditions of the problem

Let us describe the application of LSM on a specific example. We have two rows of numbers x And y , the sequence of which is shown in the image below.

This dependence can most accurately be described by the function:

At the same time, it is known that x=0 y also equal 0 . That's why given equation can be described as a dependency y=nx .

We have to find the minimum sum of squares of the difference.

Solution

Let us proceed to the description of the direct application of the method.


As you can see, the application of the least squares method is a rather complicated mathematical procedure. We have shown it in action with the simplest example, but there are much more complex cases. However, the Microsoft Excel toolkit is designed to simplify the calculations as much as possible.

Least squares method (LSM)

System m linear equations with n unknowns has the form:

Three cases are possible: m n. The case when m=n was considered in the previous paragraphs. For m

If m>n and the system is consistent, then matrix A has at least m - n linearly dependent rows. Here, the solution can be obtained by selecting n any linearly independent equations (if they exist) and applying the formula X=A -1 CV, that is, reducing the problem to the previously solved one. In this case, the resulting solution will always satisfy the remaining m - n equations.

However, when using a computer, it is more convenient to use a more general approach - the method of least squares.

Algebraic Least Squares

The algebraic method of least squares is understood as a method for solving systems of linear equations

by minimizing the Euclidean norm

Ax? b? > inf . (1.2)

Experimental Data Analysis

Let us consider some experiment, during which at the instants of time

for example, the temperature Q(t) is measured. Let the measurement results be given by an array

Let us assume that the conditions of the experiment are such that the measurements are carried out with a known error. In these cases, the law of temperature change Q(t) is sought using some polynomial

P(t) = + + + ... +,

determining the unknown coefficients, ..., from the considerations that the value E(, ...,) defined by the equality

gauss algebraic exel approximation

took the minimum value. Since the sum of squares is minimized, this method is called the least squares fit to the data.

If we replace P(t) with its expression, we get

Let's set the task of defining an array in such a way that the value is minimal, i.e. define an array using the least squares method. To do this, we equate the partial derivatives to zero:

If you enter m × n matrix A = (), i = 1, 2..., m; j = 1, 2, ..., n, where

I = 1, 2..., m; j = 1, 2, ..., n,

then the written equality takes the form

Let's rewrite the written equality in terms of operations with matrices. By definition, we have the multiplication of a matrix by a column

For a transposed matrix, a similar relationship looks like this

We introduce the following notation: we will denote the i -th component of the vector Ax In accordance with the written matrix equalities, we will have

In matrix form, this equality can be rewritten as

A T x=A T B (1.3)

Here A is a rectangular m×n matrix. Moreover, in problems of data approximation, as a rule, m > n. Equation (1.3) is called the normal equation.

It was possible from the very beginning, using the Euclidean norm of vectors, to write the problem in an equivalent matrix form:

Our goal is to minimize this function in x. In order for a minimum to be reached at the solution point, the first derivatives with respect to x at this point must be equal to zero. The derivatives of this function are

2A T B + 2A T Ax

and therefore the solution must satisfy the system of linear equations

(A T A)x = (A T B).

These equations are called normal equations. If A is an m × n matrix, then A>A - n × n is a matrix, i.e. the normal equation matrix is ​​always a square symmetric matrix. Moreover, it has the property of positive definiteness in the sense that (A>Ax, x) = (Ax, Ax) ? 0.

Comment. Sometimes a solution to an equation of the form (1.3) is called a solution to the system Ax = B, where A is a rectangular m × n (m > n) matrix by the least squares method.

The least squares problem can be graphically interpreted as minimizing the vertical distances from the data points to the model curve (see Figure 1.1). This idea is based on the assumption that all approximation errors correspond to observational errors. If there are also errors in the explanatory variables, then it may be more appropriate to minimize the Euclidean distance from the data to the model.

OLS in Excel

The algorithm for implementing OLS in Excel below assumes that all the initial data is already known. We multiply both parts of the matrix equation AЧX=B of the system from the left by the transposed matrix of the system А Т:

A T AX \u003d A T B

Then we multiply both parts of the equation on the left by the matrix (A T A) -1. If this matrix exists, then the system is defined. Taking into account the fact that

(A T A) -1 * (A T A) \u003d E, we get

X \u003d (A T A) -1 A T B.

The resulting matrix equation is a solution to a system of m linear equations with n unknowns for m>n.

Consider the application of the above algorithm on a specific example.

Example. Let it be necessary to solve the system

In Excel, the solution sheet in formula display mode for this problem looks like this:


Calculation results:

The desired vector X is located in the range E11:E12.

When solving a given system of linear equations, the following functions were used:

1. MOBR - returns inverse matrix for a matrix stored in an array.

Syntax: NBR(array).

An array is a numeric array with an equal number of rows and columns.

2. MULTIP - returns the product of matrices (matrices are stored in arrays). The result is an array with the same number of rows as array1 and the same number of columns as array2.

Syntax: MULT(array1, array2).

Array1, array2 -- multiplied arrays.

After entering the function in the upper left cell of the array range, select the array, starting from the cell containing the formula, press the F2 key, and then press the CTRL+SHIFT+ENTER keys.

3. TRANSPOSE - converts a vertical set of cells into a horizontal one, or vice versa. As a result of using this function, an array appears with the number of rows, equal to the number columns of the original array, and the number of columns equal to the number of rows of the initial array.

 


Read:



Small innovative enterprise based on the university: from idea to business Mip has the right to be the host

Small innovative enterprise based on the university: from idea to business Mip has the right to be the host

In modern economic and legal literature, the definition of a small innovative enterprise at a university is not considered, because this area...

Graduate Studies and Attachment Department of Graduate Studies Mai

Graduate Studies and Attachment Department of Graduate Studies Mai

Postgraduate study in the direction "National History" combines theoretical training in the framework of the full course of the history of Russia and research ...

Aircraft testing - specialty (24

Aircraft testing - specialty (24

Approved by order of the Ministry of Education and Science of the Russian Federation FEDERAL STATE EDUCATIONAL STANDARD OF HIGHER EDUCATION...

Aircraft testing - specialty (24

Aircraft testing - specialty (24

Approved by order of the Ministry of Education and Science of the Russian Federation FEDERAL STATE EDUCATIONAL STANDARD OF HIGHER EDUCATION...

feed image RSS