home - Verber Bernard
Small samples method. Small-sample statistics. Methods for selecting units from the general population

Small-sample statistics

It is generally accepted that the beginning of S. of m. Century. or, as it is often called, the statistics of "small n", was laid in the first decade of the XX century by the publication of the work of W. Gosset, in which he placed the t-distribution, postulated by the "student" who received a little later world fame. At the time, Gosset worked as a statistician at Guinness Breweries. One of his duties was to analyze the incoming shipments of kegs of freshly brewed porter. For a reason he never really explained, Gosset experimented with the idea of ​​dramatically reducing the number of samples taken from the very large number of barrels in the brewery's warehouses to selectively control the quality of the porter. This led him to postulate the t-distribution. Since the Guinness brewery charter prohibited the publication of research results by Guinness workers, Gossett published the results of his experiment comparing sampling quality control using the t-distribution for small samples and the traditional z-distribution (normal distribution) anonymously, under the pseudonym Student - where the name of the Student's t-distribution came from).

t-distribution. The t-distribution theory, like the z-distribution theory, is used to test the null hypothesis that two samples are simply random samples from the same population and therefore the computed statistics (e.g. mean and standard deviation) are unbiased. estimates of the parameters of the general population. However, unlike the theory of normal distribution, the theory of t-distribution for small samples does not require a priori knowledge or accurate estimates of the mathematical expectation and variance of the general population. Moreover, while testing the difference between the means of two large samples for statistical significance requires a fundamental assumption of the normal distribution of characteristics of the population, the theory of t-distribution does not require assumptions about parameters.

It is well known that normally distributed characteristics are described by one single curve - the Gaussian curve, which satisfies the following equation:

For the t-distribution, the whole family of curves is represented by the following formula:

This is why the equation for t includes a gamma function, which in mathematics means that when n changes, a different curve will satisfy the given equation.

Degrees of freedom

In the equation for t, the letter n denotes the number of degrees of freedom (df) associated with the estimate of the variance of the general population (S2), which is the second moment of any generating moment function, such as, for example, the equation for the t-distribution. In S., the number of degrees of freedom indicates how many characteristics remained free after their partial use in a particular type of analysis. In the t-distribution, one of the deviations from the sample mean is always fixed, since the sum of all such deviations must be zero. This affects the sum of squares when calculating the sample variance as an unbiased estimate of S2 and leads to df being equal to the number of measurements minus one for each sample. Hence, in the formulas and procedures for calculating the t-statistics for testing the null hypothesis df = n - 2.

F-pacnpe separation. The null hypothesis tested by the t-test is that two samples were randomly drawn from the same population, or were randomly drawn from two different populations with the same variance. What if you need to analyze more groups? The answer to this question was sought for twenty years after Gossett discovered the t-distribution. Two of the most prominent statisticians of the twentieth century were directly involved in obtaining it. One is the prominent English statistician R. A. Fisher, who proposed the first theorists. formulations, the development of which led to the F-distribution; his work on small sample theory, developing Gosset's ideas, was published in the mid-1920s (Fisher, 1925). The other is George Snedecor, one of the earliest American statisticians who developed a way to compare two independent samples of any size by calculating the ratio of two variance estimates. He named this ratio the F-ratio, after Fisher. Results of issled. Snedecor led to the fact that the F-distribution began to be defined as the distribution of the ratio of two statistics c2, each with its own degrees of freedom:

From this came Fisher's classic work on analysis of variance, a statistical method explicitly focused on the analysis of small samples.

Sample distribution F (where n = df) is represented by the following equation:

As with the t-distribution, the gamma function indicates that there is a family of distributions that satisfy the equation for F. In this case, however, the analysis includes two df values: the number of degrees of freedom for the numerator and for the denominator of the F-ratio.

Tables for estimating t- and F-statistics. When testing the null hypothesis with S., based on the theory of large samples, usually only one look-up table is required - the table of standard deviations (z), which allows you to determine the area under the normal curve between any two z-values ​​on the abscissa. However, the tables for the t- and F-distributions are, of necessity, represented by a set of tables, since these tables are based on a set of distributions obtained by varying the number of degrees of freedom. Although the t- and F-distributions are probability density distributions, like the normal distribution for large samples, they differ from the latter with respect to the four points used to describe them. The t-distribution, for example, is symmetric (note t2 in its equation) for all df, but becomes more peaked as the sample size decreases. Peaked curves (with greater than normal kurtosis) tend to be less asymptotic (that is, closer to the abscissa at the ends of the distribution) than curves with normal kurtosis, such as the Gaussian curve. This difference leads to noticeable discrepancies between the points on the abscissa, corresponding to the t and z values. With df = 5 and a bilateral level a equal to 0.05, t = 2.57, while the corresponding z = 1.96. Therefore, t = 2.57 indicates statistical significance at the 5% level. However, in the case of a normal curve, z = 2.57 (more precisely 2.58) will already indicate a 1% level of statistical significance. Similar comparisons can be made with the F-distribution, since t is equal to F when the number of samples is two.

What constitutes a “small” sample?

At one time, the question was raised about how much the sample should have in order to be considered small. There is simply no definite answer to this question. However, df = 30 is considered to be a conditional boundary between a small and a large sample. The basis for this, to some extent, of an arbitrary decision is the result of comparing the t-distribution with the normal distribution. As noted above, the discrepancy between the values ​​of t and z tends to increase with decreasing and decrease with increasing df. In fact, t begins to closely approach z long before the limiting case when t = z at df = ∞. A simple visual examination of the tabular t values ​​reveals that this approximation becomes quite fast from df = 30 and up. Comparative values ​​of t (at df = 30) and z are equal, respectively: 2.04 and 1.96 for p = 0.05; 2.75 and 2.58 for p = 0.01; 3.65 and 3.29 for p = 0.001.

Other statistics for "small" samples

Although statistical tests such as t and F are specifically designed to be applied to small samples, they are equally applicable to large samples. There are, however, many other statistical methods for analyzing small samples and are often used for this very purpose. This refers to the so-called. nonparametric or distribution-free methods. Basically, the S. figures in these methods are intended to be applied to measurements obtained using scales that do not satisfy the definition of scales of ratios or intervals. Most often these are ordinal (rank) or nominal measurements. Nonparametric S. do not require assumptions about the parameters of the distribution, in particular, about estimates of variance, because ordinal and nominal scales exclude the very concept of variance. For this reason, nonparametric methods are also used for measurements obtained using interval and ratio scales when small samples are analyzed and there is a possibility that the basic assumptions necessary for applying parametric methods are violated. Such S., which can be reasonably applied to small samples, include: Fisher's exact probability test, Friedman's two-factor nonparametric (rank) ANOVA, Kendall's rank correlation coefficient, Kendall's concordance coefficient (W), Kruskal's H-test - Wallace for nonparametric (rank) univariate analysis of variance, Mann-Whitney U-test, median test, sign test, Spearman's rank correlation coefficient, and Wilcoxon's t-test.

In the process of assessing the degree of representativeness of sample observation data, the question of the size of the sample becomes important. sample recalculation student ratio

It affects not only the value of the limits, which with a given probability will not exceed the sampling error, but also the ways of determining these limits.

With a large number of units of the sample population (), the distribution of random errors of the sample mean in accordance with Lyapunov's theorem normal or approaching normal as the number of observations increases.

The probability of an error going beyond certain limits is estimated on the basis of tables Laplace integral ... The calculation of the sampling error is based on the value of the general variance, since at large coefficients, by which the sample variance is multiplied to obtain the general variance, does not play a big role.

In the practice of statistical research, one often has to deal with small so-called small samples.

A small sample is understood as such a sample observation, the number of units of which does not exceed 30.

The development of a small sample theory was started by an English statistician V.S. Gosset (printed under the pseudonym Student ) in 1908. He proved that the estimate of the discrepancy between the average of a small sample and the general average has a special distribution law.

To determine the possible error limits, use the so-called Student's t criterion, determined by the formula

where is the measure of random fluctuations in the sample mean in

small sample.

The value is calculated based on sample observation data:

This value is used only for the studied population, and not as an approximate estimate in the general population.

With a small sample size, the distribution Student's differs from the normal one: large values ​​of the criterion have a higher probability here than with a normal distribution.

The limiting error of a small sample depending on the mean error is presented as

But in this case, the magnitude is differently related to the probable estimate than with a large sample.

According to the distribution Student's , the probable estimate depends on both the size and the size of the sample if the marginal error does not exceed the mean error in small samples.

Table 3.1 Probability distribution in small samples depending on on the coefficient of confidence and sample size


As seen from tab. 3.1 , with increasing, this distribution tends to normal and when it already differs little from it.

Let's show how to use the Student's distribution table.

Suppose that a sample survey of workers in a small enterprise showed that the workers spent time (min.) To perform one of the production operations:. Let's find the sample average costs:

Sample variance

Hence the mean error of a small sample

By tab. 3.1 we find that for the coefficient of confidence and the size of a small sample, the probability is.

Thus, it can be argued with probability that the discrepancy between the sample and the general average lies in the range from to, i.e. the difference will not exceed () in absolute value.

Consequently, the average time spent in the entire population will range from to.

The probability that this assumption is actually incorrect and the error for random reasons will be greater than, is equal to:.

Probability table Student's is often given in a different form than in Table 3.1 ... It is believed that in some cases this form is more convenient for practical use ( tab. 3.2 ).

From tab. 3.2 it follows that for each number of degrees of freedom a limiting value is indicated, which with a given probability will not be exceeded due to random fluctuations in the sample results.

Based on the tab. 3.2 quantities are determined confidence intervals : and.

This is the area of ​​those values ​​of the general average, going beyond which has a very small probability, equal to:

As a confidence probability in a two-sided check, as a rule, or is used, which does not exclude, however, the choice of others not listed in tab. 3.2 .

Table 3.2 Some meanings -Student distribution

The probabilities of a random exit of the estimated average value outside the confidence interval will be respectively equal to and, i.e. are very small.

The choice between probabilities is, to a certain extent, arbitrary. This choice is largely determined by the content of those tasks for the solution of which a small sample is used.

In conclusion, we note that the calculation of errors in a small sample differs little from similar calculations in a large sample. The difference lies in the fact that with a small sample, the probability of our approval is somewhat less than with a larger sample (in particular, in the above example and accordingly).

However, all this does not mean that you can use a small sample when you need a large sample. In many cases, the discrepancies between the found limits can reach significant proportions, which hardly satisfies the researchers. Therefore, a small sample should be used in a statistical study of socio-economic phenomena with great caution, with an appropriate theoretical and practical justification.

So, conclusions based on the results of a small sample are of practical importance only if the distribution of a trait in the general population is normal or asymptotically normal. It is also necessary to take into account the fact that the accuracy of the results of a small sample is still lower than with a large sample.

A person can learn his abilities only by trying to apply them. (Seneca)

Bootstrap, small samples, data analysis applications

Main idea

The bootstrap method was proposed by B. Efron as a development of the folding knife method in 1979.

Let's describe the main idea of ​​a bootstrap.

The purpose of data analysis is to obtain the most accurate selective assess and disseminate the results to the entire population.

The technical term for numerical data constructed from a sample is sample statistics.

The main descriptive statistics are selective mean, median, standard deviation, etc.

Summary statistics such as sample mean, median, and correlation will vary from sample to sample.

The researcher needs to know the size of these deviations depending on the population. Based on this, the margin of error is calculated.

The initial picture of all possible values ​​of a sample statistic in the form of a probability distribution is called a sample distribution.

The key is the size sampling. What if the sample size is small? One smart approach is to random extract data from the available sample.

The idea behind the bootstrap is to use the results of computations on samples as a “fictitious population” in order to determine the sample distribution of a statistic. In fact, it analyzes big the number of “ghost” samples, called bootstrap samples.

Usually, several thousand samples are randomly generated, from this set you can find the bootstrap distribution of the statistics we are interested in.

So, suppose we have a selection, at the first step, we will randomly select one of the selection elements, return this element to the selection, again randomly select an element, and so on.

We repeat the described procedure of random selection n times.

The bootstrap performs a random selection with return, the selected members of the original selection comes back into the selection and further can be selected again.

Formally, at each step, we choose an element of the original sample with a probability of 1 / n.

In total, we have n elements of the original sample, the probability of obtaining a sample with numbers (N 1 ... Nn), where Ni varies from 0 to n, is described by a polynomial distribution.

Several thousand of such samples are generated, which is quite achievable for modern computers.

For each sample, an estimate of the quantity of interest is constructed, then the estimates are averaged.

Since there are many samples, it is possible to construct an empirical distribution function of estimates, then calculate quantiles, and calculate a confidence interval.

It is clear that the bootstrap method is a modification of the Monte Carlo method.

If samples are generated no return elements, the well-known folding knife method is obtained.

The question is: why do this and when is the method reasonable to use in real data analysis?

In the bootstrap, we do not receive new information, but we use the available data reasonably, based on the task at hand.

For example, a bootstrap can be used for small samples, for estimating the median, correlations, constructing confidence intervals and in other situations.

Efron's original work considered pair correlation estimates for a sample size of n = 15.

B = 1000 bootstrap replication is generated.

On the basis of the obtained coefficients ro 1… ro B, a general estimate of the correlation coefficient and an estimate of the standard deviation are constructed.

The standard error of the sample correlation coefficient, calculated using the normal approximation, is:

where the correlation coefficient is 0.776, the size of the original sample is n = 15.

The bootstrap estimate of the standard error is 0.127, see Efron, Gall Gong, 1982.

Theoretical background

Let be the target parameter of the study, for example, the average income in the selected society.

For an arbitrary sample of size, we obtain a dataset Let the corresponding sample statistics be

For most sample statistics at big value (> 30), the sample distribution is a normal curve with a center and standard deviation, where a positive parameter depends on the population and the type of statistics

This classical result is known as the central limit theorem.

There are often serious technical difficulties in estimating the required standard deviation from the data.

For example, if median or sample correlation.

The bootstrap method overcomes these difficulties.

The idea is simple: we denote by an arbitrary value, which is the same statistics calculated from the bootstrap sample, which was obtained from the original sample

What about the sample distribution if the “original” sample is fixed?

In the limit, the sample distribution is also bell-shaped with parameters and

Thus, the bootstrap distribution is a good approximation of the sample distribution

Note that when we move from one sample to another, only changes in the expression, since it is calculated by

This is essentially a bootstrap version of the Central Limit Theorem.

It has also been found that if the limiting sample distribution of a statistical function does not include population unknowns, the bootstrap distribution allows one to obtain a better approximation to the sample distribution than the central limit theorem.

In particular, when the statistical function has the form where through denotes the true, or sample estimate of the standard error, while the marginal sample distribution is usually standard normal.

This effect is called second-order bootstrap correction.

Let i.e. the average for the population, i.e. average for the sample; is the standard deviation in the population, is the sample standard deviation calculated from the initial data, and is the calculated from the bootstrap sample.

Then the sample distribution of the quantity where, will be approximated by the bootstrap distribution, where is the average over the bootstrap sample,.

Similarly, the sample distribution will be approximated by the bootstrap distribution, where.

The first second-order correction results were published by Babu and Singh in 1981-83.

Bootstrap applications

Approximation of the standard error of the sample estimate

Suppose that for the population the parameter is known

Let be an estimate made on the basis of a random sample of size i.e. this is a function of Since the sample changes on the set of all possible samples, the following approach is used to estimate the standard error:

Let's calculate using the same formula used for but this time based on different bootstrap samples of size each. Roughly speaking, it can be taken if only not very large. In this case, it can be reduced to n ln n... Then it can be determined proceeding, in fact, from the essence of the bootstrap method: the population (sample) is replaced by an empirical population (sample).

Bayesian correction using the bootstrap method

The mean of the sample distribution often depends on usually as for large, i.e., Bayesian approximation:

where are bootstrap copies. Then the corrected value will be -

It is worth noting that the previous resampling method, called the jackknife method, is more popular.

Confidence intervals

Confidence intervals (CIs) for a given parameter are sample-based ranges.

This range has the property that the value belongs to it with a very high (predetermined) probability. This is called the significance level. Of course, this probability should apply to any sample from the possible, since each sample contributes to the determination of the confidence interval. The two most commonly used significance levels are 95% and 99%. Here we will limit ourselves to 95%.

Traditionally, CIs depend on the sample distribution of the value more precisely in the limit. There are two main types of confidence intervals that can be constructed using a bootstrap.

Percentile method

This method was already mentioned in the introduction, it is very popular due to its simplicity and naturalness. Suppose we have 1000 bootstrap copies, we denote them by Then the values ​​from the range will fall into the confidence interval. Returning to the theoretical justification of the method, it is worth noting that it requires symmetry of the sample distribution around. The reason for this is that the method approximates the sample distribution using the bootstrap distribution, although logically it turns out that it should be approximated by a value that is opposite in sign.

Centered bootstrap percentile method

Suppose that the sample distribution is approximated using the bootstrap distribution, that is, as originally assumed in bootstrapping. Let's denote the 100th percentile (in bootstrap repetitions) through Then the assumption that the value lies in the range from to will be correct with a probability of 95%. The same expression is easily transformed into a similar expression for the range from to. This interval is called the centered confidence interval by bootstrap percentiles (at a 95% confidence level).

Bootstrap-t test

As already noted, the bootstrap uses a function of the form where there is a sample estimate of the standard error

This gives additional precision.

Let's take the standard t-statistic as a basic example (hence the name of the method): that is, a special case when (population mean), (sample mean) and is the sample standard deviation. The bootstrap analogue of this function is where is calculated in the same way as for the bootstrap sample only.

Let us denote the 100th bootstrap percentile by and assume that the value lies in the interval

Using the equality you can rewrite the previous statement, i.e. lies in the interval

This interval is called the bootstrap t-confidence interval for at the 95% level.

In the literature, it is used to achieve greater accuracy than the previous approach.

Real data example

Take, for a first example, data from Hollander and Wolfe 1999, p. 63, which represents the effect of light on hatch rate.

The standard box plot assumes that there is no normality in the population data. We performed a bootstrap analysis of the median and mean.

Separately, it should be noted that there is no symmetry in the bootstrap t-histogram, which differs from the standard limit curve. The 95% confidence intervals for the median and mean (calculated using the bootstrap percentile method) roughly cover the range

This range represents the overall difference (build-up) in hatch rate results versus lighting.

As a second example, consider data from Devore 2003, p. 553, which looked at the correlation between biochemical oxygen demand (BOD) and hydrostatic weighing (HW) of professional soccer players.

2D data consists of pairs and pairs can be randomly selected during bootstrap resampling. For example, take first, then, etc.

In the figure, the box-whisker graph shows the lack of normality for the main populations. Correlation histograms computed from 2D bootstrap data are skewed (shifted to the left).

For this reason, a centered bootstrap percentile method is more appropriate in this case.

The analysis revealed that the measurements were correlated for at least 78% of the population.

Data for example 1:

8.5 -4.6 -1.8 -0.8 1.9 3.9 4.7 7.1 7.5 8.5 14.8 16.7 17.6 19.7 20.6 21.9 23.8 24.7 24.7 25.0 40.7 46.9 48.3 52.8 54.0

Data for example 2:

2.5 4.0 4.1 6.2 7.1 7.0 8.3 9.2 9.3 12.0 12.2 12.6 14.2 14.4 15.1 15.2 16.3 17.1 17.9 17.9

8.0 6.2 9.2 6.4 8.6 12.2 7.2 12.0 14.9 12.1 15.3 14.8 14.3 16.3 17.9 19.5 17.5 14.3 18.3 16.2

The literature often proposes different bootstrapping schemes that could give reliable results in different statistical situations.

What was discussed above is only the most basic elements, and there are actually a lot of other options for schemes. For example, which method is best for two-stage sampling or stratified sampling?

In this case, it is not difficult to come up with a natural scheme. Bootstrapping in the case of regression model data generally gets a lot of attention. There are two main methods: in the first, covariance and response variables are resampled together (paired bootstrapping), in the second, bootstrapping is performed by residuals (residual bootstrapping).

The pairwise method remains correct (in the sense of the results for) even if the variances of errors in the models are not equal. The second method is incorrect in this case. This disadvantage is compensated for by the fact that such a scheme provides additional precision in estimating the standard error.

It is much more difficult to bootstrap time series data.

Time series analysis, however, is one of the key areas in econometrics. Two main difficulties can be distinguished here: first, data on time series have the property of being consistently dependent. That is, it depends on, etc.

Secondly, the statistical population changes over time, that is, nonstationarity appears.

For this, methods have been developed that transfer the dependency in the initial data to bootstrap samples, in particular, a block diagram.

Instead of a bootstrap, the sample is immediately built block data that retains the dependencies from the original sample.

Quite a lot of research is currently being carried out in the field of applying bootstrap to sections of econometrics; in general, the method is being actively developed.

In the practice of statistical research, one often has to deal with small samples that have a volume of less than 30 units. Samples of more than 100 units are usually referred to as large ones.

Typically, small samples are used in cases where it is impossible or impractical to use a large sample. One has to deal with such samples, for example, when polling tourists and hotel visitors.

The magnitude of the error of a small sample is determined by formulas that differ from the formulas for a relatively large sample size ().

With a small sample size n the relationship between sample and general variance should be taken into account:

Since the fraction is essential for a small sample, the variance is calculated taking into account the so-called number of degrees of freedom ... It is understood as the number of options that can take arbitrary values ​​without changing the value of the average.

The average error of a small sample is determined by the formula:

The marginal sampling error for the mean and proportion is found similarly to the case of a large sample:

where t is the coefficient of confidence, depending on the given level of significance and the number of degrees of freedom (Appendix 5).

The values ​​of the coefficient depend not only on the given confidence level, but also on the sample size n... For individual values ​​of t and n, the confidence level is determined by the Student's distribution, which contains the distributions of standardized deviations:

Comment. As the sample size increases, the Student's distribution approaches the normal distribution: for n= 20, it already differs little from the normal distribution. When conducting small sample surveys, it should be borne in mind that the smaller the sample size n, the greater the difference between the Student's t distribution and the normal distribution. For example, for n min. = 4, this difference is very significant, which indicates a decrease in the accuracy of the results of a small sample.

  • 6. Types of statistical groupings, their cognitive value.
  • 7. Statistical tables: types, construction rules, reading techniques
  • 8.Absolute values: types, cognitive value. Conditions for the scientific use of absolute and relative indicators.
  • 9. Average values: content, types, types, scientific conditions of use.
  • 11. Properties of dispersion. The rule of addition (decomposition) of variance and its use in statistical analysis.
  • 12. Types of statistical graphs on the content of the tasks being solved and methods of construction.
  • 13. Series of dynamics: types, indicators of analysis.
  • 14. Methods for identifying trends in time series.
  • 15. Indexes: definition, basic elements of indexes, problems solved using indexes, index system in statistics.
  • 16. Rules for constructing dynamic and territorial indices.
  • 17. Foundations of the theory of the sampling method.
  • 18. The theory of small samples.
  • 19. Methods for selecting units in the sample.
  • 20. Types of relationships, statistical methods for analyzing relationships, the concept of correlation.
  • 21. Content of correlation analysis, correlation models.
  • 22. Assessment of the strength (tightness) of the correlation connection.
  • 23. The system of indicators of socio-economic statistics.
  • 24. The main groupings and classifications in socio-economic statistics.
  • 25. National wealth: category content and composition.
  • 26. Maintenance of the land registry. Indicators of the composition of land by ownership, purpose and types of land.
  • 27. Classification of fixed assets, methods of assessment and revaluation, indicators of movement, condition and use.
  • 28. The tasks of labor statistics. The concept and content of the main categories of the labor market.
  • 29. Statistics on the use of labor force and working time.
  • 30. Indicators of labor productivity and methods of analysis.
  • 31. Indicators of crop production and yield of agricultural products. Crops and Lands.
  • 32. Indicators of production of livestock products and productivity of farm animals.
  • 33. Statistics of social costs and production costs.
  • 34. Statistics of wages and labor costs.
  • 35. Statistics of gross production and income.
  • 36. Indicators of movement and sales of agricultural products.
  • 37. The tasks of statistical analysis of agricultural enterprises.
  • 38. Statistics of prices and goods of sectors of the national economy: tasks and methods of analysis.
  • 39. Statistics of the market of goods and services.
  • 40. Statistics of indicators of social production.
  • 41. Statistical analysis of consumer prices.
  • 42. Statistics of inflation and the main indicators of its assessment.
  • 43. Tasks of statistics of enterprise finance.
  • 44. The main indicators of the financial results of enterprises.
  • 45. Tasks of statistics of the state budget.
  • 46. ​​The system of indicators of statistics of the state budget.
  • 47. The system of indicators of statistics of monetary circulation.
  • 48. Statistics of the composition and structure of the money supply in the country.
  • 49. The main tasks of banking statistics.
  • 50. Basic indicators of banking statistics.
  • 51. Concept and classification of credit. The tasks of its statistical study.
  • 52. The system of indicators of credit statistics.
  • 53. Basic indicators and methods of analysis of the savings business.
  • 54. Tasks of statistics of the stock market and securities.
  • 56. Statistics of commodity exchanges: tasks and system of indicators.
  • 57.System of national accounts: concepts, main categories and classification.
  • 58. Basic principles of building SNS.
  • 59. Main macroeconomic indicators - content, methods of determination.
  • 60. Intersectoral balance: concepts, tasks, types of mob.
  • 62. Statistics of income and expenditures of the population
  • 18. The theory of small samples.

    With a large number of sampling units (n> 100), the distribution of random errors in the sample mean in accordance with A.M. Lyapunov's theorem is normal or approaches normal as the number of observations increases.

    However, in the practice of statistical research in a market economy, it is increasingly necessary to deal with small samples.

    A small sample is such a sample observation, the number of units of which does not exceed 30.

    When evaluating the results of a small sample, the size of the general population is not used. To determine the possible error limits, the Student's t test is used.

    The value of σ is calculated on the basis of sample observation data.

    This value is used only for the studied population, and not as an approximate estimate of σ in the general population.

    The probabilistic estimate of the results of a small sample differs from the estimate in a large sample in that, with a small number of observations, the probability distribution for the mean depends on the number of selected units.

    However, for a small sample, the value of the confidence coefficient t is differently related to the probabilistic estimate than for a large sample (since the distribution law differs from the normal one).

    According to the distribution law established by the Student, the probable distribution error depends both on the value of the confidence coefficient t and on the sample size B.

    The average error of a small sample is calculated by the formula:

    where is the variance of a small sample.

    In MV, the coefficient n / (n-1) must be taken into account and must be corrected. When determining the dispersion S2, the number of degrees of freedom is equal to:

    .

    The limiting error of a small sample is determined by the formula

    In this case, the value of the confidence coefficient t depends not only on the given confidence probability, but also on the number of sample units n. For individual values ​​of t and n, the confidence probability of a small sample is determined using special Student tables, which give the distributions of standardized deviations:

    The probabilistic assessment of the results of MV differs from the assessment in the BV in that, with a small number of observations, the probability distribution for the mean depends on the number of selected units

    19. Methods for selecting units in the sample.

    1. The sample must be large enough in size.

    2. The structure of the sample should best reflect the structure of the general population

    3. The selection method must be random

    Depending on whether the selected units participate in the sample, a distinction is made between a method - non-repetitive and repeated.

    A nonrepeatable selection is such a selection in which the unit that got into the sample does not return to the population from which further selection is carried out.

    Calculation of the mean error of non-repetitive random sampling:

    Calculation of the marginal error of non-repetitive random sampling:

    In case of repeated selection, the unit that got into the sample, after registering the observed features, is returned to the original (general) population for participation in the further selection procedure.

    The calculation of the mean error of repeated simple random sampling is performed as follows:

    Calculation of the marginal error of repeated random sampling:

    The type of formation of the sample population is subdivided into - individual, group and combined.

    Selection method - defines a specific mechanism for selecting units from the general population and is subdivided into: actually - random; mechanical; typical; serial; combined.

    Actually - random the most common method of selection in a random sample, it is also called the method of drawing lots, in which a ticket with a serial number is prepared for each unit of the statistical population. Further, the required number of units of the statistical population is randomly selected. Under these conditions, each of them has the same probability of being included in the sample.

    Mechanical sampling... It is used in cases where the general population is ordered in some way, that is, there is a certain sequence in the arrangement of units.

    To determine the average error of mechanical sampling, the formula for the average error is used in the case of actually random non-repetitive sampling.

    Typical selection... It is used when all units of the general population can be divided into several typical groups. Typical selection involves sampling units from each group in a proper random or mechanical way.

    For a typical sample, the value of the standard error depends on the accuracy of determining the group means. So, in the formula for the marginal error of a typical sample, the average of the group variances is taken into account, i.e.

    Serial selection... It is used in cases where the units of the population are combined into small groups or series. The essence of serial sampling is actually random or mechanical selection of series, within which a continuous survey of units is performed.

    With serial sampling, the value of the sampling error does not depend on the number of units studied, but on the number of examined series (s) and on the value of the intergroup variance:

    Combined selection can go through one or more steps. A sample is called one-stage if the units of the population that are selected once are examined.

    The sample is called multistage, if the selection of an aggregate passes through stages, successive stages, and each stage, stage of selection has its own unit of selection.

    "
     


    Read:



    Scholarship of the government of the Russian Federation in priority areas of modernization and technological development of the Russian economy

    Scholarship of the government of the Russian Federation in priority areas of modernization and technological development of the Russian economy

    The presidential scholarship received legislative approval even during the time of the first ruler of Russia B.N. Yeltsin. At that time, she was appointed only to ...

    Help for applicants: how to get a targeted referral to study at a university

    Help for applicants: how to get a targeted referral to study at a university

    Hello dear readers of the blog site. Today I would like to remind or tell applicants about the target direction, its pros and cons ...

    Preparing for an exam for admission to mithi

    Preparing for an exam for admission to mithi

    MEPhI (Moscow Engineering Physics Institute) is one of the first research educational institutions in Russia. For 75 years MEPhI ...

    Online interest calculator

    Online interest calculator

    The built-in math calculator will help you carry out the simplest calculations: multiplication and addition, subtraction, and division ...

    feed-image Rss