Simple data processing_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-09 05:19 出处：网络

Let\'s say I got this set of data. After sorting the distribution can be drawn out like below. M=[-99-99 -44.5-7.375-5.5-1.666666667-1.333333333-1.2857142860.436363636 2.353.3 4.285714286 5.052631579

Let's say I got this set of data. After sorting the distribution can be drawn out like below.

M=[-99  -99 -44.5   -7.375  -5.5    -1.666666667    -1.333333333    -1.285714286    0.436363636 2.35    3.3 4.285714286 5.052631579 6.2 7.076923077 7.230769231 7.916666667 9.7 10.66666667 16.16666667 17.4    19.2    19.6    20.75   24.25   34.5    49.5]

Simple data processing

My question is how do I find out those values that are among the middle range and record the indices. Using normal distribution or anything else? Appreciate your help!

Pict开发者_开发技巧ure for Jonas'

Simple data processing

Assuming your mid range is [-10 10] then the indices would be:

> find(-10< M & M< 10)
ans =

    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18

Please note that you can acces the values also by logical indexing, like:

> M(-10< M & M< 10)
ans =

 Columns 1 through 15:

  -7.37500  -5.50000  -1.66667  -1.33333  and so on ...

And to get your mid range, just:

> q= quantile(M(:), [.25 .75])
q =

   -1.3214
   17.0917

> find(q(1)< M & M< q(2))
ans =

    8    9   10   11   12   13   14   15   16   17   18   19   20

Note also that M(:) is used here to ensure that quantile treats M as vector. You may adopt the convention that all vectors in your programs are column vectors, then most of the functions automatically treats them correctly.

Update:
Now, for a very short description of quantiles is that: they are points taken from the cumulative distribution function (cdf) of a random variable. (Now your M is assumed to be a kind of cdf, since its nondecreasing and can be normalized to sum up to 1). Now 'simply' a quantile .5 of your data 'means that 50% of the values are lower than this quantile'. More details on quantiles can be found for example here.

If you don't know a priori what your middle range is, but you know that you want to discard the outliers both at the start and at the end of our curve, and if you have the Statistics Toolbox you can do a robust linear regression to your data using ROBUSTFIT, and only keep the inliers.

M=[-99 -99 -44.5 -7.375 -5.5 -1.666666667 -1.333333333 -1.285714286 0.436363636 2.35 3.3 4.285714286 5.052631579 6.2 7.076923077 7.230769231 7.916666667 9.7 10.66666667 16.16666667 17.4 19.2 19.6 20.75 24.25 34.5 49.5];

%# robust linear regression
x = find(isfinite(M)); %# eliminate NaN or Inf
[u,s]=robustfit(x,M(x));

%# inliers have a weight > 0.25 (raise this value to be stricter)
inlierIdx = s.w > 0.25;
middleRangeX = x(inlierIdx)
middleRangeValues = M(x(inlierIdx))

%# plot with the regression in red and the good values in green
plot(x,M(x),'-b.',x,u(1)+u(2)*x,'r')
hold on,plot(middleRangeX,middleRangeValues,'*r')

Simple data processing

middleRangeX =
  Columns 1 through 21
     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19    20    21    22    23    24
  Column 22
    25
middleRangeValues =
  Columns 1 through 10
       -7.375         -5.5      -1.6667      -1.3333      -1.2857      0.43636         2.35          3.3       4.2857       5.0526
  Columns 11 through 20
          6.2       7.0769       7.2308       7.9167          9.7       10.667       16.167         17.4         19.2         19.6
  Columns 21 through 22
        20.75        24.25