Can managed code impact instruction level parallelism?_问答_开发者

Is there any way I can impact Instruction Level Parallelism writing C# code? In other words, is there a way I can "help" the compiler produce code that best makes use of ILP? I ask this because I'm trying to abstract away from a few concepts of machine architecture, and I need to know if this is possible. If not, then I will be warranted to abstract away from ILP.

EDIT: you will notice that I do not want to exploit ILP using C# in any way. My question is exactly the opposite. Paraphrasing:开发者_如何学Go "I hope there's no way to exploit ILP from C#"

Thanks.

ILP is a feature of the CPU. You have no way to control it. Compilers try their best to exploit it by breaking dependency chains.

This may include the .Net JIT Compiler, however I have no evidence of this.

You are at the mercy of the JIT when getting instruction level parallelism. Who knows what optimisations the JIT actually does? I would select another language, like C++ if I really need this.

To best exploit ILP you need to break dependency chains. This should still apply. See this thread.

However, with all the abstraction, I would doubt that it is still possible to effectively exploit this in but the most extreme cases. What examples do you have where you need this?

There is NO explicit or direct way to influence or hint to the .NET compiler in IL or C# to do this. This is entirely the compilers job.

The only influence you can have on this would be to structure your program such that it would be more likely (although not guaranteed) to do this for you, and it would be difficult to know if it even acted on the structure or not. This is well abstracted away from the .NET languages and IL.

You are able to use ILP in CLI. So the short answer is No.

A bit longer:

I wrote a code for a simple image processing task before, and used this kind of optimazation to made my code a "bit" faster.

A "short" example:

static void Main( string[] args )
{
  const int ITERATION_NUMBER = 100;

  TimeSpan[] normal = new TimeSpan[ITERATION_NUMBER];
  TimeSpan[] ilp = new TimeSpan[ITERATION_NUMBER];

  int SIZE = 4000000;
  float[] data = new float[SIZE];
  float safe = 0.0f;

  //Normal for
  Stopwatch sw = new Stopwatch();

  for (int iteration = 0; iteration < ITERATION_NUMBER; iteration++)
  {
    //Initialization
    for (int i = 0; i < data.Length; i++)
    {
      data[i] = 1.0f;
    }

    sw.Start();
    for (int index = 0; index < data.Length; index++)
    {
      data[index] /= 3.0f * data[index] > 2.0f / data[index] ? 2.0f / data[index] : 3.0f * data[index];
    }
    sw.Stop();
    normal[iteration] = sw.Elapsed;

    safe = data[0];

    //Initialization
    for (int i = 0; i < data.Length; i++)
    {
      data[i] = 1.0f;
    }

    sw.Reset();

    //ILP For
    sw.Start();
    float ac1, ac2, ac3, ac4;
    int length = data.Length / 4;
    for (int i = 0; i < length; i++)
    {
      int index0 = i << 2;

      int index1 = index0;
      int index2 = index0 + 1;
      int index3 = index0 + 2;
      int index4 = index0 + 3;

      ac1 = 3.0f * data[index1] > 2.0f / data[index1] ? 2.0f / data[index1] : 3.0f * data[index1];

      ac2 = 3.0f * data[index2] > 2.0f / data[index2] ? 2.0f / data[index2] : 3.0f * data[index2];

      ac3 = 3.0f * data[index3] > 2.0f / data[index3] ? 2.0f / data[index3] : 3.0f * data[index3];

      ac4 = 3.0f * data[index4] > 2.0f / data[index4] ? 2.0f / data[index4] : 3.0f * data[index4];

      data[index1] /= ac1;
      data[index2] /= ac2;
      data[index3] /= ac3;
      data[index4] /= ac4;
    }
    sw.Stop();
    ilp[iteration] = sw.Elapsed;

    sw.Reset();
  }
  Console.WriteLine(data.All(item => item == data[0]));
  Console.WriteLine(data[0] == safe);
  Console.WriteLine();

  double normalElapsed = normal.Max(time => time.TotalMilliseconds);
  Console.WriteLine(String.Format("Normal Max.: {0}", normalElapsed));
  double ilpElapsed = ilp.Max(time => time.TotalMilliseconds);
  Console.WriteLine(String.Format("ILP    Max.: {0}", ilpElapsed));
  Console.WriteLine();
  normalElapsed = normal.Average(time => time.TotalMilliseconds);
  Console.WriteLine(String.Format("Normal Avg.: {0}", normalElapsed));
  ilpElapsed = ilp.Average(time => time.TotalMilliseconds);
  Console.WriteLine(String.Format("ILP    Avg.: {0}", ilpElapsed));
  Console.WriteLine();
  normalElapsed = normal.Min(time => time.TotalMilliseconds);
  Console.WriteLine(String.Format("Normal Min.: {0}", normalElapsed));
  ilpElapsed = ilp.Min(time => time.TotalMilliseconds);
  Console.WriteLine(String.Format("ILP    Min.: {0}", ilpElapsed));
}

Results are (on .Net framework 4.0 Client profile, Release):

On a Virtual Machine (I think with no ILP):

True True

Nor Max.: 111,1894
ILP Max.: 106,886

Nor Avg.: 78,163619
ILP Avg.: 77,682513

Nor Min.: 58,3035
ILP Min.: 56,7672

On a Xenon:

True True

Nor Max.: 40,5892
ILP Max.: 30,8906

Nor Avg.: 35,637308
ILP Avg.: 25,45341

Nor Min.: 34,4247
ILP Min.: 23,7888

Explanation of Results:

In Debug, there is no optization applyed by the compiler, but the second for loop is more optimal than the first so there is a significant difference.

The answer seems to be in the results of the execution of Release mode builded assemblies. The IL compiler/JIT-er make it's best to minimize the performance counsumption (I think even ILP). But whether you make a code like the second for loop, you can reach better results in special cases, and second loop can overperform the first one on some achitectures. But

You are at the mercy of the JIT

as mentioned, sadly. Sad thing there is no mention of implementation could define more optimization, like ILP (a short paragraph can be placed in the specification). But they can not enumerate every form of architectural optomizations of code, and CLI is on a higher level:

This is well abstracted away from the .NET languages and IL.

This is a very complex problem to answer it only experimental way. I don't think we could get much more precise answer a way like this. And I think the Question is missleading becuse it isn't depending on C#, it depends on the implementation of CLI.

There could be many influencing factors, and it makes hard to answer correctly a question like this thinking about JIT until we think it as Black Box.

I found things about loop vectorization and autothreading on page 512-513.: http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-335.pdf

I think they don't specify explicitly how the JIT-er need to behave in cases like this, and impelenters can choose the way of optimization. So I think you can impact, if you can write more optimal code, and JIT will try to use the ILP if it is possible/implemented.

I think because they don't specify, there is a possibility.

So the answer seems to be No. I belive you can't abstract away from ILP in the case of CLI, if the specification doesn't say it.

Update:

I found a blog post before, but I haven't found it until now: http://igoro.com/archive/gallery-of-processor-cache-effects/ Example four contains a short, but proper answer for your question.