开发者

What can cause a program to run much faster the second time?

开发者 https://www.devze.com 2023-04-08 00:55 出处:网络
Something I\'ve noticed when testing code I write is that long-running operations tend to run much longer the first time a program is run than on subsequent runs, sometimes b开发者_如何学运维y a facto

Something I've noticed when testing code I write is that long-running operations tend to run much longer the first time a program is run than on subsequent runs, sometimes b开发者_如何学运维y a factor of 10 or more. Obviously there's some sort of cold cache/warm cache issue here, but I can't seem to figure out what it is.

It's not the CPU cache, since these long-running operations tend to be loops that I feed a lot of data to, and they should be fully loaded after the first iteration. (Plus, unloading and reloading the program should clear the cache.)

Also, it's not the disc cache. I've ruled that out by loading all data from disc up-front and processing it afterwards, and it's the actual CPU-bound data processing that's going slowly.

So what can cause my program to run slow the first time I run it, but then if I close it and run it again, it runs dramatically faster? I've seen this in several different programs that do very different things, so it seems to be a general issue.

EDIT: For clarification, I'm writing in Delphi, though I don't really think this is a Delphi-specific issue. But that means that whatever the problem is, it's not related to JIT issues, garbage collection issues, or any of the other baggage that managed code brings with it. And I'm not dealing with network connections. This is pure CPU-bound processing.

One example: a script compiler. It runs like this:

  • Load entire file into memory from disc
  • Lex the entire file into a queue of tokens
  • Parse the queue into a tree
  • Run codegen on the tree to produce bytecode

If I feed it an enormous script file (~100k lines,) after loading the entire thing from disc into memory, the lex step takes about 15 seconds the first time I run, and 2 seconds on subsequent runs. (And yes, I know that's still a long time. I'm working on that...) I'd like to know where that slowdown is coming from and what I can do about it.


Three things to try:

  • Run it in a sampling profiler, including a "cold" run (first thing after a reboot). Should usually be enough.
  • Check memory usage, does it grow so high (even transiently) the OS would have to swap things out of RAM to make room for your app? That alone could be an explanation for what you're seeing. Also look at the amount of free RAM you have when you start your app.
  • Enable system performance tools and check the I/O counters or file accesses, and make sure under FileMon / Process Explorer that you don't have some file or network accesses you've forgotten about (leftover log/test code)


Even (especially) for very small command-line program, the issue can be the time it takes to load the process, link to dynamically-linked libraries etc. I believe modern operating systems avoid repeating a lot of this work if the same program is run twice at once, or repeatedly.

I wouldn't dismiss CPU cache so easily, as well. Level 0 cache is very relevant for inner loops, but much less so for a second run of the same application. On my cheap Athlon 2 X4 645 system, there's 64K + 64K (data + instruction) level 0 cache per core - not exactly a huge amount of memory. Level 1 cache is IIRC 512K per core, so less likely to be dirtied to complete irrelevance by the O/S code needed to start up a new run of the program, calls to operating system services and standard libraries, etc. Level 2 cache (on CPUs that have it - my Athlon 2 doesn't, IIRC) is larger still, and there may be some even higher level and larger cache provided by the motherboard/chipset.

There's at least one other kind of cache - branch prediction tables. Though I'd have thought they'd be dirtied to irrelevance even quicker than the level 0 cache.

I generally find that unit test programs run many times slower the first time. However, the larger and more complex the program, the less significant the effect.

For some time now, performance of applications has often been considered non-deterministic. Although it isn't strictly true, the performance is determined by so many hard-to-predict factors that it's a good model. For example, if the CPU is a bit warm, the clock speed may be reduced to prevent overheating. And the temperature varies at different parts of the chip, with changes conducting across the chip in complex ways. As changes in clock speed and the different demands of different pieces of code alter the patterns of changing temperature, there's a clear potential for chaotic (as in chaos theory) behaviour.

On some platforms, I wouldn't be surprised if the first run of the program got the processor to run if it's "fast" (rather than cool/quiet) mode, and that meant that the beginning of the second run benefitted from that speed boost as well as the end. However, this would be a tricky one - it would have to be a CPU-intensive program, and if your cooling is inadequate, the processor may then slow down again to avoid overheating.


I'd guess it's all your libraries/DLLs. These are usually loaded on-demand at run-time, so the first time your program runs the OS will have to read them all from disk. Once read, though, they'll stay loaded unless your system starts running low on memory. So if you run the same program several times in succession, the first run takes the brunt of the load time, and the other runs benefit from the pre-loaded libraries.


I usually experienced the contrary: for computation intensitive work (if anti virus is not working), I only have a 5-10% diff between calls. For instance, the 6,000,000 regression tests run for our framework have a very constant time of running, and it's very disk and CPU intensive work.

I really don't believe of a CPU cache or pipelining / branch prediction issue either, since both processed data and code seem to be consistent, as you wrote. If anti virus is off, it may be about OS thread settings: did you try to change the process CPU affinity and priority?

This should be very specific to the process you are running. Without any actual source code to reproduce it, it's almost impossible to tell what's happening with you. How many threads are there? What is the HW configuration (isn't there any Intel CPU boost there - are you using a laptop, and what are your energy settings)? Is it using CPU/FPU/MMX/SSE2 (e.g. MMX and FPU do not mix)? Does it move a lot of data, or process some existing data? Does your SW depends on external libraries (even some Windows libraries may need some time to initialize)? How do you use memory (did you try to pre-allocate the memory; or on a multi-threaded application, did you try using a scaling MM instead of FastMM4)?

I think using a sample profiler may not help so much, since it will change the general CPU core use, but it's worth trying in all cases. I'd better rely on logging profiling - see e.g. this class or you may write your own timestamps to find where the timing changes in your app.

AFAIK it has always been written that, when benchmarking, the first run of an application shall never be taken in account. Computer systems are so complex nowadays, that the first time, all the internal (SW and HW) plumbing is to be purged - so you shall not drink the first water coming out of your tap when you come back from 1 month of travel. ;)


Other factors I can think of would be memory-alignment (and the subsequent cache line fills), but say there are 2 types : perfect alignment (being fastest) and imperfect (being slower), one would expect it to occur irregularly (depending on how memory is laid out).

Perhaps it has something to do with physical page layout? As far as I know, each memory-access goes through the MMU page table entries, so dispersed physical pages could be slower than consecutive pages. (Just a wild guess, this one)

Another thing I haven't seen mentioned yet, is on which core(s) your process is running - especially on hyper-threaded CPU's, running on the slower of the two cores might have a negative impact. Try setting the processor affinity mask on one and the same core for every run, and see if that impacts the measured runtime differences between first and subsequent runs.

By the way - how do you define 'first run'? Could it be that you've just compiled the executable? In that case (and I'm just guessing again here), some process (either the OS, a virus-scanner, or even some root-kit) might be busy analyzing your executable's behaviour, which might be skipped once the executable has been analyzed before. You could try to prove that by changing some random unimportant byte of your executable between runs, and see if that impacts the runtime negatively again?

Please post a summary once you figured out the cause(s) - this might help others too. Cheers!


Just a random guess...

Does your processor support adaptive frequency? Maybe it's just the processor that doesn't have time to adapt its frequency on the first run, and is running full speed on second one.


There are lots of things that can cause this. Just as one example: if you're using ADO.NET for data access with connection pooling turned on (which is the default), the first time your application runs it will take the hit of creating the database connection. When your app is closed, the connection is maintained in its open state by ADO.NET, so the next time your app runs and does data access it will not have to take the hit of instantiating the connection, and thus will appear faster.


Guessing your using .net if im wrong you could ignore most of my ideas...

Connection pooling, JIT compilation, reflection, IO Caching the list goes on and on....

Try testing smaller portions of the code to see what parts change performance the most...

You could try ngen'ing your assemblies as this removes the JIT compilation.


where that slowdown is coming from and what I can do about it.

I would speak about quick execution the next times can from from performance caching

  • Disk internal cache (8MB or more)
  • Windows applicationDependencies (as DLL)/Core cache
  • CPU cache L3 (or L2 if some programming loop are small enough)

So you see that the first time you do not benefits from these caching systems.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号