Timing your program

How long does it take to execute the program

Timing is always a point of attention in embedded programs. If you are running a 8051 on 1 Mips or if it's the LPC2106 running alomst 60 Mips, at a moment in time you will meet its limitations. Measuring time critical parts of your code is one way to make sure that the application will not run into timing problems and knowing where timing problems arise may help solve them.

This article will show you how to measure the timing in an interrupt routine or a running application on the LPC2106 QuickStart Board from Embedded Artists.

Note: this article will assume the use of the EmbeddedArtists Infrabed build environment. Some changes to this environment have been made to make it possible to run code from RAM using the lpc21isp program I adapter to allow RAM downloads.

Using timers

There are different ways to measure program timing, on the ARM processor series a specific ETM (Embedded Trace Module) is available that can be used together with a special debugger pod. You can also set/reset a GPIO pin and measure timing using an oscilloscope or a logic analyzer.
But it is also possible to perform these measurements using an internal timer with one drawback; the timer has to be availalbe for measurements.

The LPC21xx has 2 timers available with one of these timers being used as timer tick for your favorite Real Time OS. When the OS uses a free running timer with an interrupt on a Match without reset (see the article on porting microC/OS II) this timer can also be used for micro-second acurate delays and time measurements.
In simpler test applications it is even possible to measure timing with an even higher resolution by letting the timer run on full processor speed. As a first example I will show how to setup a timer on full processor speed and measure the execution time of a specific progam section or to measure interrupt latencies.

The timer (as all peripherals in the LPC21xx) uses the PCLK - or Peripheral bus Clock - as its clock source. The clock frequency is depending on crystal frequency, PLL settings and the VPB divider settings. The actual processor clock is determined by the formula: CCLK = FOSC * PLL_MUL and this clock is then used by the VPD divider giving: PCLK = FOSC * PLL_MUL / PBSD. In the startup code from Embedded Artists the PLL_MUL and PBSD are both set to 4 resulting in a 14.7456 MHz PCLK (1/4 of CCLK), to perform more exact timing I change the value for PBSD into 1, resulting in a counter resolution that equals the processor clock. In this way a counter value of 10 means 10 CPU clock cycles.

Example 1: Measuring execution time of a piece of code

Before any measurements can be performed the timer has to be initialized, it should be setup as a free-running timer:

    TIMER0_TCR = 2; // Stop and reset timer
    TIMER0_PR  = 0; // increment TC every PCLK
    TIMER0_MCR = 0; // Do not stop, reset or interrupt on any Match

    TIMER0_TCR = 1; // Start timer

The first line sets bit 1 of the TCR , this will reset both the TimerCounter and PrescaleCounter. On the last line, resetting bit 1 while setting bit 0 will start the timer.

To measure the time, the TIMER0_TC register has to be read. Care should be taken, the C compiler may optimize the program size or speed by reshuffling the order of instructions. For example:

    start = TIMER0_TC;
    temp  = function();
    end   = TIMER0_TC;

May result in the call to function() being placed after or before the assignments and after compilation the code may be executed as:

    temp  = function();
    start = TIMER0_TC;
    end   = TIMER0_TC;

This is a valid optimization, since the function does not rely on the value of start (or end) the functioncall may be performed at a different moment in time. In this situation this optimization will be performed by the compiler because it results in compacter (and faster) code. Therefor the reading of the timer will be done in a function getTimer(); the order in which functioncalls are executed will never be reshuffled by the compiler.

In the main() function the measurements are placed, first I will start with the easiest possible measurement: Initialize the timer and read it twice without doing anything else, this will show how many time it takes to perform the measurement itself:

    initTimer();
    start = getTimer();
    end   = getTimer();

After execution of this code, we print the contents of these variables: start = 15 and end = 33. This shows that it takes 18 cycle from starting the timer up to the point where the time is read. Each cycle takes 16.9 ns to complete (with a14.7456 MHz crystal with a PLL mutliplier of 4) so this is 305 ns.

Next step is to do something in between these two lines. Just to get the hang of this, let's assign a value to a variable. To make sure this variable is stored in memory, define it as being volatile - this will prevent the compiler from optimizing it as a register variable or optimizing it away since it is never read:

    volatile tU32 temp;


    start = getTimer();
    temp  = 1;
    end   = getTimer();

The test program shows values for the timing variables like: start = 204773 and end = 204794. So it takes only 3 cycles to execute the code for assigning a value to a variable, but this does not mean that the value is indeed placed in memory at the same time - memory operations may be affected by write-back or read-ahead buffers.
In fact, this example is a bit to short to put real figures on but it does show that some time is taken by the code that is executed.
What we can see, is the amount of time taken by the printf() that can be found between the previous measurement and this one: end=33 and start = 204773 gives a time of: (start - end) * 16.9 ns = 3.4 ms. This shows you should not use the printf() in time critical functions, timing data should therefor be stored and only printed in non critical code.

Now let's do some real stuff and call a function:

    start = getTimer();
    temp2 = function(string);
    temp  = getTimer();
    temp2 = strlen(string);
    end   = getTimer();

Function() just calculates the string length (as does strlen). Here we get values of 327 cycles (5.5 µs) for function() and 181 (3 µs) for strlen() on the same string.

For the examples above this sounds a bit too much. Why bother if a function takes 3 or 5.5 µs?
But things do matter when you start to control I/O devices that need input within a certain amount of time - or the other way around, if there must be a minimum delay. Then we also may execute code within an interrupt function which may be a recurring event (e.g. collecting data from an A/D converter). The system has to complete it's interrupt function before the next interrupt occurs otherwise data will be lost.

RAM or Flash: does it differ where my code resides?

RAM and Flash memory cells both have different ways to access the data. Flash memories are most of the times slower than RAM, but in the LPC21xx controllers the Flash memory is 128 bits wide. This makes flash look faster than on other controllers since now 4 instruction words are read at the same time. It is however still worthwhile to investigate what happens when the code from the example above runs from Flash:

The function() and strlen() examples show 345 and 199 cycles when run from RAM and 364 plus 226 when from Flash.
So flash is still a bit slower (5-13% in this example).

Using THUMB code

Thumb mode has been discussed in the Managing code size article, but what does it do for program speed?

Since each instruction takes (at least) 1 clock cycle to complete and Thumb uses more instructions, your program will take more time than in ARM (32 bits) mode. The function() and strlen() examples now take 386 and 207 cycles (5-11% slower than 32 bits code) when run from RAM.
From Flash there is an even larger penalty, the same example now takes 11-13% more time to execute.

What if ...

What to do if your program is taking too much time?
That is a bit difficult to say, first you have to analyze your code by checking the critical sections by review: Looping constructions using a for() , do while() or while() loop are to be examined carefully.
Are you waiting for an I/O event to occur while in your critical code section (i.e. waiting for a buffer or waiting for an I2C or SPI transaction to complete? If that is the case, it may be easy to remove this action from the critical section.
Hidden time spenders are functions to calculate CRC or checksum or rearrange data, get these out of your critical section also.
Then, if you see no other way of shrinking the time your program takes, try to move your code to RAM.

Hidden users of the sparse CPU time may be the interrupt routine that is not related to the code you are looking at right now. All these interrupts may consume almost all time leaving no room for the rest of your application.

We will have a closer look at timing interrupt routines in a follow-up of this article.

Here is the code

Time to get your hands dirty. Grab the code, compile, change the code and see what happens.

The example