Friday Q & A 2016-04-15: Performance Comparison of Joint Operations, 2016 Edition
Back in the fog of time, before Friday, Q & A was one thing, I posted some articles that ran performance tests on normal operations and discussed the results. The last one was from 2008, running at 10.5 and the original iPhone OS, and it's a long time to make an update.
If you want to compare with decades earlier, here are the links to the previous articles:
(Note that the name of Apple's mobile OS did not become "iOS" until 2010.)  Overview
Performance testing can be dangerous. Tests are usually very artificial unless you have a specific program with a real workload you can test. These special tests are absolutely artificial and the results can not reflect how things actually perform in your own programs. The idea is just to give you a sense of coarse order, no exact numbers on everything.
It is particularly difficult to measure extremely fast operations, such as an objective C message sent or a simple arithmetic operation. Modern CPUs are heavily pipelined and parallel, and the time such an operation takes isolation can not match the time it takes when it comes to a real program. Adding one of these operations to the middle of another code can not increase the runtime of the code at all, if it is sufficiently independent that the CPU can run it in parallel. On the other hand, it can increase driving time a lot if it links important resources.
Performance also depends on external factors. Many modern CPUs will run faster when they are cold and melt down as they get hot. File system performance will depend on the storage hardware and state of the file system. Even relative performance may vary.
If something is performance-critical, you will always measure and profile it so you can see exactly what takes time in your code and know where to concentrate. It may and will surprise you to find out what's really slow in your workcode.
All that is said, it is still very useful to have a rough idea of how quickly different things are compared to each other. It's worth a little effort to avoid writing tons of data to the file system if you do not need it. It is probably not worth a little effort to avoid sending a single message. Inside it depends.
The code used for these tests is available at GitHub:
The code is written in Objective -C ++, with core performance measurement code written in C. I do not yet have a good handle on how Swift performs to feel I could do a good job with this in Swift.
The basic technique is simple: Run the operation in question in a loop for a few seconds. Share the total driving time with the number of loop iterations to get the time per operation. The number of iterations is hard-coded, and I chose that experiment number to run the test run in a timely manner.
I try to take into account the overhead of the loop itself. This overhead is completely insignificant for slower operations, but is significant for the faster ones. To do this, I run an empty loop, and then drag the time per iteration from the time measured for the other tests.
For some tests, the test code appears in the loop code. This gives incredible low times for those tests, but the results are fake. To compensate for this, all the quick operations are manually rolled off so that a single loop iteration performs the test ten times, which I hope produces a more realistic result.
Tests are compiled and run without optimizations. This is contrary to what we usually do in the real world, but I think it's the best choice here. For operations that mainly depend on remote code, like working with files or decoding of JSON, it makes little difference. For short operations like arithmetic or metal conversations, it is difficult to write a test that is not only optimized completely, as the compiler realizes that the test does not do anything that is externally visible. Optimization will also change how the loop is compiled, making it difficult to pay attention to loop overhead.
Mac tests were run on my Mac Pro 201
Here are the Macs. Each test shows what it tested, how many iterations the test runs, the total time it took to run the test and per-run time.
The first thing It stands out in this table is the first entry in the. 16-byte
memcpy The test takes less than one nanosecond per call. Looking at the generated code, the compiler is smart enough to turn the call to
memcpy in a sequence of
mov instructions, even with optimizations of. This is an interesting lesson: just because you write a function call does not mean that the compiler must generate one.
A C ++ virtual method call and an ObjC message sent with a buffered IMP, both take about the same time. They are essentially the same operation: An indirect function calls through a function pointer.
A regular Objective-C message is sent a bit slower, as we expect. Still, the speed of
objc_msgSend is amazed at me. Given that it performs a complete hashbord lookup followed by an indirect jump to the result, the fact that it is running for 2.6 nanoseconds is amazing. It's about 9 CPU cycles. For 10.5 days it was a dozen or more, so we have seen a nice improvement. If you want to switch this number up, you can do 400 million of them per second on this computer.
NSInvocation to call a method is much slower, as expected.
NSInvocation must construct the message when driving, do the work that the compiler does at the compilation time for each call. Fortunately,
NSInvocation is rarely a bottleneck in real programs. It seems to have slowed down since 10.5, with an
NSInvocation conversation taking about twice as much time in this test compared to the old one, although this test runs on faster hardware.
A stock and release couple will take approximately 23 nanoseconds together. Changing the object reference number must be thread-safe, so it requires a nuclear operation that is relatively expensive when we are down at nanosecond level that counts individual CPU cycles.
Autorelease pools have become quite faster than they used to be. In the old test it was well over 300ns to create and destroy an autorelease pool. Here it turns out to be 25 years. The implementation of autorelease pools has been completely renovated and the new implementation is much faster so this is no surprise. The pools used to be instances of the
NSAutoreleasePool class, but now they have finished using driving-time features that only do some point manipulation. At 25ns, you can afford
@autoreleasepool wherever you might suspect that you can collect some autoreleased objects.
Distribution and release of 16 bytes cost a lot like before, but larger allocations have become significantly faster. Distribution and release of 16 MB took 4.5 microseconds back in the day, but only took around 300 nanoseconds here. Typical apps make tons of memory allocations, so this is a big improvement.
Object-C object creation also got a good speed, from almost 300s to about 100ns. The typical app obviously creates and destroys a party of Objective-C objects, so this is very useful. On the back side, consider sending an existing item about 40 messages at the same time it takes to create and destroy a new item, so it's still a significantly more expensive operation, especially considering that most items will take more time to create and destroy than a simple
NSObject example does.
dispatch_queue show an interesting contrast between the different operations. A
dispatch_sync on an indefinite queue is extremely fast, under 30 years. GCD is smart and does not make any cross-thread conversations for this case so it ends up just acquiring and then releasing a lock.
dispatch_async takes a lot more time since it has to find a worker's advice to use, wake it up and get the call to it. Creating and destroying a
dispatch_queue is quite cheap, with a time comparable to creating an objective C object. GCD is able to share all the heavy-handed things, so the individual queues do not contain so much.
I have added tests for JSON and serialization and deserialization of property listings, which I did not test last time. With the iPhone recovery, these things became much more prominent. These tests encrypt or decode a simple tree element dictionary. As expected, it is relatively slow compared to simple, low level things as the message is sent, but it is still in microseconds. It is interesting that JSON surpasses property listings, even binary property listings, which I expected would be the fastest. This may be because JSON looks more used and it gets more attention, or it may just be that the JSON format is actually faster to analyze. Or it may be that testing with a tree element word list is not realistic, and the relative speed will look different to something bigger.
Zero seconds delayed execution comes quite heavily, relatively to about twice as much as the cost of a
dispatch_async . Runloops has a lot of work to do, it works.
Making a pthread and then waiting for it to be terminated is another relatively heavyweight operation, which takes just under 30 microseconds. We can see why GCD uses a thread pool and tries not to create new threads unless needed. However, this is a test that has been much faster since the old days. The same test took well over 100 microseconds in the old test.
NSView example is fast, in about 3 microseconds. Contrary to creating a
NSWindow it is much slower and takes about 10 milliseconds .
NSView is really a relatively light structure that represents an area of a window, whereas one
NSWindow represents a portion of pixel buffer in the window server. Creating one involves communicating with the window server to make it create necessary structures and it also requires a lot of work to set up all the different internal objects that an NSWindow needs, such as views for the title bar. You may be crazy with the views, but you may want to go lightly on the windows.
File access is always slow. SSDer does it much faster, but there are still lots of things going on there. Do it if you must, try not to do it if you do not need it.
Here are the iOS results.
The most remarkable about this is how it looks like the Mac results above. Looking back at the old tests, the iPhone order was slower. An objective C message sent, for example, was about 4.9n on Mac, but it took forever for iPhone in almost 200ns. A simple C ++ virtual method conversation took just over a nanosecond on Mac, but 80ns on iPhone. A small malloc / free around 50s on the Mac took about 2 microseconds on the iPhone.
Comparison of the two today, and things have clearly changed a lot in the mobile world. Most of these numbers are just a bit worse than the Mac numbers. Some are actually faster! For example, autorelease pools are significantly faster on the iPhone. I suppose ARM64 is better at doing the things that the autorelease pool code does.
Read and write small files stand out as an area where iPhone is significantly slower. 16MB file tests are comparable to Mac, but iPhone takes almost ten times longer for 16-byte file tests. It seems that iPhone storage has a good throughput, but has something in latency compared to the Mac.
Excessive focus on performance can interfere with writing good code, but it's good to remember the rough performance of the common operations we perform in your applications. That performance changes as software and hardware improve. The Mac has seen some great improvements over the years, but the improvement on iPhone is remarkable. For eight years, it's gone from being almost a hundred times slower to being about the same as Mac.
That's it for today. Come back next time for more fun stuff. Friday Questions & Answers are powered by reader suggestions, so if you have a topic you will see the cover next time or other, please submit it!
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off topic posts are deleted without notice. Culprits can be humiliated publicly in my own discretion.