This page was created following a discussion (started by Dan Weber) regarding the feasibility of scaling repro into the million of users range. Although the original topic of the discussion was about more than performance alone (it also touched reliability, security, ...), this page currently focuses on how to improve the performance of the stack and DUM.
The purpose of this page is to consolidate the ideas and experiences put forth in that discussion and to make them grow.
Areas of Interest
Transport Event Selection
- Move away from select() and into each platform's best IO selection API (epoll, ...). This will help a lot for TCP/TLS and will also enable the handling of more than 1024 sockets
- Leverage libevent so that the platform IO selection API (pre notification) is abstracted (should provide best API per platform). Can libevent also provide kernel-level async?
- Leverage asio to move to an asynchronous IO post-notification model
- Consider using kernel-level asynchronous IO APIs (rather than current synchronous ones) to be able to handle multiple events at the same time (asio might provide this)
Object Allocation and Destruction
- Use different memory allocator than new() (tcmalloc, ..., maybe a memory pool?)
- Object size could be reduced (SipMessage)
- Current object memory consumption is large and causes fragmentation (performance hit)
- [Dan Weber] Apparently SipMessage instances use much more memory in terms of heap and allocations than their pure encoded counterparts. If you do testing with UDP or INVITE transactions, you'll find that the stack can run out of memory quite a bit. Simply moving away from storing SipMessage instances in the retransmission parts of the code to the encoded arrays of say vector or such would reduce heap fragmentation significantly, and may increase performance.
- Current implementation of object streaming is costly and could be improved (boost::spirit::karma could help and some work had been done with resip faststream)
- [Dan Weber] This may be because all of the SipMessage's components are allocated all over the heap. Using an allocator for each sip message that worked somewhat like a deque using say 512 byte chunks would allow for more contiguous memory and make it easier to encode the SipMessage to the stream. In addition with something like boost::fusion adapters with a bit vector and some set fields for things like parameters, many less allocations would occur on the heap. Consider std::string vs. resip::Data. std::string in most modern implementations is reference counted using atomic instructions. So if you initialize all the parameter values to the first std::string, they will be loosely copied until they are filled in. The first one taking up 8 bytes on the stack + 64 bytes or so on the heap, and each unfilled one will be 8 bytes on the stack until filled in.
Multicore/Multi CPU and Threading
- Current model (stack and DUM threading) doesn't leverage multicore a lot
- Adding more threads does not always improve performance (especially when inter thread comm. is done base on locking)
- [Dan Weber] Any time a locking or critical section is used, you are serializing the data to utilize one thread. The truth of the matter for resiprocate and repro in general is that, there is little in terms of collisions arising from actual data itself. E.g. Receiving an INVITE from effectively the same participant in two different locations at the same time. This means that all collisions that are likely to occur are simply caused by race conditions in memory, not because there is conflicting state. The last note on this page is about improving DUM's performance with Software Transactional Memory. This applies to resiprocate in general, since it is designed to handle those cases. What we're looking for primarily is a good a replacement concurrent hash map that we can redistribute for free.
- Trying to put the transports into their own thread didn't seem to pay
- Investigate lock-free FIFOs between stack and DUM to see how performance improves on multicore
- [Dan Weber] Benchmarks of AbstractFifo showed that it took on average 1.1 seconds for 1 million concurrent read/writes on a quad core intel i7. TBB's implementation and my own perform the same task in about 100 milliseconds, where mine performs even better when used with a specialized allocator. This is on average over a 10x improvement in performance.
- Threading the transaction layer did improve the performance (Dan has an implementation of hash-based concurrency in svn)
- DUM's performance could be improved using Software Transactional Memory