After C10k (10 000 concurrency connections), C10M (10 000 000 concurrency connections), now C10B is near (10 Billions concurrent connections, 10Tbps, 10 billions packets per seconds, 1 billion connections/seconds). In my case: 1 Billion concurrent connections, 1Tbps, 1 billions packets per seconds, 1 millions connections/seconds)

I have test it with CatchChallenger, with 32 core Threadripper 2990WX , 256GB DDR4, Radeon RX Vega 64.

This experiment can be used for high speed network packet processing for ethernet 100G+ or infinityband device. I have do as R&D for my company router with 1Tbps (at 64B packet size) routing capability certified with benchmark.

I have do stateless filter on IPv6 input, each processing unit after do the state-full work, but with an special dispatching memory access to reduce the cache miss. Mean: IP/TCP processing GPU, CatchChallenger processing CPU.

Difficulty: Very limited into memory, I had used specialized swap technique with barer to delay some class of trafic (move on map: 95% of move, + not published protocol with another vector move to prevent memory access and only parse last pos with need to use previous data), bulk processing of this.

59ms average reply time
95% 342ms reply time
Burn 1.2Tbps of network bandwidth

Ref: https://link.springer.com/chapter/10.1007/978-3-642-24999-0_44

Ref2: https://shader.kaist.edu/packetshader/