iceman7311
OSNN Senior Addict
- Joined
- 4 Feb 2005
- Messages
- 311
are the new amd x2's going to be just for worksatations and servers or are the for home computers?
Sazar said:Wrt gaming (and other similar products), most games are still single threaded so a dual core processor will not make a difference. Only games programmed with multi-threading in mind will see boosts in performance.
I recently set out to start implementing the dual-processor acceleration
for QA, which I have been planning for a while. The idea is to have one
processor doing all the game processing, database traversal, and lighting,
while the other processor does absolutely nothing but issue OpenGL calls.
This effectively treats the second processor as a dedicated geometry
accelerator for the 3D card. This can only improve performance if the
card isn't the bottleneck, but voodoo2 and TNT cards aren't hitting their
limits at 640*480 on even very fast processors right now.
For single player games where there is a lot of cpu time spent running the
server, there could conceivably be up to an 80% speed improvement, but for
network games and timedemos a more realistic goal is a 40% or so speed
increase. I will be very satisfied if I can makes a dual pentium-pro 200
system perform like a pII-300.
I started on the specialized code in the renderer, but it struck me that
it might be possible to implement SMP acceleration with a generic OpenGL
driver, which would allow Quake2 / sin / halflife to take advantage of it
well before QuakeArena ships.
It took a day of hacking to get the basic framework set up: an smpgl.dll
that spawns another thread that loads the original oepngl32.dll or
3dfxgl.dll, and watches a work que for all the functions to call.
I get it basically working, then start doing some timings. Its 20%
slower than the single processor version.
I go in and optimize all the queing and working functions, tune the
communications facilities, check for SMP cache collisions, etc.
After a day of optimizing, I finally squeak out some performance gains on
my tests, but they aren't very impressive: 3% to 15% on one test scene,
but still slower on the another one.
This was fairly depressing. I had always been able to get pretty much
linear speedups out of the multithreaded utilities I wrote, even up to
sixteen processors. The difference is that the utilities just split up
the work ahead of time, then don't talk to each other until they are done,
while here the two threads work in a high bandwidth producer / consumer
relationship.
I finally got around to timing the actual communication overhead, and I was
appalled: it was taking 12 msec to fill the que, and 17 msec to read it out
on a single frame, even with nothing else going on. I'm surprised things
got faster at all with that much overhead.
What is really needed for this type of interface is a streaming read cache
protocol that performs similarly to the write combining: three dedicated
cachelines that let you read or write from a range without evicting other
things from the cache, and automatically prefetching the next cacheline as
you read.
Intel's write combining modes work great, but they can't be set directly
from user mode. All drivers that fill DMA buffers (like OpenGL ICDs...)
should definately be using them, though.
Prefetch instructions can help with the stalls, but they still don't prevent
all the wasted cache evictions.
It might be possible to avoid main memory alltogether by arranging things
so that the sending processor ping-pongs between buffers that fit in L2,
but I'm not sure if a cache coherent read on PIIs just goes from one L2
to the other, or if it becomes a forced memory transaction (or worse, two
memory transactions). It would also limit the maximum amount of overlap
in some situations. You would also get cache invalidation bus traffic.
I could probably trim 30% of my data by going to a byte level encoding of
all the function calls, instead of the explicit function pointer / parameter
count / all-parms-are-32-bits that I have now, but half of the data is just
raw vertex data, which isn't going to shrink unless I did evil things like
quantize floats to shorts.
Too much effort for what looks like a reletively minor speedup. I'm giving
up on this aproach, and going back to explicit threading in the renderer so
I can make most of the communicated data implicit.