Good starter Nvidia Card for Scientific Computation

Pow · June 25, 2017, 10:40am

I do quite a lot of FFT computation with OpenCV in C# and recently I've started looking into cuda, since my computations are not GPU accelerated so far and are done purely by CPU (which takes for eeeeever for about 10k 1024x1024 images).
And since I've seen quite a lot of papers about implementations in cuda, I thought about doing this as well.

So my question is, what would be a good mid-range Nvidia card to see, how well the GPU acceleration is? I want to start out with something for max. 200€ just to get a feel for it and try out cuda. I would assume I need a card with a lot of cuda cores to maximize the parallelization, right? Is that the only important aspect to me, or are there some other features that can help?

It's really just some basic fourier transformations and that I want to do as fast as possible.
So a lot of RAM and a lot of cuda cores are my first thought.

stconquest · June 25, 2017, 11:02am

There is only one: The GTX 1060. The 3GB version is a cut down GPU, which means you will lose raw performance. Check the Cuda count between the 3GB and 6GB versions.

https://de.pcpartpicker.com/product/4Np323/msi-geforce-gtx-1060-3gb-3gb-oc-video-card-geforce-gtx-1060-3gt-oc

Below the GTX 1060 are the more... entry level cards. GTX 1050ti and below.

A few months ago, these were cheaper I believe. Crypto mining has shot up.

Peanut253 · June 25, 2017, 11:15am

With waifu2x on a 8350, a given image at 720p takes about 30-40 seconds to process. With CUDA/CUDNN, that dropped down to < 1 second on an older GTX 660, same algorithm.

If you happen to be able to use OpenCL instead, I would recommend it over CUDA and either an RX 460 ($100) or RX 470 (when they come back in stock).

http://videocardbenchmark.net/gpu_value.html

The next generation of GPUs (next year) will focus heavily on machine learning, and will likely be dramatically better than any current generation card at GPU accelerated workloads.

Pow · June 25, 2017, 11:18am

Ah the 3GB version flew under my radar, since I've started looking for >4GB cards. But 3GB should be completely fine for my purpose. Thank you, that is a very good starting point!

Pow · June 25, 2017, 11:24am

Since I've seen that the scientific world very often makes use of cuda for that kind of work I'm doing, I was leaning towards cuda. But I'm curious, is OpenCL and Cuda well comparable?

I also wanted to start implementing my algorithms in OpenCV with OpenCL acceleration, but I have to do it in my spare time, why it's going quite slow for me right now. But if cuda is better optimized, since it's very specific bound to hardware in comparison to OpenCL, than cuda would be my choice. For starters I'll have to do a benchmark anyways and compare these two.

Is the next generation announced for next year already? I remember the GTX 10 series beeing relatively new.

Peanut253 · June 25, 2017, 11:49am

CUDA is nvidia only. Only CL can be used on both nvidia and and ati graphics cards. They can both be thought of as very high level graphics/compute libraries.

The next gen low-level graphics architectures are Vulcan and DX12, both are cross platform, but neither are compute centric, they are graphics centric. CUDA also has a CUDNN variant specific to nvidia hardware that is compute centric.

So basically, OpenCL is fine for number-crunching on GPUs unless you happen to be able to use cudnn.

As a rule, ATI graphics cards are beefier for the price and can process more raw numbers using Open CL than an Nvidia equivalent also using open CL. So as long as you can guarantee that you can do any workload in OpenCL, then buying an ATI card makes sense. If you cannot, nvidia.

If your specific workload is specifically optimized for CUDA/cudnn and does not have an OpenCL option then get an nvidia card. If it has both, then it is a tossup. An Nvidia card does run CUDA loads faster than OpenGL ones but ATI cards will run the same OpenGL workload much faster... soo... tossup.

The new architecture is called "Volta" and will be out sometime in 2018, and was announced a bit over a month ago.

Edit: reasons

pFtpr · June 25, 2017, 11:53am

I've never used CUDA, but I can tell you that OpenCL is very problematic, especially if you happen to be on linux. So if you want an easy solution and are ok with it being proprietary and bound to nvidia CUDA is probably the way to go.

The next generation of nvidia GPUs ("Volta") is supposed to launch ~~Q3 2017~~. EDIT: I've found some conflicting information about this. Expect it to launch anywhere from this fall to early 2018.

Regarding the 1060: 3GB of memory is very little at this point. If you are only going to use the card as you wrote it's probably fine, but 3GB is tight for pretty much anything else.

Lastly I'm not sure you even need a GPU. It certainly will speed up your computations, but find it hard to believe that a Graphics card is necessary for something like FFT. I've written a short test script and was able to crunch 1000 1024x1024 images in 37 seconds.
In octave.
On a 5 year old laptop.
Single threaded.

Are you sure your code isn't simply borked?

pFtpr · June 25, 2017, 12:08pm

Vulkan is just as much compute API as it is graphics API. Khronos is even considering to merge OpenCL into vulkan and is known to be working on a OpenCL over Vulkan implementation. However, Vulkan is a very complex and low-level API and requires serious effort to get into.

If and only if your problem can make use of all those tensor cores. A simple FFT will spend more time transferring data between CPU <-> GPU than on any computations.

How can you even compare a CUDA workload to an OpenGL workload? Those are two entirely different things.

No they won't. AMD's OpenGL support is pretty bad. In fact that is most likely the reason they've been pushing Mantle/Vulkan/DX12.

And while I'm at it: It's Vulkan, not Vulcan.

Pow · June 25, 2017, 12:28pm

Volta does sound interesting.

To be honest, I'm not just doing a few simple FFTs. It's an iterative algorithm with a bunch of forward and inverse transformations and substitution of Matrizes, in order to calculate a computer generated hologram (CGH). The algorithms I'm using are in C# using an OpenCV wrapper for C#. And an iteration cycle of 50 iterations for one 1024x1024 image takes up to 20 seconds on an Intel Core i5-6500. And I've got quite a lot of images to handle.
It is possible, that the code in C# is not optimized very well. I haven't written the complete code myself. But as I mentioned, the scientific world of optics is all about using GPU acceleration for CGH computation. So if I want to do some active feedback calculation with a live feed, I need to use a fast GPU at some point, I think.

And in the end, that's why I'm asking here for a good mid-range starting point, so I don't throw out my money for some high-end graphics card and not really gaining that much of speed. I want to get into it and test things out and do some comparisons

Also now I'm inspired to test out OpenCL on ATI and compare it to Nvidia with OpenCL and cuda implementation. As it happens, I have a spare Radeon HD 7770 lying around. Not the case for Nvidia cards though.

pFtpr · June 25, 2017, 12:37pm

If you want to get your feet wet go for it. There's no better way to learn.

Here's the but: GPUs are complex and you probably shouldn't be using them unless you are willing to optimize your code. You can't just throw together something and expect it to run fast because it's being executed by an RGB-lit graphics card. If you've ever worked with threads you'll know that synchronizing them is a pain and when done badly performance will decrease, not increse. GPUs are that, just with thousands of threads, different kinds of memory, and ugly APIs to go with them.

The most pragmatic solution would probably be to stay on the CPU and use C++ and OpenMP.

With that in mind, if you want to learn about GPUs grab your old card (or just some integrated graphics with OpenCL support, don't forget about those!) and get started. Python's OpenCL bindings are supposed to be great btw.

j1018 · June 25, 2017, 2:26pm

I just also happen to be working in the CGH field. The solution that I found worked best for me was to just use the inbuilt libraries in MATLAB. If you're at a uni I'm sure you already have a license. I was then able to get vastly reduced times for calculating on my GTX970 graphics card over using my 4690K CPU.

It's likely that this isn't quite the optimum solution but it was very fast to code and works well. Also running 10K iteration of the Gerchberg-Saxton ( which I assume it is that you're using) is probably overkill (although I don't quite know what you're doing).

Also I'd be interested in knowing what you're doing exactly (assuming you can say at this stage).

pFtpr · June 25, 2017, 7:27pm

@j1018 that's some good information right there.

Matlab is a great idea for doing these kind of computations, assuming:
- You are happy to deal with what got to be one of the worst successful programming languages, surrounded by the most fitting (=terrible) GUI
- you've got a license (free/dirt cheap for students)
- You are ok with not doing the GPU part yourself. Matlab takes away the burden of GPU programming, but that implies you won't be learning CUDA/OpenCL or anything.
The use of a 970 implies that not much memory is needed.

Pow · June 25, 2017, 7:40pm

That's great to hear!
Yes, I'm using IFTA for calculating CGHs. I can't tell too much about my work, because it's my master thesis and I'm writing it in a company. So most stuff is under confidentiality agreement. But my work involves laser ablation and for that, I need to test a lot of parameters and therefore have a lot of data to calculate and for the most part, I'm using about 30 - 40 iterations for now. Also as I mentioned, the implementation of a live feedback is one of my goals some day and for that I'd need as much computation power as possible.

So I'm still a student but not writing at university. A Matlab solution would be really temporarely. Since I'll continue my work after my thesis is done, I wanted to have something for the long run to work with.

@j1018 may I ask what you are doing with your CGHs?
I'd assume something with optical metrology?

j1018 · June 25, 2017, 8:07pm

Fair enough on the confidentiality agreement. Are you able to tell me the name of the company? I'm just interested really.

I think live-ish feedback should definitely be possible right now with a resolution of 1024x1024 and using FFTs. By live-ish I mean around a second of delay. I've not tested this specifically though.

That's a reasonable concern. But if you're looking to get a Gerchberg-Saxton algorithm working with GPU acceleration in 30 mins or so it's not a bad solution.

I use CGHs to design visible range metasurface holograms for security and sensing applications. My particular focus is on tunability of the optical response.

Pow · June 25, 2017, 9:01pm

Of course, the company I'm working at is Pulsar Photonics.
I recommend taking a look at the Flexible Beam Shaper (FBS) that I'm working on.
Your working field sounds pretty interesting! It's really astonishing what you can do with light.

Getting CGH computation at 1Hz rate would be an amazing improvement for now. I've spoken with a sales guy from Hamamatsu (one of the few manufacturer for spatial phase modulators) and he told me that he read a paper, where some people were generating CGHs with the GS-Algorithm at a rate of 100Hz. Computation beeing done on GPU. So I'm really curious what I can do with optimized code and will definetly take a closer look into it or get a computer science student helping me with it (I've actually studied mechanical engineering). And I have a friend, that worked at an institute for applied optics and they have whole Ph.Ds around the topic of GPU accelerated CGH computation optimized in Cuda.

Peanut253 · June 26, 2017, 4:41am

Thanks for all the info! Very informative. And also for pointing out all of my typoes ~OpenCGL and Vulcknan.

TheCaveman · June 26, 2017, 4:48am

What might be a good idea actually is too grab a Gtx 980 Ti off of ebay for a lot of cuda. It has more than 2x the cuda cores of a 1060, and it would be a lot faster for your purposes. The prices seem to have risen slightly (cough cough fucking miners cough cough) but there are some close to the ~$300 point if you shop around. Here's one for $330:

I'll add some below if I find any others, haven't looked much yet

gtbtk · June 30, 2017, 5:51pm

Be aware that while Nvidia will supports both CUDA and OpenCL, and AMD only supports OpenCL, the performance of OpenCL on AMD cards is significantly better than it is on Nvidia cards.

If you are writing something that is portable, then OpenCL is probably a better option than Cuda, that limits you to only Nvidia cards. If it is something specialist, then it probably doesn't matter either way, You can choose based on the best absolute performance for your application

Be aware that applications like premiere pro that leverage GPGPU as an assistant to CPU computation will run just as well with something like a GTX 660 as they will with a GTX 1080. GPGPU computation power/efficiencvy doesnt always follow gaming performance. That is why the AMD Cards are so popular for the current mining boom. GTX 1070 seems to do a better job that GTX 1080 in the mining GPGPU implementations as well.

With the mining boom pusing prices for modern GPUs up to stupid levels at the moment, unless you want a card to do double duty for gaming etc, you may actually be better served looking for a used GTX 770/780/780ti. the used prices for those are cheap, the miniers are not looking for them and they should be sufficient for testing your application with a reasonable level of GPGPU performance

Pow · June 30, 2017, 11:46pm

Thank you, that's helpfull.

I'm glad, I've asked here in the forum first. Now I'm seriously considering rather going the AMD + OpenCL route, because it's just much more flexible in general. Even if I want to go with FPGA in the future. And the answers here were very insightfull

pFtpr · July 1, 2017, 12:01am

Note that you can try out OpenCL without any special hardware. Intel has a CPU only implementation, and there's also POCL.