I do quite a lot of FFT computation with OpenCV in C# and recently I've started looking into cuda, since my computations are not GPU accelerated so far and are done purely by CPU (which takes for eeeeever for about 10k 1024x1024 images). And since I've seen quite a lot of papers about implementations in cuda, I thought about doing this as well.
So my question is, what would be a good mid-range Nvidia card to see, how well the GPU acceleration is? I want to start out with something for max. 200€ just to get a feel for it and try out cuda. I would assume I need a card with a lot of cuda cores to maximize the parallelization, right? Is that the only important aspect to me, or are there some other features that can help?
It's really just some basic fourier transformations and that I want to do as fast as possible. So a lot of RAM and a lot of cuda cores are my first thought.
Since I've seen that the scientific world very often makes use of cuda for that kind of work I'm doing, I was leaning towards cuda. But I'm curious, is OpenCL and Cuda well comparable?
I also wanted to start implementing my algorithms in OpenCV with OpenCL acceleration, but I have to do it in my spare time, why it's going quite slow for me right now. But if cuda is better optimized, since it's very specific bound to hardware in comparison to OpenCL, than cuda would be my choice. For starters I'll have to do a benchmark anyways and compare these two.
Is the next generation announced for next year already? I remember the GTX 10 series beeing relatively new.
CUDA is nvidia only. Only CL can be used on both nvidia and and ati graphics cards. They can both be thought of as very high level graphics/compute libraries.
The next gen low-level graphics architectures are Vulcan and DX12, both are cross platform, but neither are compute centric, they are graphics centric. CUDA also has a CUDNN variant specific to nvidia hardware that is compute centric.
So basically, OpenCL is fine for number-crunching on GPUs unless you happen to be able to use cudnn.
As a rule, ATI graphics cards are beefier for the price and can process more raw numbers using Open CL than an Nvidia equivalent also using open CL. So as long as you can guarantee that you can do any workload in OpenCL, then buying an ATI card makes sense. If you cannot, nvidia.
If your specific workload is specifically optimized for CUDA/cudnn and does not have an OpenCL option then get an nvidia card. If it has both, then it is a tossup. An Nvidia card does run CUDA loads faster than OpenGL ones but ATI cards will run the same OpenGL workload much faster... soo... tossup.
The new architecture is called "Volta" and will be out sometime in 2018, and was announced a bit over a month ago.
I've never used CUDA, but I can tell you that OpenCL is very problematic, especially if you happen to be on linux. So if you want an easy solution and are ok with it being proprietary and bound to nvidia CUDA is probably the way to go.
The next generation of nvidia GPUs ("Volta") is supposed to launch Q3 2017. EDIT: I've found some conflicting information about this. Expect it to launch anywhere from this fall to early 2018.
Regarding the 1060: 3GB of memory is very little at this point. If you are only going to use the card as you wrote it's probably fine, but 3GB is tight for pretty much anything else.
Lastly I'm not sure you even need a GPU. It certainly will speed up your computations, but find it hard to believe that a Graphics card is necessary for something like FFT. I've written a short test script and was able to crunch 1000 1024x1024 images in 37 seconds. In octave. On a 5 year old laptop. Single threaded.
Vulkan is just as much compute API as it is graphics API. Khronos is even considering to merge OpenCL into vulkan and is known to be working on a OpenCL over Vulkan implementation. However, Vulkan is a very complex and low-level API and requires serious effort to get into.
If and only if your problem can make use of all those tensor cores. A simple FFT will spend more time transferring data between CPU <-> GPU than on any computations.
How can you even compare a CUDA workload to an OpenGL workload? Those are two entirely different things.
No they won't. AMD's OpenGL support is pretty bad. In fact that is most likely the reason they've been pushing Mantle/Vulkan/DX12.
To be honest, I'm not just doing a few simple FFTs. It's an iterative algorithm with a bunch of forward and inverse transformations and substitution of Matrizes, in order to calculate a computer generated hologram (CGH). The algorithms I'm using are in C# using an OpenCV wrapper for C#. And an iteration cycle of 50 iterations for one 1024x1024 image takes up to 20 seconds on an Intel Core i5-6500. And I've got quite a lot of images to handle. It is possible, that the code in C# is not optimized very well. I haven't written the complete code myself. But as I mentioned, the scientific world of optics is all about using GPU acceleration for CGH computation. So if I want to do some active feedback calculation with a live feed, I need to use a fast GPU at some point, I think.
And in the end, that's why I'm asking here for a good mid-range starting point, so I don't throw out my money for some high-end graphics card and not really gaining that much of speed. I want to get into it and test things out and do some comparisons
Also now I'm inspired to test out OpenCL on ATI and compare it to Nvidia with OpenCL and cuda implementation. As it happens, I have a spare Radeon HD 7770 lying around. Not the case for Nvidia cards though.
If you want to get your feet wet go for it. There's no better way to learn.
Here's the but: GPUs are complex and you probably shouldn't be using them unless you are willing to optimize your code. You can't just throw together something and expect it to run fast because it's being executed by an RGB-lit graphics card. If you've ever worked with threads you'll know that synchronizing them is a pain and when done badly performance will decrease, not increse. GPUs are that, just with thousands of threads, different kinds of memory, and ugly APIs to go with them.
The most pragmatic solution would probably be to stay on the CPU and use C++ and OpenMP.
With that in mind, if you want to learn about GPUs grab your old card (or just some integrated graphics with OpenCL support, don't forget about those!) and get started. Python's OpenCL bindings are supposed to be great btw.
I just also happen to be working in the CGH field. The solution that I found worked best for me was to just use the inbuilt libraries in MATLAB. If you're at a uni I'm sure you already have a license. I was then able to get vastly reduced times for calculating on my GTX970 graphics card over using my 4690K CPU.
It's likely that this isn't quite the optimum solution but it was very fast to code and works well. Also running 10K iteration of the Gerchberg-Saxton ( which I assume it is that you're using) is probably overkill (although I don't quite know what you're doing).
Also I'd be interested in knowing what you're doing exactly (assuming you can say at this stage).
That's great to hear! Yes, I'm using IFTA for calculating CGHs. I can't tell too much about my work, because it's my master thesis and I'm writing it in a company. So most stuff is under confidentiality agreement. But my work involves laser ablation and for that, I need to test a lot of parameters and therefore have a lot of data to calculate and for the most part, I'm using about 30 - 40 iterations for now. Also as I mentioned, the implementation of a live feedback is one of my goals some day and for that I'd need as much computation power as possible.
So I'm still a student but not writing at university. A Matlab solution would be really temporarely. Since I'll continue my work after my thesis is done, I wanted to have something for the long run to work with.
@j1018 may I ask what you are doing with your CGHs? I'd assume something with optical metrology?
Of course, the company I'm working at is Pulsar Photonics. I recommend taking a look at the Flexible Beam Shaper (FBS) that I'm working on. Your working field sounds pretty interesting! It's really astonishing what you can do with light.
Getting CGH computation at 1Hz rate would be an amazing improvement for now. I've spoken with a sales guy from Hamamatsu (one of the few manufacturer for spatial phase modulators) and he told me that he read a paper, where some people were generating CGHs with the GS-Algorithm at a rate of 100Hz. Computation beeing done on GPU. So I'm really curious what I can do with optimized code and will definetly take a closer look into it or get a computer science student helping me with it (I've actually studied mechanical engineering). And I have a friend, that worked at an institute for applied optics and they have whole Ph.Ds around the topic of GPU accelerated CGH computation optimized in Cuda.
What might be a good idea actually is too grab a Gtx 980 Ti off of ebay for a lot of cuda. It has more than 2x the cuda cores of a 1060, and it would be a lot faster for your purposes. The prices seem to have risen slightly (cough cough fucking miners cough cough) but there are some close to the ~$300 point if you shop around. Here's one for $330:
I'll add some below if I find any others, haven't looked much yet
Be aware that while Nvidia will supports both CUDA and OpenCL, and AMD only supports OpenCL, the performance of OpenCL on AMD cards is significantly better than it is on Nvidia cards.
If you are writing something that is portable, then OpenCL is probably a better option than Cuda, that limits you to only Nvidia cards. If it is something specialist, then it probably doesn't matter either way, You can choose based on the best absolute performance for your application
Be aware that applications like premiere pro that leverage GPGPU as an assistant to CPU computation will run just as well with something like a GTX 660 as they will with a GTX 1080. GPGPU computation power/efficiencvy doesnt always follow gaming performance. That is why the AMD Cards are so popular for the current mining boom. GTX 1070 seems to do a better job that GTX 1080 in the mining GPGPU implementations as well.
With the mining boom pusing prices for modern GPUs up to stupid levels at the moment, unless you want a card to do double duty for gaming etc, you may actually be better served looking for a used GTX 770/780/780ti. the used prices for those are cheap, the miniers are not looking for them and they should be sufficient for testing your application with a reasonable level of GPGPU performance
I'm glad, I've asked here in the forum first. Now I'm seriously considering rather going the AMD + OpenCL route, because it's just much more flexible in general. Even if I want to go with FPGA in the future. And the answers here were very insightfull