How to troubleshoot application performing worse on better hardware?

I have this application/service that I use for indexing. (It is using lucene)

on i7 2600k (@stock) and 16GB DDR3 1600, it processes the index of 1GB within 10min
on i7 4930k (@4.2GHz) and 32GB of DDR3 1600, it processes same index within 40min
on dual X5570 (@3.2GHz) and 32GB DDR3 1600, it processed same index within 1h.
(dual X5570, was able to process that same file within 24min before - nothing has changed)

on i7 2600k my indexing service is using ~70% of cpu, and around 4GB of memory.
on i7 4930 and X5570 its taking 20-30% of cpu... (application is built to work on X amount of threads what if specify)

How do I see where is the bottleneck in this application? I don't have access to source code though.
So far i've tried seeing resource monitors, performance monitor counters - and i'm really going out of options... I don't get why its taking longer on faster machine... and why its not utilizing hardware to full potential.

(It has debugging option, so i see as items are imported and merged in memory - overall the process takes longer for every item to process, not something obvious or specific)

Those are the steps software takes
1) Reads XML file from queue into memory (caches 50k items / out of my 125k -- keeps buffer of 50k, changing size of it doesn't have any impact on performance)
2) Starts import and merge process in memory, this sends task to threads X+1 (where x is specified by me)
This is where the slowness occurs.
3) Awaits threads to finish their work and starts writing index to files. - this is quite fast.

Let me know if you know of any tools/software i can use for troubleshooting this, to find out where is the bottleneck.
I simply don't get it why weaker hardware was faster 4x at that...

I'd try to look at the drive performance from where the XML file is read.

not an issue, i have verified that its not the issue.

ran it from ramdisk couple of times to test performance, previously i ran it from ssd's, and raid5 sas 15k drives
there was no difference in process time even if I cached to memory whole file. (the reading in - is quite fast) and there are minimal read/writes on the drive (200k/300k) basically just keeping a lock on files - its not really writing or reading until its done with his items.

Might be utilizing an older cpu extension that may have changed since sandy bridge. That's my only guess.
the only built in technology that the 2600k has over the other two cpu's listed is

Intel Fast Memory Access (Updated Memory Controller Hub (MCH) to increase performance and reduce latency.)

Intel Flex Memory Access (Allows different memory sizes in dual-channel mode.)

Is this a remote share you're indexing?

Not sure if this applies to you, but I found this.

ImproveIndexingSpeed

I'm just throwing spaghetti at a wall at this point.

I'll go with @KenPC on this it seems to be using an old extension. Though I can't see why they would remove it honestly....

the code is not utilizing any special features, i don't see how ivy bridge-enthusiast cpu doesn't have features that sandy has. Then again the application was not modified since 2014 and settings in bios on dual X5570 are same as they were.

Yeah I've read this already, its related but i'm not maintaining the code i'm running. (i'm planning to move on elastic in next few months, but i need this troubleshooted now...)

I'm not really trying to solve the issue, i'm just trying to find what is the bottleneck here.

Any idea what version of Lucene it is using?

Also is it using the original java version or another ported version?

no idea unfortunately. Its not using Java it was ported to .net (i think)

I would start again with Perf mon counters - this time using data collection set created from a template generated by PAL.

Then use PAL to analyse the collected data and to generate a report. Do this on both machines to compare.

If you also want to over-ride the CPU affinity to experiment how many cores Lucence is permitted to use you shoudl be able to do that from the command line: https://blogs.msdn.microsoft.com/santhoshonline/2011/11/24/how-to-launch-a-process-with-cpu-affinity-set/

EDIT: One more though, are the Windows and .Net versions identical on all the machines? Use the SystemInfo command if you need to see what KB's are installed.