Adventures in coding using AI

alpha754293 · April 20, 2024, 8:27pm

Disclaimer:

This is just for fun. Not looking to do any serious coding/development work.
There are probably a LOT better models/methods out there (cf. EvalPlus Leaderboard), but I am just trying this out for the first time, over the last two weeks or so.
I am NOT a programmer by trade nor training. I don’t really understand code per se, and as a mechanical engineer, I’m used to people explaining what each part in a line is or does (the way that formulas/equations are taught) which, in my experience, programming isn’t taught like that.

With that out of the way – it’s been a fun week/maybe two weeks where I am playing with AI.

I started off with text-geneartion-webui (GitHub - oobabooga/text-generation-webui: A Gradio web UI for Large Language Models. Supports transformers, GPTQ, AWQ, EXL2, llama.cpp (GGUF), Llama models.) and TheBloke’s Mistral AI Mixtral-8x7B-Instruct-v0.1-GGUF (TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF · Hugging Face) where I was asking it some basic questions and also showing tiny human #1 about why AI can’t be relied upon because it can give wrong answers.

But as I started playing with it more and more, I made it to a 2-parameter regression analysis (where I later learned about a concept that Shih-Wen Chen and Hsien-Jung Hsu called “numerical hallucination” in regards to the lack of precision and accuracy when it comes to LLMs doing math (even just basic arithmetic).

Then I moved on to getting it to write a MATLAB script to approximate pi using the Gregory-Leibniz series where pi = 4*atan(1). (I actually purposely originally entered it as pi = 4/atan(1) to see if it would catch my error. It did.)

And then that led me down to using the Nilakantha method as an alternative method for approximating pi, and now I am using that to test out both Ollama/Codellama 70b and the Mistral LLMs to see which would be better at writing this code/program. (following this topic and the video that accompanied it: Ubuntu 22.04 - From Zero to 70b Llama (with BOTH Nvidia and AMD 7xxx series GPUs))

Ollama/open-webui originally didn’t work for me on my Proxmox server which was running Ubuntu 22.04 LTS via a LXC container. My Proxmox server only has a single RTX A2000 6 GB that is sharing between different LXC containers.

And then I switched over to my 6700K, which has 64 GB of RAM, and a 3090.

That worked better/faster. (was getting around 4-4.5 tokens/s with Mistral LLM.)

As I started playing around with it more and then trying to load the bigger models in, I was running out of VRAM.

So, I ended up re-purposing some of the cryptomining hardware that I had, to use this instead.

Unfortunately, my 6500T’s motherboard only has two DIMM slots on it, and out of the readily available RAM that I have just laying around, I could only put 16 GB in (2x 8 GB sticks). So, the system ran out of RAM.

So I ended up taking out my Asus Z170 E motherboard with the 6700K, and put it into my mining frame, and then hooked up two 3090s to it and now the system has both, somewhat enough system RAM and VRAM to be able to load the 70b parameter models.

This is what I would ask in the prompt:

Write a fortran77 program to calculate pi using the Nilakantha method with a user input and a double precision elapsed timer

My results so far:

Codellama has failed in this task in a few ways:

It wrote a fortran90 program, not a fortran77 program.
It kept using the Gregory-Leibniz method to approximate pi even though I asked for the Nilakantha method. (I was able to cross-check this by using Mistral’s LLM.)
Now that I have it finally able to give me a fortran77 program, the (tab) spacing for the program is all messed up such that fort77 (running via Ubuntu 22.04 LTS WSL2 on Windows), won’t compile.

In fact, the program that both Codellama and Mistral LLM wrote – neither of them would compile.

I’ve researched that there are better models (higher HumanEval scores) out there, and I haven’t tried them (all) out yet, but this has been a good, fun test for me to throw at AI, to see what it can and can’t do.

And apparently, this is still a little bit too much for AI (at least the models that I’ve tested so far), for now.

This is the program that Codellama-70b produced:


PROGRAM PI_APPROXIMATION

IMPLICIT NONE

INTEGER, PARAMETER :: N_MAX = 1000000
REAL(DP), DIMENSION(N_MAX) :: series
REAL(DP)                      :: pi_approx, start_time, end_time, elapsed_time
INTEGER                       :: i

FUNCTION nilakantha (n_terms) RESULT(result)
  IMPLICIT NONE
  INTEGER, INTENT(IN)          :: n_terms
  REAL(DP), INTENT(OUT)        :: result

  result = 0.0d0
  DO i = 1, n_terms
    IF (MOD(i, 2) /= 0) THEN
      result = result + 1.0d0 / REAL(2 * i - 1)
    END IF
  END DO
END FUNCTION nilakantha

WRITE(*,*) "Enter the number of terms for the Nilakantha series:"
READ(*,*) pi_approx

CALL TIMER()
start_time = TIMER()

DO i = 1, pi_approx
  pi_approx = pi_approx + nilakantha(i)
END DO

CALL TIMER()
end_time = TIMER()
elapsed_time = end_time - start_time
WRITE(*,*) "Approximated value of Pi: ", pi_approx, "(using", pi_approx, "terms)"
WRITE(*,*) "Elapsed time for the calculation: ", elapsed_time, " seconds."

END PROGRAM PI_APPROXIMATION

And here are the errors that I get when I try to compile it using fort77:

Error on line 1: illegal continuation card (starts "PROGRA")
Error on line 3: illegal continuation card (starts "IMPLIC")
Error on line 5: illegal continuation card (starts "INTEGE")
Error on line 6: illegal continuation card (starts "REAL(D")
Error on line 7: illegal continuation card (starts "REAL(D")
Error on line 8: illegal continuation card (starts "INTEGE")
Error on line 10: illegal continuation card (starts "FUNCTI")
Error on line 11: illegal continuation card (starts "  IMPL")
Error on line 12: illegal continuation card (starts "  INTE")
Error on line 13: illegal continuation card (starts "  REAL")
Error on line 15: illegal continuation card (starts "  resu")
Error on line 16: illegal continuation card (starts "  DO i")
Error on line 17: illegal continuation card (starts "    IF")
Error on line 19: labeled continuation line (starts "    EN")
Error on line 20: nondigit in statement label field "  END"
   MAIN:
Error on line 18: syntax error
Error on line 21: labeled continuation line (starts "END FU")
Error on line 23: labeled continuation line (starts "WRITE(")
Error on line 24: labeled continuation line (starts "READ(*")
Error on line 27: labeled continuation line (starts "start_")
Error on line 29: labeled continuation line (starts "DO i =")
Error on line 30: labeled continuation line (starts "  pi_a")
Error on line 31: labeled continuation line (starts "END DO")
Error on line 34: labeled continuation line (starts "end_ti")
Error on line 35: labeled continuation line (starts "elapse")
Error on line 36: labeled continuation line (starts "WRITE(")
Error on line 37: labeled continuation line (starts "WRITE(")
Error on line 20: unbalanced quotes; closing quote supplied
Error on line 20: unbalanced parentheses, statement skipped
Warning on line 20: missing final end statement
/usr/bin/fort77: aborting compilation

And this is the program that Mistral came up with:


program pi_approximation
  implicit none
  double precision :: pi, error, term, a, b, n, timer
  integer :: i

  call timer_on(timer)

  print *, "Enter the number of terms for the Nilakantha series:"
  read *, n

  a = 0.0d0
  b = 1.0d0
  pi = 0.0d0
  term = 1.0d0

  do i = 1, n
    pi = pi + term
    error = abs(term)
    term = term * ((a + b) / 2.0d0)
  end do

  pi = pi * 4.0d0

  call timer_off(timer)
  print *, "Pi approximated with ", n, " terms: ", pi
  print *, "Elapsed time: ", timer

end program pi_approximation

And it suffers the same issue with the spacing and neither models are able to successfully correct.

(And yes, I am purposely using Fortran77 to test.)

And yes, I am also purposely testing with the Nilakantha method rather than the Gregory-Leibniz method.

I ran this same comparison/test with MATLAB as well and the result didn’t really fair that much better, especially for the Nilakantha method.

I think that the last answer that I got from the MATLAB script that these AI models generated, calculated pi to be -6.somethingE+014

So, it’s a wildly incorrect answer.

progressEdd · April 21, 2024, 12:58am

Overall very interesting that mistral/codellama gets the structure correct, but generates code that does not compile. I wonder if FORTRAN/MATLAB code was used as part of the training dataset. If not it, might be a good use case for fine tuning.

I found that openhermes on my 2070 worked well for simple python and Linux commands

Recently I came across twinny, a vscode/codium extension that gives me a github copilot like interface and uses ollama as the llm

I like to use jupyter notebooks to test and write code, so any extension that replicates github copilot chat in works is ideal

quilt · April 21, 2024, 2:36am

Interesting use case.

Fortan must have been in the training set, otherwise these llms wouldn’t have produced what they did. But it seems like they don’t know fortran77. Llms can do python without issues, so there’s no issue in principle to understanding spacing?

You could try asking for a second attempt, clarifying the difference between F77 and F90?

Or have a clearer print from the start? E.g. it must be fixed format Fortran, must compile with fort77, etc.?

Perhaps the webui is screwing with the spacing?

To be fair there are a lot of mixed f77/f90 code bases out there and f77 compiles with no issue under f90 (pretty much) so it is perhaps to be expected that an llm plays loose with the destination if it is not extra strongly emphasised.

Iron_Bound · April 21, 2024, 3:22pm

The amount of Fortan training data in Llama, will amount to a rounding error, and early open weight models failed on even the most basic test i.e. fizzbuzz in python.

While there’s no logic in transformers, the best bet is LoRA/fine-turing on updated examples like this project has done:

progressEdd · April 21, 2024, 8:19pm

Or QLoRA

alpha754293 · April 23, 2024, 6:36pm

No idea.

From my very limited experience with it so far, I think that Python being used more often now than your “hard core” C/C++/F (77/90/95/etc.) – it will reflect what it was trained with.

I just followed the instructions from Wendell’s “Ollama” thread/post.

Sort of?

I am not sure if the specific f77 compiler makes (much) of a difference, if any at all.

My vague recollection is that


REAL :: var

should have been a proper variable declaration, but I don’t know if the fort77 compiler that I am using now, understands that (since gfortran isn’t available/a thing anymore).

So interestingly and ironically enough, I actually started with Codellama (per Wendell’s post/thread) as mentioned above, and it actually originally returned a F90 program.

But I also only “knew” this because I was cross-checking the validity of the code that was being generated with Mistral AI’s LLM, as I would just copy-and-paste the entire code (the nice thing about this example is that it is, as far as programs are concerned, very small), so it didn’t take long for it to parse through it and that was one of the ways that I was able to find out that it was NOT a valid fortran77 program because it was calling cpu_time() function to time how long the program takes to run (which is only available starting in Fortran90, but not available with Fortran77).

So I would write back, in the Codellama prompt:
cpu_time() is not available in Fortran77

And then it would apologise and rewrite the timer part of the program.

So…I actually only found out about fort77 after the original instructions that it came back with, wanted me to compile it with gfortran. But when I actually tried that, that’s when I found out that gfortran isn’t available anymore, so I had to find alternatives, and then feed that information back to it. That’s how I ended up with the fort77 compiler, but I didn’t start this adventure this way.

Yes and no.

There were versions of the f77 code where it would use C for comments at the beginning of the line, so I know that fixed formatting IS possible, but it is VERY inconsistent with it.

Agreed.

Unfortunately, now that gfortran isn’t a thing anymore, I didn’t spend much time trying to look for a fortran compile that can do both.

I haven’t gotten into LoRAs yet. (I will have to google what “LoRA” actually stands for.)

And for more backstory – I actually gave these LLM the test/task of writing the F77 program because I actually started with this by asking for a MATLAB script to do the same.

But MATLAB has tic toc functions for the timing piece of it, so that’s easy (for MATLAB).

And I was actually testing the differences between the Gregory-Leibniz algorithm for approximating pi vs. the Nilakantha algorithm and found that the Nilakantha method to a) give SIGNIFICANTLY wrong result (via the LLM generated MATLAB script) and b) for 1e9 iterations, it would take a VERY, VERY long time to compute the VERY, VERY wrong answer.

So, that’s what led me to “okay, let’s see if just a simple, and relatively basic f77 program would be able to speed this thing up”.

And that’s when I encountered all these issues with trying to use a LLM to code/program.

(But I am also using it to teach my tiny human #1 about NOT trusting/depending on AI to do tiny human #1’s homework – because it can give some very, very wrong answers, and why it is still very important for tiny human #1 to be able to tell when something is wrong, and to know how to fix it (or just do the homework manually in the first place.))

We might have been lucky enough to grow up not really worrying about that.

But with AI and chatbots, it is tempting for this generation of kids to crib their homework with an AI chatbot a la “why are you making me do it the hard way when I can just (ask Google/Alexa/Siri/etc.)?” and I told tiny human #1 that I can program a calculator that will purposely give tiny human #1 the wrong answers, but with this (since I don’t program), I don’t have to.

I can just type in a relatively simple question (obviously not about how to approximate pi), and then show how it can give the wrong answer as the result. And once tiny human #1 can see that, first hand, and I’m like “I’m not asking it a trick question, right?” – then it helps to make this point home and I didn’t have to write a single line of code. win win.

But since I don’t program, my next thought was “maybe I can use it to help me learn?”

Well…lo and behold – not really.

(And yes, I know that F77 really isn’t used much now, though Fortran is still used for HPC/CFD/FEA, which is some of the stuff that I do, but I figure that if I can use it to program for these older languages, then using it to teach me how to write Python code would be easier.)

But for this group here – I thought that it might be interesting from the higher, more technical level, about how what would, on the surface, seem like it should be something pretty easy to do (approximate pi using a known method/algorithm), turns out to be something that the LLMs struggle with, given the relatively “simple” task.

(I mean, it’s not like I am trying to use LLM to write CFD code or to code up the Black-Scholes pricing model.)

alpha754293 · June 1, 2024, 11:06am

Update:

So a few days ago, Mistral.ai released Codestral (Codestral: Hello, World! | Mistral AI | Frontier AI in your hands) which is there version of the AI LLM that is supposed to be geared towards coding tasks like codellama as mentioned in Wendell’s video and this post here.

I gave it the same problem, as above, where I told it to write a Fortran 77 program to approximate pi using the Nilakantha series and this time, the program actually worked. Well…it ran.

I don’t know that the results are accurate yet because even with 1e8 iterations, it still wasn’t approximately pi correctly, but I don’t know if that’s an algorithm problem, or something else.

So, I am going to spend a little bit more time looking into that, at least a little bit.

But, the key point here is that Codestral actually was able to produce a valid Fortran 77 program where Codellama had failed.

To me, that’s progress.

update
So I was checking on the results generated by the Fortran77 program that it had written, and unfortunately, it’s wrong.

But that might be moreso the case of a lack of familiarity with the Nilakantha series algorithm for approximating pi than it is a coding issue, but still, you’d think that if you are asking it to generate code, that it would be knowledgeable the algorithm as well as the code itself, if it is going to present you code.

Sadly, this failed the test (still) as well as the program(s) failed to produce the correct answer.

(The nice thing with approximating pi with these types of algorithms, you can actually relatively, very quickly check to see whether the code that these code-oriented LLMs is able to provide the correct results or not.)

This just goes to show that AI code-generating LLMs still have a ways to go.

alpha754293 · July 11, 2024, 6:16am

Update to this thread:

I am now using Mistral AI’s Codestral:22b model to answer my dumb questions about bash shell script using my locally hosted open-webui chatbot.