Not directly relevant to your question, but you may want to pay attention to the long context performance of the model you use. (I assume it is some quantization of Minimax or Llama 4, given the 1M context).
Context Bench is an interesting source of such benchmarks, so is Fiction.liveBench, linked therein.
Under such standards, most open-weight models do not work well past 32K, and all but the absolute SOTA won’t past 100K. Beyond that, to use 1M context and wait 4 weeks for 1.5k tokens of output…I could only say I’m impressed with your patience. ![]()
You may wish to look into RAG setups rather than trying to fit everything into a single context, especially for 100K+ words of worldbuidling background, not 5% of which directly relevant at any one time, most of the time.
As to the setup itself, maybe look into the kind of setups people are using to run things like the DeepSeek-R1. Most of which are costly, but probably not too much so, if the M4 Max 512GB is within the realm of possibility.