Getting started with LM Studio

4 min | Bryan Nehl | 2024-05-24T00:00

Installing

Go to https://lmstudio.ai and download the version for your computer. I have used it on linux, windows, and mac. It is more of a wrapper around the AI model that is being used.

Once you have it installed, run it.
- You’ll be greeted with a search bar.
  This is for finding the LLMs that can be downloaded and subsequently run on the computer. It defaults to having gguf in the search as that is the type that LM Studio can work with.
Add phi3 to the search so you have gguf phi3 in the search bar.
Hit return.
- Included in the search results will be: microsoft/Phi-3-mini-4k-instruct-gguf
- Then you will have 2 options under that.
  - If you have a machine with 16G of RAM or more, consider choosing the larger model (7.6G).
  - Otherwise, go with the smaller one (2.4G).
There will be a download progress bar at the bottom of the LM Studio window.

Hardware

Running AI models can be very resource intensive. A computer with a minimum of 16G is suggested to get started. Even just to use models, you may find that you can utilize 32G or even 64G of RAM. For instance, I have the model: TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf downloaded and it is 26.44G in size. Pushing what can be done with 32G of RAM.

The other factor is do you have a GPU or NPU? If you do, some or all of the processing may be offloaded to it. Use the settings panel to adjust for GPU, CPU, and RAM.

My best machine for this has a Ryzen 9 7940HS (8C/16T) w/NPU and RTX 4060 w/8G VRAM, 2T of SSD.

I’m considering one of the new ThinkPad T14s that are coming out with SnapDragon Elite X w/NPU and 64G of RAM. They only have a single pre-order model available now and it “only” has 32G of RAM.

Querying

Now that you have a model, click on the LM Studio icon in the left nav bar that looks like a chat bubble.
Choose the model to load from the top drop down.
The system prompt/pre-prompt can be configured in the right side settings box.
If you have a GPU you can also adjust settings for it there too.
In the middle panel is a “USER” input where you interact with the LLM.

Response Times

When I did the chat about PIC programming, I was using the application installed in the Linux environment of a ThinkPad C13 ChromeBook. The laptop has a Ryzen 7 3700C (4C,8T) with 16G RAM and SSD. It was 73 seconds before first output. Then output came at the rate of ~2.6 tokens/second. A token is roughly a word. Overall run time was ~2 minutes.
However, on an M1 mac mini with 16 layers offloaded to the GPU, the initial response time was around 7 seconds with a rate of ~15 tokens/second!

Using an LLM can be very resource intensive. For example, I was using a 13B model in the 9G size rage on a computer with an AMD Ryzen 7 7730U (8C,16T) and 16G of RAM. All 16 threads went to 100% and memory was pegged out for several minutes before I started getting my first output. It ran at around 1 token/second. An Apple Mac Mini with m1 and 16G of RAM ran about 5 tokens/second. The Ryzen 9 32G, 4060 8G machine ran the same workload at about 11 tokens/second.

Prompt Engineering

To improve your expierence, do some research around the techniques for interacting with Large Language Models. This is referred to as prompt engineering.

Other models

When you get comfortable with everything, research/check out other models. You want to look for models that are labeled “instruct” as they are geared for taking instructions and generating responses. LMStudio has features that tell you what the labels mean and other information about the models.

Community

There is an active Discord community for LM Studio. The link is at the bottom of their web page.