Chatbot have been around for a very long time. But, due to preprogramed responses and low quality of engagement from them, they never gained traction from the public. However; since ChatGPT was released, the Chatbot landscape has been turned upside down.

Even though ChatGPT makes many of mistakes, the long open ended conversation in all sorts of topics one can have, has massively shifted the public interest in AI. The underlaying technology behind systems like ChatGPT is a class of machine learning models like large language models also known as LLMs.

Following the success of ChatGPT, a lot of developers and organizations started developing their own alternatives. Open source community also followed suit. The release of Meta’s LLAMA series of models triggered a chain reaction in OSS community and we have an big ecosystem of LLMs some competing ChatGPT in many tasks.

These LLMs have very general purpose natural language capabilities. They are capable of solving a very diverse set of tasks. This makes them perfect candidate for creating new type of intelligent applications. Many companies also started using LLMs to add intelligent capabilities into their existing applications.

We have recently released a feature called “Step Suggestion” that uses the power of LLMs to help our customers create test scenarios with ease. In this essay, we will explore what are LLMs, how we integrated ChatGPT in our product and how you can add the power of LLMs into your product for intelligent features.

What is a LLM?

LLM stands for large language model. It is a special class of models that have the following properties:

  • based on transformer architecture
  • trained on language modeling objective
  • size is usually >10B parameters but >1B are also considered LLMs

These models are trained on massive datasets usually containing terrabytes of text. Due to that, they compress massive corpus of human knowledge along with some linguistic and reasoning skills within themselves. These skills can be triggered by prompting these models in specific ways.

Most of the powerful LLMs have been developed as closed source. Models like ChatGPT can be accessed via web interface. Even though such models can be costly in the long run, working with closed source models can save one the effort of setting up these massive models locally.

Context window as the main Limitation of LLMs

In order to work with LLMs, it is important to understand one of the major limitations of LLMs i.e. the maximum context size. Context is the amount of text that is instantly available in the working memory of a transformer model. Ideally, we would love to have an LLM with infinite context size and some efforts have been made but at the moment, we have to work with limited context window for LLMs.

If some information goes out of the context window; the model loses access to that information. Imagine you are chatting with an LLM and you provide it an API key. If you keep chatting until the key goes outside the context window, and if you ask it to tell you the key, it wont be able to do so because it does not have access to it.

Usually the context size is 2048 or 4096 for open source models. The latest version of GPT-4 has a context window of 128k tokens which should suffice for a very large number of tasks. However; clever solutions have been implemented to work around the limitation of the context window like LangChain.

For our step suggestion feature, we use multiple techniques like reducing the size of HTML page to provide workaround for very long pages.

How to communicate with LLMs

Communicating with LLMs is a skill. Some companies are now hiring prompt engineers whose main job is to communicate with LLMs.

Generally, we can use natural language to communicate with LLMs. Smaller models require more prompt tuning because of limited reasoning capabilities but larger models can easily understand the intention of user even if the language is not very precise.

In order to interface the traditional applications and LLMs, we can use certain protocols like JSON or YAML. Forcing LLM to output in YAML that can be parsed by a traditional application is the easiest way to add the power of LLM into any application.

We have used a similar strategy when developing the Step Suggestion feature in Autify that provides various suggestions to guide user to test their application. We communicate with ChatGPT using YAML but within that YAML there are fields that contain natural language information.

Of course, due to the stochastic nature of LLMs, some errors are always expected. Sometimes, LLM outputs invalid YAML, other times, some values/parameters are invalid but if the system is working >95% of the times, workarounds can be created for the rest of 5% of the errors.

Most common pitfalls & solutions

Start with a proprietary solution: If you are developing an intelligent App or trying to add intelligence to existing app; start with solutions that provide API access to LLMs. These models are very accurate and they save you from the hustle of setting up your own LLM at scale which can be very time consuming and costly in the early stages. This is why we decided to go with ChatGPT instead of building an in house LLM for Step Suggestion.

Don’t assume perfection: Just like humans are bound to make mistakes, AI/ML systems can’t be 100% accurate in most of the problems. So, keep this in mind and develop solutions that have this limitation in mind.

Prompt Engineering: while not a major issue for bigger models but sometimes a model maybe making a mistake consistently or output is not what you intended. Try updating the prompt or explicitly state what you want and what not. Adding a good example is far more effective that just describing in natural language. For Step Suggestion feature, we have done a lot of prompt engineering to optimize the type of output we want to provide to our customers.

Self Evaluation: LLMs are generally much better at evaluating an output than generating a specific output. Just use the same LLM to ask it to evaluate the output that it just generated to get a confidence estimate.

Monitor the output: Just like LLMs can surprise us with how good they can be sometimes, there are times when they make the stupidest of mistakes. So it is importantly to actively monitor the output once they are in production and find solutions proactively before they impact the user experience. Step Suggestion feature uses W&B Weave integration to monitor the usage of our system in production.

Never exceed the context length: Even though this has been mentioned above, the size of the input should remain smaller than the max context length. e.g. if you are using ChatGPT that has 4k context length and if the size of the input becomes larger than that, either switch to 16k version or trim the input in a way that does not harm the output. The second option requires a lot of engineering effort.

Prompts are source code: Articulating a solution to a problem using LLM is an important work that is expressed as prompts so it is important to keep prompts/prompt templates safe just as the source code. This is the reason, we moved our prompts from extension source code to backend API. We also had to implement a complex data structure to represent the application state as well as properly communicate with backend.


In this essay, we presented some background information on LLMs, how we power our product with this new class of AI models and how you can excel against your competitors by adopting this cutting edge technology. We have also shared some of the technical insights that we gained developing our own LLM powered feature.

Start improving your development process with test automation. Sign up for a 14-day free trial or talk to our experts to learn more!