At Kaizan, we aim to improve our clients’ working lives using the latest advancements in natural language processing (NLP). More specifically, we do this by using large language models (LLMs) to reduce administrative tasks, helping clients push ever closer to the utopia of admin zero.
Given how fast NLP advances, we need a robust process to identify new language models, train them, and evaluate their usefulness in achieving our goals.
What does that look like in practice? Let’s use a real-life example to find out…
The first step is to define your goal.
One of the key problems we’re tackling at Kaizan revolves around wasted time. Many people spend their working lives in back-to-back meetings. They fear that, the one time they don’t turn up, they’ll miss a critical piece of information. This can mean sitting through a one-hour call when only two minutes are relevant to them.
The right use of language models could save people from some of that waste. One example could be processing transcripts of meetings and providing summaries that people can read afterwards, instead of attending every meeting. This would let employees focus on productive work, without fear that they’ve missed something crucial.
The problem here is wasted time, and we want to solve it using a LLM that produces high-quality summaries.
To do this, we need to find the right LLM. This is done by testing models on a suitable dataset.
One of the best places to find such data is the archive of scientific papers at arXiv. Search through the papers there, looking for ones that introduce a new open source dataset related to your problem. If you can’t find something suitable on arXiv, then PapersWithCode or HuggingFace’s Datasets are also good sources.
After searching for keywords like “meeting summarisation” and “abstractive dialogue summarisation”, we found two papers with relevant datasets to our problem: SamSum and QMSum. One of them relates to summarising dialogue and the other to summarising meetings.
A random datapoint from the SamSum dataset shows its similarity to the problem we’re trying to solve:
Conversation: Hannah: Hey, do you have Betty’s number? Amanda: Lemme check Amanda: Sorry, can’t find it. Amanda: Ask Larry Amanda: He called her last time we were at the park together Hannah: I don’t know him well Amanda: Don’t be shy, he’s very nice Hannah: If you say so.. Hannah: I’d rather you texted him Amanda: Just text him 🙂 Hannah: Urgh.. Alright Amanda: Bye bye Summary: Hannah needs Betty’s number but Amanda doesn’t have it. She needs to contact Larry
This dataset is focused on summarising dialogues from a messaging application. Each data point presents both a transcript and a summary; the input we’re going to give to our AI, and the output we want it to produce. This gives it a model for transforming one into the other.
Even though this dataset is different in style and format from the transcripts we will be working with, since the structure of the summarisation is similar, we assume that the models that produce the best summaries using this dataset should eventually produce the best summaries for any meeting transcript.
That provides an initial dataset, to test our LLM on. Now we need to pick the best model for the task at hand.
Once you’ve identified your proxy dataset, it’s time to look at suitable LLMs.
Start by searching on PapersWithCode, a site that collects code for machine learning research, to find the best models suited to your task**.** Ideally, you want an open-sourced model, so that you can quickly clone the repository and test it out on your dataset.
In some cases, the best model for your problem might not be open-sourced. If that’s the case, you might be better off trying to build the model yourself. Doing this requires much more time and effort and will be dependent on how this model outperforms the most advanced open source models on the task at hand. We’ll come back to the topic of building a model from a research paper in a future blog.
Assuming that the best-performing model is open-sourced, you can test its capabilities using your chosen data set. At the time of running this experiment, bart-large-xsum-samsum topped the leaderboard of ROUGE-1 scores for the SamSum dataset. If we run that model on a segment of one of our calls at Kaizan, we get the following output:
Nicolas Blanchot and Karim Foda and Pravin Paratey need someone to coordinate the process and make sure that this is happening.
As summaries go this isn’t bad, but it’s not exactly what we’re looking for either. The summary is coherent, but very generic. It doesn’t provide specifics about what was discussed on the call. That’s fine: at this stage, we’re not looking for perfect, but for the best option out of those available to us.
It would be unfair to expect perfect summaries straight out of the box, since the model has never seen an example of our meeting transcripts or summaries. We need to train the model on a dataset of our own. But first, we need to assemble that data.
For a task as complex as summarising meetings, you need roughly 5,000 training data points in your dataset. That should provide enough information for your large language model to start to learn how to perform well on the task.
Because of the size of this task, data labelling can be time-consuming, repetitive, and de-motivating. That’s why it’s important to make it as fun as possible.
At Kaizan, we try to run labelling parties every other week, getting the team together in a virtual room for 60 minutes and labelling as many data points as possible. This might not seem like long, especially when you’re reading though 30-minute calls, but when combined with weak labels generated by the base model and prompt-based models like GPT3, you can produce a summary of a 30-minute meeting surprisingly fast. We also use freelance annotators and data-augmentation techniques to reduce the annotation burden even further. A company-wide dashboard showing how close you are to reaching this goal is a great way to keep your team’s eyes on the prize, and to motivate them during a labelling session.
As well as labelling your data, you need to break call transcripts into segments that large language models can process. The LLM can then summarise each segment and present it to users. Most LLMs can deal with transcripts of 1024–2048 tokens, which sets a maximum size.
As you label the data, publish it on HuggingFace’s datasets. Once you have at least 2,000 data points, you can start to fine-tune your model on your unique dataset and test the results. HuggingFace’s example scripts are great for running fine-tuning experiments. Using a specified model and wandb’s hyperparameter sweeps capabilities, you can conduct multiple experiments using Bayesian, random, or grid searches to optimise your hyperparameters and improve performance against your evaluation metric.
Once you’re happy with the performance of your fine-tuned model, it’s time to compare its outputs to those of commercially available LLMs, such as OpenAI’s GPT3 or Cohere’s X-Large Generation model. This will help you to evaluate the model’s performance.
Here are the results from one of our examples, using a fine-tuned Pegasus model:
GPT3 model output:
The meeting discuss the idea of using a one-pager model to explain and rank different research efforts.
Fine-tuned PEGASUS model output:
Karim, Pravin and Nicolas discuss the need to organize the process of planning and executing a project. They agree that it’s important to have a clear way to note ideas and then rank them based on two things: effort, value and user feedback.
The fine-tuned Pegasus model has produced a more specific output than GPT3, summarising who attended the call and what was agreed. This would be more useful for users, so we decided to deploy this model, and to keep training it as we labelled more data.
Like any human being, your LLM won’t be perfect, and it will sometimes make mistakes. Research suggests that we should improve our mindset when dealing with our children’s failures, and it make sense to try and do the same with our LLMs.
Make sure to celebrate when you find faulty model outputs. Recognising a gap in one’s knowledge is a good thing, as it offers an opportunity to learn and grow. Build a culture within your organisation and your user base of identifying and celebrating incorrect model outputs. The faster you discover these, the faster you can rectify them and feed examples back into the model, to prevent them occurring again.
At Kaizan, we ask all our users to flag and correct any incorrect output as soon as they find it. This improves the quality of the content they see in the future.
Your final step is to build a continuous training loop. Feed your annotated data and user-driven error corrections back in, re-running your training process whenever you pass a certain number of data points. This creates a virtuous circle that constantly feeds the model with examples of how it should perform, training it to correct errors in its own summarisation process.
Building large language models that aim to reduce admin overhead is a project that never ends. There will always be more data to add and more nuance to develop in the model’s understanding. That’s not a drawback, but one of the great benefits of this approach: you can always keep making things better, always keep moving closer to the dream of reaching zero admin.