One part of my work involves interviewing people, then analysing the information gathered and then writing on the basis of the conversation and analysis. The overall tasks and processes involved include:
- Background information gathering (research)
- Planning (initial questions, conversation approach)
- Conducting the Interview (talking, listening, debating, discussing)
- Taking notes during the interview (writing, recording)
- Transcribing the recording (listening to the interview and writing/typing everything ‘as is’)
- Writing the first draft of the story/article/research paper incorporating elements from the interview
- Editing (fact-checking, grammar checks, rewriting, re-sequencing)
- Proofreading
- Publishing/submitting
Till some years ago, depending on the complexity of the subject, a professional with about 2 years of experience could take 2-3 days to a few weeks to submit their final draft. Real pros – people who had done the exercise more than a 100 times – could submit the draft within a day, max two.
Which of the tasks was the most time consuming? In my opinion, the transcribing. I had to listen to the tape over and over and over again to ensure I type out everything accurately. To give a clearer picture: in 2022, if I had to write a 1000-word article based on a 60-minute interview, assuming it took me 12 hours, the breakup would be as follows:
- Research and Prep – 2 hours
- Interview – 1 hour
- Transcribing – 5 hours
- First Draft – 1 hour
- Editing – 2 hours
- Final Draft – 1 hour
Today, I can do the same article in 4 hours or less. How? AI can help me do the research and prep in 30 minutes (that’s an hour-and-a-half saved). I would still need an hour for the actual interview. AI can transcribe the interview in seconds – that’s the biggest time saver: almost all of the 5 hours. Writing would take the hour and I would still take the 2 hours for editing/finalising, using AI to consider multiple structuring options.
Writers, journalists, content creators are still as relevant as they were, but with the advent of generative AI, tasks are changing, roles are being modified, skills are being upgraded, but more pertinently: core skills are coming into focus while ancillary/peripheral skills are becoming redundant. For example, in the scenario described above, interviewing, writing, and editing are the core skills, while transcribing – while crucial – is a laborious process of listening and noting that is best automated and takes nothing away from core journalistic skills.
Transcribing is done by a generative AI category called Automatic Speech Recognition (ASR). A more common application of ASR is the speech-to-text application on our phones. Other examples include dictation apps or voice-controlled bots. To adapt to this evolving world means to understand and adopt generative AI.
Generative AI Categories: A High‐Level Overview
But moving beyond content creation/journalism, a good start point for someone aiming to get a deeper understanding, is to learn to classify: what are the different categories (as of now) of generative AI, what are their underlying architecture/models and what are their focus/uses? The following table provides a sort of overview.
Category | Model Architecture | Examples | Student Uses |
Text Generation | Transformers, LLMs (Large Language Models) | GPT-4, Claude, Gemini, LLaMA | Language tasks, writing, structuring, study help |
Image Generation | Diffusion Models, GANs (Generative Adversarial Networks) | DALL.E, MidJourney, Stable Diffusion | Visual aids, conceptual art, visual modelling |
Audio Generation | Diffusion, Autoregressive models, Neural Vocoders | AudioLM, ElevanLabs, Voicebox | Podcasts, music creation, voiceovers |
Video Generation | Diffusion, Transformers, GANs | Runway, Sora, Pika | Educational and infotainment videos, explainers animations |
Code Generation | LLMs focused on code | GitHub | Writing code, debugging |
Virtual Environments | Diffusion, Neural Radiance Field (NeRF) | Omniverse, Luma AI | VR, architecture modelling, game dev |
Audio-to-Text (ASR) | Transformers, Conformers, Recurrent Neural Networks + Connectionist Temporal Classification | Otter.ai, Whisper, | Transcription, note-taking, podcast-to-text |
Multimodal Models | Multimodal Transformers | GPT-4o, Gemini 1.5, LLaVA | Analysis, diagrams, cross-modal learning |
In the context of AI, architecture means the design and organisation of a model’s parts. An AI model has layers, connections, algorithms and together they help the machine (computers, really) process data and learn patterns. Broadly, there are Transformers, GANs, Diffusion models and RNN/CTC and Conformers. But what are these and how they work?
Transformers are the foundation of many AI tools such as ChatGPT, Claude, Gemini etc. They are very good at decoding and translating languages, summarising text, and writing, among other tasks. (Okay so they can’t really write but simulate writing but that is a different debate). Transformers consider chunks of information and predict what is likely to come next and they can do this for not just short articles but large bodies of text. They do this by deciding which part of the input at any given step of the process is important. What makes them powerful is that Transformers can process large quantities of information in parallel, which makes them fast and scalable.
Generative Adversarial Networks or GANs have two parts: one part (generator) tries to create fake content (for example an image) and the other part (discriminator) tries to catch it (the fakeness, so to say). As they work, both get better at their output and the results get closer to desired outcomes. GANs are good at creating realistic faces, generating photorealistic art, simulating artistic styles, creating art from photos and so on.
Diffusion models start with noise and then try removing the noise so that a clear form emerges creating a recognisable image. They are very good at creating artistic images, generating video frames and high quality backgrounds and textures.
RNNs, CTC and Conformers are good at recognising spoken language in noisy environments and multiple languages. Specifically, RNNs (Recurrent Neural Networks) handle time-based data like audio, one bit at a time, while CTC (Connectionist Temporal Classification) align audio with words without the need for precise timing. Conformers combine RNN-like time awareness with Transformer power, improving accuracy.
The way I understand the generative AI world, it is not a single technology, or model. It is a combination or collection of many little machine learning abilities that are each driven by their own class or category of models. AI tools that we use are often built on the foundation of one or more of these models. For example, ChatGPT is based on Transformer architecture, specifically a Large Language Model (LLM). Advanced versions of ChatGPT are evolving as Multimodal because they also combine other architectures such as an audio encoder based on Whisper (which is a Transformer model for Automatic Speech Recognition) and vision encoder such as CLIP which uses Contrastive Learning with dual Transformer encoders for images and text).
When I first started trying to figure out Generative AI, the deeper I went into technical aspects, the more complex it started sounding, but that’s just because I was new to the terminology. For example, a Vision Encoder is just a model that processes visual data into numerical representation. The purpose of encoding is to convert information into a form that machines and AI systems can understand. That (converted) form is vectors (which are lists of numbers). Thus, similar to a vision encoder, a text encoder is a model that converts text (words, documents, etc) into numerical representations called embeddings. Further, an audio encoder is a model that converts audio into numerical representation (audio embedding). It took me a lot of time to understand that Transformers can be Encoders, Decoders or Encoders-Decoders.



Like many people, when I start out (in a new area/subject) I use terms without necessarily understanding what it means at the beginning. For example, I understand ‘architecture’ in normal language as being a design or blueprint. But what is architecture in AI? And what is a model? How are the two related?
While an architecture is a blueprint, a model is specific instance of the architecture that is trained on data. Let’s take ‘Transformer’ which is an architecture designed to handle sequences of data, with the ability to focus on different parts of the input sequence when making predictions. An LLM is a Transformer-based model that is trained on billions of words to predict the next word given prior context. LLMs are powerful AI models that use Transform architecture to understand and generate human language at large scale.
In practical everyday terms, Transformer is the architecture, LLM is one class or category of models based on that architecture, ChatGPT is a tool built using the LLM class of models, and GPT-4 is a specific trained model within that product.
So are there other classes of models (apart from LLMs) based on Transformer architecture? Yep and the following is an overview of Transformer-based models.
Category/Class | Examples | Description |
LLMs | GPT-4, Claude, LLaMA | Generate and understand natural language |
Vision Transformers (ViTs) | ViT, BEiT, DINO | Process and classify images |
Speech Models | Whisper, Wav2Vec2 | Audio to text, understand speech |
Multimodal Models | GPT-4o, CLIP, Gemini | Process, analyse and align multiple data types |
Encoder-Decoder Models | T5, BART | Translate summarise, transcribe |
Graph Transformers | Graphormer, SAN | Model relationships in graph data |
Just as Transformers support LLMs, they also power vision, speech, and multimodal models. And just as Transformers are an architecture, Diffusion and GANs and VAEs are also architectures and they also have a list of models and specific examples within models. Diffusion models aren’t just for text-to-image—they now support video, audio, and even 3D generation. GANs and VAEs have multiple specialised versions for translation, enhancement, or stylisation. Each architecture spawns its own ecosystem of model types based on application needs. The following diagram gives us an idea of the relationship between four architectures and related hierarchy for some of the models based on them.

As I was learning through this I realised that models are called models in AI because they are mathematical representations (models) designed to capture patterns in data using equations, transformations, and optimisation techniques. That’s the core of how AI ‘learns’. In statistics and math, a model is a formal system (like an equation) that describes or predicts phenomena. So in AI:
A Transformer model is a mathematical structure (with matrices, attention weights, etc.) built to process sequences. A Diffusion model uses probability distributions and denoising functions to learn how to generate structured data.
Every “model” is really just a set of equations + learned values (parameters). And these models can then encode language, vision, audio, and more—all through math.
So how do AI, ML, model, etc come together. It’s all a hierarchy. Artificial Intelligence (AI) is the broadest subject area or science: The science of building systems that mimic human intelligence and behaviour. ML is a subset of AI, the mechanism or techniques that allow computers to learn patterns from data, without hardcoding (like earlier programming).
In practical terms, AI is the goal (intelligence), ML is a method to achieve it (learning from data), Architectures are the designs used in ML (e.g., Transformers, GANs), and Models are trained implementations of those architectures (e.g., GPT-4, Whisper).
Study AI with Goseeko: https://www.goseeko.com/landing/international-students/