AudioTags: Automatic Content Tagging for Audio and Video Files

AudioTags scans an audio file or the audio track of a video file to identify keywords, nouns, places and brands to create a tag list.

AudioTags analyzes either an audio file or the audio track of a video file to find the keywords it contains.

AudioTags analyzes speech, going to check all the words that are spoken. Based on the most often spoken words it generates a list of keywords that can be associated with the file itself.

It is perfect for categorizing files to be published on the Internet, social media, podcasts and so on.

Do you think AudioTags is cool (we do)? Well, FCP Video Tag is super!

How it works

Using it is simple: choose the language, drag the file you want to parse into the main window pane of the program and press Analyze.

After the analysis, which takes a certain amount of time depending on the speed of your Internet connection and the power of your computer, the file is transcribed and the window with the most used keywords is presented:

In this window you can choose to delete some keywords if they are not relevant.

Finally, you can copy the contents of the lower pane of this window to get the tags (keywords separated by commas).

Discover how AudioTags was build: a technical explanation by the developer

The Goal

The purpose of AudioTags is to try to understand the content of an audio file, we have designed it specifically for podcasts, but it would be good for interviews, storytelling, even for the audio track of a video file.

Of course Understanding what it’s about any content is part of that branch of science called “semantic analysis“, still today a largely unexplored territory with not particularly deterministic results.

AudioTags tries to solve the problem with a very simple process: first it performs the transcription (speech-to-text) of the audio file, and then it analyzes this transcription to find out what the keywords are.


AudioTags performs a very simple task once the transcription is done: it counts all the repetitions of all the words in the text.

If for example we do the analysis with AudioTags of this text:

Today I went home. My home is yellow. I feel very good at home.

We will see that the word “home” is pronounced three times.
Therefore, in this text, most probably the word “home” is one of the topics we are talking about.

AudioTags does just that: it takes all the words in the text, counts the repetitions, and highlights those that are repeated most often.


Of course the words that are most often pronounced, in any language, are pronounced pronouns, conjunctions, adverbs, such as: “the“, “me“, “many“.

And for this reason AudioTags needs a minimum of training the first times it is used: you have to tell the application which are the words that must be ignored.

Once a word is indicated as to be ignored, it will never be shown again and in the list of suggested tags.

To tell AudioTags to ignore a word, simply press the X button in the list of found words:

Filter by number of occurrences

Once the file is analyzed, this is the window that shows all the words that have been found, including the count of the number of repetitions.

The words are of course sorted from the most repeated words to those that have only one occurrence.

With this slider you can indicate the minimum number of repetitions for a tag to be included in the list.

Once the selector is moved, the number of tags actually found is also displayed.


In many cases it may be necessary for a tag to be added automatically even when repeated once.

For this reason there is the Triggers window: keywords that are highlighted in the tag list even when there is only one repetition.

Triggers are also important because you can specify not only single words, but also groups of words, such as a name followed by a surname, a brand, a product, even a whole sentence.

And when a trigger is found it is highlighted in the results window with this icon:

You can add a trigger directly or from the found tags window by pressing the tag icon:

Manually delete a tag

In the lower part of the found tags window there are the keywords that have been found by AudioTags.

If you select one of these keywords and press the “delete” button, the keyword is deleted from the audio file tags, but it is not added to the list of words to ignore.

To tell AudioTags to ignore a word, simply press the X button in the list of found words:


In some cases it may be necessary that a certain keyword is not displayed in the tag list as it is, but transformed.

The shifters window allows you to specify one or more keywords that, when found, must be transformed into other tags.

This is very convenient to identify keywords that must generate other keywords: for example if we specify a list of sports or words that refer to sports events as source keywords, we can indicate “sports” as destination keyword.

Semantic Analysis

Apple provides some features that try to identify names of people, places and brands within a text.

This function is implemented in AudioTags through the Semantic Analysis Flag.

The quality of the recognition ranges from “good” to “mediocre“.

If the check is selected, in the list of found tags will be highlighted those found by semantic analysis with the light bulb icon.

By default this option is deactivated because the results that are displayed are not of particularly good quality, but you can always activate it to see if some keywords are found that you have not thought about.

Analysis of Multiple Files

If multiple files are dragged into the drag and drop area, analysis is done on all files, but a single list of keywords is created.


The transcription of the audio is done through the Apple framework which is available in the following languages:

Arabic (Saudi Arabia), Catalan (Spain), Chinese (China mainland), Chinese (Hong Kong), Chinese (Taiwan), Croatian (Croatia), Czech (Czechia), Danish (Denmark), Dutch (Belgium), Dutch (Netherlands), English (Australia), English (Canada), English (India), English (Indonesia), English (Ireland), English (New Zealand), English (Philippines), English (Saudi Arabia), English (Singapore), English (South Africa), English (United Arab Emirates), English (United Kingdom), English (United States), Finnish (Finland), French (Belgium), French (Canada), French (France), French (Switzerland), German (Austria), German (Germany), German (Switzerland), Greek (Greece), Hebrew (Israel), Hindi (India), Hindi (India), Hindi (Latn), Hungarian (Hungary), Indonesian (Indonesia), Italian (Italy), Italian (Switzerland), Japanese (Japan), Korean (South Korea), Malay (Malaysia), Norwegian Bokmål (Norway), Polish (Poland), Portuguese (Brazil), Portuguese (Portugal), Romanian (Romania), Russian (Russia), Slovak (Slovakia), Spanish (Chile), Spanish (Colombia), Spanish (Latin America), Spanish (Mexico), Spanish (Spain), Spanish (United States), Swedish (Sweden), Thai (Thailand), Turkish (Turkey), Ukrainian (Ukraine), Vietnamese (Vietnam), wu (-CN), yu (-CN)

Each audio file is segmented into pieces with a maximum duration of one minute, and the analysis is performed on Apple’s servers.

For this reason it is necessary that the computer on which AudioTags is run is connected to the Internet.

The audio is sent to Apple’s servers, no files are sent to Ulti.Media’s servers: this is in respect of your privacy and personal data.

Transcription Quality

The quality of the transcription of the spoken text depends on many factors:

  • Speaker’s skill
  • Language
  • Recording quality
  • Background noises
  • Used “jargon”
  • Number of people talking.

Depending on all these parameters the number of total words “guess” from Apple’s framework varies from a satisfactory 98% to an unsatisfactory 40%.

This means that the quality of the transcription is on average not sufficient to have a real transcription.

But in the set of words that are found, for the duration of the audio files, and for the purposes of the application, although the quality of the transcription is not excellent, it is quite likely that the number of keywords found is sufficient to identify, more or less, the topics covered in the audio.

For this reason AudioTags does not aim to be the ultimate solution: it is however an aid that allows you to identify in a text some keywords that, otherwise, you might have even let slip.

Supported file types

Audio Files

Microsoft Wave
MPEG-4 Audio (AAC, M4A)

Video Files

QuickTime (MOV)
MPEG-4 Video (MP4, M4V)

AudioTags requires macOS 10.15

Do you think AudioTags is cool (we do)? Well, FCP Video Tag is super!