Using AWS Transcribe in CFML: Starting a Transcribe Job
Posted 18 September 2018
There are three distinct phases in working with AWS Transcribe: starting a job, waiting for the job to complete, and parsing the results of the job. This is because Transcribe is a fully asynchronous service. Transcribe cannot transcribe an hour long video in miliseconds. It takes time to do the work.
In this post, I’ll show you how to start a Transcribe job and explore some of the options available to you.
As always when working with an AWS service via the AWS Java SDK, there’s a basic pattern that you follow:
- Get a copy of the client that’s making a connection to the service you want to use.
- Create a “request” object.
- Fill the “request” object with the parameters (or other objects) you need to supply.
- Tell the client to make the request.
- Get back a “response” object.
As I wrote in the series on using Simple Notification Service (SNS) from CFML, the JavaDocs for the AWS Java SDK are comprehensive and always up-to-date. They are, alas, also just JavaDocs. You’re not going to find detailed examples of how to complete full tasks in these docs. If you look at the documentation for the com.amazonaws.services.transcribe.AmazonTranscribeClient, you can see all the things that you can do with Rekognition via the Java SDK.
Here are the steps to starting a Transcribe job via the AWS Java SDK:
- Get a copy of the Transcribe client we created in the first part of this series.
- Create a “S3OMediabject” object that specifies the path to the media file on S3 that you want to transcribe.
- Create a Transcribe job Settings object that details any custom settings that you want to use.
- Create a “StartTranscriptionJobRequest” object.
- Give the job a unique name.
- Set the S3OMediabject into the StartTranscriptionJobRequest object.
- Set the Settings object and all other required settings into the StartTranscriptionJobRequest.
- Run the StartTranscriptionJobRequest.
- Get back a StartTranscriptionJobResult object.
If you’ve read the other posts in this larger series on using AWS from CFML, you’ll notice step 7 in the list above is a bit vague. I’ll explain why below.
The Code to Start a Transcribe Job
Here’s how we do this in the AWSPlaybox app:
If you haven’t already read the entry on the basic setup needed to access AWS from CFML, please do so now.
As there are three basic actions when working with Transcribe job, I’ve broken out each of those into three separate code blocks in /transcribe.cfm. The first, containing the code to start a Transcribe job, starts with:
If you fill out the form that is rendered by default when /transcribe.cfm loads, and enter a valid path, this code block will execute. The code does a basic check to make sure that the path points to either a MP3 or MP4 file:
Transcribe supports the following file formats for processing: MP3, MP4, WAV, and FLAC (see com.amazonaws.services.transcribe.model.MediaFormat).
If you continue on in this code block, you’ll see the nine steps listed above translated into code, with a few detours:
The code above should be fairly self-explanatory with the comments, but I do want to talk about some of the options available to you when you start a Transcribe job.
First, you need to set the language of the source media file. Currently, Transcribe supports US English (en-us) and US Spanish (es-US). Given that Transcribe uses the same natural language AI used by Alexa, I expect we’ll see other languages supported in the future.
Next, the Transcribe Settings object contains a lot of options which you can enable depending on your specific transcription needs. In this case, I have enabled both the speaker labels setting and set the maximum number of identified speakers to five:
Enabling “show speaker labels” has the Transcribe engine do its best to identify every unique voice in the source media object. Transcribe obviously doesn’t know that the first person speaking is Jill, the second Tom, and so on. Instead, it identifies each speaker with a number. It’s up to you to know what to do with that information. Enabling speaker labels is optional. I enable it here simply to help you understand how to work with the Settings object in Transcribe.
Setting the maximum number of speaker labels is only useful if you enable speaker labels in the first place. You’d want to set this to a number that’s appropriate for your source media. You could use this to focus on the two or three people who are the primary speakers in your source media.
Additional options in the Settings object involve identifying the channel in your multi-channel audio where your speaker’s audio came from, and using custom vocabularies. Custom vocabularies can be very powerful in helping the Transcribe engine recognize highly domain-specific words, such as medical terminology.
The Destination of Your Transcribe Job
One option that I did not set in this example is the very important OutputBucketName. This is a setting added to the service in June, 2018.
By default, Transcribe will put the output of the Transcribe job into a bucket owned by AWS itself. You have no direct access to this bucket, except through a special, time-sensitive URL given to you when a Transcribe job completes. I’ll explore this more in the next post.
If you want Transcribe to put the output of the Transcribe job into a S3 bucket that you own, you can do that. You need to call setOutputBucketName() on the startTranscriptionJobRequest object and pass in the name of the S3 bucket in your account that you want to use. Note that if you choose to go this route, the IAM user that is making the request to Transcribe to start the transcription job must also have permission to access this S3 bucket, and Transcribe itself needs to have permission to access this S3 bucket.
The Importance of the Job Name
Although it seems like a throwaway requirement in starting a Transcribe job, the job name is very important. Once you start a Transcribe job, the only way you can get information about the job or the result of the job is by using the job name that you created. As such, it’s important that job names are unique (unique per AWS account, not in all of AWS), and meaningful in some way to you.
When you start a Transcribe job, you will get a StartTranscriptionJobResult object back. Even so, that job name is what you need to do any future work with your Transcribe job.
In the AWSPlaybox app, I take the job name and the start time of the job, and add those to an application-scoped array of current Transcribe jobs:
In a real, production application, you would want to store this information in a database or some other persistent storage mechanism. It can take minutes or hours for a Transcribe job to complete, and you’re going to need to check for the job to see if it’s done on a regular basis. That’s exactly what we’ll do in the next post.