Documentation: Valossa Core API

Version of documentation: 2018-02-05

Current version of Valossa Core metadata JSON format: 1.3.3 (changelog).


What does the Valossa Core API detect?

Call the API:

API overview

Input data formats

Usage examples

Creating a new video analysis job

Error responses from the API

Getting status of a single job

Getting the results of a job

Listing statuses of all your jobs

Cancel a job

Valossa Search API for your analyzed videos (coming soon)

View, quality-check and understand the API results in an easy way:

Visualization of your results in a Valossa Report

Use the machine-readable metadata provided by the API about your videos:

General notes about metadata

Output metadata JSON format   ← Understand Valossa Core metadata

Basics: How (and why) to read Valossa Core metadata in your application code

The main JSON structure


Detection types


Other optional data fields of a detection

Tips for reading some detection types

Detection data examples


Code examples for reading metadata

Metadata JSON format version changelog

Persons in the celebrity face gallery

What does the Valossa Core API detect from my videos?

The Valossa Core API is a REST API for automatic video content analysis and metadata creation. Metadata contains detections of concepts from the video.

Metadata describes the following detections:

  • humans with visible faces (with attributes such as gender and age information and the possibly detected similarity to persons in our celebrity face gallery)
  • visual context (such as objects, animals, scenes and styles)
  • audio context (music styles and musical moods, instruments, sounds of human actions and animals and machines etc., sound effects)
  • audio speech-to-text
  • topical keywords from speech (speech source being either the audio track or the SRT transcript file of the video)
  • topical keywords from video description
  • face groups of co-occurring faces
  • IAB categories for content related advertising
  • detected shot boundaries
  • explicit content detection (visual nudity detection and offensive words detection from audio and transcript)

Detections are also provided in different practical groupings: grouped by detection type (best concepts first, and including time-coded occurrences of the concept when applicable) and grouped by second ("What is detected at 00:45 in the video?").See explanation of the practical detection groupings

API overview

The REST API is invoked using HTTP (HTTPS) requests. You can also assign new video analysis jobs to the API on the easy API call page. The responses from the REST API are in the JSON format. The Valossa Report tool helps you to visualize the results. Please use the secure HTTPS transport in all API calls to protect your valuable data and confidential API key: unencrypted HTTP transport is not allowed by the API.

REST API basic communication

Get your API key from under "My account" - "API keys" in Valossa Portal. If you have several applications, please create a different API key for each of them, on the "API keys" page. Give a descriptive name for each API key, and if necessary, give access rights of the API key to different users of your organization.

Note: As the administrator user of your organization, you can create new users under "My account" - "Manage users". If your organization has several people who need to use the Portal, you should add them manually in "Manage users", so they are all mapped to your organization and may view analysis results (if you give the rights to the users in Portal) and post new jobs (if you give the rights). The permissions are mappings between users and API keys ("this user has read-write access to this API key so she can both view results and make new job requests"), so please configure the permissions understanding this; the API key permissions per user can be edited in "Manage users". For your company/organization, you must create only one customer account, but there can be as many users under the customer account as you wish!

The API consists of 5 different functions:

  1. new_job [HTTP POST]
    This function is used to create a new video analysis job in our system. The job (including e.g. the URL of the video file to be analyzed) is defined by using a JSON formatted data structure that is included as the body of the HTTP POST request. This function returns the job_id (UUID value) of the created job. The job_id is used after this as the identifier when querying the status and the results of the job.
  2. job_status [HTTP GET]
    This function is used to query (poll) the status of a specific job, based on its job_id.
  3. job_results [HTTP GET]
    This function is used to fetch the resulting metadata JSON of a finished analysis job identified by its job_id.
  4. list_jobs [HTTP GET]
    This function lists all the jobs for a given API key.
  5. cancel_job [HTTP POST]
    This function cancels a job, if it is in a cancellable state.

You can conveniently monitor the status of your jobs in Valossa Portal. There you can also call the new_job function of the API with an easy API request generator.

Your API key is shown in Valossa Portal on the request generator page and job results page. Keep the key confidential.

Please note regarding speech analysis:

  • If you already have the speech transcript of your video in the SRT format (for example the subtitles of your movie), please specify the transcript URL in the request, along with the video URL. The transcript content will be analyzed, and the detected concepts will be included in the "transcript" part of the metadata JSON.
  • Your existing transcript is, obviously, a more reliable source for speech information than audio analysis. So, if you have the transcript, please use it – it’s a valuable asset!
  • Audio keyword detection and audio speech-to-text will be performed only if you did not provide the SRT transcript (however, providing or omitting the SRT transcript does not affect the audio.context detections).
  • The audio-related metadata generated by us will not contain an actual audio transcript. Instead, we provide you a uniquely descriptive set of keywords extracted from the speech. Whether the source of speech information is audio itself or your transcript file, the output format of the detected keywords is similar in the metadata.

Input data formats

Supported video formats: we support most typical video formats, including but not limited to MP4, MPEG, AVI, FLV, WebM, with various codecs. Currently, we cannot provide a fixed list of supported formats and codecs, but for example MP4 with the H.264 codec works.

Video file size limit: 5GB per video file.

Video duration limit: 7 hours of playback time per video file.

Video vertical resolution limit: 1080 pixels.

Currently, the only supported languages for speech-based detections are English and French. By default, speech is analyzed as English language. See more information about language selection.

If the video file contains several video streams, only the first one is analyzed.

If the video file contains several audio streams, only the first one is analyzed. (Please note that audio keyword detection and audio speech-to-text will be performed only if you did not provide your own SRT-based speech transcript; however, providing or omitting the SRT transcript does not affect the audio.context detections.) The audio stream can be either mono or stereo.

Supported transcript format: SRT only.

File size limit: 5MB per SRT file.

Currently, the only supported transcript language is English.

Usage examples

Creating a new video analysis job

You must pay for the video analysis jobs that you run, unless you have enough existing positive payment balance that was added by Valossa to your account as a result of (for example) a free-usage campaign. Keep a working credit card in your billing information in Valossa Portal, and keep your suitable service subscription (such as the Recognition plan) active!

Start a new subscription on the Valossa product purchase page and manage your existing subscriptions on the plans management page. Manage your billing information on the payment profile page. You can add one or more credit cards and select one of the credit cards as the default card, on which the payments are charged. Payments are charged either at the change of each month, or when a high-enough debt has been accumulated. If payments cannot be processed on the credit card, processing of your jobs will cease. Valossa may skip charging the card in situations where the amount to be charged would be very small. The skipped charge may be actuated as part of a later charging event on a credit card. Receipts of credit card charges will be sent to your email, if you are an administrator user of the account of your organization. If a payment fails due to expired, faulty or missing credit card information, you must add a working credit card as soon as possible and Valossa has the right to charge the outstanding amount from the working credit card at any time or to collect the amount from you by other means.

(The old payment system, retired in December 2017, used prepaid Valossa Credit in order to run a video analysis job. If you had some Valossa Credit before the system change, your Valossa Credit that was remaining at the time of the change has been converted into a corresponding positive payment balance item in the new system.)

Send an HTTP POST to the URL:

Example new_job request body in JSON format:

  "api_key" : "kkkkkkkkkkkkkkkkkkkk",
  "media": {
    "title": "The Action Movie",
    "description": "Armed with a knife, Jack Jackson faces countless dangers in the jungle.",
    "video": {
      "url": ""
    "transcript": {
      "url": ""
    "customer_media_info": {
      "id": "469011911002"
    "language": "en-US"

The video URL and transcript URL can be either http:// or https:// or s3:// based. If the URL is s3:// based, you should first communicate with us to ensure that our system has read access to your S3 bucket in AWS (Amazon Web Services).

The video URL is mandatory. The URL must directly point to a downloadable video file. Our system will download the file from your system.

The transcript URL is optional – but recommended, because an existing SRT transcript is a more reliable source of speech information than audio analysis. The URL must directly point to a downloadable transcript file. Our system will download the file from your system.

The title is optional – but recommended: a human-readable title makes it easy for you to identify the video on the results page of Valossa Portal, and will also be included in the metadata file.

The description is optional. Description is any freetext, in English, that describes the video.

If title and/or description are provided in the call, the text in them will also be analyzed, and the detected concepts will be included in the analysis results (the "external" concepts in the metadata JSON).

The customer media info is optional. If you provide a customer media ID in the "id" field inside the "customer_media_info" field, you may use the customer media ID (a string from your own content management system) to refer to the specific job in the subsequent API calls, replacing the "job_id" parameter with a "customer_media_id" parameter in your calls. Note: Our system will NOT ensure that the customer media ID is unique across all jobs. Duplicate IDs will be accepted in new_job calls. It is the responsibility of your system to use unique customer media IDs, if your application logic requires customer media IDs to be unique. If you use duplicate customer media IDs, then the latest inserted job with the specific customer media ID will be picked when you use the "customer_media_id" parameter in the subsequent API calls.

The language is optional. It specifies the language model to be used for analyzing the speech in the audio track of your video. The allowed values are "en-US" (US English), "fi-FI" (Finnish) and "fr-FR" (French). More languages will be supported in the future. If the language parameter is not given, the default "en-US" will be used so the speech in the video is assumed to be in US English.

If the analysis is technically successful (i.e. if the job reaches the "finished" state), the price of the job will be added to the amount to be charged on your credit card. The price is based on the count of beginning minutes of video playback time. Example: a video of length 39 minutes 20 seconds will be billed as equivalent of 40 minutes. Always keep a working credit card in your payment profile, and if you don't yet have an active subscription please start a new subscription.

Here is an example new_job request body with only the mandatory fields present:

  "api_key": "kkkkkkkkkkkkkkkkkkkk",
  "media": {
    "video": {
      "url": ""

The response of a successful new_job call always includes the job_id of the created job.

Example response in an HTTP 200 OK message:

  "job_id": "6faefb7f-e468-43f6-988c-ddcfb315d958"

Jobs are identified by UUIDs, which appear in "job_id" fields in various messages. Your script that calls the API must, of course, save the job_id from the new_job response in order to be able to query for the status and results later.

Example test call with Curl on the command line, if your test request JSON is in a file created by you:

curl --header "Content-Type:application/json" -X POST -d @your_request.json

If you want a HTTP POST callback and/or email notification when your video analysis job reaches an end state, you may specify one or both of those in the new_job request. The HTTP POST callback mechanism in our system expects your system to send a 200 OK response for the request (callback) initiated by our system. The request will be re-tried one time by our system, if the first attempt to access your specified callback URL returns a non-200 code from your system or times out. Due to the possibility of network problems and other reasons, you should not rely on the HTTP POST callback to be received by your system. In any case, whether the HTTP POST callback event was received or not, your system can always check the status of the job using the job_status function in the REST API. The email notification will be sent to those users that have the permission to view job results for the chosen API key.

Example of a job request with a HTTP POST callback:

  "api_key": "kkkkkkkkkkkkkkkkkkkk",
  "callback": {
    "url": ""
  "media": {
    "title": "Lizards dancing",
    "video": {
      "url": ""

The HTTP POST callback message is formatted as JSON, and contains the job ID in the "job_id" field and the reached end status of the job in the "status" field. It also contains the customer media ID in the "customer_media_id" field, if you had given a customer media ID for the job. Here is an example of a HTTP POST callback message body:

  "job_id": "ad48de9c-982e-411d-93a5-d665d30c2e92",
  "status": "finished"

Example of a job request with an email notification specified:

  "api_key": "kkkkkkkkkkkkkkkkkkkk",
  "email_notification": {
    "to_group": "users_with_access_to_api_key_results"
  "media": {
    "title": "Lizards dancing",
    "video": {
      "url": ""

The generated email notification message is intended for a human recipient. So, unlike the HTTP POST callback, the email notification message is not intended for machine parsing.

Error responses from the API

The following pertains to the HTTP error responses, which are returned immediately for your API call if your request was malformed or missing mandatory fields. In other words, the following does not pertain to the separate HTTP callback messages, which were discussed above. (Callback events are not even generated for the errors that are returned immediately in the HTTP error response of an API call.)

Error responses from the API calls (new_job calls or any other calls) contain an error message, and can be automatically separated from 200 OK responses, because error responses are sent along with an HTTP error code (non-200). Error responses are also formatted as JSON, and they contain an "errors" array, where one or more errors are listed with the corresponding error messages.

Example error response in an HTTP 400 message:

  "errors": [
      "message": "Invalid API key"

Getting status of a single job

The status of a single analysis job is polled using HTTP GET.

Example request:

Example response in an HTTP 200 OK message:

  "status": "processing",
  "media_transfer_status": "finished",
  "details": null,
  "poll_again_after_seconds": 600

Possible values for the "status" field: "queued", "on_hold", "preparing_analysis", "processing", "finished", and "error". More status values may be introduced in the future.

If the job status is "error", something went wrong during the analysis process. If there is an explanation of the error in the "details" field, please see if the cause of the error is something you can fix for yourself (such as a non-video file in the video URL of the job request). Otherwise, contact us in order to resolve the issue.

If the job status is "on_hold", it means that there is a problem with your billing information and your jobs cannot proceed because of that. For example, a credit card charge may have failed, or you might not currently have an active service subscription. You need to make sure that you have at least one working credit card in your payment profile and that you have selected one of the cards as the default card, on which the payments are charged. If you don't have an active subscription please start a new subscription.

If the job status is "queued" or "processing", you should poll the status again after some time.

If the job status is "finished", you can fetch the job results using the job_results function.

The "details" field may contain some additional details about the status of the job.

The "media_transfer_status" field indicates whether the media to be analyzed has been transferred from your system to our system. Possible values for the "media_transfer_status" field: "queued", "downloading", "finished" and "error". If "media_transfer_status" is "finished", your video (and the transcript if you provided it) have been successfully transferred to our system.

The value in "poll_again_after_seconds" is just a suggestion about when you should poll the job status again (expressed as seconds to wait after the current job_status request).

If there was a problem with the job_status query itself, the error will be indicated in an HTTP non-200 response with a JSON body, similar to the error responses of the new_job function.

Getting the results of a job

After a job has been finished, the resulting video metadata can be fetched using HTTP GET.

Example request:

Response data is in the JSON format. For more details, see chapter "Output metadata JSON format".

Save the metadata and use it from your own storage disk or database for your easy and quick access. We will not necessarily store the results perpetually.

If there was a problem with the job_results query itself, the error will be indicated in an HTTP non-200 response with a JSON body, similar to the error responses of the new_job function.

Listing all your jobs and statuses of all your jobs

Convenience function for listing all your jobs, optionally with also their job statuses (optional parameter "show_status" with the value "true"), using HTTP GET:

Example request:

Example response in an HTTP 200 OK message:

{"jobs": [
    "job_id": "6faefb7f-e468-43f6-988c-ddcfb315d958",
      "status": "finished",
      "media_transfer_status": "finished",
      "details": null,
      "poll_again_after_seconds": null
    "job_id": "36119563-4b3f-44c9-83c6-b30bf69c1d2e",
    "customer_media_id": "M4070117",
      "status": "processing",
      "media_transfer_status": "finished",
      "details": null,
      "poll_again_after_seconds": 600

If you had given a customer media ID when creating the job, the "customer_media_id" field exists and contains the customer media ID value.

Showing video titles and other media information in the job listing is often useful. This can be done by using the optional GET parameter "show_media_info" with the value "true". Example request:

Example response in an HTTP 200 OK message:

{"jobs": [
    "job_id": "36119563-4b3f-44c9-83c6-b30bf69c1d2e",
    "customer_media_id": "M4070117",
      "status": "finished",
      "media_transfer_status": "finished",
      "details": null,
      "poll_again_after_seconds": null,
        "title": "Birds clip #22",
        "description": "Birds having a bath",
          "url": ""
    "job_id": "6faefb7f-e468-43f6-988c-ddcfb315d958",
      "status": "finished",
      "media_transfer_status": "finished",
      "details": null,
      "poll_again_after_seconds": null,
          "url": ""

By adding the optional GET parameter "n_jobs" to the request (example: n_jobs=500), you can control how many of your jobs will be listed if your job list is long. The default is 200. The maximum possible value for "n_jobs" is 10000.

If there was a problem with the list_jobs query itself, the error will be indicated in an HTTP non-200 response with a JSON body, similar to the error responses of the new_job function.

Cancel job

Cancel a job by sending a HTTP POST to the URL:

Example cancel_job request body:

  "api_key": "kkkkkkkkkkkkkkkkkkkk",
  "job_id": "be305b1e-3671-45b1-af88-1f052db3d1bb"

Example response in an HTTP 200 OK message:

  "job_status": "canceled"

The job must be in a cancellable state for this function to succeed. For example, a finished job is not cancellable.

If there was a problem with the cancel_job query itself, the error will be indicated in an HTTP non-200 response with a JSON body, similar to the error responses of the new_job function.

If you are interested in deleting job media from the service, contact us.

Valossa Search API for your analyzed videos (coming soon)

Coming soon! Before the release of the REST API for Search, you can use the Search functionality in the Portal with the Video Insight Tools plan.

Visualization of your results in a Valossa Report

Valossa Portal provides an easy-to-use visualization tool, called the Valossa Report, for you to get a quick visual overview of the most prominent detections, and also a more detailed heatmap for browsing the results. On the home page, each displayed information box that is related to a successfully analyzed video contains a link to the Valossa Report of the video analysis results. To see examples of Valossa Report, click "Demos" on the home page (you must be logged in to Valossa Portal in order to do this).

Below you'll find example screenshots of Valossa Report.

(Actually the Valossa Report is a tool for viewing your Valossa Core metadata in an easy way for humans. When you're ready to integrate Valossa Core metadata to your application, please see the instructions for machine-reading the Valossa Core metadata.)


The Valossa Report's Overview gives you a quick visual overview of the analyzed video content.

Charade Valossa Report

The tags are an overview of the detected concepts. By clicking the arrows you can browse through the detections in the video. You can also search the concept within the video by clicking the magnifying glass symbol.

Charade Valossa Report


The Valossa Report's Heatmap displays the timeline of a video, and detections of concepts are placed on the timeline. Each detection is shown on its own row (its own timeline). Detections are grouped by their detection type such as human.face, visual.context, audio.context, etc. Please note that different colors are given to different detection types for distinguishing them visually.

Within a detection type, detections are grouped by prominence. For example, the most prominent faces are shown first.

Charade Valossa Report

With the Valossa Report controls, you can change the resolution of the timeline (how many distinct timeslots are shown) and the number of detections shown. You can also adjust the confidence threshold for several detection types. The detections below the chosen threshold are hidden.

The depth of color in the colored spots on the timeline for a detection shows how many detections of that concept are in that timeslot and/or how confident the detections are. Click on a colored spot, and the video player on the Valossa Report page will playback the video from the corresponding timecode. Thus, you are able to see the main concepts of the video arranged by time and prominence, and verify their correctness. With the main timeline and the seek bar under the video player, you can also move to any time-position in the video.

Charade Valossa Report

Tag & Train

Tag & Train naming tool can be used to edit names and genders of the detected faces. Changes will be saved to the metadata of the video analysis job and indexed into the search automatically. Training functionality that allows AI to learn from the changes will be available during the Video Insight Tools Preview program.

Click the "Tag & Train" button above the face detections or the pencil next to a person name to open the tool.

Charade Valossa Report

General notes about metadata

Valossa Core metadata is provided in downloadable JSON files, which are available via the REST API (function job_results) or via the results page in Valossa Portal that shows the results and other info about your most recent jobs.

The sizes of the JSON files vary depending on the size of the videos and the number of detections, ranging from a few kilobytes to several megabytes. You should save the metadata JSON in your local database or file system. The metadata will not necessarily be stored perpetually in our system, download count limits may be imposed in the future, and it is also faster for your application to access the metadata from your local storage space.

The version number of the metadata format will be updated in the future, when the format changes (version changelog). If only the minor version number (z in x.y.z) changes, the changes are purely additions to the structure i.e. they can't break the parsing code.

Output metadata JSON format

Basics: How (and why) to read Valossa Core metadata in your application code

Valossa Core metadata has been designed to address several needs. It answers questions such as:

  1. What does the video contain?
  2. Where — or when — are all the instances of X in the video?
  3. What is in the video at any specific time position?
  4. What is the video predominantly about?

Please see the images below for a quick explanation of how to read these things from the metadata.

Valossa Core AI addresses the needs 1 and 4 by detecting a varity of things and then ranking the most dominant detections from the video, so that the Valossa Core metadata can be used for answering questions such as "What are the visuals about?", "Who are the faces appearing in the video?", "What sounds are in the audio track?", "What are the spoken words about?", "What is the entire video about?", etc. The detections are grouped conveniently by the detection type, see more below. The needs 2 and 3 are addressed by Valossa Core AI with a smart time-coding logic that makes it easy to read either all the temporal occurrences of a specific detection or all the detections at a specific time position, whichever way is the most useful for your application.


A more detailed explanation of the fields "detections" and "by_detection_type" can be found in the subchapter Detections.

Detections are grouped by Valossa Core AI in a way that makes it easy for your application code to iterate over all instances (occurrences) of, for example, cats:


by_second field

By reading the "by_second" field, your application code can easily list everything at a given time position. More details about the "by_second" field are provided in the subchapter Detections.

Using IAB categories, the metadata tells the topics of the video to your application code:

IAB categories

The main JSON structure

Valossa Core metadata about your videos is hierarchical and straightforward to parse for your application code. High-level structure of the current Valossa Core video metadata JSON format, not showing detailed subfields:

  "version_info": { "metadata_format": "...", "backend": "..." },
  "job_info": { "job_id": "...", "request": {...} },
  "media_info": { ... },
  "detections": { ... },
  "detection_groupings": { 
    "by_detection_type": { ... },
    "by_second": [ ... ]
  "segmentations": { ... }

You will best understand the details of the metadata structure by viewing an actual metadata JSON file generated from one of your own videos! As the very first thing you'll probably want to view your results using the easy Valossa Report visualization tool.

Note: In order to save storage space, JSON provided by the API does not contain line-breaks or indentations. If you need to view JSON data manually during your software development phase, you can use helper tools in order to get a more human-readable (pretty-printed) version of the JSON. For example, the JSONView plugin for your browser may be of help, if you download JSON metadata from the Portal: the browser plugin will display a pretty-printed, easily navigable version of the JSON. In command-line usage, you can use the "jq" tool or even Python: cat filename.json | python -m json.tool > prettyprinted.json

In the following subchapters, the JSON metadata format is described in more detail.


All concept detections from the video are listed in the field "detections". This is an associative array, where the key is detection ID and the value is the corresponding detection.

The detection IDs are used in "detection_groupings" to refer to the specific detection, so the detailed information about each detection resides in one place in the JSON but may be referenced from multiple places using the ID. Inside the field "detection_groupings", two practical groupings of detections are given for you:

  • The subfield "by_detection_type" has detection type identifiers as the key and the value is an array of detection IDs; the array is sorted by relevance, most relevant detections first! Using "by_detection_type", you can easily for example list all the detected faces, or all the detected audio-based keywords. Want to find out whether there's a cat somewhere in your video? Just loop over the visual.context detections and match detections against the Google Knowledge Graph concept identifier of "cat" ("/m/01yrx"), or against the human-readable concept label if you're adventurous; see details below. (Note: Because some identifiers were inherited from the Freebase ontology to Google Knowledge Graph, some of the identifiers are Freebase-compatible such as in the example case of "cat", but not all identifiers are from Freebase; so you should use Google Knowledge Graph as the reference ontology.)
  • The subfield "by_second" contains an array, where each item corresponds to one second of the video. Using this array you can answer questions such as "What is detected at 00:45 in the video?". Under each second, there is an array of objects which contain at least the field "d" (detection ID). Using the detection ID as the index, you will find the detection from the "detections" list. If applicable, there is also the field "c" (confidence), currently available only for visual.context and audio.context detections. If the field "o" exists, it contains an array of occurrence identifiers that correspond to this detection in this second.

The following image helps understand the usage of detection IDs as references within the JSON data:

Detection IDs

How to get an overview of the most prominent detections? That's easy: in "by_detection_type", start reading detections from the beginning of the lists under each detection type. Because the detections are sorted the most relevant ones ffirst, reading e.g. the 20 first detections from "human.face" gives you an overview of the most prominent faces in the video. For an easy and quick overview of detections, you may view the Valossa Report (visualization of detections) of the video in Valossa Portal.

However, please note that the "audio.speech" detections are not ordered by prominence, as they are just raw snippets of speech detected from a specific time-range from within the video's audio track.

Every detection in the JSON has, at minimum, the fields "t" (detection type identifier) and "label". The "label" is just the default human-readable label of the detected concept, and for many detection types, more specific information is available in additional data fields. The following is the list of currently supported detection type identifiers.

Fields that exist or don't exist in a detection, depending on the detection type and situation, include "occs", "a", "ext_refs", "categ" and "cid".

Detection types

Currently, the following detection types are supported.


The identifiers are mostly self-explanatory. Please note that "visual.context" offers a broad range of visual detections such as objects; "audio.context" offers a broad range of audio-based detections; "topic.iab.*" are IAB categories for the entire video; "external.keyword.*" refers to keywords found from video description or title; "human.face_group" are people who have a temporal correlation high enough to probably have meaningful interaction with each other.


The field "occs" contains the occurrence times of the detection. There is a start time and an end time for each occurrence. For example, a visual object "umbrella" might be detected 2 times: first occurrence from 0.3 seconds to 3.6 seconds, and another occurrence from 64.4 seconds to 68.2 seconds — so there would be 2 items in the "occs" array. Time values are given as seconds "ss" (seconds start) and "se" (seconds end), relative to the beginning of the video.

Detections that are not time-bound (e.g. topic.iab.* and external.keyword.*) cannot contain "occs".

If applicable to the detection type, occurrences have a maximum confidence ("c_max") detected during the occurrence period. (Because confidence varies at different moments during the occurrence, it makes sense to provide just the maximum value here. To find out the confidence during a particular moment, check out the "c" field of each second in the "by_second" data.) Currently, only visual.context and audio.context detections have "c_max".

Please note that if you want to answer the question "What is in the video at time xx:xx?", then you should see the "by_second" array in the "detection_groupings". Occurrences, on the other hand, are good when you want to answer the question "At what time-sections is Y detected?"

Other optional data fields of a detection

As you remember, "t" and "label" are always given for a detection. The field "occs" might not be there. Besides "occs", there are other optional fields for a detection: "a", "ext_refs", "categ"

If exists, the object-field "a" contains attributes of the detection. For example, the "human.face" detections may have attributes: "gender" that includes the detected gender and "s_visible" i.e. the total screen-time of the face, and "similar_to" i.e. possible visual similarity matches to celebrities in the face gallery. The "gender" and structure also contains the confidence field "c" (0.0 to 1.0).

If exists, the array-field "ext_refs" contains references to the detected concept in different ontologies. Most visual.context detections have "ext_refs", expressing the concept identity in the Google Knowledge Graph ontology. Inside "ext_refs", the ontology identifier for Google Knowledge Graph is "gkg" (see examples). As noted above, some of the concept IDs originate from the Freebase ontology, and Freebase concept IDs are contained as a subset the Google Knowledge Graph ontology.

If exists, the string-field "cid" contains the unique identifier of the concept in the Valossa Concept Ontology. All visual.context detections and audio.context detections have "cid". However, for example audio.speech detections don't have "cid".

If exists, the object-field "categ" contains the key "tags", and under the key "tags" there is an array-field that contains one or more category identifier tags (string-based identifiers such as "flora" or "fauna") for the concept of the detection. For example, a "dog" detection has the category tag array ["fauna"]. As another example, an "amusement park" detection has the category tag array ["place_scene", "nonlive_manmade"]. Many visual.context detections and some audio.context detections have "categ". Note! This is about the categories of a specific detection (a "single concept") — a completely different thing than the categories of the entire video (such as IAB categories).

Tips for reading some detection types

For "audio.speech" (speech-to-text) detections, the detected sentences/words are provided as a string in the "label" field of the detection.

Information related to "human.face" detections: Use the detection ID as the unique identifier per face within each video. If and only if a face is similar to one or more faces in our face gallery, the "a" field of the specific face object will contain a "similar_to" field that contains an array of the closely matching faces (for example, a specific celebrity), and within each item in the "similar_to" array there is a "name" field providing the name of the visually similar person and "c" that is the confidence (0.0 to 1.0) that the face is actually the named person. In "similar_to", the matches are sorted the best first. Please note that a face occurrence doesn't directly contain "c" — confidence for faces is only available in the "similar_to" items.

The detection type "explicit_content.nudity" has two possible detections: "bare skin" and "nsfw".

Detection data examples

An example of what is found in "detections", a visual.context detection:

"86": {
  "t": "visual.context",
  "label": "hair",
  "cid": "lC4vVLdd5huQ",
  "ext_refs": {
    "gkg": {
      "id": "/m/03q69"
  "categ": {
    "tags": [
  "occs": [
      "ss": 60.227,
      "se": 66.191,
      "c_max": 0.80443,
      "id": "267"
      "ss": 163.038,
      "se": 166.166,
      "c_max": 0.72411,
      "id": "268"

Another example from "detections", a human.face detection:

"64": {
  "t": "human.face",
  "label": "face",
  "a": {
    "gender": {
      "c": 0.929,
      "value": "female"
    "s_visible": 4.4,
    "similar_to": [
        "c": 0.92775,
        "name": "Molly Shannon"
  "occs": [
      "ss": 28.333,
      "se": 33.567,
      "id": "123"

An example of an audio.context detection:

"12": {
  "t": "audio.context",
  "label": "exciting music",
  "cid": "o7WLKO1GuL5r"
  "ext_refs": {
    "gkg": {
      "id": "/t/dd00035"
  "occs": [
      "ss": 15,
      "se": 49
      "c_max": 0.979,
      "id": "8",

An example of an IAB category detection:

"173": {
  "t": "topic.iab.transcript",
  "label": "Personal Finance",
  "ext_refs": {
    "iab": {
      "labels_hierarchy": [
        "Personal Finance"
      "id": "IAB13"

An example of keyword detection:

"132": {
  "t": "",
  "label": "Chillsbury Hills",
  "occs": [
      "ss": 109.075,
      "se": 110.975,
      "id": "460"

Please note that transcript keyword occurrence timestamps are based on SRT timestamps. In the future, if a non-timecoded transcript is supported, transcript keywords might not have occurrences/timecoding.


In "segmentations", the video is divided into time-based segments using different segmentation rules.

Currently we support automatically detected shot boundaries, hence "segmentations" contains "detected_shots". The "detected_shots" field in segmentations provides shot boundaries with seconds-based start and end timepoints ("ss", "se") and with start and end frame numbers ("fs", "fe"). Note: frame-numbers are 0-based, i.e. the first frame in the video has the number 0.

Code examples for reading metadata

Example code snippet (in Python) that illustrates how to access the data fields in Valossa Core metadata JSON:

import json
metadata = None
with open("your_core_metadata.json", "r") as jsonfile:
        metadata = json.loads(

# Loop over all detections so that they are grouped by the type
for detection_type, detections_of_this_type in metadata["detection_groupings"]["by_detection_type"].iteritems():
        print "----------"
        print "Detections of the type: " + detection_type + ", most relevant detections first:"
        for det_id in detections_of_this_type:
                print "Detection ID: " + det_id
                detection = metadata["detections"][det_id]
                print "Label: " + detection["label"]
                print "Detection, full info:"
                print detection

                # Example of accessing attributes (they are detection type specific)
                if detection_type == "human.face":
                        attrs = detection["a"]
                        print "Gender is " + attrs["gender"]["value"] + " with confidence " + str(attrs["gender"]["c"])
                        if "similar_to" in attrs:
                                for similar in attrs["similar_to"]:
                                        print "Face similar to person " + similar["name"] + " with confidence " + str(similar["c"])

                # More examples of the properties of detections:

                if detection_type == "visual.context" or detection_type == "audio.context":
                        if "ext_refs" in detection:
                                if "gkg" in detection["ext_refs"]:
                                        print "Concept ID in GKG ontology: " + detection["ext_refs"]["gkg"]["id"]

                if "occs" in detection:
                        for occ in detection["occs"]:
                                print "Occurrence starts at " + str(occ["ss"]) + "s from beginning of video, and ends at " + str(occ["se"]) + "s"
                                if "c_max" in occ:
                                        print "Maximum confidence of detection during this occurrence is " + str(occ["c_max"])
                                        # If you need the condifence for a particular time at second-level accuracy, see the by_second grouping of detections


# Example of listing only audio (speech) based word/phrase detections:
for detection_type, detections_of_this_type in metadata["detection_groupings"]["by_detection_type"].iteritems():
        if detection_type.startswith("audio.keyword."):
                for det_id in detections_of_this_type:
                        detection = metadata["detections"][det_id]
                        print "Label: " + detection["label"]  # etc... You get the idea :)

# Example of listing only detections of a specific detection type:
if "human.face" in metadata["detection_groupings"]["by_detection_type"]:
        for det_id in metadata["detection_groupings"]["by_detection_type"]["human.face"]:
                detection = metadata["detections"][det_id]  # etc...

# Example of listing IAB categories detected from different modalities (visual/audio/transcript) of the video
for detection_type, detections_of_this_type in metadata["detection_groupings"]["by_detection_type"].iteritems():
        if detection_type.startswith("topic.iab."):
                for det_id in detections_of_this_type:
                        detection = metadata["detections"][det_id]  # etc...
                        print "IAB label, simple: " + detection["label"]
                        print "IAB ID: " + detection["ext_refs"]["iab"]["id"]
                        print "IAB hierarchical label structure:"
                        print detection["ext_refs"]["iab"]["labels_hierarchy"]

# Time-based access: Loop over time (each second of the video) and access detections of each second
sec_index = -1
for secdata in metadata["detection_groupings"]["by_second"]:
        sec_index += 1
        print "----------"
        print "Detected at second " + str(sec_index) + ":"
        for detdata in secdata:
                det_id = detdata["d"]
                if "c" in detdata:
                        print "At this second, detection has confidence " + str(detdata["c"])
                if "o" in detdata:
                        # If for some reason you need to know the corresponding occurrence (time-period that contains this second-based detection)
                        print "The detection at this second is part of one of more occurrences. The occurrence IDs, suitable for searching within the 'occs' list of the 'detection' object, are:"
                        for occ_id in detdata["o"]:
                                print occ_id
                print "Detection ID: " + det_id
                detection = metadata["detections"][det_id]
                print "Label: " + detection["label"]
                print "Detection of the type " + detection["t"] + ", full info:"
                # Of course, also here you can access attributes, cid, occurrences etc. through the "detection" object
                # just like when you listed detections by their type. In other words, when you just know the ID
                # of the detection, it's easy to read the information about the detection by using the ID.
                print detection

Metadata JSON format version changelog

1.3.3: categ added to relevant visual.context and audio.context detections

1.3.2: similar_to in human.face detections supports role names

1.3.1: added metadata type field (supports distinguishing between different types of Valossa metadata in the future)

1.3.0: improved speech-to-text format

1.2.1: speech-to-text

1.2.0: field naming improved

1.1.0: more compact format

1.0.0: large changes, completely deprecated old version 0.6.1.

Persons in the celebrity face gallery

Aaron Staton
Aarti Mann
Adam Baldwin
Adam Goldberg
Adam Sandler
Adriana Lima
Aidan Gillen
Aidy Bryant
Alanna Masterson
Alan Tudyk
Albert Finney
Alec Baldwin
Alfie Allen
Al Gore
Alicia Keys
Alison Brie
Al Pacino
Amanda Schull
Amy Poehler
Anabelle Acosta
Andrew J. West
Andrew Lincoln
Andrew Scott
Andy Samberg
Angela Merkel
Angela Sarafyan
Angelina Jolie
Anna Faris
Anna Paquin
Anthony Bourdain
Anthony Hopkins
Antonio Banderas
Ariel Winter
Arnold Schwarzenegger
Art Parkinson
Aubrey Anderson-Emmons
Audrey Hepburn
Aung San Suu Kyi
Aunjanue Ellis
Austin Nichols
Ban Ki-moon
Barack Obama
Barry Pepper
Bashar Assad
Beck Bennett
Ben Affleck
Ben Barnes
Benedict Cumberbatch
Ben Feldman
Benjamin Netanyahu
Ben Kinglsey
Ben Mendelson
Ben Stiller
Bill Gates
Bill Hader
Bill Murray
Billy Boyd
Billy Crystal
Bobby Moynihan
Bob Newhart
Brad Pitt
Britney Spears
Brooks Wheelan
Bruce Springsteen
Bruce Willis
Bryan Batt
Caleb McLaughlin
Cameron Diaz
Cara Buono
Carice van Houten
Carrie-Anne Moss
Carrie Fisher
Cary Grant
Cate Blanchett
Catherine Zeta-Jones
Cecily Strong
Chad Coleman
Chandler Riggs
Charles Chaplin
Charles Dance
Charlie Heaton
Charlie Sheen
Charlie Watts
Chevy Chase
Christian Bale
Christian Serratos
Christina Hendricks
Christine Woods
Christopher Lee
Christopher Stanley
Christopher Walken
Christoph Walz
Clifton Collins Jr.
Clint Eastwood
Conan O'Brien
Conleth Hill
Constance Zimmer
Corey Stoll
Cristiano Ronaldo
Curtiss Cook
Daisy Ridley
Dalai Lama
Dana Carvey
Danai Gurira
Dan Aykroyd
Daniel Portman
Darrell Hammond
David Beckham
David Cameron
David Copperfield
David Harbour
David Letterman
David Morrissey
David Spade
David Strathairn
Dean-Charles Chapman
Denzel Washington
Derek Cecil
Desmond Tutu
Diana Rigg
Diego Luna
Dominic Monaghan
Donald Trump
Donnie Yen
Don Pardo
Dr. Dre
Drew Barrymore
Dr. Phil McGraw
Dustin Hoffman
Dwayne Johnson
Eddie Murphy
Ed Harris
Ed O'Neill
Édgar Ramírez
Edward Burns
Edward Snowden
Elijah Wood
Elisabeth Moss
Eliza Coupe
Elizabeth Marvel
Elizabeth Norment
Ellen Burstyn
Elon Musk
Elsa Pataky
Elton John
Emilia Clarke
Emily Kinney
Emma Bell
Emma Thompson
Emma Watson
Eric Stonestreet
Eugene Simon
Evan Rachel Wood
Ewan McGregor
Felicity Jones
Fernando Alonso
Fidel Castro
Finn Jones
Finn Wolfhard
Forest Whitaker
Francois Hollande
Fred Armisen
Gabriel Macht
Gal Gadot
Gary Oldman
Gaten Matarazzo
Gemma Whelan
George Bush Jr.
George Clooney
Gerald McRaney
Gilbert Gottfried
Gilda Radner
Gina Torres
Gordon Ramsay
Graham Rogers
Gwendoline Christie
Halle Berry
Hannah Murray
Harrison Ford
Heath Ledger
Heidi Klum
Hillary Clinton
Horatio Sanz
Howard Stern
Hugh Grant
Hugh Jackman
Hugh Laurie
Hugo Weaving
Iain Glen
Ian Holm
Ian McKellen
Ian Whyte
Indira Varma
Ingrid Bolsø Berdal
IronE Singleton
Isaac Hempstead Wright
Ivanka Trump
Iwan Rheon
J. Mallory McCree
Jack Gleeson
Jackie Chan
Jack Nicholson
Jacob Anderson
John Wayne
Jake McLaughlin
James Corden
James Marsden
James Mattis
Jamie Foxx
Jamie Lee Curtis
Jan Hooks
January Jones
Jared Harris
Jared Kushner
Jason Momoa
Jason Statham
Jason Sudeikis
Jay Pharoah
Jay R. Ferguson
Jeff Bezos
Jeffrey Dean Morgan
Jeffrey DeMunn
Jeffrey Wright
Jennifer Aniston
Jennifer Lawrence
Jennifer Lopez
Jennifer Love Hewitt
Jeremy Bobb
Jeremy Holm
Jeremy Maguire
Jerome Flynn
Jerry Seinfeld
Jerry Springer
Jesse Tyler Ferguson
Jessica Alba
Jessica Lange
Jessica Paré
Jewel Staite
Jiang Wen
Jim Carrey
Jimmi Simpson
Jimmy Carter
Jimmy Fallon
Jimmy Kimmel
Jim Parsons
Joan Allen
Jodie Foster
Joe Dempsie
Joe Keery
Johanna Braddy
John Belushi
John Boyega
John Bradley-West
John Cena
John Goodman
Johnny Depp
Johnny Galecki
John Ross Bowie
John Slattery
John Travolta
Jonathan Pryce
Jon Bernthal
Jon Hamm
Jon Stewart
Jordana Brewster
Joseph Gordon-Levitt
Josh Hopkins
Josh McDermitt
Jude Law
Juha Sipilä
Julia Louis-Dreyfus
Julia Stiles
Julian Assange
Julian Glover
Julianne Moore
Julia Roberts
Julie Bowen
Justin Bieber
Justin Doescher
Justin Timberlake
Kaley Cuoco
Kanye West
Kate Hudson
Kate Mara
Kate McKinnon
Kate Micucci
Kate Winslet
Katy Perry
Keanu Reeves
Keith Richards
Kenan Thompson
Ken Arnold
Kevin Nealon
Kevin Rahm
Kevin Spacey
Kevin Sussman
Khloé Kardashian
Kiefer Sutherland
Kiernan Shipka
Kim Jong Un
Kim Kardashian
King Salman bin Abdulaziz al Saud
Kit Harington
Kourtney Kardashian
Kristen Connolly
Kristen Wiig
Kristian Nairn
Kristofer Hivju
Kunal Nayyar
Kurt Russell
Kyle Mooney
Kylie Minogue
Lady Gaga
Larry Ellison
Larry King
Larry Page
Laura Spencer
Lauren Cohan
Laurie Holden
Lena Headey
Lennie James
Leonardo DiCaprio
Leonardo Nam
Liam Cunningham
Linus Torvalds
Lionel Messi
Liv Tyler
Lorne Michaels
Louise Brealey
Lucas Black
Luke Hemsworth
Madison Lintz
Mads Mikkelsen
Maggie Siff
Mahershala Ali
Maisie Williams
Mao Zedong
Marcia Cross
Mariah Carey
Marilyn Monroe
Mark Falvo
Mark Gatiss
Mark Hamill
Mark Pellegrino
Mark Wahlberg
Mark Zuckerberg
Marlon Brando
Martha Stewart
Martin Freeman
Martin Sheen
Martti Ahtisaari
Mason Vale Cotton
Matt Damon
Matthew McConaughey
Maureen O'Hara
Max von Sydow
Mayim Bialik
Meghan Markle
Melania Trump
Mel Gibson
Melissa McBride
Melissa Rauch
Meryl Streep
Michael Caine
Michael Cudlitz
Michael Douglas
Michael Gladis
Michael Jackson
Michael Jordan
Michael Kelly
Michael McElhatton
Michael Phelps
Michael Rooker
Michael Warner
Michel Gill
Michelle Fairley
Michelle Obama
Michelle Rodriguez
Michiel Huisman
Mick Jagger
Mike Myers
Mike O'Brien
Mike Pence
Miley Cyrus
Millie Bobby Brown
Molly Parker
Molly Shannon
Morena Baccarin
Morgan Freeman
Muhammad Ali
Naomi Watts
Nasim Pedrad
Natalia Dyer
Natalia Tena
Natalie Dormer
Natalie Portman
Nathalie Emmanuel
Nathan Darrow
Nathan Fillion
Nathan Lane
Nelson Mandela
Neve Campbell
Nicholas Cage
Nicki Minaj
Nicole Kidman
Nikolaj Coster-Waldau
Noah Schnapp
Noël Wells
Nolan Gould
Norman Reedus
Oprah Winfrey
Orlando Bloom
Paddy Considine
Patrick J. Adams
Patrick Wayne
Paul Giamatti
Paul McCartney
Paul Sparks
Paul Walker
Penélope Cruz
Pete Davidson
Peter Dinklage
Phil Hartman
Pope Francis
Prince Charles
Prince William
Priyanka Chopra
Ptolemy Slocum
Queen Elizabeth II
Rachel Brosnahan
Reese Witherspoon
Reg E. Cathey
Reid Ewing
Reince Priebus
Richard Madden
Rich Sommer
Rick Cosnett
Rick Hoffman
Rico Rodriguez
Riz Ahmed
Robert De Niro
Robert Downey, Jr.
Robert Morse
Robin Williams
Robin Wright
Rodrigo Santoro
Ronda Rousey
Ron Glass
Ronnie Wood
Rory McCann
Rose Leslie
Ross Marquand
Rupert Graves
Russell Crowe
Russell James
Ryan Gosling
Ryan Reynolds
Saara Aalto
Sacha Baron Cohen
Sakina Jaffrey
Samuel L. Jackson
Sandra Bullock
Sandrine Holt
Sara Gilbert
Sarah Brown
Sarah Hyland
Sarah Rafferty
Sarah Wayne Callies
Sasheer Zamata
Sauli Niinistö
Scarlett Johansson
Scott Glenn
Scott Wilson
Sean Astin
Sean Bean
Sean Connery
Sean Maher
Sean Penn
Sean Spicer
Sebastian Arcelus
Sebastian Vettel
Serena Williams
Seth Gilliam
Seth Meyers
Shannon Woodward
Sharon Osbourne
Shinzo Abe
Sibel Kekilli
Sidse Babett Knudsen
Sigourney Weaver
Silvia Reis
Silvio Berlusconi
Simon Cowell
Simon Helberg
Simon Pegg
Sir John
Sofía Vergara
Sonequa Martin-Green
Sophie Turner
Spencer Garrett
Stefanie Powers
Stephen Colbert
Stephen Dillane
Stephen Hawking
Steve Bannon
Steve Jobs
Steven Yeun
Summer Glau
Susan Sarandon
Sylvester Stallone
Talulah Riley
Taran Killam
Tate Ellington
Taylor Swift
Ted Danson
Teemu Selänne
Tessa Thompson
Thandie Newton
Theresa May
Tiger Woods
Tina Fey
Tom Cruise
Tom Hanks
Tommy Lee Jones
Tom Payne
Tom Sizemore
Tom Wlaschiha
Tracy Morgan
Ty Burrell
Tyler James Williams
Tyra Banks
Tyrese Gibson
Uma Thurman
Una Stubbs
Usain Bolt
Vanessa Bayer
Viggo Mortensen
Vincent Kartheiser
Vin Diesel
Vladimir Putin
Warwick Davis
Will Ferrell
Will Forte
Will Smith
Winona Ryder
Woody Allen
Yasmine Al Massri
Yvonne De Carlo
Zooey Deschanel