Documentation: Valossa Core API and Valossa Training API

Version of documentation: 2018-07-27

What does the Valossa Core API detect?

Valossa Core API overview

Input data formats

Usage examples

Valossa Search API for your analyzed videos (coming soon)

Valossa Training API overview

Uploading images for training

Training face identities and managing the trained identities

Visualization of your results in a Valossa Report

General notes about metadata

Output metadata JSON format   ← Understand Valossa Core metadata

Current version of the Valossa Core metadata JSON format: 1.3.6 (changelog).

Current version of the "frames_faces" metadata (face bounding boxes) JSON format: 1.0.1.

Current version of the "seconds_objects" metadata (object bounding boxes) JSON format: 1.0.0.

What does the Valossa Core API detect from my videos?

The Valossa Core API is a REST API for automatic video content analysis and metadata creation. In other words, the Valossa Core API is a video recognition API. Metadata contains detections of concepts from the video.

Metadata describes the following detections:

  • humans with visible faces (with attributes such as gender information and the possibly detected similarity to persons in your custom face gallery created with the Valossa Training API
  • visual context (such as objects, animals, scenes and styles)
  • audio context (music styles and musical moods, instruments, sounds of human actions and animals and machines etc., sound effects)
  • audio speech-to-text
  • topical keywords from speech (speech source being either the audio track or the SRT transcript file of the video)
  • topical keywords from video description
  • face groups of co-occurring faces
  • IAB categories for content related advertising
  • detected shot boundaries
  • explicit content detection (visual nudity detection, visual violence detection as well as offensive words detection from audio and transcript)
  • dominant colors in the video

Detections are also provided in different practical groupings: grouped by detection type (best concepts first, and including time-coded occurrences of the concept when applicable) and grouped by second ("What is detected at 00:45 in the video?"). See explanation of the practical detection groupings

Valossa Core API overview

The REST API is invoked using HTTP (HTTPS) requests. You can also assign new video analysis jobs to the API on the easy API call page. The responses from the REST API are in the JSON format. The Valossa Report tool helps you to visualize the results. Please use the secure HTTPS transport in all API calls to protect your valuable data and confidential API key: unencrypted HTTP transport is not allowed by the API.

REST API basic communication

Get your API key from under "My account" - "API keys" in Valossa Portal. If you have several applications, please create a different API key for each of them, on the "API keys" page. Give a descriptive name for each API key, and if necessary, give access rights of the API key to different users of your organization.

Note: As the administrator user of your organization, you can create new users under "My account" - "Manage users". If your organization has several people who need to use the Portal, you should add them manually in "Manage users", so they are all mapped to your organization and may view analysis results (if you give the rights to the users in Portal) and post new jobs (if you give the rights). The permissions are mappings between users and API keys ("this user has read-write access to this API key so she can both view results and make new job requests"), so please configure the permissions understanding this; the API key permissions per user can be edited in "Manage users". For your company/organization, you must create only one customer account, but there can be as many users under the customer account as you wish!

The API consists of 5 different functions:

  1. new_job [HTTP POST]
    This function is used to create a new video analysis job in our system. The job (including e.g. the URL of the video file to be analyzed) is defined by using a JSON formatted data structure that is included as the body of the HTTP POST request. This function returns the job_id (UUID value) of the created job. The job_id is used after this as the identifier when querying the status and the results of the job.
  2. job_status [HTTP GET]
    This function is used to query (poll) the status of a specific job, based on its job_id.
  3. job_results [HTTP GET]
    This function is used to fetch the resulting metadata JSON of a finished analysis job identified by its job_id.
  4. list_jobs [HTTP GET]
    This function lists all the jobs for a given API key.
  5. cancel_job [HTTP POST]
    This function cancels a job, if it is in a cancellable state.

You can conveniently monitor the status of your jobs in Valossa Portal. There you can also call the new_job function of the API with an easy API request generator.

Your API key is shown in Valossa Portal on the request generator page and job results page. Keep the key confidential.

Please note regarding speech analysis:

  • If you already have the speech transcript of your video in the SRT format (for example the subtitles of your movie), please specify the transcript URL in the request, along with the video URL. The transcript content will be analyzed, and the detected concepts will be included in the "transcript" part of the metadata JSON.
  • Your existing transcript is, obviously, a more reliable source for speech information than audio analysis. So, if you have the transcript, please use it – it’s a valuable asset!
  • Audio keyword detection and audio speech-to-text will be performed only if you did not provide the SRT transcript (however, providing or omitting the SRT transcript does not affect the audio.context detections).
  • The audio-related metadata generated by us will not contain an actual audio transcript. Instead, we provide you a uniquely descriptive set of keywords extracted from the speech. Whether the source of speech information is audio itself or your transcript file, the output format of the detected keywords is similar in the metadata.

Input data formats

Supported video formats: we support most typical video formats, including but not limited to MP4, MPEG, AVI, FLV, WebM, with various codecs. Currently, we cannot provide a fixed list of supported formats and codecs, but for example MP4 with the H.264 codec works.

Video file size limit: 25GB per video file.

Video duration limit: 7 hours of playback time per video file.

Video vertical resolution limit: 2160 pixels.

Currently, the only supported languages for speech-based detections are English and French. By default, speech is analyzed as English language. See more information about language selection.

If the video file contains several video streams, only the first one is analyzed.

If the video file contains several audio streams, only the first one is analyzed. (Please note that audio keyword detection and audio speech-to-text will be performed only if you did not provide your own SRT-based speech transcript; however, providing or omitting the SRT transcript does not affect the audio.context detections.) The audio stream can be either mono or stereo.

Supported transcript format: SRT only.

File size limit: 5MB per SRT file.

Currently, the only supported transcript language is English.

Usage examples

Creating a new video analysis job

You must pay for the video analysis jobs that you run, unless you have enough existing positive payment balance that was added by Valossa to your account as a result of (for example) a free-usage campaign. Keep a working credit card in your billing information in Valossa Portal, and keep your suitable service subscription (such as the Recognition plan) active!

Start a new subscription on the Valossa product purchase page and manage your existing subscriptions on the plans management page. Manage your billing information on the payment profile page. You can add one or more credit cards and select one of the credit cards as the default card, on which the payments are charged. Payments are charged either at the change of each month, or when a high-enough debt has been accumulated. If payments cannot be processed on the credit card, processing of your jobs will cease. Valossa may skip charging the card in situations where the amount to be charged would be very small. The skipped charge may be actuated as part of a later charging event on a credit card. Receipts of credit card charges will be sent to your email, if you are an administrator user of the account of your organization. If a payment fails due to expired, faulty or missing credit card information, you must add a working credit card as soon as possible and Valossa has the right to charge the outstanding amount from the working credit card at any time or to collect the amount from you by other means.

(The old payment system, retired in December 2017, used prepaid Valossa Credit in order to run a video analysis job. If you had some Valossa Credit before the system change, your Valossa Credit that was remaining at the time of the change has been converted into a corresponding positive payment balance item in the new system.)

Send an HTTP POST to the URL:

https://api.valossa.com/core/1.0/new_job

Example new_job request body in JSON format:

{
  "api_key" : "kkkkkkkkkkkkkkkkkkkk",
  "media": {
    "title": "The Action Movie",
    "description": "Armed with a knife, Jack Jackson faces countless dangers in the jungle.",
    "video": {
      "url": "https://example.com/content/Action_Movie_1989.mp4"
    },
    "transcript": {
      "url": "https://example.com/content/actionmovie.srt"
    },
    "customer_media_info": {
      "id": "469011911002"
    },
    "language": "en-US"
  }
}

The video URL and transcript URL can be either http:// or https:// or s3:// based. If the URL is s3:// based, you should first communicate with us to ensure that our system has read access to your S3 bucket in AWS (Amazon Web Services).

The video URL is mandatory. The URL must directly point to a downloadable video file. Our system will download the file from your system.

The transcript URL is optional – but recommended, because an existing SRT transcript is a more reliable source of speech information than audio analysis. The URL must directly point to a downloadable transcript file. Our system will download the file from your system.

The title is optional – but recommended: a human-readable title makes it easy for you to identify the video on the results page of Valossa Portal, and will also be included in the metadata file.

The description is optional. Description is any freetext, in English, that describes the video.

If title and/or description are provided in the call, the text in them will also be analyzed, and the detected concepts will be included in the analysis results (the "external" concepts in the metadata JSON).

The customer media info is optional. If you provide a customer media ID in the "id" field inside the "customer_media_info" field, you may use the customer media ID (a string from your own content management system) to refer to the specific job in the subsequent API calls, replacing the "job_id" parameter with a "customer_media_id" parameter in your calls. Note: Our system will NOT ensure that the customer media ID is unique across all jobs. Duplicate IDs will be accepted in new_job calls. It is the responsibility of your system to use unique customer media IDs, if your application logic requires customer media IDs to be unique. If you use duplicate customer media IDs, then the latest inserted job with the specific customer media ID will be picked when you use the "customer_media_id" parameter in the subsequent API calls.

The language is optional. It specifies the language model to be used for analyzing the speech in the audio track of your video. The allowed values are "en-US" (US English), "es-ES" (Spanish), "fi-FI" (Finnish), "fr-FR" (French) and "it-IT" (Italian). More languages will be supported in the future. If the language parameter is not given, the default "en-US" will be used so the speech in the video is assumed to be in US English.

Please note that for other languages than US English, the following exceptions apply.

  • Audio-based detections of named entities ("audio.keyword.name.person", "audio.keyword.name.location", "audio.keyword.name.organization", "audio.keyword.name.general" detections) and offensive words ("explicit_content.audio.offensive" detections) are available, but audio-based detections of novelty words ("audio.keyword.novelty_word" detections) are not available.
    • However, for the French language, "audio.keyword.novelty_word" detections are available (as English translations).
  • Audio-based IAB categories are not available.

The language-specific details are subject to change in the future.

If the analysis is technically successful (i.e. if the job reaches the "finished" state), the price of the job will be added to the amount to be charged on your credit card. The price is based on the count of beginning minutes of video playback time. Example: a video of length 39 minutes 20 seconds will be billed as equivalent of 40 minutes. Always keep a working credit card in your payment profile, and if you don't yet have an active subscription please start a new subscription.

Here is an example new_job request body with only the mandatory fields present:

{
  "api_key": "kkkkkkkkkkkkkkkkkkkk",
  "media": {
    "video": {
      "url": "https://example.com/my-car-vid.mpg"
    }
  }
}

The response of a successful new_job call always includes the job_id of the created job.

Example response in an HTTP 200 OK message:

{
  "job_id": "6faefb7f-e468-43f6-988c-ddcfb315d958"
}

Jobs are identified by UUIDs, which appear in "job_id" fields in various messages. Your script that calls the API must, of course, save the job_id from the new_job response in order to be able to query for the status and results later.

Example test call with Curl on the command line, if your test request JSON is in a file created by you:

curl --header "Content-Type:application/json" -X POST -d @your_request.json https://api.valossa.com/core/1.0/new_job

If you want a HTTP POST callback and/or email notification when your video analysis job reaches an end state, you may specify one or both of those in the new_job request. The HTTP POST callback mechanism in our system expects your system to send a 200 OK response for the request (callback) initiated by our system. The request will be re-tried one time by our system, if the first attempt to access your specified callback URL returns a non-200 code from your system or times out. Due to the possibility of network problems and other reasons, you should not rely on the HTTP POST callback to be received by your system. In any case, whether the HTTP POST callback event was received or not, your system can always check the status of the job using the job_status function in the REST API. The email notification will be sent to those users that have the permission to view job results for the chosen API key.

Example of a job request with a HTTP POST callback:

{
  "api_key": "kkkkkkkkkkkkkkkkkkkk",
  "callback": {
    "url": "https://example.com/your_callback_endpoint"
  },
  "media": {
    "title": "Lizards dancing",
    "video": {
      "url": "https://example.com/lizards_dancing.mkv"
    }
  }
}

The HTTP POST callback message is formatted as JSON, and contains the job ID in the "job_id" field and the reached end status of the job in the "status" field. It also contains the customer media ID in the "customer_media_id" field, if you had given a customer media ID for the job. Here is an example of a HTTP POST callback message body:

{
  "job_id": "ad48de9c-982e-411d-93a5-d665d30c2e92",
  "status": "finished"
}

Example of a job request with an email notification specified:

{
  "api_key": "kkkkkkkkkkkkkkkkkkkk",
  "email_notification": {
    "to_group": "users_with_access_to_api_key_results"
  },
  "media": {
    "title": "Lizards dancing",
    "video": {
      "url": "https://example.com/lizards_dancing.mkv"
    }
  }
}

The generated email notification message is intended for a human recipient. So, unlike the HTTP POST callback, the email notification message is not intended for machine parsing.

Error responses from the API

The following pertains to the HTTP error responses, which are returned immediately for your API call if your request was malformed or missing mandatory fields. In other words, the following does not pertain to the separate HTTP callback messages, which were discussed above. (Callback events are not even generated for the errors that are returned immediately in the HTTP error response of an API call.)

Error responses from the API calls (new_job calls or any other calls) contain an error message, and can be automatically separated from 200 OK responses, because error responses are sent along with an HTTP error code (non-200). Error responses are also formatted as JSON, and they contain an "errors" array, where one or more errors are listed with the corresponding error messages.

Example error response in an HTTP 400 message:

{
  "errors": [
    {
      "message": "Invalid API key"
    }
  ]
}

Getting status of a single job

The status of a single analysis job is polled using HTTP GET.

Example request:

https://api.valossa.com/core/1.0/job_status?api_key=kkkkkkkkkkkkkkkkkkkk&job_id=6faefb7f-e468-43f6-988c-ddcfb315d958

Example response in an HTTP 200 OK message:

{
  "status": "processing",
  "media_transfer_status": "finished",
  "details": null,
  "poll_again_after_seconds": 600
}

Possible values for the "status" field: "queued", "on_hold", "preparing_analysis", "processing", "finished", and "error". More status values may be introduced in the future.

If the job status is "error", something went wrong during the analysis process. If there is an explanation of the error in the "details" field, please see if the cause of the error is something you can fix for yourself (such as a non-video file in the video URL of the job request). Otherwise, contact us in order to resolve the issue.

If the job status is "on_hold", it means that there is a problem with your billing information and your jobs cannot proceed because of that. For example, a credit card charge may have failed, or you might not currently have an active service subscription. You need to make sure that you have at least one working credit card in your payment profile and that you have selected one of the cards as the default card, on which the payments are charged. If you don't have an active subscription please start a new subscription.

If the job status is "queued" or "processing", you should poll the status again after some time.

If the job status is "finished", you can fetch the job results using the job_results function.

The "details" field may contain some additional details about the status of the job.

The "media_transfer_status" field indicates whether the media to be analyzed has been transferred from your system to our system. Possible values for the "media_transfer_status" field: "queued", "downloading", "finished" and "error". If "media_transfer_status" is "finished", your video (and the transcript if you provided it) have been successfully transferred to our system.

The value in "poll_again_after_seconds" is just a suggestion about when you should poll the job status again (expressed as seconds to wait after the current job_status request).

If there was a problem with the job_status query itself, the error will be indicated in an HTTP non-200 response with a JSON body, similar to the error responses of the new_job function.

Getting the results of a job

After a job has been finished, the resulting video metadata can be fetched using HTTP GET.

Example request:

https://api.valossa.com/core/1.0/job_results?api_key=kkkkkkkkkkkkkkkkkkkk&job_id=6faefb7f-e468-43f6-988c-ddcfb315d958

Response data is in the JSON format. For more details, see chapter "Output metadata JSON format".

Save the metadata and use it from your own storage disk or database for your easy and quick access. We will not necessarily store the results perpetually.

If there was a problem with the job_results query itself, the error will be indicated in an HTTP non-200 response with a JSON body, similar to the error responses of the new_job function.

Listing all your jobs and statuses of all your jobs

Convenience function for listing all your jobs, optionally with also their job statuses (optional parameter "show_status" with the value "true"), using HTTP GET:

Example request:

https://api.valossa.com/core/1.0/list_jobs?api_key=kkkkkkkkkkkkkkkkkkkk&show_status=true

Example response in an HTTP 200 OK message:

{"jobs": [
  {
    "job_id": "6faefb7f-e468-43f6-988c-ddcfb315d958",
    "job_status":
    {
      "status": "finished",
      "media_transfer_status": "finished",
      "details": null,
      "poll_again_after_seconds": null
    }
  },{
    "job_id": "36119563-4b3f-44c9-83c6-b30bf69c1d2e",
    "customer_media_id": "M4070117",
    "job_status":
    {
      "status": "processing",
      "media_transfer_status": "finished",
      "details": null,
      "poll_again_after_seconds": 600
    }
  }
]}

If you had given a customer media ID when creating the job, the "customer_media_id" field exists and contains the customer media ID value.

Showing video titles and other media information in the job listing is often useful. This can be done by using the optional GET parameter "show_media_info" with the value "true". Example request:

https://api.valossa.com/core/1.0/list_jobs?api_key=kkkkkkkkkkkkkkkkkkkk&show_status=true&show_media_info=true

Example response in an HTTP 200 OK message:

{"jobs": [
  {
    "job_id": "36119563-4b3f-44c9-83c6-b30bf69c1d2e",
    "customer_media_id": "M4070117",
    "job_status":
    {
      "status": "finished",
      "media_transfer_status": "finished",
      "details": null,
      "poll_again_after_seconds": null,
      "media_info":
      {
        "title": "Birds clip #22",
        "description": "Birds having a bath",
        "video":
        {
          "url": "https://example.com/contentrepository/project1/aabhk-gg4rt-gi5aq-jjv6t/birds_22_original.mp4"
        }
      }
    }
  },{
    "job_id": "6faefb7f-e468-43f6-988c-ddcfb315d958",
    "job_status":
    {
      "status": "finished",
      "media_transfer_status": "finished",
      "details": null,
      "poll_again_after_seconds": null,
      "media_info":
      {
        "video":
        {
          "url": "https://example.com/my-car-vid.mpg"
        }
      }
    }
  }
]}

By adding the optional GET parameter "n_jobs" to the request (example: n_jobs=500), you can control how many of your jobs will be listed if your job list is long. The default is 200. The maximum possible value for "n_jobs" is 10000.

If there was a problem with the list_jobs query itself, the error will be indicated in an HTTP non-200 response with a JSON body, similar to the error responses of the new_job function.

Cancel a job

Cancel a job by sending a HTTP POST to the URL:

https://api.valossa.com/core/1.0/cancel_job

Example cancel_job request body:

{
  "api_key": "kkkkkkkkkkkkkkkkkkkk",
  "job_id": "be305b1e-3671-45b1-af88-1f052db3d1bb"
}

Example response in an HTTP 200 OK message:

{
  "job_status": "canceled"
}

The job must be in a cancellable state for this function to succeed. For example, a finished job is not cancellable.

If there was a problem with the cancel_job query itself, the error will be indicated in an HTTP non-200 response with a JSON body, similar to the error responses of the new_job function.

If you are interested in deleting job media from the service, contact us.

Valossa Search API for your analyzed videos (coming soon)

Coming soon! Before the release of the REST API for Search, you can use the Search functionality in the Portal with the Video Insight Tools plan.

Valossa Training API overview

The Valossa Training API is part of Video Insight Tools, thus the use of this API requires an active subscription of Video Insight Tools. In addition to REST API access, the functionalities of the Valossa Training API can be accessed using a graphical user interface in Valossa Portal.

Using the Valossa Training API, you can train the system to detect custom faces. The custom faces will be detected in those videos that you analyze after the training. By training your custom faces, you acknowledge and accept the fact that using custom-trained faces may cause some additional delays in the processing of your video analysis jobs.

The detected face identities will appear in the "similar_to" fields of the "human.face" detections in Valossa Core metadata. Your API key(s) that work for creating new analysis jobs with the Valossa Core API will also work for face training with the Valossa Training API.

Your custom gallery, and the ID (a UUID) of your custom gallery, are created implicitly when you start training custom faces.

From these Curl-based request-and-response examples it is easy to modify REST API calls for use in your application. Just like with the Core API, the HTTP response code 200 indicates a successful operation, and a non-200 code indicates error (an error message is provided in that case). The response body is in the JSON format.

As you can see from the examples, any "read data" requests use the HTTP GET method, while any "write data" or "erase data" requests use the HTTP POST method.

The trained faces will be automatically used in your subsequent video analysis jobs.

Uploading images for training

Adding sample images for a face has been designed to work with both, file uploads and file downloads. Thus, the file reference mechanism used in the add_face_image request of the Valossa Training API uses an easy URL-based syntax for both file input styles. Currently, only uploads are supported, but download support will be added in the future. Download, obviously, means that our system downloads the image file from an HTTP(S) URL provided by your system. Uploaded files each get assigned a valossaupload:// URL that uniquely identifies the successfully received file that resides in our storage system.

When you upload a file, first use the REST function upload_image to move the file content. Then, you will use the valossaupload:// URL of the correct file when referring to the file in the add_face_image request in the actual face identity training.

Before using the service in a way that makes you a "processor" and/or "controller" of the personal data of EU-resident natural persons, you are required to make sure that your actions are compliant with the General Data Protection Regulation. See the Terms and Conditions of Valossa services

An image must be in the JPG or PNG format. The maximum image file size is 8MB.

At least 10 different sample images of each face, photographed from different angles etc., should be given in order to get good training results. The more images the better. Training may in some cases work even with only a few images, but the results are better with more samples: a lot of clear, diverse, high-quality images of the face to be trained.

Send an HTTP POST to the URL:

https://api.valossa.com/training/1.0/upload_image

Curl example:

curl -F "image_data=@ricky_1.jpg" -F "api_key=kkkkkkkkkkkkkkkkkkkk" https://api.valossa.com/training/1.0/upload_image

Example response in an HTTP 200 OK message:

{
  "uploaded_file_url": "valossaupload://ff357efe-1086-427d-b90c-1d1887fb1017"
}

The Content-Type header of the file upload request must be "multipart/form-data". If you use Curl and its "-F" option, Curl will set this Content-Type as default and will also use POST as the HTTP method in the request. There must be one file per upload_image request.

Note! As you can see from the Curl request example above, the API key must be sent as a form parameter (not URL parameter). This is quite natural, taking into account that the Content-Type of the request is "multipart/form-data".

Training face identities and managing the trained identities

All the POST-based REST functions listed below accept a JSON-formatted input string, which contains the parameters of the specific function. The GET-based REST functions read their parameters from the request URL.

Create a new face identity

The string fields "name" and "gender" are optional. We recommend setting at least the name, because a nameless face identity might cause confusion for you later on (however, it is perfectly acceptable to have a nameless face identity, if your application logic requires creating such an identity). The maximum length of the value of "name" is 1024 characters. The gender is "male" or "female". The response contains the unique identifier of the face identity (person).

Send an HTTP POST to the URL:

https://api.valossa.com/training/1.0/create_face_identity

Curl example:

curl --header "Content-Type:application/json" -X POST -d '{"api_key":"kkkkkkkkkkkkkkkkkkkk", "face":{"name":"Ricky Rickson", "gender":"male"}}' https://api.valossa.com/training/1.0/create_face_identity

Example response in an HTTP 200 OK message:

{
  "face_id": "bb254a82-08d6-4498-9ddb-3de4c88f1f66"
}

Save the face ID locally. You will need it when you add images for the face or when you do any other operations with the specific face identity.

Add face images to a specific face identity

Referring to your previously uploaded files, you need to add the correct files to a specific existing face identity, one image file per add_face_image request. The response contains a unique identifier of the processed, accepted training image, from which a sample face has been detected. You need the ID later, if you want to do any operations with this training image that has been added to a specific face identity.

There must be exactly one face visible per image. This REST function may take a few seconds to complete, because the system checks that exactly one face is clearly visible (otherwise, an error response is generated).

In the future, also image download URLs will be able to be used with the same easy add_face_image call syntax. Currently, only the valossaupload:// URLs created as a result of file uploads are supported.

Please make sure that each of the images is actually an image of the correct person. Typically, checking this involves some human work. Wrong images will deteriorate the quality of face detections.

Send an HTTP POST to the URL:

https://api.valossa.com/training/1.0/add_face_image

Curl example:

curl --header "Content-Type:application/json" -X POST -d '{"api_key":"kkkkkkkkkkkkkkkkkkkk", "face":{"id":"bb254a82-08d6-4498-9ddb-3de4c88f1f66"}, "image":{"url":"valossaupload://ff357efe-1086-427d-b90c-1d1887fb1017"}}' https://api.valossa.com/training/1.0/add_face_image

Example response in an HTTP 200 OK message:

{
  "image_id": "8ac7ab90-44d1-4860-9a2f-2afbb175638a"
}

List existing face identities

Send an HTTP GET to the URL:

https://api.valossa.com/training/1.0/list_face_identities

Curl example:

curl 'https://api.valossa.com/training/1.0/list_face_identities?api_key=kkkkkkkkkkkkkkkkkkkk'

Example response in an HTTP 200 OK message:

{
  "face_identities": [
    {
      "id": "a99a59e3-ba33-4b00-8114-8bdd92a71dfa"
    },
    {
      "id": "bb254a82-08d6-4498-9ddb-3de4c88f1f66"
    }
  ]
}

It is also possible to list existing face identities with details.

Curl example:

curl 'https://api.valossa.com/training/1.0/list_face_identities?api_key=kkkkkkkkkkkkkkkkkkkk&show_details=true'

Example response in an HTTP 200 OK message:

{
  "face_identities": [
    {
      "id": "a99a59e3-ba33-4b00-8114-8bdd92a71dfa",
      "name": "Lizzy Blythriver",
      "gender": "female"
    },
    {
      "id": "bb254a82-08d6-4498-9ddb-3de4c88f1f66",
      "name": "Ricky Rickson",
      "gender": "male"
    }
  ]
}

List existing images added for a face identity

Send an HTTP GET to the URL:

https://api.valossa.com/training/1.0/list_face_images

Curl example:

curl 'https://api.valossa.com/training/1.0/list_face_images?api_key=kkkkkkkkkkkkkkkkkkkk&face_id=bb254a82-08d6-4498-9ddb-3de4c88f1f66'

Example response in an HTTP 200 OK message:

{
  "face_images": [
    {
      "id": "8ac7ab90-44d1-4860-9a2f-2afbb175638a"
    },
    {
      "id": "b5559837-62a5-4f10-b250-a554ab2ce54c"
    }
  ]
}

Update face identity

Send an HTTP POST to the URL:

https://api.valossa.com/training/1.0/update_face_identity

The "updates" structure contains one or more face parameters to update. The allowed parameters are "name" and "gender". The data type for these values is string. The maximum length of the value of "name" is 1024 characters. The value for "gender" is "male" or "female".

Note: To unset a field such as name or gender completely, just set it to null in an update_face_identity call. In an update, a value that is not mentioned in the "updates" structure will retain its old value if it had one (in other words, omitting the field from the update does not unset the value of the field, while setting it explicitly to null will unset it).

Curl example:

curl --header "Content-Type:application/json" -X POST -d '{"api_key":"kkkkkkkkkkkkkkkkkkkk", "face":{"id":"bb254a82-08d6-4498-9ddb-3de4c88f1f66", "updates":{"name":"Ricky Rixon-Borgmann"}}}' https://api.valossa.com/training/1.0/update_face_identity

Example response in an HTTP 200 OK message:

{}

Remove image from face identity

Send an HTTP POST to the URL:

https://api.valossa.com/training/1.0/remove_face_image

Curl example:

curl --header "Content-Type:application/json" -X POST -d '{"api_key":"kkkkkkkkkkkkkkkkkkkk", "image":{"id":"b5559837-62a5-4f10-b250-a554ab2ce54c"}}' https://api.valossa.com/training/1.0/remove_face_image

Example response in an HTTP 200 OK message:

{}

Remove face identity

Send an HTTP POST to the URL:

https://api.valossa.com/training/1.0/remove_face_identity

Curl example:

curl --header "Content-Type:application/json" -X POST -d '{"api_key":"kkkkkkkkkkkkkkkkkkkk", "face":{"id":"bb254a82-08d6-4498-9ddb-3de4c88f1f66"}}' https://api.valossa.com/training/1.0/remove_face_identity

Example response in an HTTP 200 OK message:

{}

Visualization of your results in a Valossa Report

Valossa Portal provides an easy-to-use visualization tool, called the Valossa Report (part of the Video Insight Tools product package), for you to get a quick visual overview of the most prominent detections, and also a more detailed heatmap for browsing the results. On the home page, each displayed information box that is related to a successfully analyzed video contains a link to the Valossa Report of the video analysis results. To see examples of Valossa Report, click "Demos" on the home page (you must be logged in to Valossa Portal in order to do this).

Below you'll find example screenshots of Valossa Report.

(Actually the Valossa Report is a tool for viewing your Valossa Core metadata in an easy way for humans. When you're ready to integrate Valossa Core metadata to your application, please see the instructions for machine-reading the Valossa Core metadata.)

Overview

The Valossa Report's Overview gives you a quick visual overview of the analyzed video content.

Charade Valossa Report

The tags are an overview of the detected concepts. By clicking the arrows you can browse through the detections in the video. You can also search the concept within the video by clicking the magnifying glass symbol.

Charade Valossa Report

Heatmap

The Valossa Report's Heatmap displays the timeline of a video, and detections of concepts are placed on the timeline. Each detection is shown on its own row (its own timeline). Detections are grouped by their detection type such as human.face, visual.context, audio.context, etc. Please note that different colors are given to different detection types for distinguishing them visually.

Within a detection type, detections are grouped by prominence. For example, the most prominent faces are shown first.

Charade Valossa Report

With the Valossa Report controls, you can change the resolution of the timeline (how many distinct timeslots are shown) and the number of detections shown. You can also adjust the confidence threshold for several detection types. The detections below the chosen threshold are hidden.

The depth of color in the colored spots on the timeline for a detection shows how many detections of that concept are in that timeslot and/or how confident the detections are. Click on a colored spot, and the video player on the Valossa Report page will playback the video from the corresponding timecode. Thus, you are able to see the main concepts of the video arranged by time and prominence, and verify their correctness. With the main timeline and the seek bar under the video player, you can also move to any time-position in the video.

Charade Valossa Report

Tag & Train

Tag & Train naming tool can be used to edit names and genders of the detected faces. Changes will be saved to the metadata of the video analysis job and indexed into the search automatically. Training functionality that allows AI to learn from the changes will be available during the Video Insight Tools Preview program.

Click the "Tag & Train" button above the face detections or the pencil next to a person name to open the tool.

Charade Valossa Report

General notes about metadata

Valossa Core metadata is provided in downloadable JSON files, which are available via the REST API (function job_results) or via the results page in Valossa Portal that shows the results and other info about your most recent jobs.

The sizes of the JSON files vary depending on the size of the videos and the number of detections, ranging from a few kilobytes to several megabytes. You should save the metadata JSON in your local database or file system. The metadata will not necessarily be stored perpetually in our system, download count limits may be imposed in the future, and it is also faster for your application to access the metadata from your local storage space.

The version number of the metadata format is continuously updated, when the format changes (version changelog of Valossa Core metadata). The version number is a concatenation of three integers, with a dot (.) as the delimiter: starting from the beginning of the string the version number x.y.z contains a major version number, a minor version number and a patch number. If only the patch version number (z in x.y.z) changes, the changes are purely additions to the structure i.e. they can't break the parsing code.

Output metadata JSON format

Basics: How (and why) to read Valossa Core metadata in your application code

Valossa Core metadata has been designed to address several needs. It answers questions such as:

  1. What does the video contain?
  2. Where — or when — are all the instances of X in the video?
  3. What is in the video at any specific time position?
  4. What is the video predominantly about?

Please see the images below for a quick explanation of how to read these things from the metadata.

Valossa Core AI addresses the needs 1 and 4 by detecting a varity of things and then ranking the most dominant detections from the video, so that the Valossa Core metadata can be used for answering questions such as "What are the visuals about?", "Who are the faces appearing in the video?", "What sounds are in the audio track?", "What are the spoken words about?", "What is the entire video about?", etc. The detections are grouped conveniently by the detection type, see more below. The needs 2 and 3 are addressed by Valossa Core AI with a smart time-coding logic that makes it easy to read either all the temporal occurrences of a specific detection or all the detections at a specific time position, whichever way is the most useful for your application.

Detections

A more detailed explanation of the fields "detections" and "by_detection_type" can be found in the subchapter Detections.

Detections are grouped by Valossa Core AI in a way that makes it easy for your application code to iterate over all instances (occurrences) of, for example, cats:

Occurrences

by_second field

By reading the "by_second" field, your application code can easily list everything at a given time position. More details about the "by_second" field are provided in the subchapter Detections.

Using IAB categories, the metadata tells the topics of the video to your application code:

IAB categories

The main JSON structure

Valossa Core metadata about your videos is hierarchical and straightforward to parse for your application code. High-level structure of the current Valossa Core video metadata JSON format, not showing detailed subfields:

{
  "version_info": { "metadata_type": "core", "metadata_format": "...", "backend": "..." },
  "job_info": { "job_id": "...", "request": {...} },
  "media_info": { ... },
  "detections": { ... },
  "detection_groupings": { 
    "by_detection_type": { ... },
    "by_second": [ ... ],
    "by_detection_property": { ... }
  },
  "segmentations": { ... }
}

Currently there are two supported values for the "metadata_type" field: "core" and "frames_faces". The default type is "core" (Valossa Core metadata) — if you need "frames_faces" metadata that contains the bounding box information for the detected faces, you must specify this in your API call when downloading metadata.

The version number of the metadata format (x.y.z, explained above) can be found in the "metadata_format" field under "version_info". Of course, the version numbering of Valossa Core metadata (the files with "core" as the value of the "metadata_type" field) is separate from the version numbering of "frames_faces" metadata for the same video.

You will best understand the details of the metadata structure by viewing an actual metadata JSON file generated from one of your own videos! As the very first thing you'll probably want to view your results using the easy Valossa Report visualization tool.

Note: In order to save storage space, JSON provided by the API does not contain line-breaks or indentations. If you need to view JSON data manually during your software development phase, you can use helper tools in order to get a more human-readable (pretty-printed) version of the JSON. For example, the JSONView plugin for your browser may be of help, if you download JSON metadata from the Portal: the browser plugin will display a pretty-printed, easily navigable version of the JSON. In command-line usage, you can use the "jq" tool or even Python: cat filename.json | python -m json.tool > prettyprinted.json

In the following subchapters, the JSON metadata format is described in more detail.

Detections

All concept detections from the video are listed in the field "detections". This is an associative array, where the key is a detection ID and the value is the corresponding detection. Please note that the detection ID is a string, and you must not assume that the string always represents an integer, even though the IDs often look like "1" or "237". So, the ID is a string, unique within the key space of the "detections" structure, but your code cannot assume that the string has a specific internal format.

The detection IDs are used in "detection_groupings" to refer to the specific detection, so the detailed information about each detection resides in one place in the JSON but may be referenced from multiple places using the ID. Inside the field "detection_groupings", three practical groupings of detections are given for you:

  • The subfield "by_detection_type" has detection type identifiers as the key and the value is an array of detection IDs; the array is sorted by relevance, most relevant detections first! Using "by_detection_type", you can easily for example list all the detected faces, or all the detected audio-based keywords. Want to find out whether there's a cat somewhere in your video? Just loop over the visual.context detections and match detections against Valossa concept identifier (cid) of "cat" ("02g28TYU3dMt"), against the Wikidata concept identifier of "cat" ("Q146"), against the Google Knowledge Graph concept identifier of "cat" ("/m/01yrx"), or even against the human-readable concept label "cat" if you're adventurous. See details below.
  • The subfield "by_second" contains an array, where each item corresponds to one second of the video. Using this array you can answer questions such as "What is detected at 00:45 in the video?". Under each second, there is an array of objects which contain at least the string-valued field "d" (detection ID). Using the detection ID as the index, you will find the detection from the "detections" list. If applicable, there is also the float-valued field "c" (confidence, max. 1.0), currently available only for visual.context and audio.context detections. If the field "o" exists, it contains an array of occurrence identifiers that correspond to this detection in this second.
  • The subfield "by_detection_property", introduced in Valossa Core metadata version 1.3.4, currently contains a convenient structure that helps you when several "human.face" detections have been matched to the same specific face ID in a face gallery. More information about iterating over faces based on the gallery face ID

The following image helps understand the usage of detection IDs as references within the JSON data:

Detection IDs

How to get an overview of the most prominent detections? That's easy: in "by_detection_type", start reading detections from the beginning of the lists under each detection type. Because the detections are sorted the most relevant ones ffirst, reading e.g. the 20 first detections from "human.face" gives you an overview of the most prominent faces in the video. For an easy and quick overview of detections, you may view the Valossa Report (visualization of detections) of the video in Valossa Portal.

However, please note that the "audio.speech" detections (speech-to-text results) are not ordered by prominence, as they are just raw snippets of speech detected from a specific time-range from within the video's audio track. The complete speech-to-text data of a video are also available in the SRT format in Valossa Portal, on the page that lists the most recent analysis results. The content of the downloadable SRT file is generated from the "audio.speech" detections from the Valossa Core metadata JSON file, so the information is the same whether you read the speech-to-text results from the metadata or from the SRT downloaded from Valossa Portal. Please note that the newlines in the generated speech-to-text SRT file are Unix-newlines (LF only, not CRLF).

Every detection in the JSON has, at minimum, the fields "t" (detection type identifier) and "label". The "label" is just the default human-readable label of the detected concept, and for many detection types, more specific information is available in additional data fields. The following is the list of currently supported detection type identifiers.

Fields that exist or don't exist in a detection, depending on the detection type and situation, include "occs", "a", "ext_refs", "categ" and "cid".

Detection types

Currently, the following detection types are supported.

visual.context
audio.context
audio.speech
human.face
human.face_group
transcript.keyword.novelty_word
transcript.keyword.name.person
transcript.keyword.name.location
transcript.keyword.name.organization
transcript.keyword.name.general
audio.keyword.novelty_word
audio.keyword.name.person
audio.keyword.name.location
audio.keyword.name.organization
audio.keyword.name.general
external.keyword.novelty_word
external.keyword.name.person
external.keyword.name.location
external.keyword.name.organization
external.keyword.name.general
topic.iab.transcript
topic.iab.visual
topic.iab.audio
explicit_content.nudity
explicit_content.audio.offensive
explicit_content.transcript.offensive
visual.color

The identifiers are mostly self-explanatory. Please note that "visual.context" offers a broad range of visual detections such as objects; "audio.context" offers a broad range of audio-based detections; "topic.iab.*" are IAB categories for the entire video; "external.keyword.*" refers to keywords found from video description or title; "human.face_group" are people who have a temporal correlation high enough to probably have meaningful interaction with each other.

Occurrences

The field "occs" contains the occurrence times of the detection. There is a start time and an end time for each occurrence. For example, a visual object "umbrella" might be detected 2 times: first occurrence from 0.3 seconds to 3.6 seconds, and another occurrence from 64.4 seconds to 68.2 seconds — so there would be 2 items in the "occs" array. Time values are given as seconds "ss" (seconds start) and "se" (seconds end), relative to the beginning of the video.

Detections that are not time-bound (such as topic.iab.* and external.keyword.*) cannot contain "occs".

If applicable to the detection type, occurrences have a maximum confidence ("c_max") detected during the occurrence period. (Because confidence varies at different moments during the occurrence, it makes sense to provide just the maximum value here. To find out the confidence during a particular moment, check out the "c" field of each second in the "by_second" data.) Currently, only visual.context and audio.context detections have "c_max".

Please note that if you want to answer the question "What is in the video at time xx:xx?", then you should see the "by_second" array in the "detection_groupings". Occurrences, on the other hand, are good when you want to answer the question "At what time-sections is Y detected?"

Other optional data fields of a detection

As you remember, "t" and "label" are always given for a detection. The field "occs" might not be there. Besides "occs", there are other optional fields for a detection: "a", "ext_refs", "categ"

If exists, the object-field "a" contains attributes of the detection. For example, the "human.face" detections may have attributes: "gender" that includes the detected gender and "s_visible" i.e. the total screen-time of the face, and "similar_to" i.e. possible visual similarity matches to persons in a face gallery. The "gender" structure also contains the confidence field "c" (0.0 to 1.0).

If exists, the string-field "cid" contains the unique identifier of the concept in the Valossa Concept Ontology. All visual.context detections and audio.context detections have "cid". However, for example audio.speech detections don't have "cid".

If exists, the array-field "ext_refs" contains references to the detected concept in different ontologies. Most visual.context detections have "ext_refs", expressing the concept identity in an external ontology, such as the Wikiedata ontology or the Google Knowledge Graph ontology (or several ontologies, depending on the availability of the concept in the various external ontologies). Inside "ext_refs", the ontology identifier for Wikidata is "wikidata" and the ontology identifier for Google Knowledge Graph is "gkg" (see examples). If a specific external ontology reference object such as "wikidata" exists, there is an "id" field inside the object; the "id" field contains the unique identifier of the concept within that external ontology. Then you may search information about the concept from exteral services such as https://www.wikidata.org/. For "topic.iab.*" detections, the "ext_refs" field contains the ontology identifier "iab", and the ontology reference object describes the topic (IAB category) in the industry-standard IAB classification.

If exists, the object-field "categ" contains the key "tags", and under the key "tags" there is an array-field that contains one or more category identifier tags (string-based identifiers such as "flora" or "fauna") for the concept of the detection. For example, a "dog" detection has the category tag array ["fauna"]. As another example, an "amusement park" detection has the category tag array ["place_scene", "nonlive_manmade"]. Many visual.context detections and some audio.context detections have "categ". Note! This is about the categories of a specific detection (a "single concept") — a completely different thing than the categories of the entire video (such as IAB categories).

Currently, the following category tags are supported.

accident
act_of_violence
bomb_explosion
brand_product
content_compliance
event
explicit_content
fauna
flora
food_drink
football
graphics
gun_weapon
human
human_expression
injury
logo
natural_disaster
nonlive_manmade
nonlive_natural
place_scene
sensual
sexual
sport
style
substance_use
threat_of_violence
time
vehicle
video_structure
violence

Tips for reading some detection types

For "audio.speech" (speech-to-text) detections, the detected sentences/words are provided as a string in the "label" field of the detection.

Information related to "human.face" detections: If and only if a face is similar to one or more faces in a face gallery, the "a" field of the specific face detection object will contain a "similar_to" field that contains an array of the closely matching faces (from a gallery), and within each item in the "similar_to" array there is a string-valued "name" field providing the name of the visually similar person and "c" that is the float-valued confidence (0.0 to 1.0) that the face is actually the named person. In "similar_to", the matches are sorted the best first. Please note that a face occurrence doesn't directly contain "c" — confidence for faces is only available in the "similar_to" items. Starting from Valossa Core metadata version 1.3.4, each "similar_to" item contains a "gallery" field and a "gallery_face" field. The "gallery" field contains an "id" field, the value of which is the face gallery ID (a UUID) of the gallery from which the matched face identity was found (see custom galleries). The "gallery_face" field contains a string-valued "name" field and an "id" field, the value of which is the face ID (a UUID), that is, a unique identifier of the face identity (person) within the specified face gallery. (Note: A "name" field exists in two places, for reader code compatibility with previous metadata formats.) For information about how to get the face coordinates (bounding boxes) of a face and how to find all occurrences of a specific gallery-matched face identity, see the separate subsection.

The detection type "explicit_content.nudity" has two possible detections: "bare skin" and "nsfw". Please note: the detection type "explicit_content.nudity" will be soon deprecated. Already now a more detailed visual explicit (nudity-related and violence-related) content model exists and has been integrated as part of the "visual.context" detections, where the relevant detections have such category tags that they can be easily distinguished from the non-explicit detections. You should already be using the explicit visual content related detections from "visual.context", instead of using the legacy "explicit_content.nudity" detections.

The detected dominant colors in the video, per second, are provided in such a way that there is only one "visual.color" detection, which covers the entire video. The colors are provided as RGB values (6 hexadecimal digits in a string) stored as attributes in the "by_second" structure in those second-based data items, where the "d" field refers the single detection that has the detection type "visual.color". Please note that at the end of a video, there might be a second where the color item is not available, so please do not write reader code that assumes a "visual.color" detection reference to exist at absolutely every second of the video. The attributes are in the "a" field of the color-related data item of the given second, and the "a" field contains an array-valued "rgb" field, where each item is an object containing information about a particular detected color. In each of those objects, the "f" field is the float-valued fraction (max. 1.0) of the image area that contains the particular color (or contains a close-enough approximation of that color), and the "v" field is the RGB value. The "letter digits" (a-f) in the hexadecimal values are in lowercase.

Example of color data at a particular second:

...
{
  "d": "12",
  "o": [
    "13"
  ],
  "a": {
    "rgb": [
      {
        "f": 0.324,
        "v": "112c58"
      },
      {
        "f": 0.301,
        "v": "475676"
      },
      {
        "f": 0.119,
        "v": "9f99a3"
      }
    ]
  }
},
...

Detection data examples

An example of what is found in "detections", a visual.context detection:

...
"86": {
  "t": "visual.context",
  "label": "hair",
  "cid": "lC4vVLdd5huQ",
  "ext_refs": {
    "wikidata": {
      "id": "Q28472"
    },
    "gkg": {
      "id": "/m/03q69"
    }
  },
  "categ": {
    "tags": [
      "human"
    ]
  },
  "occs": [
    {
      "ss": 60.227,
      "se": 66.191,
      "c_max": 0.80443,
      "id": "267"
    },
    {
      "ss": 163.038,
      "se": 166.166,
      "c_max": 0.72411,
      "id": "268"
    }
  ]
},
...

Another example from "detections", a human.face detection:

...
"64": {
  "t": "human.face",
  "label": "face",
  "a": {
    "gender": {
      "c": 0.929,
      "value": "female"
    },
    "s_visible": 4.4,
    "similar_to": [
      {
        "c": 0.92775,
        "name": "Tina Schlummeister"
        "gallery": {
          "id": "a3ead7b4-8e84-43ac-9e6b-d1727b05f189"
        },
        "gallery_face": {
          "id": "f6a728c6-5991-47da-9c17-b5302bfd0aff",
          "name": "Tina Schlummeister"
        }
      }
    ]
  },
  "occs": [
    {
      "ss": 28.333,
      "se": 33.567,
      "id": "123"
    }
  ]
},
...

An example of an audio.context detection:

...
"12": {
  "t": "audio.context",
  "label": "exciting music",
  "cid": "o7WLKO1GuL5r"
  "ext_refs": {
    "gkg": {
      "id": "/t/dd00035"
    }
  },
  "occs": [
  {
    {
      "ss": 15,
      "se": 49
      "c_max": 0.979,
      "id": "8",
    }
  ],
},
...

An example of an IAB category detection:

...
"173": {
  "t": "topic.iab.transcript",
  "label": "Personal Finance",
  "ext_refs": {
    "iab": {
      "labels_hierarchy": [
        "Personal Finance"
      ],
      "id": "IAB13"
    }
  }
},
...

An example of keyword detection:

...
"132": {
  "t": "transcript.keyword.name.location",
  "label": "Chillsbury Hills",
  "occs": [
    {
      "ss": 109.075,
      "se": 110.975,
      "id": "460"
    }
  ]
}
...

Please note that transcript keyword occurrence timestamps are based on the input SRT timestamps. In the future, if a non-timecoded transcript is supported, transcript keywords might not have occurrences/timecoding.

How to find all face occurrences of a recognized person, and how to read the face coordinates (bounding boxes)

When viewing your video analysis results, you may have noticed that several different "human.face" detections (under different detection IDs) may be recognized as the same named person from a face gallery. This is natural, because to the AI, some face detections seem different enough from each other so they are classified as separate faces (separate face detections)... but each of those detections is similar enough to a specific face in the gallery, so each of the detections has a "similar_to" item for the same gallery face! For example, there could be two "human.face" detections which are "similar_to" the gallery face "Steve Jobs", the other one with confidence 0.61 and the other one with confidence 0.98.

Of course, a question arises: Is there an easy way to see list all the detections of a specific gallery face within a given Valossa Core metadata file? For example, find all "human.face" detections that were (with some confidence) matched to the gallery face "Steve Jobs"? Yes, there is.

Under "detection_groupings":"by_detection_property", certain types of detections are grouped by their certain shared properties. Currently, the only supported property-based grouping is for "human.face" detections, and for them the only supported property-based grouping has the identifier "similar_to_face_id". As shown in the example below, all detected faces that have at least one "similar_to" item (with a gallery face ID) are listed in the structure, indexed by the gallery face ID. A gallery face ID is a UUID that uniquely identifies the specific person (more precisely: the specific face identity) within the particular face gallery.

Please note that some legacy gallery faces might not have a face ID (UUID) and thus cannot be found in the "similar_to_face_id" structure. This restriction only applies to a few customers who have a face gallery that was created before the introduction of the "similar_to_face_id" grouping structure, and of course to the analysis jobs that have been run using an old version of the system: the face IDs and the "similar_to_face_id" grouping structure were introduced in the version 1.3.4 of Valossa Core metadata.

Under each gallery face ID, there is an object that contains the fields "moccs" and "det_ids".

In the "moccs" field, there is an array of objects that are the merged occurrences of the one or more "human.face" detections that share a specific gallery face ID in their "similar_to" items. The naming "moccs" highlights the difference of the format to the "occs" format that can be found in the actual "human.face" detections.

In the "det_ids" field, there is an array of the IDs of the detections that have this specific gallery face ID in their "similar_to" items. Thus, if you want to read all the original corresponding "human.face" detections (including, among other things, the original occurrences separately for each detection in a "non-merged form") for any specific gallery face ID, it is easy.

Of course, if there is only one "human.face" detection having a "similar_to" item with a given gallery face ID, then there is only one detection ID in the "det_ids" array under that gallery face ID, and the "moccs" array of that gallery face originates solely from the occurrences of the single corresponding "human.face" detection.

The name of each face is available in the "similar_to" items of the "human.face" detections, which are referred to with their detection IDs listed in "det_ids". So, for example, by looking at the item at index "3" in the "detections" field of the metadata you would see that the face ID "cb6f580b-fa3f-4ed4-94b6-ec88c6267143" is "Steve Jobs". Naturally, an easy way for viewing the "merged" faces information is provided by the Valossa Report tool.

Example of "similar_to_face_id" detection groupings data, where the occurrences of the face detections "3" and "4" with similarity to Steve Jobs (cb6f580b-fa3f-4ed4-94b6-ec88c6267143) have been merged into one easy-to-parse "moccs" structure:

...
"detection_groupings": {
  ...
  "by_detection_property": {
    "human.face": {
      "similar_to_face_id": {
        "cb6f580b-fa3f-4ed4-94b6-ec88c6267143": {
          "moccs": [
            {"ss": 5.0, "se": 10.0},
            {"ss": 21.0, "se": 35.0},
            {"ss": 64.0, "se": 88.0},
            {"ss": 93.0, "se": 98.0},
            {"ss": 107.0, "se": 112.0},
            {"ss": 123.0, "se": 137.0},
            {"ss": 157.0, "se": 160.0},
            {"ss": 196.0, "se": 203.0},
            {"ss": 207.0, "se": 212.0}
          ],
          "det_ids": ["3", "4"]
        },
        "648ec86d-4d91-42a6-928d-a25d8dc2691c": {
          "moccs": [
            {"ss": 194.0, "se": 197.0},
            {"ss": 229.0, "se": 237.0}
          ],
          "det_ids": ["19"]
        },
        ...
      }
    }
  },
  ...
},
...

Do you need face coordinates, that is, the bounding boxes for each detected face at a specific point in time? They are available from the Valossa Core API, but because of the considerable file size, the bounding boxes are not part of the Valossa Core metadata JSON. The bounding box data must be downloaded as a separate JSON file from the API. The metadata type identifier of this special metadata JSON is "frames_faces" (in "version_info":"metadata_type"). When downloading the metadata, you need to specify the metadata type with the parameter "type=frames_faces" in the job_results call.

The "frames_faces" metadata is easy to parse. The "faces_by_frame" field, which always exists, is an array that is indexed with the frame number so that the information for the first frame is at [0], the information for the next frame at [1] and so on. For each frame, there is an array that contains one bounding box object for each face that was detected in that frame. Of course, a frame without any detected faces is represented by an empty array.

Every bounding box object contains the fields "id", "x", "y", "w", "h". The value of "id" is the same detection ID that the corresponding "human.face" detection has in the Valossa Core metadata file of the same video analysis job. The values of "x" and "y" are the coordinates of the upper-left corner of the bounding box (the x offset from the left edge of the frame, and the y offset from the top of the frame). The values of "w" and "h" are the width and height of the bounding box, respectively. The values of "x", "y", "w", "h" are all given as float values relative to frame size, thus ranging from 0.0 to 1.0, with the following exception. Because a detected face can be partially outside the frame area, some face coordinates may be slightly less than 0.0 or more than 1.0 in the cases where the system approximates the edge of the invisible part of a bounding box. For example, the "x" coordinate of a face in such a case could be -0.027.

Example job_results request for "frames_faces" metadata, using HTTP GET:

https://api.valossa.com/core/1.0/job_results?api_key=kkkkkkkkkkkkkkkkkkkk&job_id=167d6a67-fb99-438c-a44c-c22c98229b93&type=frames_faces

Example response in an HTTP 200 OK message:

{
  "version_info": { "metadata_type": "frames_faces", "metadata_format": "...", "backend": "..." },
  "job_info": { ... },
  "media_info": { ... },
  "faces_by_frame": [
    ...
    [],
    [],
    [],
    [],
    [
      {
        "id": "1",
        "x": 0.4453125,
        "y": 0.1944444477558136,
        "w": 0.11953125149011612,
        "h": 0.21388888359069824
      }
    ],
    [
      {
        "id": "1",
        "x": 0.4351562559604645,
        "y": 0.19583334028720856,
        "w": 0.11953125149011612,
        "h": 0.2152777761220932
      }
    ],
    [
      {
        "id": "1",
        "x": 0.42578125,
        "y": 0.19722221791744232,
        "w": 0.12187500298023224,
        "h": 0.22083333134651184
      },
      {
        "id": "5",
        "x": 0.3382812440395355,
        "y": 0.23888888955116272,
        "w": 0.20468750596046448,
        "h": 0.3986110985279083
      }
    ],
    ...
  ]
}

Segmentations

In "segmentations", the video is divided into time-based segments using different segmentation rules.

Currently we support automatically detected shot boundaries, hence "segmentations" contains "detected_shots". The array "detected_shots" in segmentations provides shot boundaries, as an object for each detected shot, with seconds-based start and end timepoints (float-valued fields "ss", "se") and with start and end frame numbers (integer-valued fields "fs", "fe"). The shot duration as seconds is also provided (float-valued field "sdur"). Note: frame-numbers are 0-based, i.e. the first frame in the video has the number 0. All the fields "ss", "se", "fs", "fe", "sdur" are found in every shot object. The ordering of the shot objects in the array "detected_shots" is the same as the ordering of the detected shots in the video.

Example data:

"segmentations": {
  "detected_shots": [
    {
      "ss": 0.083,
      "se": 5.214,
      "fs": 0,
      "fe": 122,
      "sdur": 5.131
    },
    {
      "ss": 5.214,
      "se": 10.177,
      "fs": 123,
      "fe": 241,
      "sdur": 4.963
    },
    ...
  ]
}

Code examples for reading metadata

Example code snippet (in Python) that illustrates how to access the data fields in Valossa Core metadata JSON:

import json
metadata = None
with open("your_core_metadata.json", "r") as jsonfile:
        metadata = json.loads(jsonfile.read())

# Loop over all detections so that they are grouped by the type
for detection_type, detections_of_this_type in metadata["detection_groupings"]["by_detection_type"].iteritems():
        print "----------"
        print "Detections of the type: " + detection_type + ", most relevant detections first:"
        print
        for det_id in detections_of_this_type:
                print "Detection ID: " + det_id
                detection = metadata["detections"][det_id]
                print "Label: " + detection["label"]
                print "Detection, full info:"
                print detection

                # Example of accessing attributes (they are detection type specific)
                if detection_type == "human.face":
                        attrs = detection["a"]
                        print "Gender is " + attrs["gender"]["value"] + " with confidence " + str(attrs["gender"]["c"])
                        if "similar_to" in attrs:
                                for similar in attrs["similar_to"]:
                                        print "Face similar to person " + similar["name"] + " with confidence " + str(similar["c"])

                # More examples of the properties of detections:

                if detection_type == "visual.context" or detection_type == "audio.context":
                        if "ext_refs" in detection:
                                if "wikidata" in detection["ext_refs"]:
                                        print "Concept ID in Wikidata ontology: " + detection["ext_refs"]["wikidata"]["id"]
                                if "gkg" in detection["ext_refs"]:
                                        print "Concept ID in GKG ontology: " + detection["ext_refs"]["gkg"]["id"]

                if "occs" in detection:
                        for occ in detection["occs"]:
                                print "Occurrence starts at " + str(occ["ss"]) + "s from beginning of video, and ends at " + str(occ["se"]) + "s"
                                if "c_max" in occ:
                                        print "Maximum confidence of detection during this occurrence is " + str(occ["c_max"])
                                        # If you need the condifence for a particular time at second-level accuracy, see the by_second grouping of detections

                print
        print

# Example of listing only audio (speech) based word/phrase detections:
for detection_type, detections_of_this_type in metadata["detection_groupings"]["by_detection_type"].iteritems():
        if detection_type.startswith("audio.keyword."):
                for det_id in detections_of_this_type:
                        detection = metadata["detections"][det_id]
                        print "Label: " + detection["label"]  # etc... You get the idea :)
print

# Example of listing only detections of a specific detection type:
if "human.face" in metadata["detection_groupings"]["by_detection_type"]:
        for det_id in metadata["detection_groupings"]["by_detection_type"]["human.face"]:
                detection = metadata["detections"][det_id]  # etc...
print

# Example of listing IAB categories detected from different modalities (visual/audio/transcript) of the video
for detection_type, detections_of_this_type in metadata["detection_groupings"]["by_detection_type"].iteritems():
        if detection_type.startswith("topic.iab."):
                for det_id in detections_of_this_type:
                        detection = metadata["detections"][det_id]  # etc...
                        print "IAB label, simple: " + detection["label"]
                        print "IAB ID: " + detection["ext_refs"]["iab"]["id"]
                        print "IAB hierarchical label structure:"
                        print detection["ext_refs"]["iab"]["labels_hierarchy"]
print

# Time-based access: Loop over time (each second of the video) and access detections of each second
sec_index = -1
for secdata in metadata["detection_groupings"]["by_second"]:
        sec_index += 1
        print "----------"
        print "Detected at second " + str(sec_index) + ":"
        print
        for detdata in secdata:
                det_id = detdata["d"]
                if "c" in detdata:
                        print "At this second, detection has confidence " + str(detdata["c"])
                if "o" in detdata:
                        # If for some reason you need to know the corresponding occurrence (time-period that contains this second-based detection)
                        print "The detection at this second is part of one of more occurrences. The occurrence IDs, suitable for searching within the 'occs' list of the 'detection' object, are:"
                        for occ_id in detdata["o"]:
                                print occ_id
                print "Detection ID: " + det_id
                detection = metadata["detections"][det_id]
                print "Label: " + detection["label"]
                print "Detection of the type " + detection["t"] + ", full info:"
                # Of course, also here you can access attributes, cid, occurrences etc. through the "detection" object
                # just like when you listed detections by their type. In other words, when you just know the ID
                # of the detection, it's easy to read the information about the detection by using the ID.
                print detection
                print

Valossa Core metadata JSON format version changelog

1.3.6: resolution, codecs and bitrates added to technical media information

1.3.5: visual.color added, violence-related concept categories added

1.3.4: detection grouping by_detection_property added, identifier information added for gallery faces

1.3.3: categ added to relevant visual.context and audio.context detections

1.3.2: similar_to in human.face detections supports role names

1.3.1: added metadata type field (supports distinguishing between different types of Valossa metadata in the future)

1.3.0: improved speech-to-text format

1.2.1: speech-to-text

1.2.0: field naming improved

1.1.0: more compact format

1.0.0: large changes, completely deprecated old version 0.6.1.