Skip to main content

Preparing Your Data

This page will help you determine what text is valid and what could potentially skew your results. Our technology performs best when samples come from written or spoken language, including conversational language, formal or informal language from a variety of sources including blog posts, survey responses, social media posts, transcribed calls, short text samples, or text messages.

Raw text works best, meaning that it's unnecessary to tokenize, lemmatize, stem, remove stop words, or remove punctuation.

During data preparation, raw text may need to be aggregated or parsed depending on the goal of your analysis. Raw text should be prepared in a way that corresponds with the level of insight you aim to produce. For example, parse raw text into discrete sentences before calling the API if you aim to produce sentence-level insights; aggregate raw text into paragraphs before calling the API if you aim to produce paragraph-level insights. The API will analyze input text submissions as a whole unit.

Refer to the table below for details surrounding what to include and what to exclude from your text before using the Receptiviti API.

ElementExample(s)Include?ActionComment
Text Encodingutf-8Encode your text strings in utf-8The API currently accepts only JSON as input. JSONs are encoded in unicode with a default encoding of utf-8. More details here and here.
@Mentions@bigScaryPupYesLeave in your text if this is relevant to your use case. Exclude, if not.For most use cases, @Mentions are data noise and not natural language and do not indicate underlying psychology or emotion. Currently, an @Mention adds 1 to the word count (wc).
Hashtags#lolnotlolYesRetain hashtags: we score them.Hashtags are separated and parsed by the API. The individual components of the hashtags count towards word count. #thiswillbescored will be split up into this will be scored and count as 4 words. Currently, hashtags adds the number of tokens in the hashtag to wc and 1 to hashtags.
Emojis\xf0\x9f\x8c\xbb 😂😡YesRetain emojis in your text.Emojis are visual representations of emotions, common objects and situations. They are powerful tools to uncover psychological and emotional meaning in language.
URLshttp://receptiviti.comYesLeave in your text if this is relevant to your use case. Exclude, if not.For most use cases, URLs are data noise and not natural language and do not indicate underlying psychology or emotion. However, if they are relevant to your use case, feel free to leave them in your text. Currently, a URL adds 1 to the wc and to the urls category.
Email headersFrom: [email protected]NoRemove all email headers, and only use email body as text. Remember, if your email body is in html, follow the instructions below to strip html tags from your text.Email headers are data noise and not natural language for Receptiviti’s metrics. They do not indicate underlying psychology or emotion. They will count towards the total number of words and thereby skew scores.
Email metadataMon, 24 Aug 2024 10:16:07 -0700 (PDT)NoRemove all email metadata, and only use email body as text. Remember, if your email body is in html, follow the instructions below to strip html tags from your text.Email metadata are data noise and not natural language for Receptiviti’s metrics. They do not indicate underlying psychology or emotion. They will count towards the total number of words and thereby skew scores.
Email footers and confidentiality disclaimersHead office: 150 Bloor St. West, Suite 310, Toronto, OntarioNoRemove all email footers and legal disclaimers from your email. Remember to use only email body as text.Email footers and confidentiality disclaimers are data noise and not natural language for Receptiviti's metrics. They do not indicate underlying psychology or emotion. They will count towards the total number of words and thereby skew scores.
HTML<!DOCTYPE html>NoStrip all HTML tags and only retain relevant content within the tags e.g., text within the <p> tags could be natural language and therefore valid for analysis.HTML tags specify formatting, not naturally spoken or written language. The text within some HTML tags may be useful (depending on your application). Tools like BeautifulSoup can help you do this.
CodePrint("Hello World")NoRemove all code snippets from your text.Code snippets are not natural language and do not indicate underlying psychology or emotion. They will count towards the total number of words and thereby skew scores.

Language Support in the Receptiviti Platform​

Our platform only supports English in the API. However, it is highly compatible with machine-translated text derived from languages that are sufficiently similar to English (Spanish, French, German, Italian, Dutch, Portuguese, etc.).

Independent research confirms the validity of leveraging machine translation prior to analyzing text using Receptiviti measures (See citation below).

This process can be applied across our frameworks, and many of our clients successfully use machine translation in their workflows.

For users looking to integrate machine translation, numerous advanced machine translation services are readily available online.

👉 Download the research paper here

Citation

Understanding Word Count Discrepancies​

You might notice that the API response item under summary/word_count displays a word count that differs from the count you see in the text editor document that contains your samples. This is because the Receptiviti API separates words on hyphens and apostrophes, so for example, he's our go-to guy is six 'words' counted as he s our go to guy.

From a counting perspective, the Receptiviti API regards words primarily as tokens, and in some cases emphasizes their individual units rather than recognizing them as complete, contextual words. This means that it dissects the text into discrete components, such as it's becoming it and s and subsequently assesses and processes these fragments separately.