“Nothing will work unless you do” –Maya Angelou: Parsing Zendesk Chat transcripts with Instaparse

One thing that really makes our software stand out is our world class analyst support. Time and time again we hear from clients how much they enjoy our real-time chat support. Anytime you need help within Looker, it's only a click away. All of our support conversations feed through Zendesk, and eventually make their way through the process described below.

The body of the chat event is stored in a character varying (65535) field within Salesforce. The schema is pretty simple: there's a timestamp, ticket id, and the body of the chat. The chats have a certain signature/format to them, where the first row for a given ticket is the start event which contains metadata about the chat session. After the start, we see n number of rows along with a concluding closing chat event. If the chat extends over a longer period, we'll see multiple rows for a given ticket since the column size of the body column is limited. Typically though, we just see three rows. The row(s) we're interested in are the ones where the actual dialog takes place, so anywhere from n+1 to n-1 rows. When we extract the chat events from the Salesforce tables, they are written as one line per chat. For example, below is an example of a chat session minus the beginning/ending metadata records:

Chat started: 2016-08-22 08:09 AM UTC

(08:09:40 AM) Susan Smith: hey there!

(08:09:53 AM) *** Larry joined the chat ***

(08:09:54 AM) Larry: Hi Susan!

(08:10:15 AM) Susan Smith: I love Looker!

(08:10:30 AM) Larry: Me too!


(08:12:19 AM) *** Larry left the chat ***

The final output of the ETL process will result in a series of rows with an event_id, ticket_id, created_at, participant, and the body of the chat. Using the above as an example, we would have four rows, throwing out the joined/left chat, and the Chat started/ended records.

event_id  | ticket_id |     created_at      |     participant     |                   body                   
--------+-----------+---------------------+---------------------+----------------------------
 1         | 51234     | 2016-08-22 08:09:40 | Susan Smith         | hey there!
 2         | 51234     | 2016-08-22 08:09:54 | Larry               | Hi Susan!
 3         | 51234     | 2016-08-22 08:10:15 | Susan Smith         | I Love Looker!
 4         | 51234     | 2016-08-22 08:10:30 | Larry               | Me too!

The format of the chat event body can pretty much be anything someone can cut and paste or type into the chat dialog window. For example, we often see people pasting in snippets of code and LookML

(10:06:02 PM) Bob Jones: dimension: MKTG_Qualified_Lead
 type: yesno                                                                                                          
   sql:

     ${TABLE}.test_lead_c = 'false'                                                                                    

      AND ${TABLE}.Email NOT LIKE '@test.com'

So the high level approach would be pretty straight forward and follow a workflow like this:

Unload the Salesforce data to S3
Process/parse the chats in S3
Upload the resulting JSON data to s3
Load into Redshift via the COPY command

All the above steps were bundled up into a Drake workflow. I'll go over the more interesting parts of the workflow below. Things like sync'ing back and forth, loading into Redshift, etc... is grunt work and doesn't really need to be covered.

Unload Salesforce data to s3

The Salesforce data is unloaded into S3 as a CSV file on a scheduled cadence. I used the control character ETX/^C to represent the field delimiter instead of say a | (pipe) delimiter. Since there could be, and are, pipes in the chat messages, the safest approach would be to use a non-printing character. When extracting out the body from Salesforce, I decided to replace all the \n in the body with SOH/^A. More than likely we will not experience control characters in the text that a user enters. The other thing this does is to get all the rows of text on one line, making it easier to read and process. Below is a trimmed down version of what the data in the CSV looks like after the export.

51119^C2016-08-22 16:58:49^CChat started: 2016-08-22 04:49 PM UTC^A^A(04:49:11 PM) Jen Smith: hi, i have a question on how to add a look to a dashboard^A(04:49:23 PM) *** Steve Jones joined the chat ***^A(04:49:34 PM) Steve Jones: Hi Jen! How's it going today?^A

Process/parse the chats in S3

This by and far was the most interesting part of the whole ETL workflow for a number of reasons. For one, a chat message is defined as being between two timestamps, with the timestamp being in the format of (08:09:40 AM) . The chat body/message is everything after that timestamp up to but not including the next timestamp. The message could be as simple as some text like "Hello Joe" or more complicated like someone pasting code and/or LookML which will have embedded newlines and who-knows-what characters.
The "chatter" is defined as anything from the timestamp, up-to but not including the : . Also, we don't know if people would go by just their first name, first and last, or other sorts of crazy permutations.
Since a chat can span longer than the allowed 65535 character limit of a row, messages do occasionally span across n+1 rows in the database. The starting event is pretty straight forward, they are defined by the text "Chat started" followed by the date and time. Below is an example chat session from the Salesforce database. Note that chat session started and continued for an extended period of time, so we had multiple row entries that would need to be parsed.

48962 | 2016-08-01 21:34:18 | 
Chat started: 2016-08-01 09:18 PM UTC                                                         

(09:18:11 PM) Bob Smith: hi

(09:18:16 PM) *** Larry joined the chat ***                                                                                                                                
                                                                                                                                       

48962 | 2016-08-02 00:36:07 | 

(10:05:00 PM) Bob Smith: hi, I'm back

(10:05:09 PM) *** Larry joined the chat                                                                                        

(10:05:14 PM) Larry: Hi Bob!

The light bulb in my head turned on! "Hey I have a great idea on how to do this! I'll use Regular Expressions..." but as the saying goes, and probably famously coined by Jamie Zawinski (an early Netscape engineer),

Some people, when confronted with a problem, think  “I know, I'll use regular expressions.”

Now they have two problems.

From my own experience, and having lived thru the pain and suffering of the above quote, I knew it would be prudent to use Regular Expressions gingerly. After noodling this for a bit, I remember playing around with LPeg to parse log files with Lua and Hadoop Streaming. I know, this sounds bat shit crazy, and it is and was, but Hadoop Streaming allowed me to easily and quickly parallelize my Lua scripts, and LPeg allowed for powerful parsing. So I knew that there was/is a more powerful way to parse text and decided to dust off the grammer cobwebs and try using Parsing Expression Grammars (PEG) and/or Context-free grammers (CFG) again. A CFG grammar is non-deterministic, which means that some input could result in two or more possible parse-trees, where as PEG is deterministic, meaning that any input can only be parsed one way. I remember seeing a presentation by Mark Engelderg introducing InstaParse at either Clojure West or the Conj a few years back. I won't go into all the details of using InstaParse, you can read the details on the github page, but what I can say is that Instaparse along with working in the Clojure REPL allowed me to easily and quickly iterate while developing a "rule" that would parse the logs.

After about a few hours of tinkering in the REPL with Instaparse, I had a prototype that would parse the Zendesk chats. For example the following rule:

(insta/parser
    "S ::=  chat-header? chat-line (<line-delimiter> chat-line)*
     <chat-banner> ::= (<'Chat'> <space> <'started'> <[colon]> <[space]> <['on']> <space> <chat-session-start-date> <space> <'UTC'> <line-delimiter>+ )
     <lparen> ::= <'('>
     <rparen> ::= <')'>
     <colon> ::= ':'
     <line-delimiter> ::= #'\\u0001'
     <chat-line-delimiter> ::= #'\\(\\d{2}:\\d{2}:\\d{2}\\s(AM|PM)\\)'
     <new-line> ::= '\n' | '\r\n'
     <space>  ::= #'\\s+'
     <number> ::= #'[0-9]+'
     <dash> ::= '-'
     created-at-date ::= date
     chat-session-start-date ::= date <space> <hour-and-min> <space> <am-or-pm>
     hour-and-min ::= number colon number
     hour-min-sec ::= number colon number colon number
     chat-line-timestamp ::=  lparen number colon number colon number space #'AM|PM' rparen
     date ::= number dash number dash number
     line ::= #'.*?(?=\\u0001\\(\\d{2}:\\d{2}:\\d{2}\\s(AM|PM)\\))' / #'.*'
     join-or-left-chat ::= #'(joined|left)'
     is-now ::= 'is now known as'
     am-or-pm ::= #'(AM|PM)'
     <chat-header> ::= chat-banner
     chat-line ::=  chat-line-timestamp <space> chatter line?
     a-chatter ::= #'[\\p{L}-\\'\\.\\@\\(\\)\\+[0-9]]+'
     chatter ::= <'*** '> a-chatter (<space?> a-chatter?)* <join-or-left-chat> <space> <'the chat ***'> <new-line>? / a-chatter (<space?> a-chatter?)* <is-now> <space> <a-chatter?>* / a-chatter (<space?> a-chatter?)* <colon> <space>
     ")

would parse each row of text into an Array Map of chats:

{:tag :chat-line,
 :content ({:tag :chat-line-timestamp, :content ("12" ":" "05" ":" "35" " " "AM")}
            {:tag :chatter,
             :content ({:tag :a-chatter, :content ("Bob")} {:tag :a-chatter, :content ("Jones")})}
            {:tag :line, :content ("want to basically see if a student is active between those dates")})}

From here, a couple more transformations were done to add some additional information and finally format the output in JSON.

(defn process-chat-line [l]
  (let [v (clojure.string/split l #"\u0003")
        [ticket-id created-at chat-line] v
        chats (-> (cfg/parse-chat chat-line)
                  (cfg/xform-chat-line-timestamp)
                  (cfg/xform-chatter)
                  (cfg/xform-chat-line)
                  (cfg/xform-chat-lines)
                  :content)

        xformed-chats (-> (cfg/add-ticket-id-and-chat-date ticket-id created-at chats)
                          (cfg/filter-empty-chats)
                          (cfg/replace-characters-in-chat-lines)
                          (cfg/created-at))]
    xformed-chats))

So in the end, we had newline JSON files that we load into Redshift and Big Query. Example:

{"chat-date":"51877","chat-line-timestamp":"06:03:06 PM","chatter":"Bob Jones","chat-line":"Opening up the look...","created-at":"2016-08-29 06:03:06 PM"}
{"chat-date":"2016-08-29","ticket-id":"51877","chat-line-timestamp":"06:03:09 PM","chatter":"Susan Smith","chat-line":"ok","created-at":"2016-08-29 06:03:09 PM"}

From here, we give it a little Looker Luv and we're in business. For example, so you want to know the average wait time for one of our top notch analyst to jump on chat?