Friday, September 30, 2016

JQ is a Kung fu Swiss Army knife!



If you need to do some basic reformatting of JSON data that's already been processed, try reaching for the Swiss Army knife of command line JSON Kung Fu,  JQ. There's a lot of different JSON parsing libraries out there, name your favorite language and there has to be at least a few you can use. Sometimes I just want to git 'R dun and not have to deal with writing a program, compiling, and then deploying it. For those times, the standard Unix toolkit and JQ can get you a long way quickly.


For instance, recently I was adding support for loading our Zendesk chat data into Google Big Query.  We currently load this data into Redshift, but we're moving towards using Big Query as our datastore for more and more datasets.  There were three things that I had to do with the existing data.  First was to transform all the JSON keys to snake case ,  second was to transform the timestamp field into a timestamp that I could load into Big Query, and third was to select out certain keys I wanted to load into the chat table.


So the JSON looks like this:




{"chat-date":"2016-03-01","ticket-id":"12345","chat-line-timestamp":"05:52:20 PM","chatter":"Mr. Magoo",
"chat-line":"Hello","created-at":"2016-03-01 05:52:20 PM"}
{"chat-date":"2016-03-01","ticket-id":"12345","chat-line-timestamp":"05:52:41 PM","chatter":"Daisy Duke",
"chat-line":"Hey Mr. Magoo","created-at":"2016-03-01 05:52:41 PM"}
{"chat-date":"2016-03-01","ticket-id":"12345","chat-line-timestamp":"05:52:44 PM","chatter":"Daisy Duke",
"chat-line":"How's it going?","created-at":"2016-03-01 05:52:44 PM"}


and once we run it thru the meat grinder we get what we want:




{"ticket_id":"12345","chatter":"Mr. Magoo","chat_line":"Hello","created_at":1456854740}
{"ticket_id":"12345","chatter":"Daisy Duke","chat_line":"Hey Mr. Magoo","created_at":1456854761}
{"ticket_id":"12345","chatter":"Daisy Duke","chat_line":"How's it going?","created_at":1456854764}



Dan-Young:~ dan$ gawk 'BEGIN { print strftime("%c", 1456854764); }'
Tue 01 Mar 2016 05:52:44 PM UTC


the easy JQ one-liner that made this all possible:



jq -r 'to_entries | 
map({"key": .key | gsub("-";"_")} + {"value": (if (.key =="created-at") then (.value | strptime("%Y-%m-%d %I:%M:%S %p")|mktime) 
else .value  end)}) | from_entries | with_entries(select(.key == ("chatter","created_at","ticket_id","chat_line"))) |tostring'

So next time you need to Kung fu some JSON, give the command line a try!  It's magic sauce....


So there you have it, Bob's your uncle !


Saturday, September 10, 2016

Parsing Zendesk Chat transcripts with Instaparse


One thing that really makes our software stand out is our world class analyst support. Time and time again we hear from clients how much they enjoy our real-time chat support.   Anytime you need help within Looker,  it's only a click away.  All of our support conversations feed through Zendesk, and eventually make their way through the process described below.
   

  
    The body of the chat event is stored in a character varying (65535) field within Salesforce.  The schema is pretty simple:  there's a timestamp, ticket id, and the body of the chat.  The chats have a certain signature/format to them, where the first row for a given ticket is the start event which contains metadata about the chat session.  After the start, we see n number of rows along with a concluding closing chat event.  If the chat extends over a longer period, we'll see multiple rows for a given ticket since the column size of the body column is limited. Typically though, we just see three rows.  The row(s) we're interested in are the ones where the actual dialog takes place, so anywhere from n+1 to n-1 rows. When we extract the chat events from the Salesforce tables, they are written as one line per chat.  For example, below is an example of a chat session minus the beginning/ending metadata records:


Chat started: 2016-08-22 08:09 AM UTC

(08:09:40 AM) Susan Smith: hey there!

(08:09:53 AM) *** Larry joined the chat ***

(08:09:54 AM) Larry: Hi Susan!

(08:10:15 AM) Susan Smith: I love Looker!
(08:10:30 AM) Larry: Me too!
(08:12:19 AM) *** Larry left the chat ***



    The final output of the ETL process will result in a series of rows with an event_id, ticket_id, created_at, participant, and the body of the chat. Using the above as an example, we would have four rows, throwing out the joined/left chat, and the Chat started/ended records.


event_id  | ticket_id |     created_at      |     participant     |                   body                   
--------+-----------+---------------------+---------------------+----------------------------
 1         | 51234     | 2016-08-22 08:09:40 | Susan Smith         | hey there!
 2         | 51234     | 2016-08-22 08:09:54 | Larry               | Hi Susan!
 3         | 51234     | 2016-08-22 08:10:15 | Susan Smith         | I Love Looker!
 4         | 51234     | 2016-08-22 08:10:30 | Larry               | Me too!

    The format of the chat event body can pretty much be anything someone can cut and paste or type into the chat dialog window.  For example, we often see people pasting in snippets of code and LookML



(10:06:02 PM) Bob Jones: dimension: MKTG_Qualified_Lead
 type: yesno                                                                                                          
   sql:

     ${TABLE}.test_lead_c = 'false'                                                                                    

      AND ${TABLE}.Email NOT LIKE '@test.com'                                                                                   


    So the high level approach would be pretty straight forward and follow a workflow like this:

  1. Unload the Salesforce data to S3
  2. Process/parse the chats in S3
  3. Upload the resulting JSON data to s3
  4. Load into Redshift via the COPY command

    All the above steps were bundled up into a Drake workflow. I'll go over the more interesting parts of the workflow below.  Things like sync'ing back and forth, loading into Redshift, etc... is grunt work and doesn't really need to be covered.

Unload Salesforce data to s3

    The Salesforce data is unloaded into S3 as a CSV file on a scheduled cadence. I used the control character ETX/^C to represent the field delimiter instead of say a | (pipe) delimiter. Since there could be, and are, pipes in the chat messages, the safest approach would be to use a non-printing character.   When extracting out the body from Salesforce, I decided to replace all the \n in the body with SOH/^A.  More than likely we will not experience control characters in the text that a user enters. The other thing this does is to get all the rows of text on one line, making it easier to read and process. Below is a trimmed down version of what the data in the CSV looks like after the export.


51119^C2016-08-22 16:58:49^CChat started: 2016-08-22 04:49 PM UTC^A^A(04:49:11 PM) Jen Smith: hi, i have a question on how to add a look to a dashboard^A(04:49:23 PM) *** Steve Jones joined the chat ***^A(04:49:34 PM) Steve Jones: Hi Jen! How's it going today?^A


Process/parse the chats in S3

    This by and far was the most interesting part of the whole ETL workflow for a number of reasons. For one, a chat message is defined as being between two timestamps, with the timestamp being in the format of  (08:09:40 AM) . The chat body/message is everything after that timestamp up to but not including the next timestamp. The message could be as simple as some text like "Hello Joe" or more complicated like someone pasting code and/or LookML which will have embedded newlines and who-knows-what characters.  
    The "chatter" is defined as anything from the timestamp, up-to but not including the : . Also, we don't know if people would go by just their first name, first and last, or other sorts of crazy permutations.
    Since a chat can span longer than the allowed 65535 character limit of a row, messages do occasionally span across n+1 rows in the database.  The starting event is pretty straight forward, they are defined by the text "Chat started" followed by the date and time. Below is an example chat session from the Salesforce database.  Note that chat session started and continued for an extended period of time, so we had multiple row entries that would need to be parsed.


48962 | 2016-08-01 21:34:18 | 
Chat started: 2016-08-01 09:18 PM UTC                                                         

(09:18:11 PM) Bob Smith: hi

(09:18:16 PM) *** Larry joined the chat ***                                                                                                                                
                                                                                                                                       

48962 | 2016-08-02 00:36:07 | 

(10:05:00 PM) Bob Smith: hi, I'm back

(10:05:09 PM) *** Larry joined the chat                                                                                        

(10:05:14 PM) Larry: Hi Bob!


    The light bulb in my head turned on!  "Hey I have a great idea on how to do this! I'll use Regular Expressions..." but as the saying goes, and probably famously coined by Jamie Zawinski (an early Netscape engineer),


Some people, when confronted with a problem, think  “I know, I'll use regular expressions.”   
Now they have two problems.

    From my own experience, and having lived thru the pain and suffering of the above quote, I knew it would be prudent to use Regular Expressions gingerly.  After noodling this for a bit,  I remember playing around with LPeg to parse log files with Lua and Hadoop Streaming.  I know, this sounds bat shit crazy, and it is and was, but Hadoop Streaming allowed me to easily and quickly parallelize my Lua scripts, and LPeg allowed for powerful parsing. So I knew that there was/is a more powerful way to parse text and decided to dust off the grammer cobwebs and try using Parsing Expression Grammars (PEG) and/or Context-free grammers (CFG) again.  A CFG grammar is non-deterministic, which means that some input could result in two or more possible parse-trees, where as PEG is deterministic, meaning that any input can only be parsed one way. I remember seeing a presentation by Mark Engelderg introducing InstaParse at either Clojure West or the Conj a few years back. I won't go into all the details of using InstaParse, you can read the details on the github page, but what I can say is that Instaparse along with working in the Clojure REPL allowed me to easily and quickly iterate while developing a "rule" that would parse the logs.


    After about a few hours of tinkering in the REPL with Instaparse, I had a prototype that would parse the Zendesk chats.  For example the following rule:



(insta/parser
    "S ::=  chat-header? chat-line (<line-delimiter> chat-line)*
     <chat-banner> ::= (<'Chat'> <space> <'started'> <[colon]> <[space]> <['on']> <space> <chat-session-start-date> <space> <'UTC'> <line-delimiter>+ )
     <lparen> ::= <'('>
     <rparen> ::= <')'>
     <colon> ::= ':'
     <line-delimiter> ::= #'\\u0001'
     <chat-line-delimiter> ::= #'\\(\\d{2}:\\d{2}:\\d{2}\\s(AM|PM)\\)'
     <new-line> ::= '\n' | '\r\n'
     <space>  ::= #'\\s+'
     <number> ::= #'[0-9]+'
     <dash> ::= '-'
     created-at-date ::= date
     chat-session-start-date ::= date <space> <hour-and-min> <space> <am-or-pm>
     hour-and-min ::= number colon number
     hour-min-sec ::= number colon number colon number
     chat-line-timestamp ::=  lparen number colon number colon number space #'AM|PM' rparen
     date ::= number dash number dash number
     line ::= #'.*?(?=\\u0001\\(\\d{2}:\\d{2}:\\d{2}\\s(AM|PM)\\))' / #'.*'
     join-or-left-chat ::= #'(joined|left)'
     is-now ::= 'is now known as'
     am-or-pm ::= #'(AM|PM)'
     <chat-header> ::= chat-banner
     chat-line ::=  chat-line-timestamp <space> chatter line?
     a-chatter ::= #'[\\p{L}-\\'\\.\\@\\(\\)\\+[0-9]]+'
     chatter ::= <'*** '> a-chatter (<space?> a-chatter?)* <join-or-left-chat> <space> <'the chat ***'> <new-line>? / a-chatter (<space?> a-chatter?)* <is-now> <space> <a-chatter?>* / a-chatter (<space?> a-chatter?)* <colon> <space>
     ")

would parse each row of text into an Array Map  of chats:


{:tag :chat-line,
 :content ({:tag :chat-line-timestamp, :content ("12" ":" "05" ":" "35" " " "AM")}
            {:tag :chatter,
             :content ({:tag :a-chatter, :content ("Bob")} {:tag :a-chatter, :content ("Jones")})}
            {:tag :line, :content ("want to basically see if a student is active between those dates")})}

    From here, a couple more transformations were done to add some additional information and finally format the output in JSON.  


(defn process-chat-line [l]
  (let [v (clojure.string/split l #"\u0003")
        [ticket-id created-at chat-line] v
        chats (-> (cfg/parse-chat chat-line)
                  (cfg/xform-chat-line-timestamp)
                  (cfg/xform-chatter)
                  (cfg/xform-chat-line)
                  (cfg/xform-chat-lines)
                  :content)

        xformed-chats (-> (cfg/add-ticket-id-and-chat-date ticket-id created-at chats)
                          (cfg/filter-empty-chats)
                          (cfg/replace-characters-in-chat-lines)
                          (cfg/created-at))]
    xformed-chats))

    So in the end, we had newline JSON files that we load into Redshift and Big Query.  Example:



{"chat-date":"51877","chat-line-timestamp":"06:03:06 PM","chatter":"Bob Jones","chat-line":"Opening up the look...","created-at":"2016-08-29 06:03:06 PM"}
{"chat-date":"2016-08-29","ticket-id":"51877","chat-line-timestamp":"06:03:09 PM","chatter":"Susan Smith","chat-line":"ok","created-at":"2016-08-29 06:03:09 PM"}





  

From here, we give it a little Looker Luv and we're in business. For example, so you want to know the average wait time for one of our top notch analyst to jump on chat?





Or who was the most active analyst in the last week?





So there you have it, Bob's your uncle !



Keeping in-sync with Salesforce

Keeping two or more data stores in-sync is always a fun task. With the help of Apache NiFi and some decently tooling, I've been able to...