Talk:Machine Learning: Difference between revisions

Revision as of 03:53, 7 March 2014

Feb. 27, 2014

Folks met and hacked on the noisebridge discuss mailing list. We created a 102MB text dump, and a python script to parse it, File:Py-piper-parser.txt. We wrote pseudo code to implement a Naive Bayesian filter to protect the world from trolls. Will implement soon.

March 6, 2014

Group compared notes on list-scraping code and then delved into details of algorithm via tracing simple example.

user presented w message user labels spam/not spam or scale user presented with rating: 50% naive after user labels, algorithm readjusts

very beginning first step: assign prior (50% of all messages are spam) also have a likelihood that each word is spam or not spam

- combines to total 50%  (also consider log-likelihood)

next step: algorithm decides per message this is end of phase zero (PERFORMANCE PHASE) algorithm performing without human intervention

next: TRAINING PHASE first step of training phase examine message, assign a rating (spamicity ie 7/7 or drama-tag or spam/not spam) then save rating associated with message ie write to vector (each index corresponds to a message)

spamicity vector: msg0 msg1 msg2 msg3 1 0 0 3

then update dictionary what dictionary? temporary dictionary in the first step every dictionary item has a frequency count

now ... after (say) 1000 messages, algorithm guesses mostly correctly 'spam or not-spam' (is this TESTING PHASE ?) training continues via occasional instances of (human) correction update dictionary with each word in (human?) rated message

one possibly viable dictionary structure: {'word':[counted_in_spam, counted_in_not_spam]}

so, algorithm might operate as per this trace:

msg[0]: 'foo bar' SPAM msg[1]: 'foo foo bar foo bar' HAM msg[2]: 'bar bar bar foo' ... WHAT IS THIS ???? Can consult algorithm because now we have SPAM and HAM so can get bayes-informed result

dictionary after msg[0] {'foo': [1, 0], 'bar': [1, 0]}

after msg[1] {'foo': [1, 3], 'bar': [1, 2]}

NOW WE ARE AT msg[2] WHAT IS THIS ???

THIS LOOKS EASIER TO SOLVE THIS TIME !!!!!!!

WE HAVE A VECTOR OF SPAM/HAM IT LOOKS LIKE THIS: ['s', 'h'] OR IF YOU LIKE BINARY [True, False] OR [0, 1] OK I GET IT !!!!

WE HAVE A VECTOR AND A DICTIONARY and a message NOW WHAT ??? {'foo': [1, 3], 'bar': [1, 2]} [0, 1] msg[2]: 'bar bar bar foo' ... WHAT IS THIS ????

A: probability of 'foo' | spam = 1 B: probability of spam = 0.5 C: probability of 'foo' | ham = 3

... wtf ???
this gets normalized later ? maybe 
sam hopes this cancels out without being painful

D: probability of ham = 0.5

= 0.25 likelihood given 'foo'

(1 * 0.5) / ((1 * 0.5) + (3 * 0.5)) A * B / ((A * B) + (C * D)) = 0.25 likelihood given 'foo'

1(.5) 1 = prob of foo in message | spam .5 = prob of any word | spam .5 = prob of 'foo' in (first) word | spam .5 = A (normed)

3 = prob of foo in message(?) | not spam 1/5 = prob of word | ham .6 = prob of foo in (first) word | ham .6 = C (normed)

likelihood given 'bar' (1 * 0.5) / ((1 * 0.5) + (2 * 0.5)) = 0.3333... (1/3)

(1/3.0)**3 * (1/4.0) / (((1/3.0)**3 * (1/4.0)) + ((2/3.0)**3 * (3/4.0))) = 0.04 The way this is not fully bayesian... p(foo) & p(bar) are interacting... Also, are we normalizing correctly?

If we normalized,

we take into account the following:
 avg freq of words in spam
 avg freq of words in ham

but this is not fully bayesian

because ...  so far ...
we have been assuming independence
 between words (at the full-message level)

python to download and decompress nb-discuss archive

import re
from itertools import chain, islice
from StringIO import StringIO
from gzip import GzipFile
from time import gmtime
from urllib import urlopen
from contextlib import closing

def decompress_url(u):
  with closing(urlopen(u)) as f:
    with closing(StringIO(f.read())) as fs:
      with GzipFile(fileobj = fs) as g:
        return g.read()

def date_in_discuss(m, y):
  if 1 <= m <= 12:
    if y > 2007:
      now = gmtime()
      yy, mm = now.tm_year, now.tm_mon
      if (y < yy) or ((y == yy) and (m <= mm)):
        return True
    elif (y == 2007) and (m >= 11):
        return True
  return False
 
def datestr(m, y):
  try:
    ms = ('January', 'February', 'March',
          'April', 'May', 'June', 'July',
          'August', 'September', 'October',
          'November', 'December')[m - 1]
    return '-'.join((str(y), ms))
  except IndexError:
    return None

def nb_gz_url(m, y, listname='noisebridge-discuss'):
  if not date_in_discuss(m, y):
    return None
  a = 'https://www.noisebridge.net/'
  b = 'pipermail/'
  c = '/'.join((listname, ''))
  d = datestr(m, y)
  e = '.txt.gz'
  return ''.join((a, b, c, d, e))

def all_nb_gz_urls():
  now = gmtime()
  yy, mm = now.tm_year, now.tm_mon
  y, m = 2007, 11
  while (y < yy) or ((y == yy) and (m <= mm)):
    yield nb_gz_url(m, y)
    if m < 12:
      m += 1
    else:
      m = 1
      y += 1

def get_month(month, year):
  u = nb_gz_url(month, year)
  s = decompress_url(u)
  return s

def spew():
  for u in all_nb_gz_urls():
    yield decompress_url(u)

def dump_uncompressed(filename='nb_wtf.txt'):
  with open(filename, 'w') as f:
    for s in spew():
      f.write(s)

def compiled_pattern(key, cache={}):
  try:
    return cache[key]
  except KeyError:
    if key == 'msg_start':
      p = msg_start_pattern()
    elif key == 'msg_stop':
      p = msg_stop_pattern()
    else:
      return None
    cache[key] = re.compile(p)
    return cache[key]

def msg_start_pattern():
  # ... and so it begins:
  # 'From jacob at appelbaum.net  Tue Nov 20 20:20:07 2007'  
  # -> r'^From .*\s+\w{3}\s+\w{3}\s+\d+\s+\d{2}:\d{2}:\d{2} \d{4}$'
  # (return compiled regex roughly equivalent to the above)
  space = r'\s+'
  datestr = space.join((r'\w{3}', r'\w{3}', r'\d+',
                        r'\d{2}:\d{2}:\d{2} \d{4}'))
  pattern = ''.join(('^', 'From .*', space, datestr, '$'))
  return re.compile(pattern)

def msg_stop_pattern():
  anchor = lambda s: ''.join(('^', s, '$'))
  htmldelim = anchor('-------------- next part --------------')
  listdelim = anchor('_______________________________________________')
  pattern = '|'.join((htmldelim, listdelim))
  return re.compile(pattern)

def msglists(s):
  # yields list of strings for each msg in string s
  msg = []
  p = compiled_pattern('msg_start')
  for r in s.splitlines():
    if p.match(r):
      if msg:
        yield msg
        msg = []
    msg.append(r)
  if msg:
    yield msg

def msg2dict(msg):
  # msg is list of strings
  # return dict with headers, contents, cruft
  d = dict()
  p = compiled_pattern('msg_start')
  if not (msg and p.match(msg[0])):
    d['bogus'] = msg
    return d 
  cruft = ''
  ss = iter(msg)
  d['fromkey'] = next(ss)
  header_list = []
  for s in ss:
    t = s.split(':', 1)
    if len(t) != 2:
      try:
        header_list[-1][1] += s
      except IndexError:
        print 'this happened ???'
        header_list.append(['bogus_header', s])
    else:
      k, v = t
      header_list.append([k, v.strip()])
      if k == 'Message-ID':
        break
  d['headers'] = dict(header_list)
  # skip blank line(s)
  s = next(ss)
  while not s:
    s = next(ss)
  contents = [s.rstrip()]
  cruft = []
  p = compiled_pattern('msg_stop')
  for s in ss:
    if p.match(s):
      cruft.append(s)
      break
    else:
      contents.append(s.rstrip())
  d['contents'] = contents
  if cruft:
    cruft.extend([s.rstrip() for s in ss])
    d['cruft'] = cruft
  return d

def msg2smtp(msg):
  smtp = dict()
  msgd = msg2dict(msg)
  headers = msgd['headers']
  q = (('From', 'fromline'),
       ('Date', 'dateline'),
       ('Subject', 'subjectline'))
  for k, v in q:
    try:
      smtp[v] = headers[k]
    except KeyError:
      print 'header not found: ', v
      continue
  smtp['messageline'] = '\n'.join(msgd['contents'])
  return smtp

def dicterator(s):
  for msg in msglists(s):
    yield msg2dict(msg)   

def smtperator(s):
  for msg in msglists(s):
    yield msg2smtp(msg)

Word parsing python script

Function 'get_words' takes list of dictionary of emails. Yields lists of words of in the message, for each message:

 def get_words(lst):
   for d in lst:
     m = d['messageline']
     yield m.split()

Plans to improve by using nltk[1]

Talk:Machine Learning: Difference between revisions

Revision as of 03:53, 7 March 2014

Contents

Feb. 27, 2014

March 6, 2014

python to download and decompress nb-discuss archive

Word parsing python script

Navigation menu

@@ Line 2: / Line 2: @@
 Folks met and hacked on the noisebridge discuss mailing list. We created a 102MB text dump, and a python script to parse it, [[File:Py-piper-parser.txt]]. We wrote pseudo code to implement a Naive Bayesian filter to protect the world from trolls. Will implement soon.
+== March 6, 2014 ==
+Group compared notes on list-scraping code and then delved into details of algorithm via tracing simple example.
+user presented w message
+user labels spam/not spam or scale
+user presented with rating: 50% naive
+after user labels, algorithm readjusts
+very beginning
+first step: assign prior (50% of all messages are spam)
+also have a likelihood that each word is spam or not spam
+ - combines to total 50%  (also consider log-likelihood)
+next step: algorithm decides per message
+this is end of phase zero (PERFORMANCE PHASE)
+algorithm performing without human intervention
+next: TRAINING PHASE
+first step of training phase
+examine message, assign a rating (spamicity ie 7/7 or drama-tag or spam/not spam)
+then save rating associated with message
+ie write to vector (each index corresponds to a message)
+spamicity vector:
+msg0 msg1 msg2 msg3
+     0    0    3
+then update dictionary
+what dictionary?
+temporary dictionary in the first step
+every dictionary item has a frequency count
+now ...
+after (say) 1000 messages, algorithm guesses mostly correctly
+'spam or not-spam' (is this TESTING PHASE ?)
+training continues via occasional instances of (human) correction
+update dictionary with each word in (human?) rated message
+one possibly viable dictionary structure:
+{'word':[counted_in_spam, counted_in_not_spam]}
+so, algorithm might operate as per this trace:
+msg[0]: 'foo bar' SPAM
+msg[1]: 'foo foo bar foo bar' HAM
+msg[2]: 'bar bar bar foo' ... WHAT IS THIS ????
+Can consult algorithm because now we have SPAM and HAM
+so can get bayes-informed result
+dictionary
+after msg[0]
+{'foo': [1, 0], 'bar': [1, 0]}
+after msg[1]
+{'foo': [1, 3], 'bar': [1, 2]}
+NOW WE ARE AT msg[2] WHAT IS THIS ???
+THIS LOOKS EASIER TO SOLVE THIS TIME !!!!!!!
+WE HAVE A VECTOR OF SPAM/HAM
+IT LOOKS LIKE THIS: ['s', 'h']
+OR IF YOU LIKE BINARY [True, False]
+OR [0, 1]
+OK I GET IT !!!!
+WE HAVE A VECTOR AND A DICTIONARY and a message NOW WHAT ???
+{'foo': [1, 3], 'bar': [1, 2]}
+[0, 1]
+msg[2]: 'bar bar bar foo' ... WHAT IS THIS ????
+A: probability of 'foo' | spam = 1
+B: probability of spam = 0.5
+C: probability of 'foo' | ham = 3
+ ... wtf ???
+ this gets normalized later ? maybe
+ sam hopes this cancels out without being painful
+D: probability of ham = 0.5
+= 0.25 likelihood given 'foo'
+(1 * 0.5) / ((1 * 0.5) + (3 * 0.5))
+A * B / ((A * B) + (C * D))
+= 0.25 likelihood given 'foo'
+(.5)
+= prob of foo in message | spam
+.5 = prob of any word | spam
+.5 = prob of 'foo' in (first) word | spam
+.5 = A (normed)
+= prob of foo in message(?) | not spam
+/5 = prob of word | ham
+.6 = prob of foo in (first) word | ham
+.6 = C (normed)
+likelihood given 'bar'
+(1 * 0.5) / ((1 * 0.5) + (2 * 0.5))
+= 0.3333... (1/3)
+(1/3.0)**3 * (1/4.0) / (((1/3.0)**3 * (1/4.0)) + ((2/3.0)**3 * (3/4.0)))
+= 0.04
+The way this is not fully bayesian... p(foo) & p(bar) are interacting...
+Also, are we normalizing correctly?
+If we normalized,
+ we take into account the following:
+  avg freq of words in spam
+  avg freq of words in ham
+but this is not fully bayesian
+ because ...  so far ...
+ we have been assuming independence
+  between words (at the full-message level)
 == python to download and decompress nb-discuss archive ==

Talk:Machine Learning: Difference between revisions

Revision as of 03:53, 7 March 2014

Feb. 27, 2014

March 6, 2014

python to download and decompress nb-discuss archive

Word parsing python script

Navigation menu

Search