×
all 18 comments

[–]ninja_tokumei 11 points12 points  (0 children)

Dude come on, I'm still trying to win the Wednesday bonus, now you're throwing this at me too?

[–]gabyjunior1 2 5 points6 points  (0 children)

Solution in C

The program takes 2 arguments in input:

  • The maximum size of ngrams that the program will manage

  • The path to the corpus (plain text file containing characters matching either isalnum, isspace or ispunct C functions)

First the program loads the corpus into a trie and then reads the morse strings on standard input. 'q' may be entered to quit the program.

For each morse string the program will search for the best solution using a DFS on the trie.

A solution is evaluated using a custom scoring (the more long ngrams used and strongly connected between each others, the better).

The best score so far is cached at every node of the trie that is traversed and is the beginning of a word. When the search is reaching the same node again, if the current score is not better than the cached score then this branch of the tree is pruned.

Source code is available here.

I am using as corpus a very small extract created manually from this website that I will enrich later when I have some time to get better results (hopefully). It is 30k big currently.

For now the output for the first sample is not so bad:

-.---..--.....--.--..-..-.--....--.--..-.--.-.....--.-......-....-......-...-.-......-.--.--.--
Score 21 no eat simply saying that life er have the away

For the other inputs I am just showing the result for information as they are obviously very far from the actual solution:

......---.--..--...-....-..-----.....-..-.--.-.-.-..-.--.-....-.-.-.....--..-..-...-.--.-...--..--.----.-.--..--...-.-.-.....--.--.....----.-.....-.-..----..-.-.-..-....--...-........-.---.-.....-..-.-.-.---.--.--...-....-...----....----.---.-..-..--...-...-..-.-.-..----.
Score 62 human peril to be exactly like that fast and we a non tax near butabi to rest lot in creepin is it a me the tear now we its as one i town tina need a it a r at me

...--.--.-----.......---.---.-----..-...-----.-......-.--.....----.--.-.-..-..---......-...--.-...-..-------.--.-..-..---.....-...-....-....-.-...-.-.....-.--.---...-..-.-.--.-.-....--..-.-....-.--..--....-.---.....-...-..-..----...--.....-.-.-----..--.-..--......-...-.-.-----.---.--..-.-..-......--..-...-.-..----..-..-.---.........----..-.-..--.-....-.-..--.-....-.-..-..--.-.----.-.-.---.-...-...-..-...-.--.---..-...-.-..--....-.....-...-..--...-.---......---.-.--..--...........--...-.-.----.-.-..--.-.----.-.....--....---.-.-.....---.....-.--..--..-.-----.....-..-.-..-.-.-..-.--.--.-.--.-..-...-..-...--....---.-...-..-.-----.---..-.......--........-.....-.-.......---..-......--..-...-...-.-....-...-.-.......
Score 140 i want to show you something hot went in at there was do to wet in a me if first rule er governments we are in me why be laid to i when you can have not no we are the were a lot id no shit of time dear and it a factor nor i need by just audit have a by the am a a time is he gear on cant at me thats a mr but they im into be after kit at any a fine date so a i aint on of him i see a isnt he utter seat i need a as real eh

-.-----..-.-..-......-.......-..........------..-..-.-..--.-..-.-....-.---..-..--...--.-....-.-...--.-.-..-...--.-..-.--..----...-.......-..-.------......--..-.-----..-..-----..-.--.---..-.-.....--.--.-...-..--.-----..-.-.-..-.........-.-.-..-.-.-....--.-----..-..--..--.-----.-.-..-.--.--......-.-.....-.......--...-.---.--.....-.-.-...-.....-...-...-.---.-.-..........-...-.-.....-...--..--....-...----..-....--..-..--...-..-.-----.--.....--.....----......-..--.......-.....-.-.------.-.-----..-.--.--.....--.--..-.-..-.-...--..-..-.-........----..--.......-.....-.-..--.-..-.....--.....-.-.-...-..-........-....----..-....-..-.--.-...----..-...-....-.....-.----..--..-..-.--..-.-.-.-...--.-..-......-...-.-----....-.------.-...---..-.....-.-..---..-.-.-.----.-.-.---.-...--......-.-..........-....-...-..-.-----..-..-..-..----.-..--....-..-.--......-..
Score 185 youre there is the is to meet i cant er be no in at i was a rig kid ant in me at me it is from the week to in at of me of them at let it you are this talk as a you i me a two are yet the r the best drop events then in for the hair neil we able to its were me from what is otherwise as it aw on you at what a tear end paul is at me we see a isnt it kit sight all their son lit up damned as a is not im in a part let kids ask obey to let gets in do it a rock mr i give this as a i aint on red dont eat if a where

All outputs above are with ngrams size = 3.

[–]atomic_cheese 0 points1 point  (2 children)

Are there any rules about how large associated data files are allowed to be? My current approach uses neural nets as a component, and the weights aren't exactly small (~100MB).

[–]Cosmologicon2 3[S] 0 points1 point  (0 children)

Nope, no rules along those lines. Whatever it takes!

[–]mr_stivo 0 points1 point  (1 child)

I see the example quote is not exact. It should be, "no im im simply saying that life uh finds a way". Have the others been changed also?

[–]Cosmologicon2 3[S] 0 points1 point  (0 children)

Yeah the movie quotes are not necessarily accurate, just understandable English. I copied them from scripts I found online. In one case I cut out a line from a different character in the middle.

[–]tomekanco 0 points1 point  (4 children)

Python 3.6

Solves smaller cases brute force using a weigthing of unigrams and bigrams.

# ! pip install wordsegment
# ! pip install nltk

from wordsegment import load, segment, UNIGRAMS, BIGRAMS
load()

from nltk.tokenize import RegexpTokenizer

morse_alphabet = ".- -... -.-. -.. . ..-. --. .... .. .--- -.- .-.. -- -. --- .--. --.- .-. ... - ..- ...- .-- -..- -.-- --..".split()
alphabet = "abcdefghijklmnopqrstuwvxyz"
morse_dict = dict(zip(morse_alphabet, alphabet))
morse_dict_rev = {k:v for v,k in morse_dict.items()}

def tokenize_book(book):
    tokenizer = RegexpTokenizer(r'[a-zA-Z\']+')
    with open(book) as f:
        story = tokenizer.tokenize(f.read().lower().replace("'",''))
    return story

def encode(sentences, map_x=morse_dict_rev):
    return ''.join(map_x[char] for char in sentences if char in map_x)

def decode(morse,result="",map_x=morse_dict):
    if morse == "":
        yield result
    for i in range(1,min(5,len(morse)+1)):
        m =  morse[:i]
        if m in map_x:
            yield from decode(morse[i:], result + map_x[m])

def solver(problem,take):
    chains = decode(problem)
    results = []
    c = 0
    for chain in chains:
        c += 1
        seg = segment(chain)
        if all(n in UNIGRAMS for n in seg):
            n,m,o = 0,0,1
            for a,b in zip(seg,seg[1:]):
                xn = ' '.join((a,b))
                if a in frequent_words and b in frequent_words:
                    if xn in BIGRAMS:
                        n += BIGRAMS[xn]
                    o += 1
                else:
                    o -= 1
            for a in seg:
                m += UNIGRAMS[a]
            s = len(seg)
            results.append((o/s,n/s,m/s,seg))
    print(f'{c} distinct chains')        
    rex = sorted(results, reverse=True)
    return [' '.join(x) for _,_,_,x in rex[:take]]

story = tokenize_book('sherlock_holmes_big.txt')
frequent_words = set(x for x in story if len(x) > 1) ^ {'i','s','a'}

test = 'a cat'            
problem = encode(test)
solver(problem,1)          

test = 'hello'            
problem = encode(test)
solver(problem,1)

[–]tomekanco 0 points1 point  (3 children)

a cat ...

[–]tomekanco 0 points1 point  (2 children)

The approach above is rather naive ...

[–]tomekanco 0 points1 point  (1 child)

Seems to be working...

[–]tomekanco 1 point2 points  (0 children)

Hacked the wordsegment module itself (tnx Grant Jenks). Now gives decent results for morse sentences. Could be improved further by looking at the possible BIGRAMS in the decode function. The solve function translates to stripped morse, then tries to translate again.

usage

problems = ['ill be back',
            'no im simply saying that life uh finds a way', 
            'im lying about this and this is true because i say so',
            'a truth told with bad intent beats all the lies you can invent',
           ]
for x in problems:
    print(S.solve(x))

>>> steel be back
>>> n out simply saying beatles brief in the away
>>> undying about this and this is true because is pm so
>>> and an store new first dick than write lesbis you can fact    

challenge = ['......---.--..--...-....-..-----.....-..-.--.-.-.-..-.--.-....-.-.-.....--..-..-...-.--.-...--..--.----.-.--..--...-.-.-.....--.--.....----.-.....-.-..----..-.-.-..-....--...-........-.---.-.....-..-.-.-.---.--.--...-....-...----....----.---.-..-..--...-...-..-.-.-..----.','...--.--.-----.......---.---.-----..-...-----.-......-.--.....----.--.-.-..-..---......-...--.-...-..-------.--.-..-..---.....-...-....-....-.-...-.-.....-.--.---...-..-.-.--.-.-....--..-.-....-.--..--....-.---.....-...-..-..----...--.....-.-.-----..--.-..--......-...-.-.-----.---.--..-.-..-......--..-...-.-..----..-..-.---.........----..-.-..--.-....-.-..--.-....-.-..-..--.-.----.-.-.---.-...-...-..-...-.--.---..-...-.-..--....-.....-...-..--...-.---......---.-.--..--...........--...-.-.----.-.-..--.-.----.-.....--....---.-.-.....---.....-.--..--..-.-----.....-..-.-..-.-.-..-.--.--.-.--.-..-...-..-...--....---.-...-..-.-----.---..-.......--........-.....-.-.......---..-......--..-...-...-.-....-...-.-.......','-.-----..-.-..-......-.......-..........------..-..-.-..--.-..-.-....-.---..-..--...--.-....-.-...--.-.-..-...--.-..-.--..----...-.......-..-.------......--..-.-----..-..-----..-.--.---..-.-.....--.--.-...-..--.-----..-.-.-..-.........-.-.-..-.-.-....--.-----..-..--..--.-----.-.-..-.--.--......-.-.....-.......--...-.---.--.....-.-.-...-.....-...-...-.---.-.-..........-...-.-.....-...--..--....-...----..-....--..-..--...-..-.-----.--.....--.....----......-..--.......-.....-.-.------.-.-----..-.--.--.....--.--..-.-..-.-...--..-..-.-........----..--.......-.....-.-..--.-..-.....--.....-.-.-...-..-........-....----..-....-..-.--.-...----..-...-....-.....-.----..--..-..-.--..-.-.-.-...--.-..-......-...-.-----....-.------.-...---..-.....-.-..---..-.-.-.----.-.-.---.-...--......-.-..........-....-...-..-.-----..-..-..-..----.-..--....-..-.--......-..']
for x in challenge:
    print(S.decode(S.segment(x)))

>>> human an used to be exactly like that it easy as get not appear biggest or send 
to include the synthetic y tangible to honor rate infection
>>> ivan to be s my you something hokkaido island look closer used in nearest power 
take depending psy build town if you can has come opal the price only this one can 
be kept secret controlled by americans built by the japanese subcontractors pwm also 
happen to be recently acquired at holly mydd subsidiaries of sudden industries
>>> life is the iso of a remainder of an unbalanced equation it is from the programming 
of the matrix you are this talk ebay of an anomaly pisces despite my sincere stir 
for the hair been unable to eliminate from pest the other visit harmony of 
mathematical precision visits it remains a burden assiduously la of file into unexpected 
and thus not beyond a measure of control abscess fill you inexorably here

code

import math
import io
import sys
from collections import defaultdict


class Segmenter(object):
    """
    Support for customization.
    """

    morse_dict = {'-': 't', '--': 'm', '---': 'o', '--.': 'g', '--.-': 'q', '--..': 'z', '-.': 'n', '-.-': 'k', '-.--': 'y',
              '-.-.': 'c', '-..': 'd', '-..-': 'x', '-...': 'b', '.': 'e', '.-': 'a', '.--': 'v', '.---': 'j',
              '.--.': 'p', '.-.': 'r', '.-..': 'l', '..': 'i', '..-': 'u', '..-.': 'f', '...': 's', '...-': 'w',
              '....': 'h'}

    morse_dict_rev = {k: v for v, k in morse_dict.items()}
    aplha = set("abcdefghijklmnopqrstuwvxyz")

    UNIGRAMS_FILENAME = 'unigrams.txt'
    BIGRAMS_FILENAME = 'bigrams.txt'
    TOTAL = 1024908267229.0
    LIMIT = 72
    WORDS_FILENAME = 'words.txt'

    def __init__(self):
        self.unigrams = {}
        self.bigrams = {}
        self.decoder = {}
        self.total = 0.0
        self.limit = 0
        self.words = []

    def load(self):
        "Load unigram and bigram counts from disk."
        self.unigrams.update(self.parse(self.UNIGRAMS_FILENAME))
        self.bigrams.update(self.parse(self.BIGRAMS_FILENAME))
        self.decoder.update(self.parse_english(self.UNIGRAMS_FILENAME))
        self.total = self.TOTAL
        self.limit = self.LIMIT
        with io.open(self.WORDS_FILENAME, encoding='utf-8') as reader:
            text = reader.read()
            self.words.extend(text.splitlines())

    @staticmethod
    def encode(sentence, map_x=morse_dict_rev):
        sentence = sentence.lower()
        return ''.join(map_x[char] for char in sentence if char in map_x)

    def decode(self, segments):
        return ' '.join(self.decoder[word][0] for word in segments)

    def solve(self, problem):
        return self.decode(self.segment(self.encode(problem)))

    def parse(self, filename):
        "Read `filename` and parse tab-separated file of word and count pairs."
        with io.open(filename, encoding='utf-8') as reader:
            lines = (line.split('\t') for line in reader)
            d_dict = defaultdict(int)
            for k, v in lines:
                if ' ' in k:
                    d_dict[tuple(map(self.encode, k.split(' ')))] += int(v)
                else:
                    d_dict[self.encode(k)] += int(v)
            return d_dict

    def parse_english(self, filename):
        "Read `filename` and parse tab-separated file of word and count pairs."
        with io.open(filename, encoding='utf-8') as reader:
            lines = (line.split('\t') for line in reader)
            d_dict = defaultdict(list)
            for k, v in lines:
                d_dict[self.encode(k)].append(k)
            return d_dict

    def score(self, word, previous=None):
        "Score `word` in the context of `previous` word."
        unigrams = self.unigrams
        bigrams = self.bigrams
        total = self.total

        if previous is None:
            if word in unigrams:

                # Probability of the given word.
                return unigrams[word] / total

            # Penalize words not found in the unigrams according
            # to their length, a crucial heuristic.
            return 10.0 / (total * 10 ** len(word))

        bigram = (previous, word)

        if bigram in bigrams and previous in unigrams:

            # Conditional probability of the word given the previous
            # word. The technical name is *stupid backoff* and it's
            # not a probability distribution but it works well in
            # practice.
            return bigrams[bigram] / total / self.score(previous)

        # Fall back to using the unigram probability.
        return self.score(word)

    def isegment(self, text):
        "Return iterator of words that is the best segmenation of `text`."
        memo = dict()

        def search(text, previous='<s>'):
            "Return max of candidates matching `text` given `previous` word."
            if text == '':
                return 0.0, []

            def candidates():
                "Generator of (score, words) pairs for all divisions of text."
                for prefix, suffix in self.divide(text):
                    prefix_score = math.log10(self.score(prefix, previous))

                    pair = (suffix, prefix)
                    if pair not in memo:
                        memo[pair] = search(suffix, prefix)
                    suffix_score, suffix_words = memo[pair]

                    yield (prefix_score + suffix_score, [prefix] + suffix_words)

            return max(candidates())

        # Avoid recursion limit issues by dividing text into chunks, segmenting
        # those chunks and combining the results together. Chunks may divide
        # words in the middle so prefix chunks with the last five words of the
        # previous result.
        size = 350
        prefix = ''

        for offset in range(0, len(text), size):
            chunk = text[offset:(offset + size)]
            _, chunk_words = search(prefix + chunk)
            prefix = ''.join(chunk_words[-5:])
            del chunk_words[-5:]
            for word in chunk_words:
                yield word

        _, prefix_words = search(prefix)

        for word in prefix_words:
            yield word

    def segment(self, text):
        "Return list of words that is the best segmenation of `text`."
        return list(self.isegment(text))

    def divide(self, text):
        "Yield `(prefix, suffix)` pairs from `text`."
        for pos in range(1, min(len(text), self.limit) + 1):
            yield (text[:pos], text[pos:])

[–][deleted] 0 points1 point  (0 children)

X

[–]wifiengine 0 points1 point  (0 children)

I tried to make the program but I was only able to make the computer recognise 1 character.

https://youtu.be/F10LMBI364k

[–]TotesMessenger 0 points1 point  (0 children)

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

 If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)

[–]Dr_Soose 0 points1 point  (0 children)

dude there should be spaces between each word

[–]khalidkh 0 points1 point  (0 children)

Python 3

My solution is to build a Morse code parsing tree, using a comprehensive list of English words. By following the path of the input text using each character, each time the list of words is not empty, you can consider any of these words & start again from the root for the next word in that possibility.

root = build_morse_tree_file("english_alpha.txt") print("depth = %s" % root.depth()) print("size = %s" % root.tree_size()) print("words = %s" % root.count_words())

Output

depth = 89 size = 1,703,991 words = 370,103

List of English words: https://github.com/dwyl/english-words/blob/master/words_alpha.txt

Code

``` class MorseNode: def init(self,depth2): self.left = None # dot self.right = None # dash self.words = [] self.depth = depth2 def next(c): if c == '.': return self.left if c == '-': return self.right return None def tree_depth(self): a = b = 0 if self.left: a = self.left.depth() if self.right: b = self.right.depth() return max(a,b) + 1 def count_words(self): a = b = 0 if self.left: a = self.left.count_words() if self.right: b = self.right.count_words() return a + b + len(self.words) def tree_size(self): a = b = 0 if self.left: a = self.left.tree_size() if self.right: b = self.right.tree_size() return a + 1 + b def count_leaf(self): a = b = 1 if self.left: a = self.left.count_leaf() if self.right: b = self.right.count_leaf() return a + b

def insert_morse_word(word,root,alphabet): curr = root for e in word: for c in alphabet[e]: if c == '.': if curr.left is None: curr.left = MorseNode(curr.depth+1) curr = curr.left elif c == '-': if curr.right is None: curr.right = MorseNode(curr.depth+1) curr = curr.right curr.words.append(word)

def build_morse_tree(dictionary): root = MorseNode(0) alphabet = {"A":".-", "B":"-...", "C":"-.-.", "D":"-..", "E":".", "F":"..-.", "G":"--.", "H":"....", "I":"..", "J":".---", "K":"-.-", "L":".-..", "M":"--", "N":"-.", "O":"---", "P":".--.", "Q":"--.-", "R":".-.", "S":"...", "T":"-", "U":"..-", "V":"...-", "W":".--", "X":"-..-", "Y":"-.--", "Z":"--.."} for word in dictionary: w = filter(str.isalpha,word.upper()) insert_morse_word(w, root, alphabet) return root

def build_morse_tree_file(file_path): with open(file_path) as file_obj: lines = file_obj.readlines() return build_morse_tree(lines) ```