Language-detection heuristic (English, French or German) based on Unigram and Bigram models











up vote
1
down vote

favorite
1












Given a string, for example "I hate AI", I need to find out if the sentence is in English, German or French. Unigram Model makes the prediction on the basis of each character frequency in a training text, while Bigram model makes prediction based on what character follows another character.



The following code has 2 methods 1. getBigramResult() 2. getUnigramResult().



Both the methods take an ArrayList<Character> as a parameter and return a HashMap<Language,Double> with Key as the Language (French, English, German) and the probability associated with each language for the given character list as the value. The two methods are almost the same except for





  1. The for loop->



    for(int j = 0; j < textCharList.size() - 1; j++)// getBigramResult()

    for(int j=0; j<textCharList.size(); j++)// getUnigramResult()



  2. The if condition->



    if(textCharList.get(i) !='+' && textCharList.get(i+1) !='+')// getBigramResult()

    if(textCharList.get(i)!='+')// getUnigramResult()



  3. The probability calculating function



    getConditionalProbability(textCharacter.get(i),textCharacter.get(i+1)) // getBigramResult()

    getProbability(textCharacter.get(i))// getUnigramResult()


  4. getBigramResult() works on a class call BigramV2 and getUnigramResult() works on a class call Unigram.



The code of the methods are as follows



public static HashMap<Language, Double> getBigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size() - 1; j++) {
if (textCharList.get(j) != '+' && textCharList.get(j + 1) != '+') {
FileHandler.writeSentences("BIGRAM :"+textCharList.get(j)+""+textCharList.get(j + 1),false);
for (int k = 0; k < biGramList.size(); k++) {
BiGramV2 temp = biGramList.get(k);
double conditionalProbability = Math.log10(temp.getConditionalProbabilty(textCharList.get(j),
textCharList.get(j + 1)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j+1)+"|"+textCharList.get(j) +") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);
}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}

public static HashMap<Language, Double> getUnigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size(); j++) {
if (textCharList.get(j) != '+') {
FileHandler.writeSentences("UNIGRAM :"+textCharList.get(j),false);
for (int k = 0; k < uniGramList.size(); k++) {
Unigram temp = uniGramList.get(k);
double conditionalProbability = Math.log10(temp.getProbabilty(textCharList.get(j)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j)+") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);

}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}


Both the above methods getBigramResult() and getUnigramResult() are very similar, and I feel like it's not design efficient, but I am not able to refactor them because of the different outer for-loop, if block and different probability calculating methods.



My BiGramV2 class



public class BiGramV2  {
private double delta;
private Language language;

public BiGramV2(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');
double storage = new double[dictCharacters.size()][dictCharacters.size()];
private double countOfRows = new double[dictCharacters.size()];

public void fit(List<Character> characters) {
for (int i = 0; i < characters.size() - 1; i++) {
if (characters.get(i) != '+' && characters.get(i + 1) != '+')
{
int rowNo = dictCharacters.indexOf(characters.get(i));
int columnNo = dictCharacters.indexOf(characters.get(i + 1));
storage[rowNo][columnNo]++;
countOfRows[rowNo]++;
}

}

}

public Language getLanguage()
{
return language;// Enum of GERMAN, FRENCH and ENGLISH
}

public double getConditionalProbabilty(char first, char second)
{
int rowNo = dictCharacters.indexOf(first);
int columnNo = dictCharacters.indexOf(second);
double numerator=storage[rowNo][columnNo] + delta;
double denominator=countOfRows[rowNo]+ (delta*dictCharacters.size());
double conditionalProbability=numerator/denominator;
return conditionalProbability;
}}


And my Unigram Class is



public class Unigram {

HashMap<Character,Integer> storage = new HashMap<Character,Integer>();
private double delta;
private Language language;
private int noOfCharacters=0;
public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');

public Unigram(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public Language getLanguage()
{
return language;
}

public void fit(List<Character> characters)
{
for (int i = 0; i < characters.size() ; i++) {
if (characters.get(i) != '+')
{
storage.put(characters.get(i), storage.getOrDefault(characters.get(i), 0)+1);
noOfCharacters++;
}
}

}

public double getProbabilty(char first)
{

double numerator=storage.get(first) + delta;
double denominator=noOfCharacters+ (delta*dictCharacters.size());
double probability=numerator/denominator;
return probability;
}


}



Any suggestion on my code would be appreciated.










share|improve this question




















  • 2




    Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.".
    – Sᴀᴍ Onᴇᴌᴀ
    Nov 29 at 22:59










  • @SᴀᴍOnᴇᴌᴀ Do you think the edits I made are OK?
    – dividedbyzero
    Nov 30 at 2:58










  • Please update the title to express what the code does not your concerns for the code.
    – bruglesco
    Nov 30 at 3:08










  • @bruglesco Do you think its ok now?
    – dividedbyzero
    Nov 30 at 3:14










  • The title is better. I think its also a good question but it's a bit of a grey area. I cant vote to reopen however so it will be up to the rest of the community. Good Luck!
    – bruglesco
    Nov 30 at 3:26















up vote
1
down vote

favorite
1












Given a string, for example "I hate AI", I need to find out if the sentence is in English, German or French. Unigram Model makes the prediction on the basis of each character frequency in a training text, while Bigram model makes prediction based on what character follows another character.



The following code has 2 methods 1. getBigramResult() 2. getUnigramResult().



Both the methods take an ArrayList<Character> as a parameter and return a HashMap<Language,Double> with Key as the Language (French, English, German) and the probability associated with each language for the given character list as the value. The two methods are almost the same except for





  1. The for loop->



    for(int j = 0; j < textCharList.size() - 1; j++)// getBigramResult()

    for(int j=0; j<textCharList.size(); j++)// getUnigramResult()



  2. The if condition->



    if(textCharList.get(i) !='+' && textCharList.get(i+1) !='+')// getBigramResult()

    if(textCharList.get(i)!='+')// getUnigramResult()



  3. The probability calculating function



    getConditionalProbability(textCharacter.get(i),textCharacter.get(i+1)) // getBigramResult()

    getProbability(textCharacter.get(i))// getUnigramResult()


  4. getBigramResult() works on a class call BigramV2 and getUnigramResult() works on a class call Unigram.



The code of the methods are as follows



public static HashMap<Language, Double> getBigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size() - 1; j++) {
if (textCharList.get(j) != '+' && textCharList.get(j + 1) != '+') {
FileHandler.writeSentences("BIGRAM :"+textCharList.get(j)+""+textCharList.get(j + 1),false);
for (int k = 0; k < biGramList.size(); k++) {
BiGramV2 temp = biGramList.get(k);
double conditionalProbability = Math.log10(temp.getConditionalProbabilty(textCharList.get(j),
textCharList.get(j + 1)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j+1)+"|"+textCharList.get(j) +") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);
}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}

public static HashMap<Language, Double> getUnigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size(); j++) {
if (textCharList.get(j) != '+') {
FileHandler.writeSentences("UNIGRAM :"+textCharList.get(j),false);
for (int k = 0; k < uniGramList.size(); k++) {
Unigram temp = uniGramList.get(k);
double conditionalProbability = Math.log10(temp.getProbabilty(textCharList.get(j)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j)+") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);

}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}


Both the above methods getBigramResult() and getUnigramResult() are very similar, and I feel like it's not design efficient, but I am not able to refactor them because of the different outer for-loop, if block and different probability calculating methods.



My BiGramV2 class



public class BiGramV2  {
private double delta;
private Language language;

public BiGramV2(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');
double storage = new double[dictCharacters.size()][dictCharacters.size()];
private double countOfRows = new double[dictCharacters.size()];

public void fit(List<Character> characters) {
for (int i = 0; i < characters.size() - 1; i++) {
if (characters.get(i) != '+' && characters.get(i + 1) != '+')
{
int rowNo = dictCharacters.indexOf(characters.get(i));
int columnNo = dictCharacters.indexOf(characters.get(i + 1));
storage[rowNo][columnNo]++;
countOfRows[rowNo]++;
}

}

}

public Language getLanguage()
{
return language;// Enum of GERMAN, FRENCH and ENGLISH
}

public double getConditionalProbabilty(char first, char second)
{
int rowNo = dictCharacters.indexOf(first);
int columnNo = dictCharacters.indexOf(second);
double numerator=storage[rowNo][columnNo] + delta;
double denominator=countOfRows[rowNo]+ (delta*dictCharacters.size());
double conditionalProbability=numerator/denominator;
return conditionalProbability;
}}


And my Unigram Class is



public class Unigram {

HashMap<Character,Integer> storage = new HashMap<Character,Integer>();
private double delta;
private Language language;
private int noOfCharacters=0;
public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');

public Unigram(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public Language getLanguage()
{
return language;
}

public void fit(List<Character> characters)
{
for (int i = 0; i < characters.size() ; i++) {
if (characters.get(i) != '+')
{
storage.put(characters.get(i), storage.getOrDefault(characters.get(i), 0)+1);
noOfCharacters++;
}
}

}

public double getProbabilty(char first)
{

double numerator=storage.get(first) + delta;
double denominator=noOfCharacters+ (delta*dictCharacters.size());
double probability=numerator/denominator;
return probability;
}


}



Any suggestion on my code would be appreciated.










share|improve this question




















  • 2




    Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.".
    – Sᴀᴍ Onᴇᴌᴀ
    Nov 29 at 22:59










  • @SᴀᴍOnᴇᴌᴀ Do you think the edits I made are OK?
    – dividedbyzero
    Nov 30 at 2:58










  • Please update the title to express what the code does not your concerns for the code.
    – bruglesco
    Nov 30 at 3:08










  • @bruglesco Do you think its ok now?
    – dividedbyzero
    Nov 30 at 3:14










  • The title is better. I think its also a good question but it's a bit of a grey area. I cant vote to reopen however so it will be up to the rest of the community. Good Luck!
    – bruglesco
    Nov 30 at 3:26













up vote
1
down vote

favorite
1









up vote
1
down vote

favorite
1






1





Given a string, for example "I hate AI", I need to find out if the sentence is in English, German or French. Unigram Model makes the prediction on the basis of each character frequency in a training text, while Bigram model makes prediction based on what character follows another character.



The following code has 2 methods 1. getBigramResult() 2. getUnigramResult().



Both the methods take an ArrayList<Character> as a parameter and return a HashMap<Language,Double> with Key as the Language (French, English, German) and the probability associated with each language for the given character list as the value. The two methods are almost the same except for





  1. The for loop->



    for(int j = 0; j < textCharList.size() - 1; j++)// getBigramResult()

    for(int j=0; j<textCharList.size(); j++)// getUnigramResult()



  2. The if condition->



    if(textCharList.get(i) !='+' && textCharList.get(i+1) !='+')// getBigramResult()

    if(textCharList.get(i)!='+')// getUnigramResult()



  3. The probability calculating function



    getConditionalProbability(textCharacter.get(i),textCharacter.get(i+1)) // getBigramResult()

    getProbability(textCharacter.get(i))// getUnigramResult()


  4. getBigramResult() works on a class call BigramV2 and getUnigramResult() works on a class call Unigram.



The code of the methods are as follows



public static HashMap<Language, Double> getBigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size() - 1; j++) {
if (textCharList.get(j) != '+' && textCharList.get(j + 1) != '+') {
FileHandler.writeSentences("BIGRAM :"+textCharList.get(j)+""+textCharList.get(j + 1),false);
for (int k = 0; k < biGramList.size(); k++) {
BiGramV2 temp = biGramList.get(k);
double conditionalProbability = Math.log10(temp.getConditionalProbabilty(textCharList.get(j),
textCharList.get(j + 1)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j+1)+"|"+textCharList.get(j) +") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);
}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}

public static HashMap<Language, Double> getUnigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size(); j++) {
if (textCharList.get(j) != '+') {
FileHandler.writeSentences("UNIGRAM :"+textCharList.get(j),false);
for (int k = 0; k < uniGramList.size(); k++) {
Unigram temp = uniGramList.get(k);
double conditionalProbability = Math.log10(temp.getProbabilty(textCharList.get(j)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j)+") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);

}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}


Both the above methods getBigramResult() and getUnigramResult() are very similar, and I feel like it's not design efficient, but I am not able to refactor them because of the different outer for-loop, if block and different probability calculating methods.



My BiGramV2 class



public class BiGramV2  {
private double delta;
private Language language;

public BiGramV2(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');
double storage = new double[dictCharacters.size()][dictCharacters.size()];
private double countOfRows = new double[dictCharacters.size()];

public void fit(List<Character> characters) {
for (int i = 0; i < characters.size() - 1; i++) {
if (characters.get(i) != '+' && characters.get(i + 1) != '+')
{
int rowNo = dictCharacters.indexOf(characters.get(i));
int columnNo = dictCharacters.indexOf(characters.get(i + 1));
storage[rowNo][columnNo]++;
countOfRows[rowNo]++;
}

}

}

public Language getLanguage()
{
return language;// Enum of GERMAN, FRENCH and ENGLISH
}

public double getConditionalProbabilty(char first, char second)
{
int rowNo = dictCharacters.indexOf(first);
int columnNo = dictCharacters.indexOf(second);
double numerator=storage[rowNo][columnNo] + delta;
double denominator=countOfRows[rowNo]+ (delta*dictCharacters.size());
double conditionalProbability=numerator/denominator;
return conditionalProbability;
}}


And my Unigram Class is



public class Unigram {

HashMap<Character,Integer> storage = new HashMap<Character,Integer>();
private double delta;
private Language language;
private int noOfCharacters=0;
public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');

public Unigram(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public Language getLanguage()
{
return language;
}

public void fit(List<Character> characters)
{
for (int i = 0; i < characters.size() ; i++) {
if (characters.get(i) != '+')
{
storage.put(characters.get(i), storage.getOrDefault(characters.get(i), 0)+1);
noOfCharacters++;
}
}

}

public double getProbabilty(char first)
{

double numerator=storage.get(first) + delta;
double denominator=noOfCharacters+ (delta*dictCharacters.size());
double probability=numerator/denominator;
return probability;
}


}



Any suggestion on my code would be appreciated.










share|improve this question















Given a string, for example "I hate AI", I need to find out if the sentence is in English, German or French. Unigram Model makes the prediction on the basis of each character frequency in a training text, while Bigram model makes prediction based on what character follows another character.



The following code has 2 methods 1. getBigramResult() 2. getUnigramResult().



Both the methods take an ArrayList<Character> as a parameter and return a HashMap<Language,Double> with Key as the Language (French, English, German) and the probability associated with each language for the given character list as the value. The two methods are almost the same except for





  1. The for loop->



    for(int j = 0; j < textCharList.size() - 1; j++)// getBigramResult()

    for(int j=0; j<textCharList.size(); j++)// getUnigramResult()



  2. The if condition->



    if(textCharList.get(i) !='+' && textCharList.get(i+1) !='+')// getBigramResult()

    if(textCharList.get(i)!='+')// getUnigramResult()



  3. The probability calculating function



    getConditionalProbability(textCharacter.get(i),textCharacter.get(i+1)) // getBigramResult()

    getProbability(textCharacter.get(i))// getUnigramResult()


  4. getBigramResult() works on a class call BigramV2 and getUnigramResult() works on a class call Unigram.



The code of the methods are as follows



public static HashMap<Language, Double> getBigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size() - 1; j++) {
if (textCharList.get(j) != '+' && textCharList.get(j + 1) != '+') {
FileHandler.writeSentences("BIGRAM :"+textCharList.get(j)+""+textCharList.get(j + 1),false);
for (int k = 0; k < biGramList.size(); k++) {
BiGramV2 temp = biGramList.get(k);
double conditionalProbability = Math.log10(temp.getConditionalProbabilty(textCharList.get(j),
textCharList.get(j + 1)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j+1)+"|"+textCharList.get(j) +") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);
}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}

public static HashMap<Language, Double> getUnigramResult(ArrayList<Character> textCharList) {
HashMap<Language, Double> totalProbabilities = new HashMap<Language, Double>();
for (int j = 0; j < textCharList.size(); j++) {
if (textCharList.get(j) != '+') {
FileHandler.writeSentences("UNIGRAM :"+textCharList.get(j),false);
for (int k = 0; k < uniGramList.size(); k++) {
Unigram temp = uniGramList.get(k);
double conditionalProbability = Math.log10(temp.getProbabilty(textCharList.get(j)));
updateTotalProbabilities(totalProbabilities,temp.getLanguage(),conditionalProbability);
FileHandler.writeSentences(temp.getLanguage().toString()+ ": p("+textCharList.get(j)+") ="+conditionalProbability+"==> log prob of sentence so far: " +totalProbabilities.get(temp.getLanguage()),false);

}
FileHandler.writeSentences("",false);
}
}
return totalProbabilities;
}


Both the above methods getBigramResult() and getUnigramResult() are very similar, and I feel like it's not design efficient, but I am not able to refactor them because of the different outer for-loop, if block and different probability calculating methods.



My BiGramV2 class



public class BiGramV2  {
private double delta;
private Language language;

public BiGramV2(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');
double storage = new double[dictCharacters.size()][dictCharacters.size()];
private double countOfRows = new double[dictCharacters.size()];

public void fit(List<Character> characters) {
for (int i = 0; i < characters.size() - 1; i++) {
if (characters.get(i) != '+' && characters.get(i + 1) != '+')
{
int rowNo = dictCharacters.indexOf(characters.get(i));
int columnNo = dictCharacters.indexOf(characters.get(i + 1));
storage[rowNo][columnNo]++;
countOfRows[rowNo]++;
}

}

}

public Language getLanguage()
{
return language;// Enum of GERMAN, FRENCH and ENGLISH
}

public double getConditionalProbabilty(char first, char second)
{
int rowNo = dictCharacters.indexOf(first);
int columnNo = dictCharacters.indexOf(second);
double numerator=storage[rowNo][columnNo] + delta;
double denominator=countOfRows[rowNo]+ (delta*dictCharacters.size());
double conditionalProbability=numerator/denominator;
return conditionalProbability;
}}


And my Unigram Class is



public class Unigram {

HashMap<Character,Integer> storage = new HashMap<Character,Integer>();
private double delta;
private Language language;
private int noOfCharacters=0;
public static List<Character> dictCharacters = Arrays.asList('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z');

public Unigram(double delta, Language language) {
this.delta = delta;
this.language = language;
}

public Language getLanguage()
{
return language;
}

public void fit(List<Character> characters)
{
for (int i = 0; i < characters.size() ; i++) {
if (characters.get(i) != '+')
{
storage.put(characters.get(i), storage.getOrDefault(characters.get(i), 0)+1);
noOfCharacters++;
}
}

}

public double getProbabilty(char first)
{

double numerator=storage.get(first) + delta;
double denominator=noOfCharacters+ (delta*dictCharacters.size());
double probability=numerator/denominator;
return probability;
}


}



Any suggestion on my code would be appreciated.







java natural-language-processing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 30 at 7:23

























asked Nov 29 at 22:09









dividedbyzero

112




112








  • 2




    Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.".
    – Sᴀᴍ Onᴇᴌᴀ
    Nov 29 at 22:59










  • @SᴀᴍOnᴇᴌᴀ Do you think the edits I made are OK?
    – dividedbyzero
    Nov 30 at 2:58










  • Please update the title to express what the code does not your concerns for the code.
    – bruglesco
    Nov 30 at 3:08










  • @bruglesco Do you think its ok now?
    – dividedbyzero
    Nov 30 at 3:14










  • The title is better. I think its also a good question but it's a bit of a grey area. I cant vote to reopen however so it will be up to the rest of the community. Good Luck!
    – bruglesco
    Nov 30 at 3:26














  • 2




    Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.".
    – Sᴀᴍ Onᴇᴌᴀ
    Nov 29 at 22:59










  • @SᴀᴍOnᴇᴌᴀ Do you think the edits I made are OK?
    – dividedbyzero
    Nov 30 at 2:58










  • Please update the title to express what the code does not your concerns for the code.
    – bruglesco
    Nov 30 at 3:08










  • @bruglesco Do you think its ok now?
    – dividedbyzero
    Nov 30 at 3:14










  • The title is better. I think its also a good question but it's a bit of a grey area. I cant vote to reopen however so it will be up to the rest of the community. Good Luck!
    – bruglesco
    Nov 30 at 3:26








2




2




Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.".
– Sᴀᴍ Onᴇᴌᴀ
Nov 29 at 22:59




Welcome to Code Review! What task does this code accomplish? Please tell us, and also make that the title of the question via edit. Maybe you missed the placeholder on the title element: "State the task that your code accomplishes. Make your title distinctive.". Also from How to Ask: "State what your code does in your title, not your main concerns about it.".
– Sᴀᴍ Onᴇᴌᴀ
Nov 29 at 22:59












@SᴀᴍOnᴇᴌᴀ Do you think the edits I made are OK?
– dividedbyzero
Nov 30 at 2:58




@SᴀᴍOnᴇᴌᴀ Do you think the edits I made are OK?
– dividedbyzero
Nov 30 at 2:58












Please update the title to express what the code does not your concerns for the code.
– bruglesco
Nov 30 at 3:08




Please update the title to express what the code does not your concerns for the code.
– bruglesco
Nov 30 at 3:08












@bruglesco Do you think its ok now?
– dividedbyzero
Nov 30 at 3:14




@bruglesco Do you think its ok now?
– dividedbyzero
Nov 30 at 3:14












The title is better. I think its also a good question but it's a bit of a grey area. I cant vote to reopen however so it will be up to the rest of the community. Good Luck!
– bruglesco
Nov 30 at 3:26




The title is better. I think its also a good question but it's a bit of a grey area. I cant vote to reopen however so it will be up to the rest of the community. Good Luck!
– bruglesco
Nov 30 at 3:26















active

oldest

votes











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208724%2flanguage-detection-heuristic-english-french-or-german-based-on-unigram-and-bi%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208724%2flanguage-detection-heuristic-english-french-or-german-based-on-unigram-and-bi%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Terni

A new problem with tex4ht and tikz

Sun Ra