I am trying to perform the following task:
For a given data column (stored as a numpy matrix), "place" the data in a greedy way where I test the current object and the next to calculate its entropy.
Pseudocode would look like this:
split_data (feature): BestValues = 0 For each value in the characteristic: Calculate CurrentGain As InformationGain (Entropy (Characteristic) - Entropy (Value + Next Value)) If CurrentGain> BestGain: Set BestValues = Value, Next value Set BestGain = CurrentGain Back BestValues
I currently have a Python code that looks like the following:
# This function finds the total entropy for a given data set def entropy (data set): # Declare variables total_entropy = 0 # Determine the classes and the number of elements in each class classes = numpy.unique (dataset)[:,-1]) # Scroll through each "class", or label For classes in classes: # Create temporary variables currFreq = 0 currProb = 0 # Walk through each row in the data set for the row in the data set: # If that row has the same label as the current class, implement the frequency if (aclass == row[-1]): currFreq = currFreq + 1 # If not, continue plus: continue # The current probability is the # of occurrences / total occurrences currProb = currFreq / len (data set) # If it is 0, then the entropy is 0. If not, use the entropy formula if (currFreq> 0): total_entropy = total_entropy + (-currProb * math.log (currProb, 2)) plus: return 0 # Returns the total entropy return total_entropy # This function gets entropy for a single attribute. def entropy_by_attribute (data set, characteristic): # The attribute is the specific characteristic of the data set. attribute = data set[:,feature] # The target_variables are the unique values in that characteristic target_variables = numpy.unique (dataset)[:,-1]) # The unique values in the column that we are evaluating. variables = numpy.unique (attribute) # The entropy of the attribute in question. entropy_attribute = 0 # Go through each of the possible values. by variable in variables: denominator = 0 entropy_each_feature = 0 # For each row in the column for the row in the attribute: # If it is equal to the current value that we are estimating, increase its denominator if row == variable: denominator = denominator + 1 # Now go through each class for target_variable in target_variables: numerator = 0 # Go through the data set for the row in the data set: index = 0 # if the current row in the entity is equal to the value you are evaluating # and the label is equal to the label you are evaluating, increase the numerator yes data set[index][feature] == variable and data set[index][-1] == target_variable: numerator = numerator + 1 plus: continue index = index + 1 # use eps to protect from dividing by 0 fraction = numerator / (denominator + numpy.finfo (float) .eps) entropy_each_feature = entropy_each_feature + (-fraction * math.log (fraction + numpy.finfo (float) .eps, 2)) # Now calculate the total entropy for the attribute in question big_fraction = denominator / len (data set) entropy_attribute = entropy_attribute + (- big_fraction * entropy_each_feature) # Return that entropy return entropy_attribute # This function calculates the information gain. def infogain (data set, characteristic): # Grab the entropy of the total data set total_entropy = entropy (data set) # Grabs entropy for the current feature that is being evaluated feature_entropy = entropy_by_attribute (data set, characteristic) # Calculate the infogain infogain = float (abs (total_entropy - feature_entropy)) # Return the infogain back infogain
However, I'm not sure how to do the following:
- For a characteristic, it takes its total entropy
- For a single characteristic, determine the entropy using a grouping technique where I am testing two values
I can not logically conceive how to develop codes to achieve 1 and 2, and I'm struggling a lot. I will continue updating with the progress I do.