# python – Bin and calculate the entropy using Numpy

I am trying to perform the following task:

For a given data column (stored as a numpy matrix), "place" the data in a greedy way where I test the current object and the next to calculate its entropy.

Pseudocode would look like this:

``````split_data (feature):
BestValues ​​= 0
For each value in the characteristic:
Calculate CurrentGain As InformationGain (Entropy (Characteristic) - Entropy (Value + Next Value))
If CurrentGain> BestGain:
Set BestValues ​​= Value, Next value
Set BestGain = CurrentGain

Back BestValues
``````

I currently have a Python code that looks like the following:

``````# This function finds the total entropy for a given data set
def entropy (data set):
# Declare variables
total_entropy = 0
# Determine the classes and the number of elements in each class
classes = numpy.unique (dataset)[:,-1])

# Scroll through each "class", or label
For classes in classes:
# Create temporary variables
currFreq = 0
currProb = 0
# Walk through each row in the data set
for the row in the data set:
# If that row has the same label as the current class, implement the frequency
if (aclass == row[-1]):
currFreq = currFreq + 1
# If not, continue
plus:
continue
# The current probability is the # of occurrences / total occurrences
currProb = currFreq / len (data set)
# If it is 0, then the entropy is 0. If not, use the entropy formula
if (currFreq> 0):
total_entropy = total_entropy + (-currProb * math.log (currProb, 2))
plus:
return 0

# Returns the total entropy

# This function gets entropy for a single attribute.
def entropy_by_attribute (data set, characteristic):
# The attribute is the specific characteristic of the data set.
attribute = data set[:,feature]
# The target_variables are the unique values ​​in that characteristic
target_variables = numpy.unique (dataset)[:,-1])
# The unique values ​​in the column that we are evaluating.
variables = numpy.unique (attribute)
# The entropy of the attribute in question.
entropy_attribute = 0

# Go through each of the possible values.
by variable in variables:
denominator = 0
entropy_each_feature = 0
# For each row in the column
for the row in the attribute:
# If it is equal to the current value that we are estimating, increase its denominator
if row == variable:
denominator = denominator + 1

# Now go through each class
for target_variable in target_variables:
numerator = 0
# Go through the data set
for the row in the data set:
index = 0
# if the current row in the entity is equal to the value you are evaluating
# and the label is equal to the label you are evaluating, increase the numerator
yes data set[index][feature]    == variable and data set[index][-1]    == target_variable:
numerator = numerator + 1
plus:
continue
index = index + 1

# use eps to protect from dividing by 0
fraction = numerator / (denominator + numpy.finfo (float) .eps)
entropy_each_feature = entropy_each_feature + (-fraction * math.log (fraction + numpy.finfo (float) .eps, 2))

# Now calculate the total entropy for the attribute in question
big_fraction = denominator / len (data set)
entropy_attribute = entropy_attribute + (- big_fraction * entropy_each_feature)

# Return that entropy
return entropy_attribute

# This function calculates the information gain.
def infogain (data set, characteristic):
# Grab the entropy of the total data set
total_entropy = entropy (data set)
# Grabs entropy for the current feature that is being evaluated
feature_entropy = entropy_by_attribute (data set, characteristic)
# Calculate the infogain
infogain = float (abs (total_entropy - feature_entropy))

# Return the infogain
back infogain
``````

However, I'm not sure how to do the following:

1. For a characteristic, it takes its total entropy
2. For a single characteristic, determine the entropy using a grouping technique where I am testing two values

I can not logically conceive how to develop codes to achieve 1 and 2, and I'm struggling a lot. I will continue updating with the progress I do.