java – Counting the number of unique IP addresses in a very large file. Follow-Up #1

It’s a follow-up question.
A previous version
of this code has been posted on Code Review about 2 weeks ago.

What was done after the last review

In fact, the entire application is rewritten from scratch. Here
is current version with tests and JavaDoc.

I tried to split the application into small classes that have only one responsibility and can be re-used and extended.

The main approach to solve a problem has not changed. I still put one value in accordance with each ip and set a bit with the corresponding index in the bit array. The required amount of memory remained the same about 550-600 MB. The speed has increased, now I am practically limited to the performance of my hard disk. It is still assumed that only valid IPs will be in the file.

I deleted all JavaDoc comments from code examples, because they occupy more space than the code itself.

FixedSizeBitVector

I wrote a simple implementation of the bit array that allows you to store N bits with indexes from 0 to N-1. There are
three operations are supported: set bit, examine bit’s value, and get the number of all the set bits.
getCapacity() method used for testing and may be useful in other cases.

I do not bring the BitVector interface due to its primitiveness and simplicity.

package chptr.one;

public class FixedSizeBitVector implements BitVector {

    public static final long MIN_CAPACITY = 1L;
    public static final long MAX_CAPACITY = 1L << 32;

    private final long capacity;
    private final int() intArray;
    private long cardinality;

    public FixedSizeBitVector(long capacity) {
        if (capacity < MIN_CAPACITY || capacity > MAX_CAPACITY) {
            throw new IllegalArgumentException("Capacity must be in range (1.." + MAX_CAPACITY + ").");
        }
        int arraySize = 1 + (int) ((capacity - 1) >> 5);
        this.intArray = new int(arraySize);
        this.capacity = capacity;
    }

    private void checkBounds(long bitIndex) {
        if (bitIndex < 0 || bitIndex >= capacity) {
            throw new IllegalArgumentException("Bit index must be in range (0.." + (capacity - 1) + ").");
        }
    }

    @Override
    public void setBit(long bitIndex) {
        checkBounds(bitIndex);
        int index = (int) (bitIndex >> 5);
        int bit = 1 << (bitIndex & 31);
        if ((intArray(index) & bit) == 0) {
            cardinality++;
            intArray(index) |= bit;
        }
    }

    @Override
    public boolean getBit(long bitIndex) {
        checkBounds(bitIndex);
        int index = (int) (bitIndex >> 5);
        int bit = 1 << (bitIndex & 31);
        return (intArray(index) & bit) != 0;
    }

    @Override
    public long getCapacity() {
        return capacity;
    }

    @Override
    public long getCardinality() {
        return cardinality;
    }
}

UniqueStringCounter

This class implements the counter of unique lines in the input Iterable<String> sequence. The counter uses BitVector
implementation. To work, it is required that the input sequence has a final number of possible string-values and this
amount did not exceed the maximum capacity of the bit vector used.

It also requires a hash function that puts a String and a long in an unambiguous match.

package chptr.one;

import javax.validation.constraints.NotNull;
import java.util.Objects;
import java.util.function.ToLongFunction;

public class UniqueStringCounter {

    private final Iterable<String> lines;
    private final ToLongFunction<String> hashFunction;
    private final BitVector bitVector;
    private long linesProcessed;

    public UniqueStringCounter(@NotNull Iterable<String> lines,
                               long capacity,
                               @NotNull ToLongFunction<String> hashFunction) {

        Objects.requireNonNull(lines);
        Objects.requireNonNull(hashFunction);
        this.lines = lines;
        this.hashFunction = hashFunction;
        this.bitVector = new FixedSizeBitVector(capacity);
    }

    public long count() {
        for (String line : lines) {
            long value = hashFunction.applyAsLong(line);
            bitVector.setBit(value);
            linesProcessed++;
        }
        return bitVector.getCardinality();
    }

    public long getLinesProcessed() {
        return linesProcessed;
    }
}

IpStringHashFunction

Hash function to convert a String to a long value. This function must generate a unique value for each unique line,
the collisions are not allowed.

I tested several options for the conversion of IP-String in the long-value and came to the conclusion that they all work around at the same speed that the InetAddress. I do not see the reasons to write my own implementation when the speed of the library function is completely satisfied.

The processing of exceptions really is not needed here, because according to the terms of the task, I am guaranteed that there will be only valid IP. I had to make this processing since otherwise I can’t use this function as a parameter for UniqueStringCounter.

package chptr.one;

import java.net.InetAddress;
import java.net.UnknownHostException;
import java.util.function.ToLongFunction;

public class IpStringHashFunction implements ToLongFunction<String> {

    @Override
    public long applyAsLong(String value) {
        long result = 0;
        try {
            for (byte b : InetAddress.getByName(value).getAddress())
                result = (result << 8) | (b & 255);
        } catch (UnknownHostException e) {
            throw new RuntimeException(e);
        }
        return result;
    }
}

BufferedReaderIterable

A simple adapter that allows you to work with BufferedReader as with Iterable. I could not come up with a better way
to send the contents of the file in UniqueStringCounter, which expects to the Iterable<String> as parameter.

package chptr.one;

import javax.validation.constraints.NotNull;
import java.io.BufferedReader;
import java.io.IOException;
import java.util.Iterator;
import java.util.Objects;

public class BufferedReaderIterable implements Iterable<String> {

    private final Iterator<String> iterator;

    public BufferedReaderIterable(@NotNull BufferedReader bufferedReader) {
        Objects.requireNonNull(bufferedReader);
        iterator = new BufferedReaderIterator(bufferedReader);
    }

    public Iterator<String> iterator() {
        return iterator;
    }

    private static class BufferedReaderIterator implements Iterator<String> {

        private final BufferedReader bufferedReader;

        private BufferedReaderIterator(BufferedReader bufferedReader) {
            this.bufferedReader = bufferedReader;
        }

        @Override
        public boolean hasNext() {
            try {
                return bufferedReader.ready();
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }

        @Override
        public String next() {
            String line;
            try {
                line = bufferedReader.readLine();
                if (line == null) {
                    try {
                        bufferedReader.close();
                    } catch (IOException e) {
                        throw new RuntimeException(e);
                    }
                }
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
            return line;
        }
    }
}

IpCounterApp

The main class of the application. Accepts the file name in the -file parameter.

package chptr.one;

import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Objects;

public class IpCounterApp {

    private static String parseFileName(String() args) {
        Objects.requireNonNull(args, "No arguments found. Use -file file-name to processing file.");
        if (args.length == 2 || "-file".equals(args(0))) {
            return args(1);
        }
        return null;
    }

    public static void main(String() args) {
        String fileName = parseFileName(args);
        if (fileName == null) {
            System.err.println("Wrong arguments. Use -file file-name to processing file.");
            return;
        }

        Path filePath = Paths.get(fileName);
        if (!Files.exists(filePath)) {
            System.err.printf("File %s does not exists.n", filePath);
            return;
        }

        try {
            System.out.printf("Processing file: %sn", filePath);
            long startTime = System.nanoTime();
            BufferedReader bufferedReader = Files.newBufferedReader(filePath);
            Iterable<String> strings = new BufferedReaderIterable(bufferedReader);
            UniqueStringCounter counter = new UniqueStringCounter(strings, 1L << 32, new IpStringHashFunction());
            long numberOfUniqueIp = counter.count();
            long linesProcessed = counter.getLinesProcessed();
            long elapsedTime = System.nanoTime() - startTime;
            System.out.printf("Unique IP addresses: %d in total %d.n", numberOfUniqueIp, linesProcessed);
            System.out.printf("Total time: %d milliseconds.n", elapsedTime / 1_000_000);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Why do I ask the new Review?

In previous question
I received a few excellent answers and one excellent bug report. I tried to take into account so many comments as I
could. I also tried to reconsider my approach to the design of classes. I am interested in mostly three questions:

  • How are my new classes are suitable for re-use and extension? What other abstractions can be allocated here and should
    it be done?
  • What should I do with error processing? I know that now it is bad. I just try to fall as early as possible.
  • And most importantly. Do I move in the right direction? I am a newbie in programming and get an overview for my code is
    the only way to learn how to write a quality and understandable code.

combinatorics – What counting problem would have this solution?

I need to create a counting problem that has the following formula as a solution, where $n$ and $k$ are positive integers:

${n choose k } 2^k(n-k)_k$ where $(n)_i = n(n-1)(n-2)…(n-i+1)$ is the falling factorial.

My theoretical problem that would have this solution is one of ordering beads. Suppose we have an inexhaustible amount of black beads, white beads, and $n−k$ other colors of beads (red, blue, green, etc., etc.) (By “inexhaustible” it is meant that “at least $k$ black beads”, “$k$ white beads”, and “at least one bead of each of the $n−k$ other colors”.)

How many ways/orders can we pick n beads such that exactly $k$ beads are black or white and the remaining $n−k$ beads have no repeated colors?

My understanding is we are taking beads and lining them up in a certain order and finding the total different ways. The solution would then be ${n choose k}2^k(n-k)_k$ is this true?

java – Counting the number of unique IP addresses in a very large file

I made a test job to the position of Junior Java Developer. I did not receive any answer from the employer, so I would like to get a review here.

Task description

A simple text file with IPv4 addresses is given. One line is one address, something like this:

145.67.23.4
8.34.5.23
89.54.3.124
89.54.3.124
3.45.71.5.
...

The file size is not limited and can occupy tens and hundreds of gigabytes.

It is necessary to calculate the number of unique IP addresses in this file, consuming as little memory and time as possible. There is a “naive” algorithm for solving this task (we read strings and put it in HashSet), it is desirable that your implementation is better than this simple, naive algorithm.

Test case

https://ecwid-vgv-storage.s3.eu-central-1.amazonaws.com/ip_addresses.zip

WARNING! This file is about 20GB, and is unpacking approximately 120GB. It consists of 8 billions lines.

My approach

I emerge from the fact that there are 2 * 32 valid unique ip addresses. We can use a bit array of 2 * 32 bits (unsigned integer) and put each bit of this array in line with one ip address. Such an array will take exactly 512 megabytes of memory and will be allocated once at the start of the program. Its size does not depend on size of the input data.

Unfortunately, Java does not have an unsigned int type and also there are no convenient bit operations in it, so I use BitSet for my purposes.

GitHub

You can see the full code of the project with a test file to 1_000_000 addresses here:

https://github.com/chptr-one/ip-addr-counter

Thank you!

Please feel free to tell me all the flaws that you can find.

Code

Main class

public class IpCounterApp {

    private static String parseFileName(String() args) {
        String fileName = null;
        if (args.length == 2 && "-file".equals(args(0))) {
            fileName = args(1);
        }
        return fileName;
    }

    public static void main(String() args) {
        String fileName = parseFileName(args);
        if (fileName == null) {
            System.out.println("Wrong arguments. Use '-file file_name' to specify file for processing");
            return;
        }

        UniqueIpCounter counter = new BitSetUniqueIpCounter();
        long numberOfUniqueIp = counter.countUniqueIp(fileName);
        if (numberOfUniqueIp != -1) {
            System.out.println("Found " + numberOfUniqueIp + " unique IP's");
        } else {
            System.out.println("Some errors here. Check log for details.");
        }
    }
}

UniqueIpCounter interface

public interface UniqueIpCounter {

    /*
    In total there are 2 ^ 32 valid IP addresses exists.
     */
    long NUMBER_OF_IP_ADDRESSES = 256L * 256 * 256 * 256;

    /*
    Map string representing the IP address in format 0-255.0-255.0-255.0-255 to number
    in the range of 0..2^32-1 inclusive.
    It is guaranteed that the input string contains a valid IP address.
     */
    static long toLongValue(String ipString) {
        StringBuilder field = new StringBuilder(3);
        int startIndex = 0;
        long result = 0;

        for (int i = 0; i < 3; i++) {
            int spacerPosition = ipString.indexOf('.', startIndex);
            field.append(ipString, startIndex, spacerPosition);
            int fieldValue = Integer.parseInt(field.toString());
            field.setLength(0);
            result += fieldValue * Math.pow(256, 3 - i);
            startIndex = spacerPosition + 1;
        }
        result += Integer.parseInt(ipString.substring(startIndex));

        return result;
    }

    /*
    Returns the number of unique IP addresses in the file whose name is pass by the argument.
    Returns the number from 0 to 2 ^ 32-1 inclusive.
    Returns -1 in case of any errors.
     */
    long countUniqueIp(String fileName);
}

BitSetUniqueIpCounter implementation

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.BitSet;
import java.util.logging.Level;
import java.util.logging.Logger;

public class BitSetUniqueIpCounter implements UniqueIpCounter {

    private final Logger logger = Logger.getLogger("BitSetUniqueIpCounter");

    /*
    To count unique IP's use a bit array, where each bit is set in accordance with one IP address.
    In Java, there is no unsigned int and maximum BitSet size is integer.MAX_VALUE therefore we use two arrays.
     */
    private final BitSet bitSetLow = new BitSet(Integer.MAX_VALUE); // 0 - 2_147_483_647
    private final BitSet bitSetHi = new BitSet(Integer.MAX_VALUE); // 2_147_483_648 - 4_294_967_295
    private long counter = 0;

    private void registerLongValue(long longValue) {
        int intValue = (int) longValue;
        BitSet workingSet = bitSetLow;
        if (longValue > Integer.MAX_VALUE) {
            intValue = (int) (longValue - Integer.MAX_VALUE);
            workingSet = bitSetHi;
        }

        if (!workingSet.get(intValue)) {
            counter++;
            workingSet.set(intValue);
        }
    }

    @Override
    public long countUniqueIp(String fileName) {
        logger.log(Level.INFO, "Reading file: " + fileName);
        try (BufferedReader in = new BufferedReader(new FileReader(fileName))) {
            long linesProcessed = 0;
            String line;
            // If already counted 2 ^ 32 unique addresses, then to the end of the file there will be only duplicates
            while ((line = in.readLine()) != null && counter <= NUMBER_OF_IP_ADDRESSES) {
                registerLongValue(UniqueIpCounter.toLongValue(line));
                linesProcessed++;
            }
            logger.log(Level.INFO, "Total lines processed: " + linesProcessed);
        } catch (FileNotFoundException e) {
            logger.log(Level.WARNING, "File '" + fileName + "' not found", e);
            counter = -1;
        } catch (IOException e) {
            logger.log(Level.WARNING, "IOException occurs", e);
            counter = -1;
        }
        return counter;
    }
}

linear algebra – Counting tuples which satisfy certain additive conditions

I have 4-tuples

$(a_1, a_2, a_3, a_4) in mathbb{N}$ (including 0) such that $ a1 + a2 + a3 + a4 = n$ and $2 | a_2$ and $3 | a_3$ and $4 | a_4$

For example, if $n = 2,$ then we could have $(2,0,0,0)$ or $(0,2,0,0)$. If $n = 3,$ then we have $(0,0,3,0)$ and $(3,0,0,0)$ and $(1,2,0,0)$.

How can I count how many tuples I have for each $n in mathbb{N}$? It looks like it really blows up after $n = 5$.

sql – snowflake: counting no.of rows precent in an hour as single row

I have a user record for every login he does. I need to count how many times user has logged in. But I also need to consider that even though how many times a user logged in half an hour, i need to count as 1 time.

USER_ID  TIMESTAMP
A1        2021-03-10 10:00:00
A1        2021-03-10 10:01:00
A1        2021-03-10 10:05:00
A1        2021-03-10 10:15:00
A1        2021-03-10 10:32:00
A1        2021-03-10 11:02:00
A1        2021-03-11 12:00:00
A2        2021-03-10 10:01:00
USER_ID     TIMESTAMP
A1            4
A2            1

I am not able to figure out how to use lag and lead with the situation. Any help would be appreciatable.

How do I solve the full counting sort problem in HackerRank?

I tried to look up in the internet and it seems to be hard to understand what’s the concept of it. Can someone show me how to do it in java?

algorithms – Reduction between Parity-SAT and approximate counting

Consider two problems as defined here.

Approximate counting: Given a Boolean function $f(x)$, for $x in {0, 1}^{n}$, distinguish between the two cases:

  1. The number of satisfying assignments for $f(x)$ is $geq 2^{k+1}$.
  2. The number of satisfying assignments for $f(x)$ is $leq 2^{k}$.

Parity-SAT: Given a Boolean function $f(x)$, for $x in {0, 1}^{n}$, output $1$ if the number of satisfying assignments to $f(x)$ is even.

Is there a way to reduce Parity-SAT to approximate counting (or vice versa)?

innodb – MySQL – Counting multiple parameters from the same subquery

I have a very large MySQL (~10M rows) database, currently using the InnoDB engine, which stores log events. I am working on a front-end that allows searching through this log database, and one of the features of this front-end is the ability to filter by certain values of the event. For the visual, the filter is a select-multiple element that lists both the possible values and the number of events currently matching that value:

enter image description here

However the queries I’m using to count these values are causing a significant performance hit. Part of the issue I’m having is I currently have 7 different parameters I allow filtering on, as well as more general filters (e.g. time window), so for each of these 7 parameters I’m running something like:

SELECT parameter1, count(parameter1) FROM table WHERE log_time > 'Some Date' AND parameter2 = 'Something', AND parameter3 = ... GROUP BY parameter1

…where parameter 2, 3, etc. are other conditions that have already been set. So I’m applying the same where condition to all 7 count queries, which seems redundant to me. I know I can reformat this to use a subquery instead of the where condition, but is it possible to then count all 7 filters off the same subquery?

Another idea I had was to create a temporary table which gets populated with the results of the subquery, and is then interrogated for the 7 filter counts. In testing this did overall save some time, but seems like overkill to me (especially from a disk IOP perspective).

I also tried switching the table to MyISAM, since count() operations are supposed to be faster there, but didn’t notice a significant difference in query time. I’m guessing this is because I’m using where conditions that force full table scans anyway.

quarantine – Counting days before/after traveling for COVID purposes

How do I count the days to figure out when I should take a COVID test, as it relates to travel? I left the airport around noon on Monday, which was my last contact with the public. I don’t have any symptoms or specific known exposure, except that I was traveling.

The recommendation in my area is to take a COVID test 3-5 days after travel and self-quarantine for 7 days after travel or until a negative test. Obviously I can always safely err on the side of a longer quarantine, but how should I count days to get the most accurate test results? How about the fastest reasonably accurate results?

(Lest anyone fret, this was not a pleasure trip. It involved a house that could have been on Hoarders and I’m the next-of-kin.)

google sheets – Counting rows based on the date it has and the current end of the week

Problem to solve:
Given 2 rows, containing the author’s name and a date, you have to count how many rows that author has.
The tricky part here is that I have to use the date row and check how many entries does that author have within the week of that date he entered.

What my data looks like:

enter image description here

Expected output:

enter image description here

Let’s hypothetically say that today is the 19th of February and it’s a Friday.
I need to count each row that has a date between 15th of February (Monday) and 21st of February (Sunday).

Next week, from 22nd of February, I need to start it over and check it again for that week and so on.

So, I’m trying to better understand how to approach a simple / stupid problem that I can’t find a solution to it.
I’ve been messing around with the QUERY() function but with no luck since I have literally no idea what lead to follow.

Thanks in advance if someone can solve it or at least point me into the right direction.