performance: speed up the execution time of large document text tokenization through Python and RegEx

I am currently trying to process a large number of very large text files (> 10k words). In my data channel, I identified the gensim tokenize function as my bottleneck, the relevant part is provided in my MWE below:

import re
import urllib.request
import time

url='https://raw.githubusercontent.com/mxw/grmr/master/src/finaltests/bible.txt'
doc=urllib.request.urlopen(url).read().decode('utf-8')

PAT_ALPHABETIC = re.compile(r'(^Wd)+')

def tokenize(text):
    matches=PAT_ALPHABETIC.finditer(text)
    for match in matches:
        yield match.group()

def preprocessing(doc):
    tokens = (token for token in tokenize(doc))
    return tokens


start_time = time.time()
preprocessing(doc)
print("--- %s seconds ---" % (time.time() - start_time))

I managed to reduce the runtime performance by about 50% using PyPy. It takes approximately 0.40 seconds to run the preprocessing function on my mid 2010 Consumer Notebook.

However, is there anything I can still optimize in my code?
I'd also be interested in the runtimes of people with newer hardware (or hardware suggestions in general).

Thanks in advance

pr.probability – Terminology: "absolute constant large enough"

"Absolute constant" means that it does not depend on anything. For example, $ 3, 10 ^ {12}, pi $ and the Feigenbaum number are absolute constants. They are real numbers. "Large enough" means that
the authors did not care or were unable to calculate or estimate it.

Python: advice on taming a slow loop required for viewing large GIS data sets

enter the image description here

I am working to plot a large GIS dataset that I have shown a previous sample of about 1/6 of the data. I am happy with how fast the data loads and bokeh renders html almost instantly. However, I have come across a fairly active loop in my code that is not scaling well as I increase the 1) number of rows and 2) the resolution of the polygons. I just got killed in the #count points loop and I wonder if there isn't a better way to do this?

I found the suggestion for a loop from a GIS readthedoc.io and was happy with its performance for a few thousand points a couple of months ago. But now the project needs to process a GeoDataFrame with> 730000 rows. Am I supposed to use a better method to count the number of points in each polygon? I'm at a modern desk to do the calculation, but the project has access to Azure resources, maybe that's the majority of people who do this type of calculation professionally? I'd rather do the calculation locally, but it means my desktop might have to wait for maximum CPU cycles overnight or longer, which is not an exciting prospect. I am using Python 3.8.2 and Conda 4.3.2.

from shapely.geometry import Polygon
import pysal.viz.mapclassify as mc
import geopandas as gpd

def count_points(main_df, geo_grid, levels=5):
    """
    outputs a gdf of polygons with a columns of classifiers to be used for color mapping
    """
    pts = gpd.GeoDataFrame(main_df("geometry")).copy()

    #counts points
    pts_in_polys = ()
    for i, poly in geo_grid.iterrows():
        pts_in_this_poly = ()
        for j, pt in pts.iterrows():
            if poly.geometry.contains(pt.geometry):
                pts_in_this_poly.append(pt.geometry)
                pts = pts.drop((j))
        nums = len(pts_in_this_poly)
        pts_in_polys.append(nums)
    geo_grid('number of points') = gpd.GeoSeries(pts_in_polys) #Adds number of points in each polygon

    # Adds Quantiles column
    classifier = mc.Quantiles.make(k=levels)
    geo_grid("class") = geo_grid(("number of points")).apply(classifier)


    # Adds Polygon grid points to new geodataframe
    geo_grid("x") = geo_grid.apply(getPolyCoords, geom="geometry", coord_type="x", axis=1)
    geo_grid("y") = geo_grid.apply(getPolyCoords, geom="geometry", coord_type="y", axis=1)
    polygons = geo_grid.drop("geometry", axis=1).copy()

    return polygons

Where can I find a large diameter lens with a focal length equal to its physical length?

Some background: I read that for a Stanhope lens "the focal length of the lens equals the length of the lens" However, I am looking for a much larger diameter than a Stanhope, something like 15mm-25mm.

architecture: to structure large and expandable projects

TLDR with bald

I want to create a library (I think this is the correct term) for my own reinforcement learning environments (env for short). Most envs would be based on self-implemented games created in pure Python or C ++ with Python bindings. What would be the best way to structure this project in a way that is easy to expand, maintain, and makes the most sense? I want to be able to reuse code, like using a general board class for all my board game implementations (eg chess, go, gomoku). I plan to do it cross-platform with the help of CMake and I might even venture to package it as a Conda package.

Through my initial search, I found out that this design (and its variants) is popular and decided to trust that.

My initial plan was to structure my library into projects, create a repository for each project and include one project in another as a git submodule. To create the environment for the game
2048 the structure would be the following (CMakeLists were omitted):

(vector env is based on env-2048, which is based on game-2048 using the general-board class)

general-board
├── external
│   └── Catch2/
├── include
│   └── general-board
│       └── file.h
├── src
│   └── file.cpp
└── tests
    └── tests.cpp

game-2048
├── app
│   └── manual_game.cpp
├── external
│   ├── Catch2
│   └── general-board
├── include
│   └── game-2048
│       └── file.h
├── src
│   └── file.cpp
└── tests
    └── tests.cpp

env-2048
├── external
│   ├── Catch2/
│   └── game-2048/
├── include
│   └── env-2048
│       └── file.h
├── src
│   └── file.cpp
└── tests
    └── tests.cpp 

env-vector <---- this would be on the top, bundling the envs together
├── external
│   ├── Catch2/
│   ├── env-2048/ 
│   ├── env-chess/ <---- another board game
│   └── env-go/ <---- another board game
├── include
│   └── env-vector
│       └── file.h
├── python
│   └── pybind11_magic_here
├── src
│   └── file.cpp
└── tests
    └── tests.cpp

After some implementation, I was concerned if the number of submodules and redundancy were too high. With this structure, the project at the top would contain the draft general meeting north times (where north is the number of games that depend on a board), and catch2 would be included even more. It looks suspicious and error prone.

My second idea was to create a great project and include everything in it in a & # 39; flat & # 39; and not & # 39; nested & # 39; like before. It would look like this:

(line ending with '/' depicts a folder)
environments_all_in_one
│
├── external
│   └── Catch2/
├── include
│   └── environments_all_in_one
│       └── **not_even_sure_what_to_put_here**      
├── python
│   └── pybind11_magic_here
├── src
│   ├── env_vector
│   ├── envs
│   │   ├── env-2048/
│   │   ├── env-chess/
│   │   └── env-go/
│   ├── games
│   │   ├── game-2048/
│   │   ├── game-chess/
│   │   └── game-go/
│   └── general-board
│       ├── board_abc/
│       ├── board_array/
│       └── board_vector/
└── tests
    └── tests.cpp

In this way, the code would not be present multiple times and definitely aids transparency. However, as I have no experience, I should ask:

Is there a better way to do it?

Java: iterate through a large dataset on a map every time I click performance?

I need to get values ​​from a map by the position of the adapter that I get from a RecyclerView.

As you can see every time I click on an album art, I create a new array of album objects.

final Album() albums = new Album(albumMap.size());
      for (Map.Entry e : albumMap.entrySet()){
      albums(i++) = e.getValue();
 }

Then I get the album like so: String selectedAlbum = albums(position).getAlbum();

But what if someone has over 10,000 albums on their device and each time an album is clicked a new array of album objects is created and iterate just to get the album name.

Would this have a performance impact if there are many albums present?

TL; DR Is this code bad?

Complete code

@Override
    public void onClickAlbum(int position, Map albumMap) {
        if (getActivity() != null) {
            int i = 0;
            final Album() albums = new Album(albumMap.size());
            for (Map.Entry e : albumMap.entrySet()){
                albums(i++) = e.getValue();
            }
            String selectedAlbum = albums(position).getAlbum();
            Main.getInstance().setSongsFilteredBy(SongsLibrary.getInstance().getSongsByAlbum(selectedAlbum));
            Intent intent = new Intent(getActivity(), ListSongsActivity.class);
            intent.putExtra("title", selectedAlbum);
            startActivity(intent);
            Toast.makeText(getActivity(), "test: " + selectedAlbum, Toast.LENGTH_SHORT).show();
        }
}

Pathfinder 1e – Notion of natural and adjacent range for large or larger creatures

"Adjacent" means what it does in English, bordering directly on each other. Two squares are adjacent if they share a corner or edge. Two creatures are adjacent if their spaces include two adjacent squares. This is the same as the squares that are within five feet of each other, yes, since the grid is 5 feet. Grid. This does not change with the size of the creature: adjacent always means that the sides must be touching at least one corner or edge.

But all this comes from what the word adjacent means. That parenthesis might be the only explicit description of it, but it can't be "the official definition" or anything, not in parentheses on a very specific topic. Rather, the word is not defined at all by the rules of the game, and "the official definition" is found in any dictionary.

Anyway, then, no, Bodyguard does not change its range with scope (the number of adjacent squares increases with space, however, since there are 8 squares around a 1 × 1 space and 12 squares around a 2 × 2 space). That being said, I tend to agree with you that some uses of things that affect adjacent squares would make more sense that affect the natural range of a given size (i.e. they are not subject to any bonuses for reaching that a creature may have about its size) But when deciding whether a given feature should use natural range instead of adjacent squares, to be fair and honest, you should decide that in advance and inform players about it, and also keep in mind that Tiny and the smallest creatures have a natural range of 0 feet – should they be able to use this feature while sharing a space with something? (Maybe! But it's something to think about when making the judgment, since you must be consistent).

In the event that "it came up in the middle of a fight and now I realize I want to change the rules," my usual approach is to make the house immediately only if it benefits PCs. If I damaged the PCs, I will do the houserule after the fight is over, for future fights. In both cases, I let the players know, and if a player decides that the change makes them wish they had taken that feat, whatever, I'm looking for a way to make that happen. (There may be exceptions, either in very unusual circumstances: maybe a boss fight would be completely neutralized if I don't act, and that's no fun for players either, but I can't remember the last time I did that.)

lightroom – Converts a large database of mixed-format files to screen-size JPEG files

If you have the technical skills to install Python3 and the Python Image Library on your computer, you can use this Python script:

# !/usr/bin/python3
import os
import sys
from PIL import Image

size = 128, 128

def thumbnail(fromFile,  toFile):
    _, ext = os.path.splitext(fromFile)
    if ext.lower() in ('.jpg', '.jpeg', '.png'):
      print('Creating thumbnail', toFile)
      im = Image.open(fromFile)
      im.thumbnail(size)
      im.save(file + ".thumbnail", "JPEG")

(_, fromDir, toDir) = sys.argv
for root, dirs, files in os.walk(fromDir, topdown=True):
   for name in dirs:
      toPath = os.path.join(toDir, root, name)
      print('Making', toPath)
      os.makedirs(toPath, 0o777, True)

   for name in files:
      fromFile = os.path.join(root, name)
      toFile = os.path.join(toDir, root, name)
      thumbnail(fromFile,  toFile)

Run it like this: python3 scriptname sourcedir destdir

Remember that sourcedir Must be a relative path name. If you have a file called /usr/traveler/mystuff/images/2009/Mar/IMG3.jpggo to the directory /usr/traveler/mystuff, and use python3 scriptname images thumbnails, the output will be in /usr/traveler/mystuff/thumbnails/images/2009/Mar/IMG3.jpg

Memory problem when sending a large number of emails from the console command

I have a console command that checks orders and sends emails to customers.

protected function execute(InputInterface $input, OutputInterface $output){
    $orders = $this->getOrders();
    foreach ($orders as $order) {
        $data = ... //prepare some data
        $this->sendMail($order->getCustomerEmail(), $data);
    }
}

private function sendMail($email, $data){
    $postObject = new MagentoFrameworkDataObject();
    $postObject->setData($data);

    if ($this->options("test-email")) {
        $email = $this->options("test-email");
    }

    $maskedEmail = substr($email, 0, 1).'***'.substr($email, strpos($email, "@"));
    $this->output->writeln("tSending mail to: {$maskedEmail}");

    // send mail to recipients
    $this->inlineTranslation->suspend();
    $storeScope = MagentoStoreModelScopeInterface::SCOPE_STORE;
    $transport = $this->transportBuilder
        ->setTemplateIdentifier(
            $this->scopeConfig->getValue(self::EMAIL_TEMPLATE, $storeScope)
        )
        ->setTemplateOptions(
            (
                'area' => MagentoFrameworkAppArea::AREA_FRONTEND,
                'store' => $this->storeManager->getStore()->getId(),
            )
        )
        ->setTemplateVars(('data' => $postObject))
        ->setFrom(
            $this->scopeConfig->getValue(self::EMAIL_SENDER, $storeScope)
        )
        ->addTo($email)
        ->getTransport();

    $transport->sendMessage();
    $this->inlineTranslation->resume();
}

This works fine, but I tried it with ~ 5000 orders and in ~ 2700 email it gives me this error:

PHP Fatal error: Allowed memory size of xxxxxxx bytes exhausted (tried to allocate 344064 bytes) in /app/xxxxxx/vendor/magento/framework/View/TemplateEngine/Php.php on line 66
{..rendered email html content..}
Check https://getcomposer.org/doc/articles/troubleshooting.md#memory-limit-errors for more info on how to handle out of memory errors.

I assume that processed html remains in memory.
I tried this unset($transport) but still memory usage increases.

What could be the problem?

Thanks in advance.

pdf: Solid purple color appears when trying to view large files in Preview

I am trying to edit some large uncompressed .tif files (800MB-3.5GB).
While I can open them in Adobe Photoshop and Bridge, I'm having trouble opening the images in Preview as the image starts to resolve when opened and suddenly turns purple. Below are two solid color images that are cleaned in the image as it is resolved from left to right.
Partially covered image
Fully covered image

  • I have checked the images to see if they contain an alpha channel that
    The preview would normally be recognized as a mask. None of the images has
    An alpha channel.
  • I have isolated the layers in Photoshop to extract the image from any baked alpha channel and exported them. The problem still occurs.
  • I have formatted the image to .PNG, the problem still
    it happens.
  • I compressed the image using LZW and the problem still
    it happens.

I contacted the person who took the images to obtain new copies to see if the versions I have are damaged in any way, since when I open them in Adobe Reader this message appears:

Error message

However, the images open well in Windows Image Viewer and Irfanview, which makes me suspect that the images are not corrupted.

I'm less inclined to think that the problem has something to do with image size, since smaller compressed .PNG and .tif files also have the problem.

Currently running OSX 10.15.4 on an iMac 2017 i5 3.4GHz i5, 64GB 2400MHz DDR4 with a Radeon Pro 570 4GB.

Is this a problem that someone else has encountered using the preview before and someone knows what causes it? Better yet, is there a solution?

The ultimate goal is to have these images uploaded to an online archive for students, so making these images compatible with any operating system (if that's the problem) would be the best result.

Health,

K.