natural language processing – How to extract the list of References and title from a pdf of a Research paper?

Thanks for contributing an answer to Computer Science Stack Exchange!

  • Please be sure to answer the question. Provide details and share your research!

But avoid

  • Asking for help, clarification, or responding to other answers.
  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

parsing – How to extract data and attribute data from an XMLElement

I am trying to write a function to extract both the data and the attribute data from an XMLElement in the same pass. I have followed the examples from Mathematica’s Transforming XML tutorial and the Mathematica StackExchange solution here

Extract attribute data from an XMLElement

However, the function I wrote returns the empty list. I believe the Cases function traverses the XML string once and, consequently, does not parse the XMLElement for the second time as I thought it would.

My MWL starts with reading the fleetXMLString from file with Import[fName]

    XMLObject[
  "Document"][{XMLObject["Declaration"]["Version" -> "1.0", 
   "Encoding" -> "utf-8"]}, 
 XMLElement[
  "Fleet", {}, {XMLElement[
    "SomeVehicle", {}, {XMLElement["Name", {}, {"BJ#00"}], 
     XMLElement[
      "Bus", {}, {XMLElement["Shape", {}, {"parallelepiped"}], 
       XMLElement["Length", {"unit" -> "Distance"}, {"0.5"}], 
       XMLElement["Width", {"unit" -> "Distance"}, {"0.4"}], 
       XMLElement["Height", {"unit" -> "Distance"}, {"0.3"}], 
       XMLElement["Density", {"unit" -> "Density"}, {"500.0"}]}]}], 
   XMLElement[
    "SomeVehicle", {}, {XMLElement["Name", {}, {"BJ#01"}], 
     XMLElement[
      "Bus", {}, {XMLElement["Shape", {}, {"parallelepiped"}], 
       XMLElement["Length", {"unit" -> "Distance"}, {"0.5"}], 
       XMLElement["Width", {"unit" -> "Distance"}, {"0.4"}], 
       XMLElement["Height", {"unit" -> "Distance"}, {"0.3"}], 
       XMLElement[
        "Density", {"unit" -> "Density"}, {"500.0"}]}]}]}], {}]

The two functions below give the expected results

BusPhysParam[xmlString_, name_, pName_] :=
 Cases[
  Cases[xmlString,
   XMLElement["SomeVehicle", _, {___,
     XMLElement["Name", _, {name}], ___}], Infinity],
  XMLElement["Bus", _, {___,
     XMLElement[pName, _, {dim_}], ___}] :> ToExpression[dim], 
  Infinity]
BusPhysParam[fleetXMLString, "BJ#00", "Width"]

{0.4}

and

BusPhysParamUnit[xmlString_, name_, pName_] :=
 Cases[
  Cases[xmlString,
   XMLElement["SomeVehicle", _, {___,
     XMLElement["Name", _, {name}], ___}], Infinity],
  XMLElement["Bus", _, {___,
     XMLElement[pName, {___, "unit" -> unit_}, ___], ___}] :> unit, 
  Infinity]
BusPhysParamUnit[fleetXMLString, "BJ#00", "Width"]

{Distance}

However, this function returns the empty list

BusPhysParamMod[xmlString_, name_, pName_] :=
 Cases[
  Cases[xmlString,
   XMLElement["SomeVehicle", _, {___,
     XMLElement["Name", _, {name}], ___}], Infinity],
  XMLElement["Bus", _, {___,
     XMLElement[pName, _, {dim_}], ___,
     XMLElement[pName, {___, "unit" -> unit_}, ___]}] :> {ToExpression[dim], 
    unit}, Infinity]
BusPhysParamMod[fleetXMLString, "BJ#00", "Width"]

Is there a way to extract both the value and attribute at the same time?
Thank you!
B

Search two separate places or range and extract specific keyword or number to populate cell in Google Sheets from a disorganized List

I Included a Google Sheets link

I’m attempting to take apart an unorganized list in a single column, separating it by keywords into columns and rows so that the data can be usable – to maybe make a graph or analyze trends. The trouble I’m having, in general, is the keyword I’m searching for may not be on the exact line I expect it to be, but only one cell either up or down. Is it possible If a match isn’t detected in A2 in would also check A3?

So far I can only get it to search a single cell, but not a range of cells, like A2:A3 and extracts on row 2 inthe appropriate column

Much appreciated if anyone can point me in the right direction

asynchronously extract title from urls using python

The following code asynchronously extract title from urls (saved in book-urls.txt).

from bs4 import BeautifulSoup
import grequests

links = list()

file1 = open("book-urls.txt", "r")
Lines = file1.readlines()
for line in Lines:
    links.append(line)
file1.close()

reqs = (grequests.get(link) for link in links)
resp = grequests.map(reqs)

for r in resp:
    soup = BeautifulSoup(r.text)
    print(soup.title.string)

The input is:

https://learning.oreilly.com/library/view/a-reviewers-handbook/9781118025635/
https://learning.oreilly.com/library/view/accounting-all-in-one-for/9781119453895/
https://learning.oreilly.com/library/view/business-valuation-for/9780470344019/
https://learning.oreilly.com/library/view/the-business-of/9780470444481/

The output is:

A Reviewer's Handbook to Business Valuation: Practical Guidance to the Use and Abuse of a Business Appraisal (Book)
Accounting All-in-One For Dummies, with Online Practice, 2nd Edition (Book)
Business Valuation For Dummies (Book)
The Business of Value Investing: Six Essential Elements to Buying Companies Like Warren Buffett (Book)

is grequests the best option or should I use something else?

Please give some suggestion on how I can make this code better.

google sheets query – Array formula to extract N non blank cell in Row 1-16

I have a google form where team coaches input which players played on a given day.

They also input if they were Man of the Match, Yellow/Red Cards Received, and how many Goals they scored.

Google Form Tick Box Grid

When the form is submitted there are blank Columns where players did not play.
There are 11 age groups and 4 teams within each age group, roughly 400 players, therefore 400 columns!

This is team WHITES.

enter image description here

In Columns to the right (after the last player of the last team) i need to register only the players that played and omit the non playing players.

I have the following code but it’s not extracting the correct data.

=ARRAYFORMULA(
IF(ISBLANK($A1:$A),"",
IF(ROW($A1:$A)=ROW($A$1),"Player "&COLUMN(A1),


QUERY(TRANSPOSE(M2:JM),"Select Col1 where Col1 is not null limit 1",0))))

As it is automating a Google Form the first three lines of the formula can be ignored. Its from QUERY( which is the problem.

The formula in JN1 is then copied right for 16 Players

enter image description here

Dave should be Player 3 with 1 Goal and FRED should be Player 4 with 2 Goals and the rest should be blank.

Is Query/Transpose the correct way extract this data or something else?

malware – Is it safe to extract file from potentially infected disk

I have a hard drive used for years, there are windows and many personal files on it. What I called “files” are images, musics, documents (pdf or docx), but not programs. All the “files” were not initially infected. As I said in the title, the hard drive may be infected by malware (I did not safely use it).

My question is : can I extract these personal files on a safe computer without risk of contamination ? In other words, may these files be infected and spread malware ?

database – Extract data from excel, based on one column condition SQL query should get the values and table name from excel sheet in python

Stack Exchange Network


Stack Exchange network consists of 176 Q&A communities including Stack Overflow, the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Visit Stack Exchange

plotting – How can I rotat the LegendLabel in MatrixPlot and extract the color schame to be used in DensityPlot?

    data = Table(
   Sin(x) Cos(y) + 0.05 x y, {x, 0, 2 Pi, 0.1}, {y, 0, 2 Pi, 0.1});
MatrixPlot(data, 
 PlotLegends -> 
  Placed(BarLegend(Automatic, LegendMarkerSize -> 100, 
    LegendLabel -> 
     Placed(Style(Text("(Text for Label)"), Red, FontSize -> 22), 
      Top)), Right), AspectRatio -> 2)   

I really have to questions:

  1. how can I rotate (90 Degree) the red LegendLabel to be aligned with BarLegend?
  2. I want to extract the color scheme in the MatrixPlot so I can use it in DensityPlot.

enter image description here

fitting – how to extract CountryData[] without String names for use in numeric analysis

I have the following Code that extracts selected variables and makes a table. I like to use these data in statistical analysis such as Fit[...], but I cannot use them because String variable units are also extracted together with the raw data.

How can I extract the raw data without String units and export it as an XLS file for use in Linear and Nonlinear Regression analysis?

countLst={"Argentina", "Australia", "Austria", "Belgium", 
  "Bulgaria","Brazil", "Brunei Darussalam", "Canada",  
  "Switzerland", "Chile", "China", "Colombia", "Costa Rica", 
  "Cyprus", "Czech Republic","Germany", "Denmark", "Spain", 
  "Estonia", "Finland", "France", "United Kingdom", "Greece", 
  "Hong Kong", "Croatia", "Hungary", "Indonesia", "India", 
  "Ireland", "Iceland", "Israel", "Italy", "Japan", "Kazakhstan",
  "Cambodia", "South Korea", "Lithuania", "Luxembourg", "Latvia", 
  "Morocco", "Mexico", "Malta", "Malaysia", "Netherlands", 
  "Norway", "New Zealand", "Peru", "Philippines", "Poland", 
  "Portugal", "Romania", "Russian Federation", "Saudi Arabia", 
  "Singapore", "Slovak Republic", "Slovenia", "Sweden", 
  "Thailand", "Tunisia", "Turkey", "Taiwan", "United States", 
  "Vietnam", "South Africa"
 };

Text[Grid[
  Prepend[{CountryData[#, "Name"],
  CountryData[#,"PopulationGrowth"],
  CountryData[#, "GDP"],
  CountryData[#, "TotalFertilityRate"], 
  CountryData[#, "GrossInvestment"], 
  CountryData[#, "InternetUsers"], 
  CountryData[#, "InventoryChange"], 
  CountryData[#, "MedianAge"], 
  CountryData[#, "TradeValueAdded"], 
  CountryData[#, "UnemploymentFraction"]} & /@ countLst, {"", 
  "pop. growth", "GDP", "fertility", "grossInv", "internet", 
  "inventory", "medianAge", "tradeVA", "unempl."}], Frame -> All, 
  Background -> {None, {LightBlue, {LightYellow}}}]
 ]

extract all textcontent in htmlcollection to array with javascript

Declare your variables – whenever you assign to or reference a variable without defining it first, you will either (1) implicitly create a property on the global object (which can result in weird bugs), or (2) throw an error, if you’re running in strict mode. You currently aren’t defining any of your variables. Fix it by putting const (or, when needed, let) in front of them when assigning to them for the first time, eg const htmlObject = $(anycasestr);.

jQuery or DOM methods? You’re using jQuery to turn the string into a jQuery collection of elements, but then you’re using getElementsByTagName to select children. If you’re using jQuery, you can be concise and consistent to use it to select the <div> children.. To find children of an element which match a particular tag name, call .find on the jQuery collection – then, you can use .map to turn the found jQuery elements into a collection of just the text of the elements:

const $parent = $(anycasestr);
const arr = $parent.find('div')
  .map((_, child) => child.textContent)
  .get(); // turn the jQuery collection of strings into an array of strings
const anycasestr = `<div style="color: rgb(51, 51, 51); background-color: rgb(253, 246, 227); font-family: Menlo, Monaco, &quot;Courier New&quot;, monospace; font-size: 12px; line-height: 18px;"><div>refinement</div><div>decent</div><div>elegant</div></div>`;
const $parent = $(anycasestr);
const arr = $parent.find('div')
  .map((_, child) => child.textContent)
  .get(); // turn the jQuery collection of strings into an array of strings
console.log(arr);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

Or, you can use DOMParser instead. Using DOMParser rather than jQuery to turn the text into a collection of elements can avoid accidental execution of malicious scripts. Example exploit using jQuery:

const anycasestr = `<img src="https://codereview.stackexchange.com/" onerror="alert('evil')">`;
const $parent = $(anycasestr);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

With DOMParser:

const anycasestr = `<div style="color: rgb(51, 51, 51); background-color: rgb(253, 246, 227); font-family: Menlo, Monaco, &quot;Courier New&quot;, monospace; font-size: 12px; line-height: 18px;"><div>refinement</div><div>decent</div><div>elegant</div></div>`;
const doc = new DOMParser().parseFromString(anycasestr, 'text/html');
const arr = (...doc.querySelectorAll('div > div'))
  .map(div => div.textContent);
console.log(arr);

The query string div > div selects <div> elements which are direct children of another <div>. It works exactly the same way as CSS selectors. querySelectorAll is a great tool for concise selection of elements – it can be easier to write and understand at a glance than other methods (like your original htmlObject(0).getElementsByTagName("div")).

Array.prototype.slice.call is a bit verbose – on non-ancient environments, you can use spread syntax instead, like I did above. Creating an array all at once by mapping is also somewhat more elegant than declaring an array then .pushing onto it.

If you had more <div> children and wanted to take only the text from the first 3 of them, it’d be more functional to .slice the array of elements instead of putting an iteration count in a for loop:

const anycasestr = `<div style="color: rgb(51, 51, 51); background-color: rgb(253, 246, 227); font-family: Menlo, Monaco, &quot;Courier New&quot;, monospace; font-size: 12px; line-height: 18px;">
  <div>refinement</div>
  <div>decent</div>
  <div>elegant</div>
  <div>don't include me</div>
  <div>don't include me</div>
  <div>don't include me</div>
</div>`;
const doc = new DOMParser().parseFromString(anycasestr, 'text/html');
const arr = (...doc.querySelectorAll('div > div'))
  .slice(0, 3)
  .map(div => div.textContent);
console.log(arr);

in terms of computational concerns?

Unless the stuff that needs to be parsed is unreasonably large, performance for this sort of thing is not a concern; better to write clean, readable, maintainable code. If you later find that something is taking longer to run than is ideal, you can identify the bottleneck and then figure out how to fix it. (But this almost certainly won’t be the bottleneck.)