Parse dblp XML and output sums of publications grouped by year and type
$begingroup$
The following Go program parses a gzipped XML file (available here) which contains bibliographic information on computer science publications and has the following indicative structure:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2017-05-28" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>https://doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/Simon83">
<author>Hans Ulrich Simon</author>
<title>Pattern Matching in Trees and Nets.</title>
<pages>227-248</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#Simon83</url>
<ee>https://doi.org/10.1007/BF01257084</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/GoodmanS83">
<author>Nathan Goodman</author>
<author>Oded Shmueli</author>
<title>NP-complete Problems Simplified on Tree Schemas.</title>
<pages>171-178</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#GoodmanS83</url>
<ee>https://doi.org/10.1007/BF00289414</ee>
</article>
</dblp>
The XML has multiple publication types denoted by the title of the element (i.e. proceedings, book, phdthesis) and for each of which I have defined a separate struct in my program:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Dblp contains the array of articles in the dblp xml file
type Dblp struct {
XMLName xml.Name `xml:"dblp"`
Dblp Article
}
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // not currently in use
Year string `xml:"year"`
Author string `xml:"author"` // not currently in use
Title string `xml:"title"` // not currently in use
}
// Article struct and the following structs contain the elements we want to parse and they "inherit" the metadata struct defined above
type Article struct {
XMLName xml.Name `xml:"article"`
Metadata
}
type InProceedings struct {
XMLName xml.Name `xml:"inproceedings"`
Metadata
}
type Proceedings struct {
XMLName xml.Name `xml:"proceedings"`
Metadata
}
type Book struct {
XMLName xml.Name `xml:"book"`
Metadata
}
type InCollection struct {
XMLName xml.Name `xml:"incollection"`
Metadata
}
type PhdThesis struct {
XMLName xml.Name `xml:"phdthesis"`
Metadata
}
type MastersThesis struct {
XMLName xml.Name `xml:"mastersthesis"`
Metadata
}
type WWW struct {
XMLName xml.Name `xml:"www"`
Metadata
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
// SumRecord is used to store the aggregated articles by year in srMap map
//(count is stored in the map's int which is used as key)
type SumRecord struct {
Type string
Year string
}
The program stores each publication in a map structure and finally exports two csv files:
- results.csv which contains an id, publication type and year for each publication
- sumresults.csv which contains the sum of each publication type per year
It is the first "complete" program I've written in Go - I'm currently trying to get a grasp on the language and I've needed to ask two questions on Stack Overflow while writing it here and here.
The rest of the code:
func main() {
// Start counting time
start := time.Now()
// Initialize counter variables for each publication type
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter int
var i = 1
// Initialize hash map
m := make(map[int]Record)
//Open gzipped dblp xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
}
defer gz.Close()
//Directly open xml file for testing purposes if needed - be sure to comment out gzip file opening above
//xmlFile, err := os.Open("dblp.xml")
//xmlFile, err := os.Open("TestDblp.xml")
if err != nil {
fmt.Println(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
// defer the closing of XML file so that we can parse it later on
defer xmlFile.Close()
// Initialize main object from Dblp struct
var articles Dblp
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
err = decoder.Decode(&articles.Dblp)
if err != nil {
fmt.Println(err)
}
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Here, we inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "dblp":
case "article":
var p Article
decoder.DecodeElement(&p, &se)
increment(&articleCounter)
m[i] = Record{i, articleCounter, "article", p.Year}
increment(&i)
case "inproceedings":
var p InProceedings
decoder.DecodeElement(&p, &se)
increment(&InProceedingsCounter)
m[i] = Record{i, InProceedingsCounter, "inproceedings", p.Year}
increment(&i)
case "proceedings":
var p Proceedings
decoder.DecodeElement(&p, &se)
increment(&ProceedingsCounter)
m[i] = Record{i, ProceedingsCounter, "proceedings", p.Year}
increment(&i)
case "book":
var p Book
decoder.DecodeElement(&p, &se)
increment(&BookCounter)
m[i] = Record{i, BookCounter, "proceedings", p.Year}
increment(&i)
case "incollection":
var p InCollection
decoder.DecodeElement(&p, &se)
increment(&InCollectionCounter)
m[i] = Record{i, InCollectionCounter, "incollection", p.Year}
increment(&i)
case "phdthesis":
var p PhdThesis
decoder.DecodeElement(&p, &se)
increment(&PhdThesisCounter)
m[i] = Record{i, PhdThesisCounter, "phdthesis", p.Year}
increment(&i)
case "mastersthesis":
var p MastersThesis
decoder.DecodeElement(&p, &se)
increment(&mastersThesisCounter)
m[i] = Record{i, mastersThesisCounter, "mastersthesis", p.Year}
increment(&i)
case "www":
var p WWW
decoder.DecodeElement(&p, &se)
increment(&wwwCounter)
m[i] = Record{i, wwwCounter, "www", p.Year}
increment(&i)
}
}
}
log.Println("Element parsing completed in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We can start processing the map.
// First we create a map and count the number of occurences of each publication type for a given year.
srMap := make(map[SumRecord]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := SumRecord{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
//// Create sum csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
// define column headers
sumheaders := string{
"type",
"year",
"sum",
}
sumwriter.Write(sumheaders)
var SumString string
// Create sorted map by VALUE (integer)
SortedSrMap := map[int]SumRecord{}
SortedSrMapKeys := int{}
for key, val := range SortedSrMap {
// SortedSrMap[val] = key
// SortedSrMapKeys = append(SortedSrMapKeys, val)
SumString = strconv.Itoa(key)
fmt.Println("sumstring:", SumString, "value: ", val)
}
sort.Ints(SortedSrMapKeys)
// END Create sorted map by VALUE (integer)
// Export sum csv
for key, val := range srMap {
r := make(string, 0, 1+len(sumheaders))
SumString = strconv.Itoa(val)
r = append(
r,
key.Type,
key.Year,
SumString,
)
sumwriter.Write(r)
}
sumwriter.Flush()
// CREATE RESULTS CSV
log.Println("Creating results csv file")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
// define column headers
headers := string{
"uid",
"id",
"type",
"year",
}
// write column headers
writer.Write(headers)
var idString string
var uidString string
// Create sorted map
var keys int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make(string, 0, 1+len(headers)) // capacity of 4, 1 + the number of properties our struct has & the number of column headers we are passing
// convert the Record.ID and UID ints to string in order to pass into append()
idString = strconv.Itoa(m[k].ID)
uidString = strconv.Itoa(m[k].UID)
r = append(
r,
uidString,
idString,
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// END CREATE RESULTS CSV
// Finally report results - update below line with more counters as desired
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:", ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:", PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
//log.Println("map:", m)
//log.Println("map length:", len(m))
//log.Println("sum map length:", len(srMap))
//fmt.Println("sum map contents:", srMap)
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func increment(i *int) {
*i = *i + 1
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for this case
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
Main problems and issues I've identified:
- The parsing is quite slow (it takes about 3:45 minutes) given the size of the file (474 Mb gzip). Can I improve something to make it faster?
- Can the code be made less verbose but not at the expense of making it less readable / understandable to a person just starting out with Go? For example, by generalizing the structs which are used to define the different publication types as well as the
case
/switch
statements?
parsing xml go
$endgroup$
add a comment |
$begingroup$
The following Go program parses a gzipped XML file (available here) which contains bibliographic information on computer science publications and has the following indicative structure:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2017-05-28" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>https://doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/Simon83">
<author>Hans Ulrich Simon</author>
<title>Pattern Matching in Trees and Nets.</title>
<pages>227-248</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#Simon83</url>
<ee>https://doi.org/10.1007/BF01257084</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/GoodmanS83">
<author>Nathan Goodman</author>
<author>Oded Shmueli</author>
<title>NP-complete Problems Simplified on Tree Schemas.</title>
<pages>171-178</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#GoodmanS83</url>
<ee>https://doi.org/10.1007/BF00289414</ee>
</article>
</dblp>
The XML has multiple publication types denoted by the title of the element (i.e. proceedings, book, phdthesis) and for each of which I have defined a separate struct in my program:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Dblp contains the array of articles in the dblp xml file
type Dblp struct {
XMLName xml.Name `xml:"dblp"`
Dblp Article
}
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // not currently in use
Year string `xml:"year"`
Author string `xml:"author"` // not currently in use
Title string `xml:"title"` // not currently in use
}
// Article struct and the following structs contain the elements we want to parse and they "inherit" the metadata struct defined above
type Article struct {
XMLName xml.Name `xml:"article"`
Metadata
}
type InProceedings struct {
XMLName xml.Name `xml:"inproceedings"`
Metadata
}
type Proceedings struct {
XMLName xml.Name `xml:"proceedings"`
Metadata
}
type Book struct {
XMLName xml.Name `xml:"book"`
Metadata
}
type InCollection struct {
XMLName xml.Name `xml:"incollection"`
Metadata
}
type PhdThesis struct {
XMLName xml.Name `xml:"phdthesis"`
Metadata
}
type MastersThesis struct {
XMLName xml.Name `xml:"mastersthesis"`
Metadata
}
type WWW struct {
XMLName xml.Name `xml:"www"`
Metadata
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
// SumRecord is used to store the aggregated articles by year in srMap map
//(count is stored in the map's int which is used as key)
type SumRecord struct {
Type string
Year string
}
The program stores each publication in a map structure and finally exports two csv files:
- results.csv which contains an id, publication type and year for each publication
- sumresults.csv which contains the sum of each publication type per year
It is the first "complete" program I've written in Go - I'm currently trying to get a grasp on the language and I've needed to ask two questions on Stack Overflow while writing it here and here.
The rest of the code:
func main() {
// Start counting time
start := time.Now()
// Initialize counter variables for each publication type
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter int
var i = 1
// Initialize hash map
m := make(map[int]Record)
//Open gzipped dblp xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
}
defer gz.Close()
//Directly open xml file for testing purposes if needed - be sure to comment out gzip file opening above
//xmlFile, err := os.Open("dblp.xml")
//xmlFile, err := os.Open("TestDblp.xml")
if err != nil {
fmt.Println(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
// defer the closing of XML file so that we can parse it later on
defer xmlFile.Close()
// Initialize main object from Dblp struct
var articles Dblp
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
err = decoder.Decode(&articles.Dblp)
if err != nil {
fmt.Println(err)
}
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Here, we inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "dblp":
case "article":
var p Article
decoder.DecodeElement(&p, &se)
increment(&articleCounter)
m[i] = Record{i, articleCounter, "article", p.Year}
increment(&i)
case "inproceedings":
var p InProceedings
decoder.DecodeElement(&p, &se)
increment(&InProceedingsCounter)
m[i] = Record{i, InProceedingsCounter, "inproceedings", p.Year}
increment(&i)
case "proceedings":
var p Proceedings
decoder.DecodeElement(&p, &se)
increment(&ProceedingsCounter)
m[i] = Record{i, ProceedingsCounter, "proceedings", p.Year}
increment(&i)
case "book":
var p Book
decoder.DecodeElement(&p, &se)
increment(&BookCounter)
m[i] = Record{i, BookCounter, "proceedings", p.Year}
increment(&i)
case "incollection":
var p InCollection
decoder.DecodeElement(&p, &se)
increment(&InCollectionCounter)
m[i] = Record{i, InCollectionCounter, "incollection", p.Year}
increment(&i)
case "phdthesis":
var p PhdThesis
decoder.DecodeElement(&p, &se)
increment(&PhdThesisCounter)
m[i] = Record{i, PhdThesisCounter, "phdthesis", p.Year}
increment(&i)
case "mastersthesis":
var p MastersThesis
decoder.DecodeElement(&p, &se)
increment(&mastersThesisCounter)
m[i] = Record{i, mastersThesisCounter, "mastersthesis", p.Year}
increment(&i)
case "www":
var p WWW
decoder.DecodeElement(&p, &se)
increment(&wwwCounter)
m[i] = Record{i, wwwCounter, "www", p.Year}
increment(&i)
}
}
}
log.Println("Element parsing completed in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We can start processing the map.
// First we create a map and count the number of occurences of each publication type for a given year.
srMap := make(map[SumRecord]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := SumRecord{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
//// Create sum csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
// define column headers
sumheaders := string{
"type",
"year",
"sum",
}
sumwriter.Write(sumheaders)
var SumString string
// Create sorted map by VALUE (integer)
SortedSrMap := map[int]SumRecord{}
SortedSrMapKeys := int{}
for key, val := range SortedSrMap {
// SortedSrMap[val] = key
// SortedSrMapKeys = append(SortedSrMapKeys, val)
SumString = strconv.Itoa(key)
fmt.Println("sumstring:", SumString, "value: ", val)
}
sort.Ints(SortedSrMapKeys)
// END Create sorted map by VALUE (integer)
// Export sum csv
for key, val := range srMap {
r := make(string, 0, 1+len(sumheaders))
SumString = strconv.Itoa(val)
r = append(
r,
key.Type,
key.Year,
SumString,
)
sumwriter.Write(r)
}
sumwriter.Flush()
// CREATE RESULTS CSV
log.Println("Creating results csv file")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
// define column headers
headers := string{
"uid",
"id",
"type",
"year",
}
// write column headers
writer.Write(headers)
var idString string
var uidString string
// Create sorted map
var keys int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make(string, 0, 1+len(headers)) // capacity of 4, 1 + the number of properties our struct has & the number of column headers we are passing
// convert the Record.ID and UID ints to string in order to pass into append()
idString = strconv.Itoa(m[k].ID)
uidString = strconv.Itoa(m[k].UID)
r = append(
r,
uidString,
idString,
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// END CREATE RESULTS CSV
// Finally report results - update below line with more counters as desired
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:", ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:", PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
//log.Println("map:", m)
//log.Println("map length:", len(m))
//log.Println("sum map length:", len(srMap))
//fmt.Println("sum map contents:", srMap)
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func increment(i *int) {
*i = *i + 1
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for this case
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
Main problems and issues I've identified:
- The parsing is quite slow (it takes about 3:45 minutes) given the size of the file (474 Mb gzip). Can I improve something to make it faster?
- Can the code be made less verbose but not at the expense of making it less readable / understandable to a person just starting out with Go? For example, by generalizing the structs which are used to define the different publication types as well as the
case
/switch
statements?
parsing xml go
$endgroup$
add a comment |
$begingroup$
The following Go program parses a gzipped XML file (available here) which contains bibliographic information on computer science publications and has the following indicative structure:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2017-05-28" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>https://doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/Simon83">
<author>Hans Ulrich Simon</author>
<title>Pattern Matching in Trees and Nets.</title>
<pages>227-248</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#Simon83</url>
<ee>https://doi.org/10.1007/BF01257084</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/GoodmanS83">
<author>Nathan Goodman</author>
<author>Oded Shmueli</author>
<title>NP-complete Problems Simplified on Tree Schemas.</title>
<pages>171-178</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#GoodmanS83</url>
<ee>https://doi.org/10.1007/BF00289414</ee>
</article>
</dblp>
The XML has multiple publication types denoted by the title of the element (i.e. proceedings, book, phdthesis) and for each of which I have defined a separate struct in my program:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Dblp contains the array of articles in the dblp xml file
type Dblp struct {
XMLName xml.Name `xml:"dblp"`
Dblp Article
}
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // not currently in use
Year string `xml:"year"`
Author string `xml:"author"` // not currently in use
Title string `xml:"title"` // not currently in use
}
// Article struct and the following structs contain the elements we want to parse and they "inherit" the metadata struct defined above
type Article struct {
XMLName xml.Name `xml:"article"`
Metadata
}
type InProceedings struct {
XMLName xml.Name `xml:"inproceedings"`
Metadata
}
type Proceedings struct {
XMLName xml.Name `xml:"proceedings"`
Metadata
}
type Book struct {
XMLName xml.Name `xml:"book"`
Metadata
}
type InCollection struct {
XMLName xml.Name `xml:"incollection"`
Metadata
}
type PhdThesis struct {
XMLName xml.Name `xml:"phdthesis"`
Metadata
}
type MastersThesis struct {
XMLName xml.Name `xml:"mastersthesis"`
Metadata
}
type WWW struct {
XMLName xml.Name `xml:"www"`
Metadata
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
// SumRecord is used to store the aggregated articles by year in srMap map
//(count is stored in the map's int which is used as key)
type SumRecord struct {
Type string
Year string
}
The program stores each publication in a map structure and finally exports two csv files:
- results.csv which contains an id, publication type and year for each publication
- sumresults.csv which contains the sum of each publication type per year
It is the first "complete" program I've written in Go - I'm currently trying to get a grasp on the language and I've needed to ask two questions on Stack Overflow while writing it here and here.
The rest of the code:
func main() {
// Start counting time
start := time.Now()
// Initialize counter variables for each publication type
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter int
var i = 1
// Initialize hash map
m := make(map[int]Record)
//Open gzipped dblp xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
}
defer gz.Close()
//Directly open xml file for testing purposes if needed - be sure to comment out gzip file opening above
//xmlFile, err := os.Open("dblp.xml")
//xmlFile, err := os.Open("TestDblp.xml")
if err != nil {
fmt.Println(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
// defer the closing of XML file so that we can parse it later on
defer xmlFile.Close()
// Initialize main object from Dblp struct
var articles Dblp
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
err = decoder.Decode(&articles.Dblp)
if err != nil {
fmt.Println(err)
}
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Here, we inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "dblp":
case "article":
var p Article
decoder.DecodeElement(&p, &se)
increment(&articleCounter)
m[i] = Record{i, articleCounter, "article", p.Year}
increment(&i)
case "inproceedings":
var p InProceedings
decoder.DecodeElement(&p, &se)
increment(&InProceedingsCounter)
m[i] = Record{i, InProceedingsCounter, "inproceedings", p.Year}
increment(&i)
case "proceedings":
var p Proceedings
decoder.DecodeElement(&p, &se)
increment(&ProceedingsCounter)
m[i] = Record{i, ProceedingsCounter, "proceedings", p.Year}
increment(&i)
case "book":
var p Book
decoder.DecodeElement(&p, &se)
increment(&BookCounter)
m[i] = Record{i, BookCounter, "proceedings", p.Year}
increment(&i)
case "incollection":
var p InCollection
decoder.DecodeElement(&p, &se)
increment(&InCollectionCounter)
m[i] = Record{i, InCollectionCounter, "incollection", p.Year}
increment(&i)
case "phdthesis":
var p PhdThesis
decoder.DecodeElement(&p, &se)
increment(&PhdThesisCounter)
m[i] = Record{i, PhdThesisCounter, "phdthesis", p.Year}
increment(&i)
case "mastersthesis":
var p MastersThesis
decoder.DecodeElement(&p, &se)
increment(&mastersThesisCounter)
m[i] = Record{i, mastersThesisCounter, "mastersthesis", p.Year}
increment(&i)
case "www":
var p WWW
decoder.DecodeElement(&p, &se)
increment(&wwwCounter)
m[i] = Record{i, wwwCounter, "www", p.Year}
increment(&i)
}
}
}
log.Println("Element parsing completed in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We can start processing the map.
// First we create a map and count the number of occurences of each publication type for a given year.
srMap := make(map[SumRecord]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := SumRecord{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
//// Create sum csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
// define column headers
sumheaders := string{
"type",
"year",
"sum",
}
sumwriter.Write(sumheaders)
var SumString string
// Create sorted map by VALUE (integer)
SortedSrMap := map[int]SumRecord{}
SortedSrMapKeys := int{}
for key, val := range SortedSrMap {
// SortedSrMap[val] = key
// SortedSrMapKeys = append(SortedSrMapKeys, val)
SumString = strconv.Itoa(key)
fmt.Println("sumstring:", SumString, "value: ", val)
}
sort.Ints(SortedSrMapKeys)
// END Create sorted map by VALUE (integer)
// Export sum csv
for key, val := range srMap {
r := make(string, 0, 1+len(sumheaders))
SumString = strconv.Itoa(val)
r = append(
r,
key.Type,
key.Year,
SumString,
)
sumwriter.Write(r)
}
sumwriter.Flush()
// CREATE RESULTS CSV
log.Println("Creating results csv file")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
// define column headers
headers := string{
"uid",
"id",
"type",
"year",
}
// write column headers
writer.Write(headers)
var idString string
var uidString string
// Create sorted map
var keys int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make(string, 0, 1+len(headers)) // capacity of 4, 1 + the number of properties our struct has & the number of column headers we are passing
// convert the Record.ID and UID ints to string in order to pass into append()
idString = strconv.Itoa(m[k].ID)
uidString = strconv.Itoa(m[k].UID)
r = append(
r,
uidString,
idString,
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// END CREATE RESULTS CSV
// Finally report results - update below line with more counters as desired
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:", ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:", PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
//log.Println("map:", m)
//log.Println("map length:", len(m))
//log.Println("sum map length:", len(srMap))
//fmt.Println("sum map contents:", srMap)
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func increment(i *int) {
*i = *i + 1
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for this case
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
Main problems and issues I've identified:
- The parsing is quite slow (it takes about 3:45 minutes) given the size of the file (474 Mb gzip). Can I improve something to make it faster?
- Can the code be made less verbose but not at the expense of making it less readable / understandable to a person just starting out with Go? For example, by generalizing the structs which are used to define the different publication types as well as the
case
/switch
statements?
parsing xml go
$endgroup$
The following Go program parses a gzipped XML file (available here) which contains bibliographic information on computer science publications and has the following indicative structure:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2017-05-28" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>https://doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/Simon83">
<author>Hans Ulrich Simon</author>
<title>Pattern Matching in Trees and Nets.</title>
<pages>227-248</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#Simon83</url>
<ee>https://doi.org/10.1007/BF01257084</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/GoodmanS83">
<author>Nathan Goodman</author>
<author>Oded Shmueli</author>
<title>NP-complete Problems Simplified on Tree Schemas.</title>
<pages>171-178</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#GoodmanS83</url>
<ee>https://doi.org/10.1007/BF00289414</ee>
</article>
</dblp>
The XML has multiple publication types denoted by the title of the element (i.e. proceedings, book, phdthesis) and for each of which I have defined a separate struct in my program:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Dblp contains the array of articles in the dblp xml file
type Dblp struct {
XMLName xml.Name `xml:"dblp"`
Dblp Article
}
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // not currently in use
Year string `xml:"year"`
Author string `xml:"author"` // not currently in use
Title string `xml:"title"` // not currently in use
}
// Article struct and the following structs contain the elements we want to parse and they "inherit" the metadata struct defined above
type Article struct {
XMLName xml.Name `xml:"article"`
Metadata
}
type InProceedings struct {
XMLName xml.Name `xml:"inproceedings"`
Metadata
}
type Proceedings struct {
XMLName xml.Name `xml:"proceedings"`
Metadata
}
type Book struct {
XMLName xml.Name `xml:"book"`
Metadata
}
type InCollection struct {
XMLName xml.Name `xml:"incollection"`
Metadata
}
type PhdThesis struct {
XMLName xml.Name `xml:"phdthesis"`
Metadata
}
type MastersThesis struct {
XMLName xml.Name `xml:"mastersthesis"`
Metadata
}
type WWW struct {
XMLName xml.Name `xml:"www"`
Metadata
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
// SumRecord is used to store the aggregated articles by year in srMap map
//(count is stored in the map's int which is used as key)
type SumRecord struct {
Type string
Year string
}
The program stores each publication in a map structure and finally exports two csv files:
- results.csv which contains an id, publication type and year for each publication
- sumresults.csv which contains the sum of each publication type per year
It is the first "complete" program I've written in Go - I'm currently trying to get a grasp on the language and I've needed to ask two questions on Stack Overflow while writing it here and here.
The rest of the code:
func main() {
// Start counting time
start := time.Now()
// Initialize counter variables for each publication type
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter int
var i = 1
// Initialize hash map
m := make(map[int]Record)
//Open gzipped dblp xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
}
defer gz.Close()
//Directly open xml file for testing purposes if needed - be sure to comment out gzip file opening above
//xmlFile, err := os.Open("dblp.xml")
//xmlFile, err := os.Open("TestDblp.xml")
if err != nil {
fmt.Println(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
// defer the closing of XML file so that we can parse it later on
defer xmlFile.Close()
// Initialize main object from Dblp struct
var articles Dblp
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
err = decoder.Decode(&articles.Dblp)
if err != nil {
fmt.Println(err)
}
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Here, we inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "dblp":
case "article":
var p Article
decoder.DecodeElement(&p, &se)
increment(&articleCounter)
m[i] = Record{i, articleCounter, "article", p.Year}
increment(&i)
case "inproceedings":
var p InProceedings
decoder.DecodeElement(&p, &se)
increment(&InProceedingsCounter)
m[i] = Record{i, InProceedingsCounter, "inproceedings", p.Year}
increment(&i)
case "proceedings":
var p Proceedings
decoder.DecodeElement(&p, &se)
increment(&ProceedingsCounter)
m[i] = Record{i, ProceedingsCounter, "proceedings", p.Year}
increment(&i)
case "book":
var p Book
decoder.DecodeElement(&p, &se)
increment(&BookCounter)
m[i] = Record{i, BookCounter, "proceedings", p.Year}
increment(&i)
case "incollection":
var p InCollection
decoder.DecodeElement(&p, &se)
increment(&InCollectionCounter)
m[i] = Record{i, InCollectionCounter, "incollection", p.Year}
increment(&i)
case "phdthesis":
var p PhdThesis
decoder.DecodeElement(&p, &se)
increment(&PhdThesisCounter)
m[i] = Record{i, PhdThesisCounter, "phdthesis", p.Year}
increment(&i)
case "mastersthesis":
var p MastersThesis
decoder.DecodeElement(&p, &se)
increment(&mastersThesisCounter)
m[i] = Record{i, mastersThesisCounter, "mastersthesis", p.Year}
increment(&i)
case "www":
var p WWW
decoder.DecodeElement(&p, &se)
increment(&wwwCounter)
m[i] = Record{i, wwwCounter, "www", p.Year}
increment(&i)
}
}
}
log.Println("Element parsing completed in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We can start processing the map.
// First we create a map and count the number of occurences of each publication type for a given year.
srMap := make(map[SumRecord]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := SumRecord{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
//// Create sum csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
// define column headers
sumheaders := string{
"type",
"year",
"sum",
}
sumwriter.Write(sumheaders)
var SumString string
// Create sorted map by VALUE (integer)
SortedSrMap := map[int]SumRecord{}
SortedSrMapKeys := int{}
for key, val := range SortedSrMap {
// SortedSrMap[val] = key
// SortedSrMapKeys = append(SortedSrMapKeys, val)
SumString = strconv.Itoa(key)
fmt.Println("sumstring:", SumString, "value: ", val)
}
sort.Ints(SortedSrMapKeys)
// END Create sorted map by VALUE (integer)
// Export sum csv
for key, val := range srMap {
r := make(string, 0, 1+len(sumheaders))
SumString = strconv.Itoa(val)
r = append(
r,
key.Type,
key.Year,
SumString,
)
sumwriter.Write(r)
}
sumwriter.Flush()
// CREATE RESULTS CSV
log.Println("Creating results csv file")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
// define column headers
headers := string{
"uid",
"id",
"type",
"year",
}
// write column headers
writer.Write(headers)
var idString string
var uidString string
// Create sorted map
var keys int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make(string, 0, 1+len(headers)) // capacity of 4, 1 + the number of properties our struct has & the number of column headers we are passing
// convert the Record.ID and UID ints to string in order to pass into append()
idString = strconv.Itoa(m[k].ID)
uidString = strconv.Itoa(m[k].UID)
r = append(
r,
uidString,
idString,
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// END CREATE RESULTS CSV
// Finally report results - update below line with more counters as desired
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:", ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:", PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
//log.Println("map:", m)
//log.Println("map length:", len(m))
//log.Println("sum map length:", len(srMap))
//fmt.Println("sum map contents:", srMap)
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func increment(i *int) {
*i = *i + 1
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for this case
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
Main problems and issues I've identified:
- The parsing is quite slow (it takes about 3:45 minutes) given the size of the file (474 Mb gzip). Can I improve something to make it faster?
- Can the code be made less verbose but not at the expense of making it less readable / understandable to a person just starting out with Go? For example, by generalizing the structs which are used to define the different publication types as well as the
case
/switch
statements?
parsing xml go
parsing xml go
edited 58 mins ago
Jamal♦
30.4k11121227
30.4k11121227
asked Mar 11 at 15:20
orestisforestisf
285
285
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
The decoder.Decode
call is unnecessary and in fact throws an error at
the moment.
To your second point, yes, especially the case
statements can all be
compressed down to a single function most likely, since they all only
have a few variables exchanged.
The indexing into a hash map map[int]Record
is not ideal, in fact
that's probably causing a slowdown too with the two million elements in
that table, instead you can simply append
the elements to a slice and
then it's all sorted and fine for iteration later on, no sorting
necessary at all.
And for increment(&i)
... just go ahead and increment the counters.
If you make functions, okay, but like this it's not helping with
readability (i += 1
is much clearer).
make(string, 0, 1+len(headers)
- well that's valid, but you can
simply create the array with all elements instead, like
string{uidString, ..., m[k].Year
etc. Might be even better if you
can reuse that array for all loop iterations.
Well I can't see any other obvious things to change. There's the
possibility that getting rid of DecodeElement
and doing the whole
decoding yourself might improve things, but I'm skeptical. If I, for
example, remove the whole switch
block, doing nothing but XML
decoding essentially, this still takes three minutes for me, essentially
just one minute less than with that block included! Meaning that with
this library it's just not going to get much quicker overall.
$endgroup$
1
$begingroup$
Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removeddecoder.Decode
. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is thexml.Name
element.
$endgroup$
– orestisf
Mar 17 at 20:01
1
$begingroup$
From what I know about theencoding/xml
package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
$endgroup$
– ferada
Mar 17 at 22:25
add a comment |
$begingroup$
I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.
Main points:
Only two structs are now used:
type Metadata struct {
Key string `xml:"key,attr"`
Year string `xml:"year"`
Author string `xml:"author"`
Title string `xml:"title"`
}
type Record struct {
UID int
ID int
Type string
Year string
}
The publications are all processed with the following function:
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
The entire code looks now like this:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // currently not in use
Year string `xml:"year"`
Author string `xml:"author"` // currently not in use
Title string `xml:"title"` // currently not in use
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
type Count int
type Counter interface {
Incr() int
ReturnInt() int
}
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count
func main() {
start := time.Now()
//Open gzipped dblp xml
//xmlFile, err := os.Open("TestDblp.xml.gz")
// Uncomment below for actual xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
defer gz.Close()
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
if err != nil {
log.Fatal(err)
}
m := make(map[int]Record)
var p Metadata
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done with parsing.
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Let's inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "article":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)
case "inproceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)
case "proceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)
case "book":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)
case "incollection":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)
case "phdthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)
case "mastersthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)
case "www":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
}
}
}
log.Println("XML parsing done in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We create srMap map object and count the number of occurences of each publication type for a given year.
srMap := make(map[Record]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := Record{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
// Create sumresult.csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
sumheaders := string{
"publicationType",
"year",
"sum",
}
sumwriter.Write(sumheaders)
// Export sumresult.csv
for key, val := range srMap {
r := make(string, 0, 1+len(sumheaders))
r = append(
r,
key.Type,
key.Year,
strconv.Itoa(val),
)
sumwriter.Write(r)
}
sumwriter.Flush()
// Create result.csv
log.Println("Creating result.csv")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
headers := string{
"uid",
"id",
"type",
"year",
}
writer.Write(headers)
// Create sorted map
var keys int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make(string, 0, 1+len(headers))
r = append(
r,
strconv.Itoa(m[k].UID),
strconv.Itoa(m[k].ID),
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// Finally report results
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
log.Println("Distinct publication map length:", len(m))
log.Println("Sum map length:", len(srMap))
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
func (c *Count) Incr() int {
*c = *c + 1
return int(*c)
}
func (c *Count) ReturnInt() int {
return int(*c)
}
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
I feel that the csv generation parts can be further streamlined as they are still a bit messy.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f215203%2fparse-dblp-xml-and-output-sums-of-publications-grouped-by-year-and-type%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The decoder.Decode
call is unnecessary and in fact throws an error at
the moment.
To your second point, yes, especially the case
statements can all be
compressed down to a single function most likely, since they all only
have a few variables exchanged.
The indexing into a hash map map[int]Record
is not ideal, in fact
that's probably causing a slowdown too with the two million elements in
that table, instead you can simply append
the elements to a slice and
then it's all sorted and fine for iteration later on, no sorting
necessary at all.
And for increment(&i)
... just go ahead and increment the counters.
If you make functions, okay, but like this it's not helping with
readability (i += 1
is much clearer).
make(string, 0, 1+len(headers)
- well that's valid, but you can
simply create the array with all elements instead, like
string{uidString, ..., m[k].Year
etc. Might be even better if you
can reuse that array for all loop iterations.
Well I can't see any other obvious things to change. There's the
possibility that getting rid of DecodeElement
and doing the whole
decoding yourself might improve things, but I'm skeptical. If I, for
example, remove the whole switch
block, doing nothing but XML
decoding essentially, this still takes three minutes for me, essentially
just one minute less than with that block included! Meaning that with
this library it's just not going to get much quicker overall.
$endgroup$
1
$begingroup$
Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removeddecoder.Decode
. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is thexml.Name
element.
$endgroup$
– orestisf
Mar 17 at 20:01
1
$begingroup$
From what I know about theencoding/xml
package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
$endgroup$
– ferada
Mar 17 at 22:25
add a comment |
$begingroup$
The decoder.Decode
call is unnecessary and in fact throws an error at
the moment.
To your second point, yes, especially the case
statements can all be
compressed down to a single function most likely, since they all only
have a few variables exchanged.
The indexing into a hash map map[int]Record
is not ideal, in fact
that's probably causing a slowdown too with the two million elements in
that table, instead you can simply append
the elements to a slice and
then it's all sorted and fine for iteration later on, no sorting
necessary at all.
And for increment(&i)
... just go ahead and increment the counters.
If you make functions, okay, but like this it's not helping with
readability (i += 1
is much clearer).
make(string, 0, 1+len(headers)
- well that's valid, but you can
simply create the array with all elements instead, like
string{uidString, ..., m[k].Year
etc. Might be even better if you
can reuse that array for all loop iterations.
Well I can't see any other obvious things to change. There's the
possibility that getting rid of DecodeElement
and doing the whole
decoding yourself might improve things, but I'm skeptical. If I, for
example, remove the whole switch
block, doing nothing but XML
decoding essentially, this still takes three minutes for me, essentially
just one minute less than with that block included! Meaning that with
this library it's just not going to get much quicker overall.
$endgroup$
1
$begingroup$
Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removeddecoder.Decode
. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is thexml.Name
element.
$endgroup$
– orestisf
Mar 17 at 20:01
1
$begingroup$
From what I know about theencoding/xml
package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
$endgroup$
– ferada
Mar 17 at 22:25
add a comment |
$begingroup$
The decoder.Decode
call is unnecessary and in fact throws an error at
the moment.
To your second point, yes, especially the case
statements can all be
compressed down to a single function most likely, since they all only
have a few variables exchanged.
The indexing into a hash map map[int]Record
is not ideal, in fact
that's probably causing a slowdown too with the two million elements in
that table, instead you can simply append
the elements to a slice and
then it's all sorted and fine for iteration later on, no sorting
necessary at all.
And for increment(&i)
... just go ahead and increment the counters.
If you make functions, okay, but like this it's not helping with
readability (i += 1
is much clearer).
make(string, 0, 1+len(headers)
- well that's valid, but you can
simply create the array with all elements instead, like
string{uidString, ..., m[k].Year
etc. Might be even better if you
can reuse that array for all loop iterations.
Well I can't see any other obvious things to change. There's the
possibility that getting rid of DecodeElement
and doing the whole
decoding yourself might improve things, but I'm skeptical. If I, for
example, remove the whole switch
block, doing nothing but XML
decoding essentially, this still takes three minutes for me, essentially
just one minute less than with that block included! Meaning that with
this library it's just not going to get much quicker overall.
$endgroup$
The decoder.Decode
call is unnecessary and in fact throws an error at
the moment.
To your second point, yes, especially the case
statements can all be
compressed down to a single function most likely, since they all only
have a few variables exchanged.
The indexing into a hash map map[int]Record
is not ideal, in fact
that's probably causing a slowdown too with the two million elements in
that table, instead you can simply append
the elements to a slice and
then it's all sorted and fine for iteration later on, no sorting
necessary at all.
And for increment(&i)
... just go ahead and increment the counters.
If you make functions, okay, but like this it's not helping with
readability (i += 1
is much clearer).
make(string, 0, 1+len(headers)
- well that's valid, but you can
simply create the array with all elements instead, like
string{uidString, ..., m[k].Year
etc. Might be even better if you
can reuse that array for all loop iterations.
Well I can't see any other obvious things to change. There's the
possibility that getting rid of DecodeElement
and doing the whole
decoding yourself might improve things, but I'm skeptical. If I, for
example, remove the whole switch
block, doing nothing but XML
decoding essentially, this still takes three minutes for me, essentially
just one minute less than with that block included! Meaning that with
this library it's just not going to get much quicker overall.
answered Mar 11 at 23:37
feradaferada
9,3161557
9,3161557
1
$begingroup$
Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removeddecoder.Decode
. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is thexml.Name
element.
$endgroup$
– orestisf
Mar 17 at 20:01
1
$begingroup$
From what I know about theencoding/xml
package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
$endgroup$
– ferada
Mar 17 at 22:25
add a comment |
1
$begingroup$
Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removeddecoder.Decode
. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is thexml.Name
element.
$endgroup$
– orestisf
Mar 17 at 20:01
1
$begingroup$
From what I know about theencoding/xml
package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
$endgroup$
– ferada
Mar 17 at 22:25
1
1
$begingroup$
Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed
decoder.Decode
. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name
element.$endgroup$
– orestisf
Mar 17 at 20:01
$begingroup$
Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed
decoder.Decode
. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name
element.$endgroup$
– orestisf
Mar 17 at 20:01
1
1
$begingroup$
From what I know about the
encoding/xml
package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.$endgroup$
– ferada
Mar 17 at 22:25
$begingroup$
From what I know about the
encoding/xml
package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.$endgroup$
– ferada
Mar 17 at 22:25
add a comment |
$begingroup$
I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.
Main points:
Only two structs are now used:
type Metadata struct {
Key string `xml:"key,attr"`
Year string `xml:"year"`
Author string `xml:"author"`
Title string `xml:"title"`
}
type Record struct {
UID int
ID int
Type string
Year string
}
The publications are all processed with the following function:
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
The entire code looks now like this:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // currently not in use
Year string `xml:"year"`
Author string `xml:"author"` // currently not in use
Title string `xml:"title"` // currently not in use
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
type Count int
type Counter interface {
Incr() int
ReturnInt() int
}
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count
func main() {
start := time.Now()
//Open gzipped dblp xml
//xmlFile, err := os.Open("TestDblp.xml.gz")
// Uncomment below for actual xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
defer gz.Close()
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
if err != nil {
log.Fatal(err)
}
m := make(map[int]Record)
var p Metadata
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done with parsing.
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Let's inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "article":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)
case "inproceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)
case "proceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)
case "book":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)
case "incollection":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)
case "phdthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)
case "mastersthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)
case "www":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
}
}
}
log.Println("XML parsing done in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We create srMap map object and count the number of occurences of each publication type for a given year.
srMap := make(map[Record]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := Record{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
// Create sumresult.csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
sumheaders := string{
"publicationType",
"year",
"sum",
}
sumwriter.Write(sumheaders)
// Export sumresult.csv
for key, val := range srMap {
r := make(string, 0, 1+len(sumheaders))
r = append(
r,
key.Type,
key.Year,
strconv.Itoa(val),
)
sumwriter.Write(r)
}
sumwriter.Flush()
// Create result.csv
log.Println("Creating result.csv")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
headers := string{
"uid",
"id",
"type",
"year",
}
writer.Write(headers)
// Create sorted map
var keys int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make(string, 0, 1+len(headers))
r = append(
r,
strconv.Itoa(m[k].UID),
strconv.Itoa(m[k].ID),
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// Finally report results
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
log.Println("Distinct publication map length:", len(m))
log.Println("Sum map length:", len(srMap))
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
func (c *Count) Incr() int {
*c = *c + 1
return int(*c)
}
func (c *Count) ReturnInt() int {
return int(*c)
}
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
I feel that the csv generation parts can be further streamlined as they are still a bit messy.
$endgroup$
add a comment |
$begingroup$
I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.
Main points:
Only two structs are now used:
type Metadata struct {
Key string `xml:"key,attr"`
Year string `xml:"year"`
Author string `xml:"author"`
Title string `xml:"title"`
}
type Record struct {
UID int
ID int
Type string
Year string
}
The publications are all processed with the following function:
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
The entire code looks now like this:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // currently not in use
Year string `xml:"year"`
Author string `xml:"author"` // currently not in use
Title string `xml:"title"` // currently not in use
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
type Count int
type Counter interface {
Incr() int
ReturnInt() int
}
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count
func main() {
start := time.Now()
//Open gzipped dblp xml
//xmlFile, err := os.Open("TestDblp.xml.gz")
// Uncomment below for actual xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
defer gz.Close()
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
if err != nil {
log.Fatal(err)
}
m := make(map[int]Record)
var p Metadata
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done with parsing.
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Let's inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "article":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)
case "inproceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)
case "proceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)
case "book":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)
case "incollection":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)
case "phdthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)
case "mastersthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)
case "www":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
}
}
}
log.Println("XML parsing done in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We create srMap map object and count the number of occurences of each publication type for a given year.
srMap := make(map[Record]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := Record{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
// Create sumresult.csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
sumheaders := string{
"publicationType",
"year",
"sum",
}
sumwriter.Write(sumheaders)
// Export sumresult.csv
for key, val := range srMap {
r := make(string, 0, 1+len(sumheaders))
r = append(
r,
key.Type,
key.Year,
strconv.Itoa(val),
)
sumwriter.Write(r)
}
sumwriter.Flush()
// Create result.csv
log.Println("Creating result.csv")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
headers := string{
"uid",
"id",
"type",
"year",
}
writer.Write(headers)
// Create sorted map
var keys int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make(string, 0, 1+len(headers))
r = append(
r,
strconv.Itoa(m[k].UID),
strconv.Itoa(m[k].ID),
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// Finally report results
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
log.Println("Distinct publication map length:", len(m))
log.Println("Sum map length:", len(srMap))
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
func (c *Count) Incr() int {
*c = *c + 1
return int(*c)
}
func (c *Count) ReturnInt() int {
return int(*c)
}
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
I feel that the csv generation parts can be further streamlined as they are still a bit messy.
$endgroup$
add a comment |
$begingroup$
I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.
Main points:
Only two structs are now used:
type Metadata struct {
Key string `xml:"key,attr"`
Year string `xml:"year"`
Author string `xml:"author"`
Title string `xml:"title"`
}
type Record struct {
UID int
ID int
Type string
Year string
}
The publications are all processed with the following function:
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
The entire code looks now like this:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // currently not in use
Year string `xml:"year"`
Author string `xml:"author"` // currently not in use
Title string `xml:"title"` // currently not in use
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
type Count int
type Counter interface {
Incr() int
ReturnInt() int
}
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count
func main() {
start := time.Now()
//Open gzipped dblp xml
//xmlFile, err := os.Open("TestDblp.xml.gz")
// Uncomment below for actual xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
defer gz.Close()
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
if err != nil {
log.Fatal(err)
}
m := make(map[int]Record)
var p Metadata
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done with parsing.
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Let's inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "article":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)
case "inproceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)
case "proceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)
case "book":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)
case "incollection":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)
case "phdthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)
case "mastersthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)
case "www":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
}
}
}
log.Println("XML parsing done in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We create srMap map object and count the number of occurences of each publication type for a given year.
srMap := make(map[Record]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := Record{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
// Create sumresult.csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
sumheaders := string{
"publicationType",
"year",
"sum",
}
sumwriter.Write(sumheaders)
// Export sumresult.csv
for key, val := range srMap {
r := make(string, 0, 1+len(sumheaders))
r = append(
r,
key.Type,
key.Year,
strconv.Itoa(val),
)
sumwriter.Write(r)
}
sumwriter.Flush()
// Create result.csv
log.Println("Creating result.csv")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
headers := string{
"uid",
"id",
"type",
"year",
}
writer.Write(headers)
// Create sorted map
var keys int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make(string, 0, 1+len(headers))
r = append(
r,
strconv.Itoa(m[k].UID),
strconv.Itoa(m[k].ID),
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// Finally report results
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
log.Println("Distinct publication map length:", len(m))
log.Println("Sum map length:", len(srMap))
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
func (c *Count) Incr() int {
*c = *c + 1
return int(*c)
}
func (c *Count) ReturnInt() int {
return int(*c)
}
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
I feel that the csv generation parts can be further streamlined as they are still a bit messy.
$endgroup$
I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.
Main points:
Only two structs are now used:
type Metadata struct {
Key string `xml:"key,attr"`
Year string `xml:"year"`
Author string `xml:"author"`
Title string `xml:"title"`
}
type Record struct {
UID int
ID int
Type string
Year string
}
The publications are all processed with the following function:
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
The entire code looks now like this:
package main
import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"
"golang.org/x/text/encoding/charmap"
)
// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // currently not in use
Year string `xml:"year"`
Author string `xml:"author"` // currently not in use
Title string `xml:"title"` // currently not in use
}
// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}
type Count int
type Counter interface {
Incr() int
ReturnInt() int
}
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count
func main() {
start := time.Now()
//Open gzipped dblp xml
//xmlFile, err := os.Open("TestDblp.xml.gz")
// Uncomment below for actual xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}
defer gz.Close()
// Create decoder element
decoder := xml.NewDecoder(gz)
// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
if err != nil {
log.Fatal(err)
}
m := make(map[int]Record)
var p Metadata
for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()
// If we reach the end of the file, we are done with parsing.
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}
// Let's inspect the token
switch se := t.(type) {
// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "article":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)
case "inproceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)
case "proceedings":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)
case "book":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)
case "incollection":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)
case "phdthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)
case "mastersthesis":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)
case "www":
decoder.DecodeElement(&p, &se)
ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
}
}
}
log.Println("XML parsing done in:", time.Since(start))
// All parsed elements have been added to m := make(map[int]Record)
// We create srMap map object and count the number of occurences of each publication type for a given year.
srMap := make(map[Record]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := Record{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}
// Create sumresult.csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()
sumheaders := string{
"publicationType",
"year",
"sum",
}
sumwriter.Write(sumheaders)
// Export sumresult.csv
for key, val := range srMap {
r := make(string, 0, 1+len(sumheaders))
r = append(
r,
key.Type,
key.Year,
strconv.Itoa(val),
)
sumwriter.Write(r)
}
sumwriter.Flush()
// Create result.csv
log.Println("Creating result.csv")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
headers := string{
"uid",
"id",
"type",
"year",
}
writer.Write(headers)
// Create sorted map
var keys int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)
for _, k := range keys {
r := make(string, 0, 1+len(headers))
r = append(
r,
strconv.Itoa(m[k].UID),
strconv.Itoa(m[k].ID),
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()
// Finally report results
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
log.Println("Distinct publication map length:", len(m))
log.Println("Sum map length:", len(srMap))
log.Println("XML parsing and csv export executed in:", time.Since(start))
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}
func (c *Count) Incr() int {
*c = *c + 1
return int(*c)
}
func (c *Count) ReturnInt() int {
return int(*c)
}
func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
}
I feel that the csv generation parts can be further streamlined as they are still a bit messy.
answered yesterday
orestisforestisf
285
285
add a comment |
add a comment |
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f215203%2fparse-dblp-xml-and-output-sums-of-publications-grouped-by-year-and-type%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown