Parse dblp XML and output sums of publications grouped by year and type












5












$begingroup$


The following Go program parses a gzipped XML file (available here) which contains bibliographic information on computer science publications and has the following indicative structure:



<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<article mdate="2017-05-28" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>https://doi.org/10.1007/BF03036466</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/Simon83">
<author>Hans Ulrich Simon</author>
<title>Pattern Matching in Trees and Nets.</title>
<pages>227-248</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#Simon83</url>
<ee>https://doi.org/10.1007/BF01257084</ee>
</article>
<article mdate="2017-05-28" key="journals/acta/GoodmanS83">
<author>Nathan Goodman</author>
<author>Oded Shmueli</author>
<title>NP-complete Problems Simplified on Tree Schemas.</title>
<pages>171-178</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#GoodmanS83</url>
<ee>https://doi.org/10.1007/BF00289414</ee>
</article>
</dblp>


The XML has multiple publication types denoted by the title of the element (i.e. proceedings, book, phdthesis) and for each of which I have defined a separate struct in my program:



package main

import (
"compress/gzip"
"encoding/csv"
"encoding/xml"
"fmt"
"io"
"log"
"os"
"sort"
"strconv"
"time"

"golang.org/x/text/encoding/charmap"
)

// Dblp contains the array of articles in the dblp xml file
type Dblp struct {
XMLName xml.Name `xml:"dblp"`
Dblp Article
}

// Metadata contains the fields shared by all structs
type Metadata struct {
Key string `xml:"key,attr"` // not currently in use
Year string `xml:"year"`
Author string `xml:"author"` // not currently in use
Title string `xml:"title"` // not currently in use
}

// Article struct and the following structs contain the elements we want to parse and they "inherit" the metadata struct defined above
type Article struct {
XMLName xml.Name `xml:"article"`
Metadata
}

type InProceedings struct {
XMLName xml.Name `xml:"inproceedings"`
Metadata
}

type Proceedings struct {
XMLName xml.Name `xml:"proceedings"`
Metadata
}

type Book struct {
XMLName xml.Name `xml:"book"`
Metadata
}

type InCollection struct {
XMLName xml.Name `xml:"incollection"`
Metadata
}

type PhdThesis struct {
XMLName xml.Name `xml:"phdthesis"`
Metadata
}

type MastersThesis struct {
XMLName xml.Name `xml:"mastersthesis"`
Metadata
}

type WWW struct {
XMLName xml.Name `xml:"www"`
Metadata
}

// Record is used to store each Article's type and year which will be passed as a value to map m
type Record struct {
UID int
ID int
Type string
Year string
}

// SumRecord is used to store the aggregated articles by year in srMap map
//(count is stored in the map's int which is used as key)
type SumRecord struct {
Type string
Year string
}


The program stores each publication in a map structure and finally exports two csv files:




  • results.csv which contains an id, publication type and year for each publication

  • sumresults.csv which contains the sum of each publication type per year


It is the first "complete" program I've written in Go - I'm currently trying to get a grasp on the language and I've needed to ask two questions on Stack Overflow while writing it here and here.



The rest of the code:



func main() {
// Start counting time
start := time.Now()

// Initialize counter variables for each publication type
var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter int
var i = 1

// Initialize hash map
m := make(map[int]Record)

//Open gzipped dblp xml
xmlFile, err := os.Open("dblp.xml.gz")
gz, err := gzip.NewReader(xmlFile)
if err != nil {
log.Fatal(err)

}
defer gz.Close()

//Directly open xml file for testing purposes if needed - be sure to comment out gzip file opening above
//xmlFile, err := os.Open("dblp.xml")
//xmlFile, err := os.Open("TestDblp.xml")
if err != nil {
fmt.Println(err)
} else {
log.Println("Successfully Opened Dblp XML file")
}

// defer the closing of XML file so that we can parse it later on
defer xmlFile.Close()

// Initialize main object from Dblp struct
var articles Dblp

// Create decoder element
decoder := xml.NewDecoder(gz)

// Suppress xml errors
decoder.Strict = false
decoder.CharsetReader = makeCharsetReader
err = decoder.Decode(&articles.Dblp)
if err != nil {
fmt.Println(err)
}

for {
// Read tokens from the XML document in a stream.
t, err := decoder.Token()

// If we reach the end of the file, we are done
if err == io.EOF {
log.Println("XML successfully parsed:", err)
break
} else if err != nil {
log.Fatalf("Error decoding token: %t", err)
} else if t == nil {
break
}

// Here, we inspect the token
switch se := t.(type) {

// We have the start of an element and the token we created above in t:
case xml.StartElement:
switch se.Name.Local {
case "dblp":

case "article":
var p Article
decoder.DecodeElement(&p, &se)
increment(&articleCounter)
m[i] = Record{i, articleCounter, "article", p.Year}
increment(&i)

case "inproceedings":
var p InProceedings
decoder.DecodeElement(&p, &se)
increment(&InProceedingsCounter)
m[i] = Record{i, InProceedingsCounter, "inproceedings", p.Year}
increment(&i)

case "proceedings":
var p Proceedings
decoder.DecodeElement(&p, &se)
increment(&ProceedingsCounter)
m[i] = Record{i, ProceedingsCounter, "proceedings", p.Year}
increment(&i)

case "book":
var p Book
decoder.DecodeElement(&p, &se)
increment(&BookCounter)
m[i] = Record{i, BookCounter, "proceedings", p.Year}
increment(&i)

case "incollection":
var p InCollection
decoder.DecodeElement(&p, &se)
increment(&InCollectionCounter)
m[i] = Record{i, InCollectionCounter, "incollection", p.Year}
increment(&i)

case "phdthesis":
var p PhdThesis
decoder.DecodeElement(&p, &se)
increment(&PhdThesisCounter)
m[i] = Record{i, PhdThesisCounter, "phdthesis", p.Year}
increment(&i)

case "mastersthesis":
var p MastersThesis
decoder.DecodeElement(&p, &se)
increment(&mastersThesisCounter)
m[i] = Record{i, mastersThesisCounter, "mastersthesis", p.Year}
increment(&i)

case "www":
var p WWW
decoder.DecodeElement(&p, &se)
increment(&wwwCounter)
m[i] = Record{i, wwwCounter, "www", p.Year}
increment(&i)
}
}
}
log.Println("Element parsing completed in:", time.Since(start))

// All parsed elements have been added to m := make(map[int]Record)
// We can start processing the map.
// First we create a map and count the number of occurences of each publication type for a given year.

srMap := make(map[SumRecord]int)
log.Println("Creating sums by article type per year")
for key := range m {
sr := SumRecord{
Type: m[key].Type,
Year: m[key].Year,
}
srMap[sr]++
}

//// Create sum csv
log.Println("Creating sum results csv file")
sumfile, err := os.Create("sumresult.csv")
checkError("Cannot create file", err)
defer sumfile.Close()
sumwriter := csv.NewWriter(sumfile)
defer sumwriter.Flush()

// define column headers
sumheaders := string{
"type",
"year",
"sum",
}

sumwriter.Write(sumheaders)
var SumString string

// Create sorted map by VALUE (integer)

SortedSrMap := map[int]SumRecord{}
SortedSrMapKeys := int{}
for key, val := range SortedSrMap {
// SortedSrMap[val] = key
// SortedSrMapKeys = append(SortedSrMapKeys, val)
SumString = strconv.Itoa(key)
fmt.Println("sumstring:", SumString, "value: ", val)
}
sort.Ints(SortedSrMapKeys)

// END Create sorted map by VALUE (integer)

// Export sum csv
for key, val := range srMap {
r := make(string, 0, 1+len(sumheaders))
SumString = strconv.Itoa(val)
r = append(
r,
key.Type,
key.Year,
SumString,
)
sumwriter.Write(r)
}
sumwriter.Flush()

// CREATE RESULTS CSV
log.Println("Creating results csv file")
file, err := os.Create("result.csv")
checkError("Cannot create file", err)
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()

// define column headers
headers := string{
"uid",
"id",
"type",
"year",
}

// write column headers
writer.Write(headers)

var idString string
var uidString string

// Create sorted map
var keys int
for k := range m {
keys = append(keys, k)
}
sort.Ints(keys)

for _, k := range keys {

r := make(string, 0, 1+len(headers)) // capacity of 4, 1 + the number of properties our struct has & the number of column headers we are passing

// convert the Record.ID and UID ints to string in order to pass into append()
idString = strconv.Itoa(m[k].ID)
uidString = strconv.Itoa(m[k].UID)

r = append(
r,
uidString,
idString,
m[k].Type,
m[k].Year,
)
writer.Write(r)
}
writer.Flush()

// END CREATE RESULTS CSV

// Finally report results - update below line with more counters as desired
log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:", ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:", PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
//log.Println("map:", m)
//log.Println("map length:", len(m))
//log.Println("sum map length:", len(srMap))
//fmt.Println("sum map contents:", srMap)
log.Println("XML parsing and csv export executed in:", time.Since(start))
}

func increment(i *int) {
*i = *i + 1
}

func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}

func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
if charset == "ISO-8859-1" {
// Windows-1252 is a superset of ISO-8859-1, so it should be ok for this case
return charmap.Windows1252.NewDecoder().Reader(input), nil
}
return nil, fmt.Errorf("Unknown charset: %s", charset)
}


Main problems and issues I've identified:




  • The parsing is quite slow (it takes about 3:45 minutes) given the size of the file (474 Mb gzip). Can I improve something to make it faster?

  • Can the code be made less verbose but not at the expense of making it less readable / understandable to a person just starting out with Go? For example, by generalizing the structs which are used to define the different publication types as well as the case / switch statements?










share|improve this question











$endgroup$

















    5












    $begingroup$


    The following Go program parses a gzipped XML file (available here) which contains bibliographic information on computer science publications and has the following indicative structure:



    <?xml version="1.0" encoding="ISO-8859-1"?>
    <!DOCTYPE dblp SYSTEM "dblp.dtd">
    <dblp>
    <article mdate="2017-05-28" key="journals/acta/Saxena96">
    <author>Sanjeev Saxena</author>
    <title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
    <pages>607-619</pages>
    <year>1996</year>
    <volume>33</volume>
    <journal>Acta Inf.</journal>
    <number>7</number>
    <url>db/journals/acta/acta33.html#Saxena96</url>
    <ee>https://doi.org/10.1007/BF03036466</ee>
    </article>
    <article mdate="2017-05-28" key="journals/acta/Simon83">
    <author>Hans Ulrich Simon</author>
    <title>Pattern Matching in Trees and Nets.</title>
    <pages>227-248</pages>
    <year>1983</year>
    <volume>20</volume>
    <journal>Acta Inf.</journal>
    <url>db/journals/acta/acta20.html#Simon83</url>
    <ee>https://doi.org/10.1007/BF01257084</ee>
    </article>
    <article mdate="2017-05-28" key="journals/acta/GoodmanS83">
    <author>Nathan Goodman</author>
    <author>Oded Shmueli</author>
    <title>NP-complete Problems Simplified on Tree Schemas.</title>
    <pages>171-178</pages>
    <year>1983</year>
    <volume>20</volume>
    <journal>Acta Inf.</journal>
    <url>db/journals/acta/acta20.html#GoodmanS83</url>
    <ee>https://doi.org/10.1007/BF00289414</ee>
    </article>
    </dblp>


    The XML has multiple publication types denoted by the title of the element (i.e. proceedings, book, phdthesis) and for each of which I have defined a separate struct in my program:



    package main

    import (
    "compress/gzip"
    "encoding/csv"
    "encoding/xml"
    "fmt"
    "io"
    "log"
    "os"
    "sort"
    "strconv"
    "time"

    "golang.org/x/text/encoding/charmap"
    )

    // Dblp contains the array of articles in the dblp xml file
    type Dblp struct {
    XMLName xml.Name `xml:"dblp"`
    Dblp Article
    }

    // Metadata contains the fields shared by all structs
    type Metadata struct {
    Key string `xml:"key,attr"` // not currently in use
    Year string `xml:"year"`
    Author string `xml:"author"` // not currently in use
    Title string `xml:"title"` // not currently in use
    }

    // Article struct and the following structs contain the elements we want to parse and they "inherit" the metadata struct defined above
    type Article struct {
    XMLName xml.Name `xml:"article"`
    Metadata
    }

    type InProceedings struct {
    XMLName xml.Name `xml:"inproceedings"`
    Metadata
    }

    type Proceedings struct {
    XMLName xml.Name `xml:"proceedings"`
    Metadata
    }

    type Book struct {
    XMLName xml.Name `xml:"book"`
    Metadata
    }

    type InCollection struct {
    XMLName xml.Name `xml:"incollection"`
    Metadata
    }

    type PhdThesis struct {
    XMLName xml.Name `xml:"phdthesis"`
    Metadata
    }

    type MastersThesis struct {
    XMLName xml.Name `xml:"mastersthesis"`
    Metadata
    }

    type WWW struct {
    XMLName xml.Name `xml:"www"`
    Metadata
    }

    // Record is used to store each Article's type and year which will be passed as a value to map m
    type Record struct {
    UID int
    ID int
    Type string
    Year string
    }

    // SumRecord is used to store the aggregated articles by year in srMap map
    //(count is stored in the map's int which is used as key)
    type SumRecord struct {
    Type string
    Year string
    }


    The program stores each publication in a map structure and finally exports two csv files:




    • results.csv which contains an id, publication type and year for each publication

    • sumresults.csv which contains the sum of each publication type per year


    It is the first "complete" program I've written in Go - I'm currently trying to get a grasp on the language and I've needed to ask two questions on Stack Overflow while writing it here and here.



    The rest of the code:



    func main() {
    // Start counting time
    start := time.Now()

    // Initialize counter variables for each publication type
    var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
    InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter int
    var i = 1

    // Initialize hash map
    m := make(map[int]Record)

    //Open gzipped dblp xml
    xmlFile, err := os.Open("dblp.xml.gz")
    gz, err := gzip.NewReader(xmlFile)
    if err != nil {
    log.Fatal(err)

    }
    defer gz.Close()

    //Directly open xml file for testing purposes if needed - be sure to comment out gzip file opening above
    //xmlFile, err := os.Open("dblp.xml")
    //xmlFile, err := os.Open("TestDblp.xml")
    if err != nil {
    fmt.Println(err)
    } else {
    log.Println("Successfully Opened Dblp XML file")
    }

    // defer the closing of XML file so that we can parse it later on
    defer xmlFile.Close()

    // Initialize main object from Dblp struct
    var articles Dblp

    // Create decoder element
    decoder := xml.NewDecoder(gz)

    // Suppress xml errors
    decoder.Strict = false
    decoder.CharsetReader = makeCharsetReader
    err = decoder.Decode(&articles.Dblp)
    if err != nil {
    fmt.Println(err)
    }

    for {
    // Read tokens from the XML document in a stream.
    t, err := decoder.Token()

    // If we reach the end of the file, we are done
    if err == io.EOF {
    log.Println("XML successfully parsed:", err)
    break
    } else if err != nil {
    log.Fatalf("Error decoding token: %t", err)
    } else if t == nil {
    break
    }

    // Here, we inspect the token
    switch se := t.(type) {

    // We have the start of an element and the token we created above in t:
    case xml.StartElement:
    switch se.Name.Local {
    case "dblp":

    case "article":
    var p Article
    decoder.DecodeElement(&p, &se)
    increment(&articleCounter)
    m[i] = Record{i, articleCounter, "article", p.Year}
    increment(&i)

    case "inproceedings":
    var p InProceedings
    decoder.DecodeElement(&p, &se)
    increment(&InProceedingsCounter)
    m[i] = Record{i, InProceedingsCounter, "inproceedings", p.Year}
    increment(&i)

    case "proceedings":
    var p Proceedings
    decoder.DecodeElement(&p, &se)
    increment(&ProceedingsCounter)
    m[i] = Record{i, ProceedingsCounter, "proceedings", p.Year}
    increment(&i)

    case "book":
    var p Book
    decoder.DecodeElement(&p, &se)
    increment(&BookCounter)
    m[i] = Record{i, BookCounter, "proceedings", p.Year}
    increment(&i)

    case "incollection":
    var p InCollection
    decoder.DecodeElement(&p, &se)
    increment(&InCollectionCounter)
    m[i] = Record{i, InCollectionCounter, "incollection", p.Year}
    increment(&i)

    case "phdthesis":
    var p PhdThesis
    decoder.DecodeElement(&p, &se)
    increment(&PhdThesisCounter)
    m[i] = Record{i, PhdThesisCounter, "phdthesis", p.Year}
    increment(&i)

    case "mastersthesis":
    var p MastersThesis
    decoder.DecodeElement(&p, &se)
    increment(&mastersThesisCounter)
    m[i] = Record{i, mastersThesisCounter, "mastersthesis", p.Year}
    increment(&i)

    case "www":
    var p WWW
    decoder.DecodeElement(&p, &se)
    increment(&wwwCounter)
    m[i] = Record{i, wwwCounter, "www", p.Year}
    increment(&i)
    }
    }
    }
    log.Println("Element parsing completed in:", time.Since(start))

    // All parsed elements have been added to m := make(map[int]Record)
    // We can start processing the map.
    // First we create a map and count the number of occurences of each publication type for a given year.

    srMap := make(map[SumRecord]int)
    log.Println("Creating sums by article type per year")
    for key := range m {
    sr := SumRecord{
    Type: m[key].Type,
    Year: m[key].Year,
    }
    srMap[sr]++
    }

    //// Create sum csv
    log.Println("Creating sum results csv file")
    sumfile, err := os.Create("sumresult.csv")
    checkError("Cannot create file", err)
    defer sumfile.Close()
    sumwriter := csv.NewWriter(sumfile)
    defer sumwriter.Flush()

    // define column headers
    sumheaders := string{
    "type",
    "year",
    "sum",
    }

    sumwriter.Write(sumheaders)
    var SumString string

    // Create sorted map by VALUE (integer)

    SortedSrMap := map[int]SumRecord{}
    SortedSrMapKeys := int{}
    for key, val := range SortedSrMap {
    // SortedSrMap[val] = key
    // SortedSrMapKeys = append(SortedSrMapKeys, val)
    SumString = strconv.Itoa(key)
    fmt.Println("sumstring:", SumString, "value: ", val)
    }
    sort.Ints(SortedSrMapKeys)

    // END Create sorted map by VALUE (integer)

    // Export sum csv
    for key, val := range srMap {
    r := make(string, 0, 1+len(sumheaders))
    SumString = strconv.Itoa(val)
    r = append(
    r,
    key.Type,
    key.Year,
    SumString,
    )
    sumwriter.Write(r)
    }
    sumwriter.Flush()

    // CREATE RESULTS CSV
    log.Println("Creating results csv file")
    file, err := os.Create("result.csv")
    checkError("Cannot create file", err)
    defer file.Close()
    writer := csv.NewWriter(file)
    defer writer.Flush()

    // define column headers
    headers := string{
    "uid",
    "id",
    "type",
    "year",
    }

    // write column headers
    writer.Write(headers)

    var idString string
    var uidString string

    // Create sorted map
    var keys int
    for k := range m {
    keys = append(keys, k)
    }
    sort.Ints(keys)

    for _, k := range keys {

    r := make(string, 0, 1+len(headers)) // capacity of 4, 1 + the number of properties our struct has & the number of column headers we are passing

    // convert the Record.ID and UID ints to string in order to pass into append()
    idString = strconv.Itoa(m[k].ID)
    uidString = strconv.Itoa(m[k].UID)

    r = append(
    r,
    uidString,
    idString,
    m[k].Type,
    m[k].Year,
    )
    writer.Write(r)
    }
    writer.Flush()

    // END CREATE RESULTS CSV

    // Finally report results - update below line with more counters as desired
    log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:", ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:", PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
    //log.Println("map:", m)
    //log.Println("map length:", len(m))
    //log.Println("sum map length:", len(srMap))
    //fmt.Println("sum map contents:", srMap)
    log.Println("XML parsing and csv export executed in:", time.Since(start))
    }

    func increment(i *int) {
    *i = *i + 1
    }

    func checkError(message string, err error) {
    if err != nil {
    log.Fatal(message, err)
    }
    }

    func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
    if charset == "ISO-8859-1" {
    // Windows-1252 is a superset of ISO-8859-1, so it should be ok for this case
    return charmap.Windows1252.NewDecoder().Reader(input), nil
    }
    return nil, fmt.Errorf("Unknown charset: %s", charset)
    }


    Main problems and issues I've identified:




    • The parsing is quite slow (it takes about 3:45 minutes) given the size of the file (474 Mb gzip). Can I improve something to make it faster?

    • Can the code be made less verbose but not at the expense of making it less readable / understandable to a person just starting out with Go? For example, by generalizing the structs which are used to define the different publication types as well as the case / switch statements?










    share|improve this question











    $endgroup$















      5












      5








      5





      $begingroup$


      The following Go program parses a gzipped XML file (available here) which contains bibliographic information on computer science publications and has the following indicative structure:



      <?xml version="1.0" encoding="ISO-8859-1"?>
      <!DOCTYPE dblp SYSTEM "dblp.dtd">
      <dblp>
      <article mdate="2017-05-28" key="journals/acta/Saxena96">
      <author>Sanjeev Saxena</author>
      <title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
      <pages>607-619</pages>
      <year>1996</year>
      <volume>33</volume>
      <journal>Acta Inf.</journal>
      <number>7</number>
      <url>db/journals/acta/acta33.html#Saxena96</url>
      <ee>https://doi.org/10.1007/BF03036466</ee>
      </article>
      <article mdate="2017-05-28" key="journals/acta/Simon83">
      <author>Hans Ulrich Simon</author>
      <title>Pattern Matching in Trees and Nets.</title>
      <pages>227-248</pages>
      <year>1983</year>
      <volume>20</volume>
      <journal>Acta Inf.</journal>
      <url>db/journals/acta/acta20.html#Simon83</url>
      <ee>https://doi.org/10.1007/BF01257084</ee>
      </article>
      <article mdate="2017-05-28" key="journals/acta/GoodmanS83">
      <author>Nathan Goodman</author>
      <author>Oded Shmueli</author>
      <title>NP-complete Problems Simplified on Tree Schemas.</title>
      <pages>171-178</pages>
      <year>1983</year>
      <volume>20</volume>
      <journal>Acta Inf.</journal>
      <url>db/journals/acta/acta20.html#GoodmanS83</url>
      <ee>https://doi.org/10.1007/BF00289414</ee>
      </article>
      </dblp>


      The XML has multiple publication types denoted by the title of the element (i.e. proceedings, book, phdthesis) and for each of which I have defined a separate struct in my program:



      package main

      import (
      "compress/gzip"
      "encoding/csv"
      "encoding/xml"
      "fmt"
      "io"
      "log"
      "os"
      "sort"
      "strconv"
      "time"

      "golang.org/x/text/encoding/charmap"
      )

      // Dblp contains the array of articles in the dblp xml file
      type Dblp struct {
      XMLName xml.Name `xml:"dblp"`
      Dblp Article
      }

      // Metadata contains the fields shared by all structs
      type Metadata struct {
      Key string `xml:"key,attr"` // not currently in use
      Year string `xml:"year"`
      Author string `xml:"author"` // not currently in use
      Title string `xml:"title"` // not currently in use
      }

      // Article struct and the following structs contain the elements we want to parse and they "inherit" the metadata struct defined above
      type Article struct {
      XMLName xml.Name `xml:"article"`
      Metadata
      }

      type InProceedings struct {
      XMLName xml.Name `xml:"inproceedings"`
      Metadata
      }

      type Proceedings struct {
      XMLName xml.Name `xml:"proceedings"`
      Metadata
      }

      type Book struct {
      XMLName xml.Name `xml:"book"`
      Metadata
      }

      type InCollection struct {
      XMLName xml.Name `xml:"incollection"`
      Metadata
      }

      type PhdThesis struct {
      XMLName xml.Name `xml:"phdthesis"`
      Metadata
      }

      type MastersThesis struct {
      XMLName xml.Name `xml:"mastersthesis"`
      Metadata
      }

      type WWW struct {
      XMLName xml.Name `xml:"www"`
      Metadata
      }

      // Record is used to store each Article's type and year which will be passed as a value to map m
      type Record struct {
      UID int
      ID int
      Type string
      Year string
      }

      // SumRecord is used to store the aggregated articles by year in srMap map
      //(count is stored in the map's int which is used as key)
      type SumRecord struct {
      Type string
      Year string
      }


      The program stores each publication in a map structure and finally exports two csv files:




      • results.csv which contains an id, publication type and year for each publication

      • sumresults.csv which contains the sum of each publication type per year


      It is the first "complete" program I've written in Go - I'm currently trying to get a grasp on the language and I've needed to ask two questions on Stack Overflow while writing it here and here.



      The rest of the code:



      func main() {
      // Start counting time
      start := time.Now()

      // Initialize counter variables for each publication type
      var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
      InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter int
      var i = 1

      // Initialize hash map
      m := make(map[int]Record)

      //Open gzipped dblp xml
      xmlFile, err := os.Open("dblp.xml.gz")
      gz, err := gzip.NewReader(xmlFile)
      if err != nil {
      log.Fatal(err)

      }
      defer gz.Close()

      //Directly open xml file for testing purposes if needed - be sure to comment out gzip file opening above
      //xmlFile, err := os.Open("dblp.xml")
      //xmlFile, err := os.Open("TestDblp.xml")
      if err != nil {
      fmt.Println(err)
      } else {
      log.Println("Successfully Opened Dblp XML file")
      }

      // defer the closing of XML file so that we can parse it later on
      defer xmlFile.Close()

      // Initialize main object from Dblp struct
      var articles Dblp

      // Create decoder element
      decoder := xml.NewDecoder(gz)

      // Suppress xml errors
      decoder.Strict = false
      decoder.CharsetReader = makeCharsetReader
      err = decoder.Decode(&articles.Dblp)
      if err != nil {
      fmt.Println(err)
      }

      for {
      // Read tokens from the XML document in a stream.
      t, err := decoder.Token()

      // If we reach the end of the file, we are done
      if err == io.EOF {
      log.Println("XML successfully parsed:", err)
      break
      } else if err != nil {
      log.Fatalf("Error decoding token: %t", err)
      } else if t == nil {
      break
      }

      // Here, we inspect the token
      switch se := t.(type) {

      // We have the start of an element and the token we created above in t:
      case xml.StartElement:
      switch se.Name.Local {
      case "dblp":

      case "article":
      var p Article
      decoder.DecodeElement(&p, &se)
      increment(&articleCounter)
      m[i] = Record{i, articleCounter, "article", p.Year}
      increment(&i)

      case "inproceedings":
      var p InProceedings
      decoder.DecodeElement(&p, &se)
      increment(&InProceedingsCounter)
      m[i] = Record{i, InProceedingsCounter, "inproceedings", p.Year}
      increment(&i)

      case "proceedings":
      var p Proceedings
      decoder.DecodeElement(&p, &se)
      increment(&ProceedingsCounter)
      m[i] = Record{i, ProceedingsCounter, "proceedings", p.Year}
      increment(&i)

      case "book":
      var p Book
      decoder.DecodeElement(&p, &se)
      increment(&BookCounter)
      m[i] = Record{i, BookCounter, "proceedings", p.Year}
      increment(&i)

      case "incollection":
      var p InCollection
      decoder.DecodeElement(&p, &se)
      increment(&InCollectionCounter)
      m[i] = Record{i, InCollectionCounter, "incollection", p.Year}
      increment(&i)

      case "phdthesis":
      var p PhdThesis
      decoder.DecodeElement(&p, &se)
      increment(&PhdThesisCounter)
      m[i] = Record{i, PhdThesisCounter, "phdthesis", p.Year}
      increment(&i)

      case "mastersthesis":
      var p MastersThesis
      decoder.DecodeElement(&p, &se)
      increment(&mastersThesisCounter)
      m[i] = Record{i, mastersThesisCounter, "mastersthesis", p.Year}
      increment(&i)

      case "www":
      var p WWW
      decoder.DecodeElement(&p, &se)
      increment(&wwwCounter)
      m[i] = Record{i, wwwCounter, "www", p.Year}
      increment(&i)
      }
      }
      }
      log.Println("Element parsing completed in:", time.Since(start))

      // All parsed elements have been added to m := make(map[int]Record)
      // We can start processing the map.
      // First we create a map and count the number of occurences of each publication type for a given year.

      srMap := make(map[SumRecord]int)
      log.Println("Creating sums by article type per year")
      for key := range m {
      sr := SumRecord{
      Type: m[key].Type,
      Year: m[key].Year,
      }
      srMap[sr]++
      }

      //// Create sum csv
      log.Println("Creating sum results csv file")
      sumfile, err := os.Create("sumresult.csv")
      checkError("Cannot create file", err)
      defer sumfile.Close()
      sumwriter := csv.NewWriter(sumfile)
      defer sumwriter.Flush()

      // define column headers
      sumheaders := string{
      "type",
      "year",
      "sum",
      }

      sumwriter.Write(sumheaders)
      var SumString string

      // Create sorted map by VALUE (integer)

      SortedSrMap := map[int]SumRecord{}
      SortedSrMapKeys := int{}
      for key, val := range SortedSrMap {
      // SortedSrMap[val] = key
      // SortedSrMapKeys = append(SortedSrMapKeys, val)
      SumString = strconv.Itoa(key)
      fmt.Println("sumstring:", SumString, "value: ", val)
      }
      sort.Ints(SortedSrMapKeys)

      // END Create sorted map by VALUE (integer)

      // Export sum csv
      for key, val := range srMap {
      r := make(string, 0, 1+len(sumheaders))
      SumString = strconv.Itoa(val)
      r = append(
      r,
      key.Type,
      key.Year,
      SumString,
      )
      sumwriter.Write(r)
      }
      sumwriter.Flush()

      // CREATE RESULTS CSV
      log.Println("Creating results csv file")
      file, err := os.Create("result.csv")
      checkError("Cannot create file", err)
      defer file.Close()
      writer := csv.NewWriter(file)
      defer writer.Flush()

      // define column headers
      headers := string{
      "uid",
      "id",
      "type",
      "year",
      }

      // write column headers
      writer.Write(headers)

      var idString string
      var uidString string

      // Create sorted map
      var keys int
      for k := range m {
      keys = append(keys, k)
      }
      sort.Ints(keys)

      for _, k := range keys {

      r := make(string, 0, 1+len(headers)) // capacity of 4, 1 + the number of properties our struct has & the number of column headers we are passing

      // convert the Record.ID and UID ints to string in order to pass into append()
      idString = strconv.Itoa(m[k].ID)
      uidString = strconv.Itoa(m[k].UID)

      r = append(
      r,
      uidString,
      idString,
      m[k].Type,
      m[k].Year,
      )
      writer.Write(r)
      }
      writer.Flush()

      // END CREATE RESULTS CSV

      // Finally report results - update below line with more counters as desired
      log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:", ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:", PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
      //log.Println("map:", m)
      //log.Println("map length:", len(m))
      //log.Println("sum map length:", len(srMap))
      //fmt.Println("sum map contents:", srMap)
      log.Println("XML parsing and csv export executed in:", time.Since(start))
      }

      func increment(i *int) {
      *i = *i + 1
      }

      func checkError(message string, err error) {
      if err != nil {
      log.Fatal(message, err)
      }
      }

      func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
      if charset == "ISO-8859-1" {
      // Windows-1252 is a superset of ISO-8859-1, so it should be ok for this case
      return charmap.Windows1252.NewDecoder().Reader(input), nil
      }
      return nil, fmt.Errorf("Unknown charset: %s", charset)
      }


      Main problems and issues I've identified:




      • The parsing is quite slow (it takes about 3:45 minutes) given the size of the file (474 Mb gzip). Can I improve something to make it faster?

      • Can the code be made less verbose but not at the expense of making it less readable / understandable to a person just starting out with Go? For example, by generalizing the structs which are used to define the different publication types as well as the case / switch statements?










      share|improve this question











      $endgroup$




      The following Go program parses a gzipped XML file (available here) which contains bibliographic information on computer science publications and has the following indicative structure:



      <?xml version="1.0" encoding="ISO-8859-1"?>
      <!DOCTYPE dblp SYSTEM "dblp.dtd">
      <dblp>
      <article mdate="2017-05-28" key="journals/acta/Saxena96">
      <author>Sanjeev Saxena</author>
      <title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
      <pages>607-619</pages>
      <year>1996</year>
      <volume>33</volume>
      <journal>Acta Inf.</journal>
      <number>7</number>
      <url>db/journals/acta/acta33.html#Saxena96</url>
      <ee>https://doi.org/10.1007/BF03036466</ee>
      </article>
      <article mdate="2017-05-28" key="journals/acta/Simon83">
      <author>Hans Ulrich Simon</author>
      <title>Pattern Matching in Trees and Nets.</title>
      <pages>227-248</pages>
      <year>1983</year>
      <volume>20</volume>
      <journal>Acta Inf.</journal>
      <url>db/journals/acta/acta20.html#Simon83</url>
      <ee>https://doi.org/10.1007/BF01257084</ee>
      </article>
      <article mdate="2017-05-28" key="journals/acta/GoodmanS83">
      <author>Nathan Goodman</author>
      <author>Oded Shmueli</author>
      <title>NP-complete Problems Simplified on Tree Schemas.</title>
      <pages>171-178</pages>
      <year>1983</year>
      <volume>20</volume>
      <journal>Acta Inf.</journal>
      <url>db/journals/acta/acta20.html#GoodmanS83</url>
      <ee>https://doi.org/10.1007/BF00289414</ee>
      </article>
      </dblp>


      The XML has multiple publication types denoted by the title of the element (i.e. proceedings, book, phdthesis) and for each of which I have defined a separate struct in my program:



      package main

      import (
      "compress/gzip"
      "encoding/csv"
      "encoding/xml"
      "fmt"
      "io"
      "log"
      "os"
      "sort"
      "strconv"
      "time"

      "golang.org/x/text/encoding/charmap"
      )

      // Dblp contains the array of articles in the dblp xml file
      type Dblp struct {
      XMLName xml.Name `xml:"dblp"`
      Dblp Article
      }

      // Metadata contains the fields shared by all structs
      type Metadata struct {
      Key string `xml:"key,attr"` // not currently in use
      Year string `xml:"year"`
      Author string `xml:"author"` // not currently in use
      Title string `xml:"title"` // not currently in use
      }

      // Article struct and the following structs contain the elements we want to parse and they "inherit" the metadata struct defined above
      type Article struct {
      XMLName xml.Name `xml:"article"`
      Metadata
      }

      type InProceedings struct {
      XMLName xml.Name `xml:"inproceedings"`
      Metadata
      }

      type Proceedings struct {
      XMLName xml.Name `xml:"proceedings"`
      Metadata
      }

      type Book struct {
      XMLName xml.Name `xml:"book"`
      Metadata
      }

      type InCollection struct {
      XMLName xml.Name `xml:"incollection"`
      Metadata
      }

      type PhdThesis struct {
      XMLName xml.Name `xml:"phdthesis"`
      Metadata
      }

      type MastersThesis struct {
      XMLName xml.Name `xml:"mastersthesis"`
      Metadata
      }

      type WWW struct {
      XMLName xml.Name `xml:"www"`
      Metadata
      }

      // Record is used to store each Article's type and year which will be passed as a value to map m
      type Record struct {
      UID int
      ID int
      Type string
      Year string
      }

      // SumRecord is used to store the aggregated articles by year in srMap map
      //(count is stored in the map's int which is used as key)
      type SumRecord struct {
      Type string
      Year string
      }


      The program stores each publication in a map structure and finally exports two csv files:




      • results.csv which contains an id, publication type and year for each publication

      • sumresults.csv which contains the sum of each publication type per year


      It is the first "complete" program I've written in Go - I'm currently trying to get a grasp on the language and I've needed to ask two questions on Stack Overflow while writing it here and here.



      The rest of the code:



      func main() {
      // Start counting time
      start := time.Now()

      // Initialize counter variables for each publication type
      var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
      InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter int
      var i = 1

      // Initialize hash map
      m := make(map[int]Record)

      //Open gzipped dblp xml
      xmlFile, err := os.Open("dblp.xml.gz")
      gz, err := gzip.NewReader(xmlFile)
      if err != nil {
      log.Fatal(err)

      }
      defer gz.Close()

      //Directly open xml file for testing purposes if needed - be sure to comment out gzip file opening above
      //xmlFile, err := os.Open("dblp.xml")
      //xmlFile, err := os.Open("TestDblp.xml")
      if err != nil {
      fmt.Println(err)
      } else {
      log.Println("Successfully Opened Dblp XML file")
      }

      // defer the closing of XML file so that we can parse it later on
      defer xmlFile.Close()

      // Initialize main object from Dblp struct
      var articles Dblp

      // Create decoder element
      decoder := xml.NewDecoder(gz)

      // Suppress xml errors
      decoder.Strict = false
      decoder.CharsetReader = makeCharsetReader
      err = decoder.Decode(&articles.Dblp)
      if err != nil {
      fmt.Println(err)
      }

      for {
      // Read tokens from the XML document in a stream.
      t, err := decoder.Token()

      // If we reach the end of the file, we are done
      if err == io.EOF {
      log.Println("XML successfully parsed:", err)
      break
      } else if err != nil {
      log.Fatalf("Error decoding token: %t", err)
      } else if t == nil {
      break
      }

      // Here, we inspect the token
      switch se := t.(type) {

      // We have the start of an element and the token we created above in t:
      case xml.StartElement:
      switch se.Name.Local {
      case "dblp":

      case "article":
      var p Article
      decoder.DecodeElement(&p, &se)
      increment(&articleCounter)
      m[i] = Record{i, articleCounter, "article", p.Year}
      increment(&i)

      case "inproceedings":
      var p InProceedings
      decoder.DecodeElement(&p, &se)
      increment(&InProceedingsCounter)
      m[i] = Record{i, InProceedingsCounter, "inproceedings", p.Year}
      increment(&i)

      case "proceedings":
      var p Proceedings
      decoder.DecodeElement(&p, &se)
      increment(&ProceedingsCounter)
      m[i] = Record{i, ProceedingsCounter, "proceedings", p.Year}
      increment(&i)

      case "book":
      var p Book
      decoder.DecodeElement(&p, &se)
      increment(&BookCounter)
      m[i] = Record{i, BookCounter, "proceedings", p.Year}
      increment(&i)

      case "incollection":
      var p InCollection
      decoder.DecodeElement(&p, &se)
      increment(&InCollectionCounter)
      m[i] = Record{i, InCollectionCounter, "incollection", p.Year}
      increment(&i)

      case "phdthesis":
      var p PhdThesis
      decoder.DecodeElement(&p, &se)
      increment(&PhdThesisCounter)
      m[i] = Record{i, PhdThesisCounter, "phdthesis", p.Year}
      increment(&i)

      case "mastersthesis":
      var p MastersThesis
      decoder.DecodeElement(&p, &se)
      increment(&mastersThesisCounter)
      m[i] = Record{i, mastersThesisCounter, "mastersthesis", p.Year}
      increment(&i)

      case "www":
      var p WWW
      decoder.DecodeElement(&p, &se)
      increment(&wwwCounter)
      m[i] = Record{i, wwwCounter, "www", p.Year}
      increment(&i)
      }
      }
      }
      log.Println("Element parsing completed in:", time.Since(start))

      // All parsed elements have been added to m := make(map[int]Record)
      // We can start processing the map.
      // First we create a map and count the number of occurences of each publication type for a given year.

      srMap := make(map[SumRecord]int)
      log.Println("Creating sums by article type per year")
      for key := range m {
      sr := SumRecord{
      Type: m[key].Type,
      Year: m[key].Year,
      }
      srMap[sr]++
      }

      //// Create sum csv
      log.Println("Creating sum results csv file")
      sumfile, err := os.Create("sumresult.csv")
      checkError("Cannot create file", err)
      defer sumfile.Close()
      sumwriter := csv.NewWriter(sumfile)
      defer sumwriter.Flush()

      // define column headers
      sumheaders := string{
      "type",
      "year",
      "sum",
      }

      sumwriter.Write(sumheaders)
      var SumString string

      // Create sorted map by VALUE (integer)

      SortedSrMap := map[int]SumRecord{}
      SortedSrMapKeys := int{}
      for key, val := range SortedSrMap {
      // SortedSrMap[val] = key
      // SortedSrMapKeys = append(SortedSrMapKeys, val)
      SumString = strconv.Itoa(key)
      fmt.Println("sumstring:", SumString, "value: ", val)
      }
      sort.Ints(SortedSrMapKeys)

      // END Create sorted map by VALUE (integer)

      // Export sum csv
      for key, val := range srMap {
      r := make(string, 0, 1+len(sumheaders))
      SumString = strconv.Itoa(val)
      r = append(
      r,
      key.Type,
      key.Year,
      SumString,
      )
      sumwriter.Write(r)
      }
      sumwriter.Flush()

      // CREATE RESULTS CSV
      log.Println("Creating results csv file")
      file, err := os.Create("result.csv")
      checkError("Cannot create file", err)
      defer file.Close()
      writer := csv.NewWriter(file)
      defer writer.Flush()

      // define column headers
      headers := string{
      "uid",
      "id",
      "type",
      "year",
      }

      // write column headers
      writer.Write(headers)

      var idString string
      var uidString string

      // Create sorted map
      var keys int
      for k := range m {
      keys = append(keys, k)
      }
      sort.Ints(keys)

      for _, k := range keys {

      r := make(string, 0, 1+len(headers)) // capacity of 4, 1 + the number of properties our struct has & the number of column headers we are passing

      // convert the Record.ID and UID ints to string in order to pass into append()
      idString = strconv.Itoa(m[k].ID)
      uidString = strconv.Itoa(m[k].UID)

      r = append(
      r,
      uidString,
      idString,
      m[k].Type,
      m[k].Year,
      )
      writer.Write(r)
      }
      writer.Flush()

      // END CREATE RESULTS CSV

      // Finally report results - update below line with more counters as desired
      log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:", ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:", PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
      //log.Println("map:", m)
      //log.Println("map length:", len(m))
      //log.Println("sum map length:", len(srMap))
      //fmt.Println("sum map contents:", srMap)
      log.Println("XML parsing and csv export executed in:", time.Since(start))
      }

      func increment(i *int) {
      *i = *i + 1
      }

      func checkError(message string, err error) {
      if err != nil {
      log.Fatal(message, err)
      }
      }

      func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
      if charset == "ISO-8859-1" {
      // Windows-1252 is a superset of ISO-8859-1, so it should be ok for this case
      return charmap.Windows1252.NewDecoder().Reader(input), nil
      }
      return nil, fmt.Errorf("Unknown charset: %s", charset)
      }


      Main problems and issues I've identified:




      • The parsing is quite slow (it takes about 3:45 minutes) given the size of the file (474 Mb gzip). Can I improve something to make it faster?

      • Can the code be made less verbose but not at the expense of making it less readable / understandable to a person just starting out with Go? For example, by generalizing the structs which are used to define the different publication types as well as the case / switch statements?







      parsing xml go






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 58 mins ago









      Jamal

      30.4k11121227




      30.4k11121227










      asked Mar 11 at 15:20









      orestisforestisf

      285




      285






















          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$

          The decoder.Decode call is unnecessary and in fact throws an error at
          the moment.



          To your second point, yes, especially the case statements can all be
          compressed down to a single function most likely, since they all only
          have a few variables exchanged.



          The indexing into a hash map map[int]Record is not ideal, in fact
          that's probably causing a slowdown too with the two million elements in
          that table, instead you can simply append the elements to a slice and
          then it's all sorted and fine for iteration later on, no sorting
          necessary at all.



          And for increment(&i) ... just go ahead and increment the counters.
          If you make functions, okay, but like this it's not helping with
          readability (i += 1 is much clearer).



          make(string, 0, 1+len(headers) - well that's valid, but you can
          simply create the array with all elements instead, like
          string{uidString, ..., m[k].Year etc. Might be even better if you
          can reuse that array for all loop iterations.



          Well I can't see any other obvious things to change. There's the
          possibility that getting rid of DecodeElement and doing the whole
          decoding yourself might improve things, but I'm skeptical. If I, for
          example, remove the whole switch block, doing nothing but XML
          decoding essentially, this still takes three minutes for me, essentially
          just one minute less than with that block included! Meaning that with
          this library it's just not going to get much quicker overall.






          share|improve this answer









          $endgroup$









          • 1




            $begingroup$
            Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed decoder.Decode. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name element.
            $endgroup$
            – orestisf
            Mar 17 at 20:01






          • 1




            $begingroup$
            From what I know about the encoding/xml package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
            $endgroup$
            – ferada
            Mar 17 at 22:25



















          0












          $begingroup$

          I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.



          Main points:



          Only two structs are now used:



          type Metadata struct {
          Key string `xml:"key,attr"`
          Year string `xml:"year"`
          Author string `xml:"author"`
          Title string `xml:"title"`
          }

          type Record struct {
          UID int
          ID int
          Type string
          Year string
          }


          The publications are all processed with the following function:



          func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
          m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
          }


          The entire code looks now like this:



          package main

          import (
          "compress/gzip"
          "encoding/csv"
          "encoding/xml"
          "fmt"
          "io"
          "log"
          "os"
          "sort"
          "strconv"
          "time"

          "golang.org/x/text/encoding/charmap"
          )

          // Metadata contains the fields shared by all structs
          type Metadata struct {
          Key string `xml:"key,attr"` // currently not in use
          Year string `xml:"year"`
          Author string `xml:"author"` // currently not in use
          Title string `xml:"title"` // currently not in use
          }

          // Record is used to store each Article's type and year which will be passed as a value to map m
          type Record struct {
          UID int
          ID int
          Type string
          Year string
          }

          type Count int

          type Counter interface {
          Incr() int
          ReturnInt() int
          }

          var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
          InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count

          func main() {
          start := time.Now()

          //Open gzipped dblp xml
          //xmlFile, err := os.Open("TestDblp.xml.gz")
          // Uncomment below for actual xml
          xmlFile, err := os.Open("dblp.xml.gz")
          gz, err := gzip.NewReader(xmlFile)
          if err != nil {
          log.Fatal(err)

          } else {
          log.Println("Successfully Opened Dblp XML file")
          }

          defer gz.Close()

          // Create decoder element
          decoder := xml.NewDecoder(gz)

          // Suppress xml errors
          decoder.Strict = false
          decoder.CharsetReader = makeCharsetReader
          if err != nil {
          log.Fatal(err)
          }

          m := make(map[int]Record)
          var p Metadata

          for {
          // Read tokens from the XML document in a stream.
          t, err := decoder.Token()

          // If we reach the end of the file, we are done with parsing.
          if err == io.EOF {
          log.Println("XML successfully parsed:", err)
          break
          } else if err != nil {
          log.Fatalf("Error decoding token: %t", err)
          } else if t == nil {
          break
          }

          // Let's inspect the token
          switch se := t.(type) {

          // We have the start of an element and the token we created above in t:
          case xml.StartElement:
          switch se.Name.Local {

          case "article":
          decoder.DecodeElement(&p, &se)
          ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)

          case "inproceedings":
          decoder.DecodeElement(&p, &se)
          ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)

          case "proceedings":
          decoder.DecodeElement(&p, &se)
          ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)

          case "book":
          decoder.DecodeElement(&p, &se)
          ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)

          case "incollection":
          decoder.DecodeElement(&p, &se)
          ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)

          case "phdthesis":
          decoder.DecodeElement(&p, &se)
          ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)

          case "mastersthesis":
          decoder.DecodeElement(&p, &se)
          ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)

          case "www":
          decoder.DecodeElement(&p, &se)
          ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
          }
          }
          }
          log.Println("XML parsing done in:", time.Since(start))

          // All parsed elements have been added to m := make(map[int]Record)
          // We create srMap map object and count the number of occurences of each publication type for a given year.

          srMap := make(map[Record]int)
          log.Println("Creating sums by article type per year")
          for key := range m {
          sr := Record{
          Type: m[key].Type,
          Year: m[key].Year,
          }
          srMap[sr]++
          }

          // Create sumresult.csv
          log.Println("Creating sum results csv file")
          sumfile, err := os.Create("sumresult.csv")
          checkError("Cannot create file", err)
          defer sumfile.Close()

          sumwriter := csv.NewWriter(sumfile)
          defer sumwriter.Flush()

          sumheaders := string{
          "publicationType",
          "year",
          "sum",
          }

          sumwriter.Write(sumheaders)

          // Export sumresult.csv
          for key, val := range srMap {
          r := make(string, 0, 1+len(sumheaders))
          r = append(
          r,
          key.Type,
          key.Year,
          strconv.Itoa(val),
          )
          sumwriter.Write(r)
          }
          sumwriter.Flush()

          // Create result.csv
          log.Println("Creating result.csv")

          file, err := os.Create("result.csv")
          checkError("Cannot create file", err)
          defer file.Close()

          writer := csv.NewWriter(file)
          defer writer.Flush()

          headers := string{
          "uid",
          "id",
          "type",
          "year",
          }

          writer.Write(headers)

          // Create sorted map
          var keys int
          for k := range m {
          keys = append(keys, k)
          }
          sort.Ints(keys)

          for _, k := range keys {

          r := make(string, 0, 1+len(headers))
          r = append(
          r,
          strconv.Itoa(m[k].UID),
          strconv.Itoa(m[k].ID),
          m[k].Type,
          m[k].Year,
          )
          writer.Write(r)
          }
          writer.Flush()

          // Finally report results
          log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
          ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
          PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
          log.Println("Distinct publication map length:", len(m))
          log.Println("Sum map length:", len(srMap))
          log.Println("XML parsing and csv export executed in:", time.Since(start))
          }

          func checkError(message string, err error) {
          if err != nil {
          log.Fatal(message, err)
          }
          }

          func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
          if charset == "ISO-8859-1" {
          // Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
          return charmap.Windows1252.NewDecoder().Reader(input), nil
          }
          return nil, fmt.Errorf("Unknown charset: %s", charset)
          }

          func (c *Count) Incr() int {
          *c = *c + 1
          return int(*c)
          }

          func (c *Count) ReturnInt() int {
          return int(*c)
          }

          func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
          m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
          }


          I feel that the csv generation parts can be further streamlined as they are still a bit messy.






          share|improve this answer









          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "196"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f215203%2fparse-dblp-xml-and-output-sums-of-publications-grouped-by-year-and-type%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1












            $begingroup$

            The decoder.Decode call is unnecessary and in fact throws an error at
            the moment.



            To your second point, yes, especially the case statements can all be
            compressed down to a single function most likely, since they all only
            have a few variables exchanged.



            The indexing into a hash map map[int]Record is not ideal, in fact
            that's probably causing a slowdown too with the two million elements in
            that table, instead you can simply append the elements to a slice and
            then it's all sorted and fine for iteration later on, no sorting
            necessary at all.



            And for increment(&i) ... just go ahead and increment the counters.
            If you make functions, okay, but like this it's not helping with
            readability (i += 1 is much clearer).



            make(string, 0, 1+len(headers) - well that's valid, but you can
            simply create the array with all elements instead, like
            string{uidString, ..., m[k].Year etc. Might be even better if you
            can reuse that array for all loop iterations.



            Well I can't see any other obvious things to change. There's the
            possibility that getting rid of DecodeElement and doing the whole
            decoding yourself might improve things, but I'm skeptical. If I, for
            example, remove the whole switch block, doing nothing but XML
            decoding essentially, this still takes three minutes for me, essentially
            just one minute less than with that block included! Meaning that with
            this library it's just not going to get much quicker overall.






            share|improve this answer









            $endgroup$









            • 1




              $begingroup$
              Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed decoder.Decode. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name element.
              $endgroup$
              – orestisf
              Mar 17 at 20:01






            • 1




              $begingroup$
              From what I know about the encoding/xml package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
              $endgroup$
              – ferada
              Mar 17 at 22:25
















            1












            $begingroup$

            The decoder.Decode call is unnecessary and in fact throws an error at
            the moment.



            To your second point, yes, especially the case statements can all be
            compressed down to a single function most likely, since they all only
            have a few variables exchanged.



            The indexing into a hash map map[int]Record is not ideal, in fact
            that's probably causing a slowdown too with the two million elements in
            that table, instead you can simply append the elements to a slice and
            then it's all sorted and fine for iteration later on, no sorting
            necessary at all.



            And for increment(&i) ... just go ahead and increment the counters.
            If you make functions, okay, but like this it's not helping with
            readability (i += 1 is much clearer).



            make(string, 0, 1+len(headers) - well that's valid, but you can
            simply create the array with all elements instead, like
            string{uidString, ..., m[k].Year etc. Might be even better if you
            can reuse that array for all loop iterations.



            Well I can't see any other obvious things to change. There's the
            possibility that getting rid of DecodeElement and doing the whole
            decoding yourself might improve things, but I'm skeptical. If I, for
            example, remove the whole switch block, doing nothing but XML
            decoding essentially, this still takes three minutes for me, essentially
            just one minute less than with that block included! Meaning that with
            this library it's just not going to get much quicker overall.






            share|improve this answer









            $endgroup$









            • 1




              $begingroup$
              Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed decoder.Decode. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name element.
              $endgroup$
              – orestisf
              Mar 17 at 20:01






            • 1




              $begingroup$
              From what I know about the encoding/xml package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
              $endgroup$
              – ferada
              Mar 17 at 22:25














            1












            1








            1





            $begingroup$

            The decoder.Decode call is unnecessary and in fact throws an error at
            the moment.



            To your second point, yes, especially the case statements can all be
            compressed down to a single function most likely, since they all only
            have a few variables exchanged.



            The indexing into a hash map map[int]Record is not ideal, in fact
            that's probably causing a slowdown too with the two million elements in
            that table, instead you can simply append the elements to a slice and
            then it's all sorted and fine for iteration later on, no sorting
            necessary at all.



            And for increment(&i) ... just go ahead and increment the counters.
            If you make functions, okay, but like this it's not helping with
            readability (i += 1 is much clearer).



            make(string, 0, 1+len(headers) - well that's valid, but you can
            simply create the array with all elements instead, like
            string{uidString, ..., m[k].Year etc. Might be even better if you
            can reuse that array for all loop iterations.



            Well I can't see any other obvious things to change. There's the
            possibility that getting rid of DecodeElement and doing the whole
            decoding yourself might improve things, but I'm skeptical. If I, for
            example, remove the whole switch block, doing nothing but XML
            decoding essentially, this still takes three minutes for me, essentially
            just one minute less than with that block included! Meaning that with
            this library it's just not going to get much quicker overall.






            share|improve this answer









            $endgroup$



            The decoder.Decode call is unnecessary and in fact throws an error at
            the moment.



            To your second point, yes, especially the case statements can all be
            compressed down to a single function most likely, since they all only
            have a few variables exchanged.



            The indexing into a hash map map[int]Record is not ideal, in fact
            that's probably causing a slowdown too with the two million elements in
            that table, instead you can simply append the elements to a slice and
            then it's all sorted and fine for iteration later on, no sorting
            necessary at all.



            And for increment(&i) ... just go ahead and increment the counters.
            If you make functions, okay, but like this it's not helping with
            readability (i += 1 is much clearer).



            make(string, 0, 1+len(headers) - well that's valid, but you can
            simply create the array with all elements instead, like
            string{uidString, ..., m[k].Year etc. Might be even better if you
            can reuse that array for all loop iterations.



            Well I can't see any other obvious things to change. There's the
            possibility that getting rid of DecodeElement and doing the whole
            decoding yourself might improve things, but I'm skeptical. If I, for
            example, remove the whole switch block, doing nothing but XML
            decoding essentially, this still takes three minutes for me, essentially
            just one minute less than with that block included! Meaning that with
            this library it's just not going to get much quicker overall.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Mar 11 at 23:37









            feradaferada

            9,3161557




            9,3161557








            • 1




              $begingroup$
              Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed decoder.Decode. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name element.
              $endgroup$
              – orestisf
              Mar 17 at 20:01






            • 1




              $begingroup$
              From what I know about the encoding/xml package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
              $endgroup$
              – ferada
              Mar 17 at 22:25














            • 1




              $begingroup$
              Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed decoder.Decode. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name element.
              $endgroup$
              – orestisf
              Mar 17 at 20:01






            • 1




              $begingroup$
              From what I know about the encoding/xml package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
              $endgroup$
              – ferada
              Mar 17 at 22:25








            1




            1




            $begingroup$
            Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed decoder.Decode. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name element.
            $endgroup$
            – orestisf
            Mar 17 at 20:01




            $begingroup$
            Thanks for taking the time to review the code! I've reworked it based on your feedback: - Removed decoder.Decode. - Created single function to process the elements I'm interested in which will do the increments / map / slice appends. - For the increment functions, indeed they would make the code a bit more readable, however I want to keep them for now for learning's sake. - Working on removing the maps and using a slice instead. I was wondering whether it would be possible to "concatenate" the different structs, as the only difference about them is the xml.Name element.
            $endgroup$
            – orestisf
            Mar 17 at 20:01




            1




            1




            $begingroup$
            From what I know about the encoding/xml package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
            $endgroup$
            – ferada
            Mar 17 at 22:25




            $begingroup$
            From what I know about the encoding/xml package I don't think there's anything to make more succinct about the structs unfortunately. You could go to a generic nested struct and to the decoding though without the struct definitions.
            $endgroup$
            – ferada
            Mar 17 at 22:25













            0












            $begingroup$

            I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.



            Main points:



            Only two structs are now used:



            type Metadata struct {
            Key string `xml:"key,attr"`
            Year string `xml:"year"`
            Author string `xml:"author"`
            Title string `xml:"title"`
            }

            type Record struct {
            UID int
            ID int
            Type string
            Year string
            }


            The publications are all processed with the following function:



            func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
            m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
            }


            The entire code looks now like this:



            package main

            import (
            "compress/gzip"
            "encoding/csv"
            "encoding/xml"
            "fmt"
            "io"
            "log"
            "os"
            "sort"
            "strconv"
            "time"

            "golang.org/x/text/encoding/charmap"
            )

            // Metadata contains the fields shared by all structs
            type Metadata struct {
            Key string `xml:"key,attr"` // currently not in use
            Year string `xml:"year"`
            Author string `xml:"author"` // currently not in use
            Title string `xml:"title"` // currently not in use
            }

            // Record is used to store each Article's type and year which will be passed as a value to map m
            type Record struct {
            UID int
            ID int
            Type string
            Year string
            }

            type Count int

            type Counter interface {
            Incr() int
            ReturnInt() int
            }

            var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
            InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count

            func main() {
            start := time.Now()

            //Open gzipped dblp xml
            //xmlFile, err := os.Open("TestDblp.xml.gz")
            // Uncomment below for actual xml
            xmlFile, err := os.Open("dblp.xml.gz")
            gz, err := gzip.NewReader(xmlFile)
            if err != nil {
            log.Fatal(err)

            } else {
            log.Println("Successfully Opened Dblp XML file")
            }

            defer gz.Close()

            // Create decoder element
            decoder := xml.NewDecoder(gz)

            // Suppress xml errors
            decoder.Strict = false
            decoder.CharsetReader = makeCharsetReader
            if err != nil {
            log.Fatal(err)
            }

            m := make(map[int]Record)
            var p Metadata

            for {
            // Read tokens from the XML document in a stream.
            t, err := decoder.Token()

            // If we reach the end of the file, we are done with parsing.
            if err == io.EOF {
            log.Println("XML successfully parsed:", err)
            break
            } else if err != nil {
            log.Fatalf("Error decoding token: %t", err)
            } else if t == nil {
            break
            }

            // Let's inspect the token
            switch se := t.(type) {

            // We have the start of an element and the token we created above in t:
            case xml.StartElement:
            switch se.Name.Local {

            case "article":
            decoder.DecodeElement(&p, &se)
            ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)

            case "inproceedings":
            decoder.DecodeElement(&p, &se)
            ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)

            case "proceedings":
            decoder.DecodeElement(&p, &se)
            ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)

            case "book":
            decoder.DecodeElement(&p, &se)
            ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)

            case "incollection":
            decoder.DecodeElement(&p, &se)
            ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)

            case "phdthesis":
            decoder.DecodeElement(&p, &se)
            ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)

            case "mastersthesis":
            decoder.DecodeElement(&p, &se)
            ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)

            case "www":
            decoder.DecodeElement(&p, &se)
            ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
            }
            }
            }
            log.Println("XML parsing done in:", time.Since(start))

            // All parsed elements have been added to m := make(map[int]Record)
            // We create srMap map object and count the number of occurences of each publication type for a given year.

            srMap := make(map[Record]int)
            log.Println("Creating sums by article type per year")
            for key := range m {
            sr := Record{
            Type: m[key].Type,
            Year: m[key].Year,
            }
            srMap[sr]++
            }

            // Create sumresult.csv
            log.Println("Creating sum results csv file")
            sumfile, err := os.Create("sumresult.csv")
            checkError("Cannot create file", err)
            defer sumfile.Close()

            sumwriter := csv.NewWriter(sumfile)
            defer sumwriter.Flush()

            sumheaders := string{
            "publicationType",
            "year",
            "sum",
            }

            sumwriter.Write(sumheaders)

            // Export sumresult.csv
            for key, val := range srMap {
            r := make(string, 0, 1+len(sumheaders))
            r = append(
            r,
            key.Type,
            key.Year,
            strconv.Itoa(val),
            )
            sumwriter.Write(r)
            }
            sumwriter.Flush()

            // Create result.csv
            log.Println("Creating result.csv")

            file, err := os.Create("result.csv")
            checkError("Cannot create file", err)
            defer file.Close()

            writer := csv.NewWriter(file)
            defer writer.Flush()

            headers := string{
            "uid",
            "id",
            "type",
            "year",
            }

            writer.Write(headers)

            // Create sorted map
            var keys int
            for k := range m {
            keys = append(keys, k)
            }
            sort.Ints(keys)

            for _, k := range keys {

            r := make(string, 0, 1+len(headers))
            r = append(
            r,
            strconv.Itoa(m[k].UID),
            strconv.Itoa(m[k].ID),
            m[k].Type,
            m[k].Year,
            )
            writer.Write(r)
            }
            writer.Flush()

            // Finally report results
            log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
            ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
            PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
            log.Println("Distinct publication map length:", len(m))
            log.Println("Sum map length:", len(srMap))
            log.Println("XML parsing and csv export executed in:", time.Since(start))
            }

            func checkError(message string, err error) {
            if err != nil {
            log.Fatal(message, err)
            }
            }

            func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
            if charset == "ISO-8859-1" {
            // Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
            return charmap.Windows1252.NewDecoder().Reader(input), nil
            }
            return nil, fmt.Errorf("Unknown charset: %s", charset)
            }

            func (c *Count) Incr() int {
            *c = *c + 1
            return int(*c)
            }

            func (c *Count) ReturnInt() int {
            return int(*c)
            }

            func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
            m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
            }


            I feel that the csv generation parts can be further streamlined as they are still a bit messy.






            share|improve this answer









            $endgroup$


















              0












              $begingroup$

              I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.



              Main points:



              Only two structs are now used:



              type Metadata struct {
              Key string `xml:"key,attr"`
              Year string `xml:"year"`
              Author string `xml:"author"`
              Title string `xml:"title"`
              }

              type Record struct {
              UID int
              ID int
              Type string
              Year string
              }


              The publications are all processed with the following function:



              func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
              m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
              }


              The entire code looks now like this:



              package main

              import (
              "compress/gzip"
              "encoding/csv"
              "encoding/xml"
              "fmt"
              "io"
              "log"
              "os"
              "sort"
              "strconv"
              "time"

              "golang.org/x/text/encoding/charmap"
              )

              // Metadata contains the fields shared by all structs
              type Metadata struct {
              Key string `xml:"key,attr"` // currently not in use
              Year string `xml:"year"`
              Author string `xml:"author"` // currently not in use
              Title string `xml:"title"` // currently not in use
              }

              // Record is used to store each Article's type and year which will be passed as a value to map m
              type Record struct {
              UID int
              ID int
              Type string
              Year string
              }

              type Count int

              type Counter interface {
              Incr() int
              ReturnInt() int
              }

              var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
              InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count

              func main() {
              start := time.Now()

              //Open gzipped dblp xml
              //xmlFile, err := os.Open("TestDblp.xml.gz")
              // Uncomment below for actual xml
              xmlFile, err := os.Open("dblp.xml.gz")
              gz, err := gzip.NewReader(xmlFile)
              if err != nil {
              log.Fatal(err)

              } else {
              log.Println("Successfully Opened Dblp XML file")
              }

              defer gz.Close()

              // Create decoder element
              decoder := xml.NewDecoder(gz)

              // Suppress xml errors
              decoder.Strict = false
              decoder.CharsetReader = makeCharsetReader
              if err != nil {
              log.Fatal(err)
              }

              m := make(map[int]Record)
              var p Metadata

              for {
              // Read tokens from the XML document in a stream.
              t, err := decoder.Token()

              // If we reach the end of the file, we are done with parsing.
              if err == io.EOF {
              log.Println("XML successfully parsed:", err)
              break
              } else if err != nil {
              log.Fatalf("Error decoding token: %t", err)
              } else if t == nil {
              break
              }

              // Let's inspect the token
              switch se := t.(type) {

              // We have the start of an element and the token we created above in t:
              case xml.StartElement:
              switch se.Name.Local {

              case "article":
              decoder.DecodeElement(&p, &se)
              ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)

              case "inproceedings":
              decoder.DecodeElement(&p, &se)
              ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)

              case "proceedings":
              decoder.DecodeElement(&p, &se)
              ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)

              case "book":
              decoder.DecodeElement(&p, &se)
              ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)

              case "incollection":
              decoder.DecodeElement(&p, &se)
              ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)

              case "phdthesis":
              decoder.DecodeElement(&p, &se)
              ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)

              case "mastersthesis":
              decoder.DecodeElement(&p, &se)
              ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)

              case "www":
              decoder.DecodeElement(&p, &se)
              ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
              }
              }
              }
              log.Println("XML parsing done in:", time.Since(start))

              // All parsed elements have been added to m := make(map[int]Record)
              // We create srMap map object and count the number of occurences of each publication type for a given year.

              srMap := make(map[Record]int)
              log.Println("Creating sums by article type per year")
              for key := range m {
              sr := Record{
              Type: m[key].Type,
              Year: m[key].Year,
              }
              srMap[sr]++
              }

              // Create sumresult.csv
              log.Println("Creating sum results csv file")
              sumfile, err := os.Create("sumresult.csv")
              checkError("Cannot create file", err)
              defer sumfile.Close()

              sumwriter := csv.NewWriter(sumfile)
              defer sumwriter.Flush()

              sumheaders := string{
              "publicationType",
              "year",
              "sum",
              }

              sumwriter.Write(sumheaders)

              // Export sumresult.csv
              for key, val := range srMap {
              r := make(string, 0, 1+len(sumheaders))
              r = append(
              r,
              key.Type,
              key.Year,
              strconv.Itoa(val),
              )
              sumwriter.Write(r)
              }
              sumwriter.Flush()

              // Create result.csv
              log.Println("Creating result.csv")

              file, err := os.Create("result.csv")
              checkError("Cannot create file", err)
              defer file.Close()

              writer := csv.NewWriter(file)
              defer writer.Flush()

              headers := string{
              "uid",
              "id",
              "type",
              "year",
              }

              writer.Write(headers)

              // Create sorted map
              var keys int
              for k := range m {
              keys = append(keys, k)
              }
              sort.Ints(keys)

              for _, k := range keys {

              r := make(string, 0, 1+len(headers))
              r = append(
              r,
              strconv.Itoa(m[k].UID),
              strconv.Itoa(m[k].ID),
              m[k].Type,
              m[k].Year,
              )
              writer.Write(r)
              }
              writer.Flush()

              // Finally report results
              log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
              ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
              PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
              log.Println("Distinct publication map length:", len(m))
              log.Println("Sum map length:", len(srMap))
              log.Println("XML parsing and csv export executed in:", time.Since(start))
              }

              func checkError(message string, err error) {
              if err != nil {
              log.Fatal(message, err)
              }
              }

              func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
              if charset == "ISO-8859-1" {
              // Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
              return charmap.Windows1252.NewDecoder().Reader(input), nil
              }
              return nil, fmt.Errorf("Unknown charset: %s", charset)
              }

              func (c *Count) Incr() int {
              *c = *c + 1
              return int(*c)
              }

              func (c *Count) ReturnInt() int {
              return int(*c)
              }

              func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
              m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
              }


              I feel that the csv generation parts can be further streamlined as they are still a bit messy.






              share|improve this answer









              $endgroup$
















                0












                0








                0





                $begingroup$

                I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.



                Main points:



                Only two structs are now used:



                type Metadata struct {
                Key string `xml:"key,attr"`
                Year string `xml:"year"`
                Author string `xml:"author"`
                Title string `xml:"title"`
                }

                type Record struct {
                UID int
                ID int
                Type string
                Year string
                }


                The publications are all processed with the following function:



                func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
                m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
                }


                The entire code looks now like this:



                package main

                import (
                "compress/gzip"
                "encoding/csv"
                "encoding/xml"
                "fmt"
                "io"
                "log"
                "os"
                "sort"
                "strconv"
                "time"

                "golang.org/x/text/encoding/charmap"
                )

                // Metadata contains the fields shared by all structs
                type Metadata struct {
                Key string `xml:"key,attr"` // currently not in use
                Year string `xml:"year"`
                Author string `xml:"author"` // currently not in use
                Title string `xml:"title"` // currently not in use
                }

                // Record is used to store each Article's type and year which will be passed as a value to map m
                type Record struct {
                UID int
                ID int
                Type string
                Year string
                }

                type Count int

                type Counter interface {
                Incr() int
                ReturnInt() int
                }

                var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
                InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count

                func main() {
                start := time.Now()

                //Open gzipped dblp xml
                //xmlFile, err := os.Open("TestDblp.xml.gz")
                // Uncomment below for actual xml
                xmlFile, err := os.Open("dblp.xml.gz")
                gz, err := gzip.NewReader(xmlFile)
                if err != nil {
                log.Fatal(err)

                } else {
                log.Println("Successfully Opened Dblp XML file")
                }

                defer gz.Close()

                // Create decoder element
                decoder := xml.NewDecoder(gz)

                // Suppress xml errors
                decoder.Strict = false
                decoder.CharsetReader = makeCharsetReader
                if err != nil {
                log.Fatal(err)
                }

                m := make(map[int]Record)
                var p Metadata

                for {
                // Read tokens from the XML document in a stream.
                t, err := decoder.Token()

                // If we reach the end of the file, we are done with parsing.
                if err == io.EOF {
                log.Println("XML successfully parsed:", err)
                break
                } else if err != nil {
                log.Fatalf("Error decoding token: %t", err)
                } else if t == nil {
                break
                }

                // Let's inspect the token
                switch se := t.(type) {

                // We have the start of an element and the token we created above in t:
                case xml.StartElement:
                switch se.Name.Local {

                case "article":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)

                case "inproceedings":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)

                case "proceedings":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)

                case "book":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)

                case "incollection":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)

                case "phdthesis":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)

                case "mastersthesis":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)

                case "www":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
                }
                }
                }
                log.Println("XML parsing done in:", time.Since(start))

                // All parsed elements have been added to m := make(map[int]Record)
                // We create srMap map object and count the number of occurences of each publication type for a given year.

                srMap := make(map[Record]int)
                log.Println("Creating sums by article type per year")
                for key := range m {
                sr := Record{
                Type: m[key].Type,
                Year: m[key].Year,
                }
                srMap[sr]++
                }

                // Create sumresult.csv
                log.Println("Creating sum results csv file")
                sumfile, err := os.Create("sumresult.csv")
                checkError("Cannot create file", err)
                defer sumfile.Close()

                sumwriter := csv.NewWriter(sumfile)
                defer sumwriter.Flush()

                sumheaders := string{
                "publicationType",
                "year",
                "sum",
                }

                sumwriter.Write(sumheaders)

                // Export sumresult.csv
                for key, val := range srMap {
                r := make(string, 0, 1+len(sumheaders))
                r = append(
                r,
                key.Type,
                key.Year,
                strconv.Itoa(val),
                )
                sumwriter.Write(r)
                }
                sumwriter.Flush()

                // Create result.csv
                log.Println("Creating result.csv")

                file, err := os.Create("result.csv")
                checkError("Cannot create file", err)
                defer file.Close()

                writer := csv.NewWriter(file)
                defer writer.Flush()

                headers := string{
                "uid",
                "id",
                "type",
                "year",
                }

                writer.Write(headers)

                // Create sorted map
                var keys int
                for k := range m {
                keys = append(keys, k)
                }
                sort.Ints(keys)

                for _, k := range keys {

                r := make(string, 0, 1+len(headers))
                r = append(
                r,
                strconv.Itoa(m[k].UID),
                strconv.Itoa(m[k].ID),
                m[k].Type,
                m[k].Year,
                )
                writer.Write(r)
                }
                writer.Flush()

                // Finally report results
                log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
                ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
                PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
                log.Println("Distinct publication map length:", len(m))
                log.Println("Sum map length:", len(srMap))
                log.Println("XML parsing and csv export executed in:", time.Since(start))
                }

                func checkError(message string, err error) {
                if err != nil {
                log.Fatal(message, err)
                }
                }

                func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
                if charset == "ISO-8859-1" {
                // Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
                return charmap.Windows1252.NewDecoder().Reader(input), nil
                }
                return nil, fmt.Errorf("Unknown charset: %s", charset)
                }

                func (c *Count) Incr() int {
                *c = *c + 1
                return int(*c)
                }

                func (c *Count) ReturnInt() int {
                return int(*c)
                }

                func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
                m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
                }


                I feel that the csv generation parts can be further streamlined as they are still a bit messy.






                share|improve this answer









                $endgroup$



                I've revisited the code to clean it up a bit and to follow some of the recommendations as I progress with my understanding of the language.



                Main points:



                Only two structs are now used:



                type Metadata struct {
                Key string `xml:"key,attr"`
                Year string `xml:"year"`
                Author string `xml:"author"`
                Title string `xml:"title"`
                }

                type Record struct {
                UID int
                ID int
                Type string
                Year string
                }


                The publications are all processed with the following function:



                func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
                m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
                }


                The entire code looks now like this:



                package main

                import (
                "compress/gzip"
                "encoding/csv"
                "encoding/xml"
                "fmt"
                "io"
                "log"
                "os"
                "sort"
                "strconv"
                "time"

                "golang.org/x/text/encoding/charmap"
                )

                // Metadata contains the fields shared by all structs
                type Metadata struct {
                Key string `xml:"key,attr"` // currently not in use
                Year string `xml:"year"`
                Author string `xml:"author"` // currently not in use
                Title string `xml:"title"` // currently not in use
                }

                // Record is used to store each Article's type and year which will be passed as a value to map m
                type Record struct {
                UID int
                ID int
                Type string
                Year string
                }

                type Count int

                type Counter interface {
                Incr() int
                ReturnInt() int
                }

                var articleCounter, InProceedingsCounter, ProceedingsCounter, BookCounter,
                InCollectionCounter, PhdThesisCounter, mastersThesisCounter, wwwCounter, i Count

                func main() {
                start := time.Now()

                //Open gzipped dblp xml
                //xmlFile, err := os.Open("TestDblp.xml.gz")
                // Uncomment below for actual xml
                xmlFile, err := os.Open("dblp.xml.gz")
                gz, err := gzip.NewReader(xmlFile)
                if err != nil {
                log.Fatal(err)

                } else {
                log.Println("Successfully Opened Dblp XML file")
                }

                defer gz.Close()

                // Create decoder element
                decoder := xml.NewDecoder(gz)

                // Suppress xml errors
                decoder.Strict = false
                decoder.CharsetReader = makeCharsetReader
                if err != nil {
                log.Fatal(err)
                }

                m := make(map[int]Record)
                var p Metadata

                for {
                // Read tokens from the XML document in a stream.
                t, err := decoder.Token()

                // If we reach the end of the file, we are done with parsing.
                if err == io.EOF {
                log.Println("XML successfully parsed:", err)
                break
                } else if err != nil {
                log.Fatalf("Error decoding token: %t", err)
                } else if t == nil {
                break
                }

                // Let's inspect the token
                switch se := t.(type) {

                // We have the start of an element and the token we created above in t:
                case xml.StartElement:
                switch se.Name.Local {

                case "article":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &articleCounter, se.Name.Local, p.Year, m)

                case "inproceedings":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &InProceedingsCounter, se.Name.Local, p.Year, m)

                case "proceedings":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &ProceedingsCounter, se.Name.Local, p.Year, m)

                case "book":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &BookCounter, se.Name.Local, p.Year, m)

                case "incollection":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &InCollectionCounter, se.Name.Local, p.Year, m)

                case "phdthesis":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &PhdThesisCounter, se.Name.Local, p.Year, m)

                case "mastersthesis":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &mastersThesisCounter, se.Name.Local, p.Year, m)

                case "www":
                decoder.DecodeElement(&p, &se)
                ProcessPublication(&i, &wwwCounter, se.Name.Local, p.Year, m)
                }
                }
                }
                log.Println("XML parsing done in:", time.Since(start))

                // All parsed elements have been added to m := make(map[int]Record)
                // We create srMap map object and count the number of occurences of each publication type for a given year.

                srMap := make(map[Record]int)
                log.Println("Creating sums by article type per year")
                for key := range m {
                sr := Record{
                Type: m[key].Type,
                Year: m[key].Year,
                }
                srMap[sr]++
                }

                // Create sumresult.csv
                log.Println("Creating sum results csv file")
                sumfile, err := os.Create("sumresult.csv")
                checkError("Cannot create file", err)
                defer sumfile.Close()

                sumwriter := csv.NewWriter(sumfile)
                defer sumwriter.Flush()

                sumheaders := string{
                "publicationType",
                "year",
                "sum",
                }

                sumwriter.Write(sumheaders)

                // Export sumresult.csv
                for key, val := range srMap {
                r := make(string, 0, 1+len(sumheaders))
                r = append(
                r,
                key.Type,
                key.Year,
                strconv.Itoa(val),
                )
                sumwriter.Write(r)
                }
                sumwriter.Flush()

                // Create result.csv
                log.Println("Creating result.csv")

                file, err := os.Create("result.csv")
                checkError("Cannot create file", err)
                defer file.Close()

                writer := csv.NewWriter(file)
                defer writer.Flush()

                headers := string{
                "uid",
                "id",
                "type",
                "year",
                }

                writer.Write(headers)

                // Create sorted map
                var keys int
                for k := range m {
                keys = append(keys, k)
                }
                sort.Ints(keys)

                for _, k := range keys {

                r := make(string, 0, 1+len(headers))
                r = append(
                r,
                strconv.Itoa(m[k].UID),
                strconv.Itoa(m[k].ID),
                m[k].Type,
                m[k].Year,
                )
                writer.Write(r)
                }
                writer.Flush()

                // Finally report results
                log.Println("Articles:", articleCounter, "inproceedings", InProceedingsCounter, "proceedings:",
                ProceedingsCounter, "book:", BookCounter, "incollection:", InCollectionCounter, "phdthesis:",
                PhdThesisCounter, "mastersthesis:", mastersThesisCounter, "www:", wwwCounter)
                log.Println("Distinct publication map length:", len(m))
                log.Println("Sum map length:", len(srMap))
                log.Println("XML parsing and csv export executed in:", time.Since(start))
                }

                func checkError(message string, err error) {
                if err != nil {
                log.Fatal(message, err)
                }
                }

                func makeCharsetReader(charset string, input io.Reader) (io.Reader, error) {
                if charset == "ISO-8859-1" {
                // Windows-1252 is a superset of ISO-8859-1, so it should be ok for correctly decoding the dblp.xml
                return charmap.Windows1252.NewDecoder().Reader(input), nil
                }
                return nil, fmt.Errorf("Unknown charset: %s", charset)
                }

                func (c *Count) Incr() int {
                *c = *c + 1
                return int(*c)
                }

                func (c *Count) ReturnInt() int {
                return int(*c)
                }

                func ProcessPublication(i Counter, publicationCounter Counter, publicationType string, publicationYear string, m map[int]Record) {
                m[i.Incr()] = Record{i.ReturnInt(), int(publicationCounter.Incr()), publicationType, publicationYear}
                }


                I feel that the csv generation parts can be further streamlined as they are still a bit messy.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered yesterday









                orestisforestisf

                285




                285






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Code Review Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f215203%2fparse-dblp-xml-and-output-sums-of-publications-grouped-by-year-and-type%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Сан-Квентин

                    Алькесар

                    Josef Freinademetz