Looking for an algorithm to identify contiguous repeated series of lines in a long string












5














I would like an algorithm that can identify repeated parts of big stack traces like this:



java.lang.StackOverflowError
at transform.Erasure$Eraser.typed1(Erasure.scala:789)
at typechecker.Typers$
Typer.runTyper$1(Typers.scala:5640)
at typechecker.Typers$
Typer.typedInternal(Typers.scala:5672)
at typechecker.Typers$Typer.body$2(Typers.scala:5613)
at typechecker.Typers$Typer.typed(Typers.scala:5618)
at typechecker.Typers$
Typer.$anonfun$typed1$38(Typers.scala:4752)
at typechecker.Typers$
Typer.silent(Typers.scala:700)
at typechecker.Typers$Typer.normalTypedApply$1(Typers.scala:4754)
at typechecker.Typers$Typer.typedApply$1(Typers.scala:4801)
at typechecker.Typers$Typer.typedInAnyMode$1(Typers.scala:5586)
at typechecker.Typers$Typer.typed1(Typers.scala:5603)
at transform.Erasure$
Eraser.typed1(Erasure.scala:789)
at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
at typechecker.Typers$
Typer.body$2(Typers.scala:5613)
at typechecker.Typers$
Typer.typed(Typers.scala:5618)
at typechecker.Typers$Typer.typedQualifier(Typers.scala:5723)
at typechecker.Typers$
Typer.typedQualifier(Typers.scala:5731)
at transform.Erasure$Eraser.adaptMember(Erasure.scala:714)
at transform.Erasure$
Eraser.typed1(Erasure.scala:789)
at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
at typechecker.Typers$
Typer.body$2(Typers.scala:5613)
at typechecker.Typers$
Typer.typed(Typers.scala:5618)
at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)
at typechecker.Typers$Typer.silent(Typers.scala:700)
at typechecker.Typers$
Typer.normalTypedApply$1(Typers.scala:4754)
at typechecker.Typers$
Typer.typedApply$1(Typers.scala:4801)
at typechecker.Typers$
Typer.typedInAnyMode$1(Typers.scala:5586)
at typechecker.Typers$
Typer.typed1(Typers.scala:5603)
at transform.Erasure$Eraser.typed1(Erasure.scala:789)
at typechecker.Typers$
Typer.runTyper$1(Typers.scala:5640)
at typechecker.Typers$
Typer.typedInternal(Typers.scala:5672)
at typechecker.Typers$Typer.body$2(Typers.scala:5613)
at typechecker.Typers$Typer.typed(Typers.scala:5618)
at typechecker.Typers$
Typer.typedQualifier(Typers.scala:5723)
at typechecker.Typers$Typer.typedQualifier(Typers.scala:5731)
at transform.Erasure$
Eraser.adaptMember(Erasure.scala:714)
at transform.Erasure$Eraser.typed1(Erasure.scala:789)
at typechecker.Typers$
Typer.runTyper$1(Typers.scala:5640)
at typechecker.Typers$
Typer.typedInternal(Typers.scala:5672)
at typechecker.Typers$Typer.body$2(Typers.scala:5613)
at typechecker.Typers$Typer.typed(Typers.scala:5618)
at typechecker.Typers$
Typer.$anonfun$typed1$38(Typers.scala:4752)
at typechecker.Typers$
Typer.silent(Typers.scala:700)
at typechecker.Typers$Typer.normalTypedApply$1(Typers.scala:4754)
at typechecker.Typers$Typer.typedApply$1(Typers.scala:4801)
at typechecker.Typers$Typer.typedInAnyMode$1(Typers.scala:5586)
at typechecker.Typers$Typer.typed1(Typers.scala:5603)
at transform.Erasure$
Eraser.typed1(Erasure.scala:789)
at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
at typechecker.Typers$
Typer.body$2(Typers.scala:5613)
at typechecker.Typers$
Typer.typed(Typers.scala:5618)
at typechecker.Typers$Typer.typedQualifier(Typers.scala:5723)
at typechecker.Typers$
Typer.typedQualifier(Typers.scala:5731)
at transform.Erasure$Eraser.adaptMember(Erasure.scala:714)
at transform.Erasure$
Eraser.typed1(Erasure.scala:789)
at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
at typechecker.Typers$
Typer.body$2(Typers.scala:5613)
at typechecker.Typers$
Typer.typed(Typers.scala:5618)
at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)


With a bit of inspection, it's clear that this segment is being repeated three times:



at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
at typechecker.Typers$
Typer.body$2(Typers.scala:5613)
at typechecker.Typers$
Typer.typed(Typers.scala:5618)
at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)
at typechecker.Typers$Typer.silent(Typers.scala:700)
at typechecker.Typers$
Typer.normalTypedApply$1(Typers.scala:4754)
at typechecker.Typers$
Typer.typedApply$1(Typers.scala:4801)
at typechecker.Typers$
Typer.typedInAnyMode$1(Typers.scala:5586)
at typechecker.Typers$
Typer.typed1(Typers.scala:5603)
at transform.Erasure$Eraser.typed1(Erasure.scala:789)
at typechecker.Typers$
Typer.runTyper$1(Typers.scala:5640)
at typechecker.Typers$
Typer.typedInternal(Typers.scala:5672)
at typechecker.Typers$Typer.body$2(Typers.scala:5613)
at typechecker.Typers$Typer.typed(Typers.scala:5618)
at typechecker.Typers$
Typer.typedQualifier(Typers.scala:5723)
at typechecker.Typers$Typer.typedQualifier(Typers.scala:5731)
at transform.Erasure$
Eraser.adaptMember(Erasure.scala:714)
at transform.Erasure$Eraser.typed1(Erasure.scala:789)


The end goal is so I can print out stack traces of recursive functions in a nicer fashion



java.lang.StackOverflowError
at transform.Erasure$Eraser.typed1(Erasure.scala:789)
------------------------- Repeated 3x -------------------------
at typechecker.Typers$
Typer.runTyper$1(Typers.scala:5640)
at typechecker.Typers$
Typer.typedInternal(Typers.scala:5672)
at typechecker.Typers$Typer.body$2(Typers.scala:5613)
at typechecker.Typers$Typer.typed(Typers.scala:5618)
at typechecker.Typers$
Typer.$anonfun$typed1$38(Typers.scala:4752)
at typechecker.Typers$
Typer.silent(Typers.scala:700)
at typechecker.Typers$Typer.normalTypedApply$1(Typers.scala:4754)
at typechecker.Typers$Typer.typedApply$1(Typers.scala:4801)
at typechecker.Typers$Typer.typedInAnyMode$1(Typers.scala:5586)
at typechecker.Typers$Typer.typed1(Typers.scala:5603)
at transform.Erasure$
Eraser.typed1(Erasure.scala:789)
at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
at typechecker.Typers$
Typer.body$2(Typers.scala:5613)
at typechecker.Typers$
Typer.typed(Typers.scala:5618)
at typechecker.Typers$Typer.typedQualifier(Typers.scala:5723)
at typechecker.Typers$
Typer.typedQualifier(Typers.scala:5731)
at transform.Erasure$Eraser.adaptMember(Erasure.scala:714)
at transform.Erasure$
Eraser.typed1(Erasure.scala:789)
-------------------------------------------------------------
at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
at typechecker.Typers$
Typer.body$2(Typers.scala:5613)
at typechecker.Typers$
Typer.typed(Typers.scala:5618)
at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)


I'm not sure how feasible this is, but I'd be happy for any and all possible solutions even if they have their own limitations and constraints. Preliminary Googling didn't find me anything, but quite likely I just don't know what to google










share|cite|improve this question









New contributor




Li Haoyi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

























    5














    I would like an algorithm that can identify repeated parts of big stack traces like this:



    java.lang.StackOverflowError
    at transform.Erasure$Eraser.typed1(Erasure.scala:789)
    at typechecker.Typers$
    Typer.runTyper$1(Typers.scala:5640)
    at typechecker.Typers$
    Typer.typedInternal(Typers.scala:5672)
    at typechecker.Typers$Typer.body$2(Typers.scala:5613)
    at typechecker.Typers$Typer.typed(Typers.scala:5618)
    at typechecker.Typers$
    Typer.$anonfun$typed1$38(Typers.scala:4752)
    at typechecker.Typers$
    Typer.silent(Typers.scala:700)
    at typechecker.Typers$Typer.normalTypedApply$1(Typers.scala:4754)
    at typechecker.Typers$Typer.typedApply$1(Typers.scala:4801)
    at typechecker.Typers$Typer.typedInAnyMode$1(Typers.scala:5586)
    at typechecker.Typers$Typer.typed1(Typers.scala:5603)
    at transform.Erasure$
    Eraser.typed1(Erasure.scala:789)
    at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
    at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
    at typechecker.Typers$
    Typer.body$2(Typers.scala:5613)
    at typechecker.Typers$
    Typer.typed(Typers.scala:5618)
    at typechecker.Typers$Typer.typedQualifier(Typers.scala:5723)
    at typechecker.Typers$
    Typer.typedQualifier(Typers.scala:5731)
    at transform.Erasure$Eraser.adaptMember(Erasure.scala:714)
    at transform.Erasure$
    Eraser.typed1(Erasure.scala:789)
    at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
    at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
    at typechecker.Typers$
    Typer.body$2(Typers.scala:5613)
    at typechecker.Typers$
    Typer.typed(Typers.scala:5618)
    at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)
    at typechecker.Typers$Typer.silent(Typers.scala:700)
    at typechecker.Typers$
    Typer.normalTypedApply$1(Typers.scala:4754)
    at typechecker.Typers$
    Typer.typedApply$1(Typers.scala:4801)
    at typechecker.Typers$
    Typer.typedInAnyMode$1(Typers.scala:5586)
    at typechecker.Typers$
    Typer.typed1(Typers.scala:5603)
    at transform.Erasure$Eraser.typed1(Erasure.scala:789)
    at typechecker.Typers$
    Typer.runTyper$1(Typers.scala:5640)
    at typechecker.Typers$
    Typer.typedInternal(Typers.scala:5672)
    at typechecker.Typers$Typer.body$2(Typers.scala:5613)
    at typechecker.Typers$Typer.typed(Typers.scala:5618)
    at typechecker.Typers$
    Typer.typedQualifier(Typers.scala:5723)
    at typechecker.Typers$Typer.typedQualifier(Typers.scala:5731)
    at transform.Erasure$
    Eraser.adaptMember(Erasure.scala:714)
    at transform.Erasure$Eraser.typed1(Erasure.scala:789)
    at typechecker.Typers$
    Typer.runTyper$1(Typers.scala:5640)
    at typechecker.Typers$
    Typer.typedInternal(Typers.scala:5672)
    at typechecker.Typers$Typer.body$2(Typers.scala:5613)
    at typechecker.Typers$Typer.typed(Typers.scala:5618)
    at typechecker.Typers$
    Typer.$anonfun$typed1$38(Typers.scala:4752)
    at typechecker.Typers$
    Typer.silent(Typers.scala:700)
    at typechecker.Typers$Typer.normalTypedApply$1(Typers.scala:4754)
    at typechecker.Typers$Typer.typedApply$1(Typers.scala:4801)
    at typechecker.Typers$Typer.typedInAnyMode$1(Typers.scala:5586)
    at typechecker.Typers$Typer.typed1(Typers.scala:5603)
    at transform.Erasure$
    Eraser.typed1(Erasure.scala:789)
    at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
    at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
    at typechecker.Typers$
    Typer.body$2(Typers.scala:5613)
    at typechecker.Typers$
    Typer.typed(Typers.scala:5618)
    at typechecker.Typers$Typer.typedQualifier(Typers.scala:5723)
    at typechecker.Typers$
    Typer.typedQualifier(Typers.scala:5731)
    at transform.Erasure$Eraser.adaptMember(Erasure.scala:714)
    at transform.Erasure$
    Eraser.typed1(Erasure.scala:789)
    at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
    at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
    at typechecker.Typers$
    Typer.body$2(Typers.scala:5613)
    at typechecker.Typers$
    Typer.typed(Typers.scala:5618)
    at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)


    With a bit of inspection, it's clear that this segment is being repeated three times:



    at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
    at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
    at typechecker.Typers$
    Typer.body$2(Typers.scala:5613)
    at typechecker.Typers$
    Typer.typed(Typers.scala:5618)
    at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)
    at typechecker.Typers$Typer.silent(Typers.scala:700)
    at typechecker.Typers$
    Typer.normalTypedApply$1(Typers.scala:4754)
    at typechecker.Typers$
    Typer.typedApply$1(Typers.scala:4801)
    at typechecker.Typers$
    Typer.typedInAnyMode$1(Typers.scala:5586)
    at typechecker.Typers$
    Typer.typed1(Typers.scala:5603)
    at transform.Erasure$Eraser.typed1(Erasure.scala:789)
    at typechecker.Typers$
    Typer.runTyper$1(Typers.scala:5640)
    at typechecker.Typers$
    Typer.typedInternal(Typers.scala:5672)
    at typechecker.Typers$Typer.body$2(Typers.scala:5613)
    at typechecker.Typers$Typer.typed(Typers.scala:5618)
    at typechecker.Typers$
    Typer.typedQualifier(Typers.scala:5723)
    at typechecker.Typers$Typer.typedQualifier(Typers.scala:5731)
    at transform.Erasure$
    Eraser.adaptMember(Erasure.scala:714)
    at transform.Erasure$Eraser.typed1(Erasure.scala:789)


    The end goal is so I can print out stack traces of recursive functions in a nicer fashion



    java.lang.StackOverflowError
    at transform.Erasure$Eraser.typed1(Erasure.scala:789)
    ------------------------- Repeated 3x -------------------------
    at typechecker.Typers$
    Typer.runTyper$1(Typers.scala:5640)
    at typechecker.Typers$
    Typer.typedInternal(Typers.scala:5672)
    at typechecker.Typers$Typer.body$2(Typers.scala:5613)
    at typechecker.Typers$Typer.typed(Typers.scala:5618)
    at typechecker.Typers$
    Typer.$anonfun$typed1$38(Typers.scala:4752)
    at typechecker.Typers$
    Typer.silent(Typers.scala:700)
    at typechecker.Typers$Typer.normalTypedApply$1(Typers.scala:4754)
    at typechecker.Typers$Typer.typedApply$1(Typers.scala:4801)
    at typechecker.Typers$Typer.typedInAnyMode$1(Typers.scala:5586)
    at typechecker.Typers$Typer.typed1(Typers.scala:5603)
    at transform.Erasure$
    Eraser.typed1(Erasure.scala:789)
    at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
    at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
    at typechecker.Typers$
    Typer.body$2(Typers.scala:5613)
    at typechecker.Typers$
    Typer.typed(Typers.scala:5618)
    at typechecker.Typers$Typer.typedQualifier(Typers.scala:5723)
    at typechecker.Typers$
    Typer.typedQualifier(Typers.scala:5731)
    at transform.Erasure$Eraser.adaptMember(Erasure.scala:714)
    at transform.Erasure$
    Eraser.typed1(Erasure.scala:789)
    -------------------------------------------------------------
    at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
    at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
    at typechecker.Typers$
    Typer.body$2(Typers.scala:5613)
    at typechecker.Typers$
    Typer.typed(Typers.scala:5618)
    at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)


    I'm not sure how feasible this is, but I'd be happy for any and all possible solutions even if they have their own limitations and constraints. Preliminary Googling didn't find me anything, but quite likely I just don't know what to google










    share|cite|improve this question









    New contributor




    Li Haoyi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.























      5












      5








      5


      1





      I would like an algorithm that can identify repeated parts of big stack traces like this:



      java.lang.StackOverflowError
      at transform.Erasure$Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$
      Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$
      Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$Typer.typed(Typers.scala:5618)
      at typechecker.Typers$
      Typer.$anonfun$typed1$38(Typers.scala:4752)
      at typechecker.Typers$
      Typer.silent(Typers.scala:700)
      at typechecker.Typers$Typer.normalTypedApply$1(Typers.scala:4754)
      at typechecker.Typers$Typer.typedApply$1(Typers.scala:4801)
      at typechecker.Typers$Typer.typedInAnyMode$1(Typers.scala:5586)
      at typechecker.Typers$Typer.typed1(Typers.scala:5603)
      at transform.Erasure$
      Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$
      Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$
      Typer.typed(Typers.scala:5618)
      at typechecker.Typers$Typer.typedQualifier(Typers.scala:5723)
      at typechecker.Typers$
      Typer.typedQualifier(Typers.scala:5731)
      at transform.Erasure$Eraser.adaptMember(Erasure.scala:714)
      at transform.Erasure$
      Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$
      Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$
      Typer.typed(Typers.scala:5618)
      at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)
      at typechecker.Typers$Typer.silent(Typers.scala:700)
      at typechecker.Typers$
      Typer.normalTypedApply$1(Typers.scala:4754)
      at typechecker.Typers$
      Typer.typedApply$1(Typers.scala:4801)
      at typechecker.Typers$
      Typer.typedInAnyMode$1(Typers.scala:5586)
      at typechecker.Typers$
      Typer.typed1(Typers.scala:5603)
      at transform.Erasure$Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$
      Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$
      Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$Typer.typed(Typers.scala:5618)
      at typechecker.Typers$
      Typer.typedQualifier(Typers.scala:5723)
      at typechecker.Typers$Typer.typedQualifier(Typers.scala:5731)
      at transform.Erasure$
      Eraser.adaptMember(Erasure.scala:714)
      at transform.Erasure$Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$
      Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$
      Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$Typer.typed(Typers.scala:5618)
      at typechecker.Typers$
      Typer.$anonfun$typed1$38(Typers.scala:4752)
      at typechecker.Typers$
      Typer.silent(Typers.scala:700)
      at typechecker.Typers$Typer.normalTypedApply$1(Typers.scala:4754)
      at typechecker.Typers$Typer.typedApply$1(Typers.scala:4801)
      at typechecker.Typers$Typer.typedInAnyMode$1(Typers.scala:5586)
      at typechecker.Typers$Typer.typed1(Typers.scala:5603)
      at transform.Erasure$
      Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$
      Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$
      Typer.typed(Typers.scala:5618)
      at typechecker.Typers$Typer.typedQualifier(Typers.scala:5723)
      at typechecker.Typers$
      Typer.typedQualifier(Typers.scala:5731)
      at transform.Erasure$Eraser.adaptMember(Erasure.scala:714)
      at transform.Erasure$
      Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$
      Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$
      Typer.typed(Typers.scala:5618)
      at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)


      With a bit of inspection, it's clear that this segment is being repeated three times:



      at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$
      Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$
      Typer.typed(Typers.scala:5618)
      at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)
      at typechecker.Typers$Typer.silent(Typers.scala:700)
      at typechecker.Typers$
      Typer.normalTypedApply$1(Typers.scala:4754)
      at typechecker.Typers$
      Typer.typedApply$1(Typers.scala:4801)
      at typechecker.Typers$
      Typer.typedInAnyMode$1(Typers.scala:5586)
      at typechecker.Typers$
      Typer.typed1(Typers.scala:5603)
      at transform.Erasure$Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$
      Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$
      Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$Typer.typed(Typers.scala:5618)
      at typechecker.Typers$
      Typer.typedQualifier(Typers.scala:5723)
      at typechecker.Typers$Typer.typedQualifier(Typers.scala:5731)
      at transform.Erasure$
      Eraser.adaptMember(Erasure.scala:714)
      at transform.Erasure$Eraser.typed1(Erasure.scala:789)


      The end goal is so I can print out stack traces of recursive functions in a nicer fashion



      java.lang.StackOverflowError
      at transform.Erasure$Eraser.typed1(Erasure.scala:789)
      ------------------------- Repeated 3x -------------------------
      at typechecker.Typers$
      Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$
      Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$Typer.typed(Typers.scala:5618)
      at typechecker.Typers$
      Typer.$anonfun$typed1$38(Typers.scala:4752)
      at typechecker.Typers$
      Typer.silent(Typers.scala:700)
      at typechecker.Typers$Typer.normalTypedApply$1(Typers.scala:4754)
      at typechecker.Typers$Typer.typedApply$1(Typers.scala:4801)
      at typechecker.Typers$Typer.typedInAnyMode$1(Typers.scala:5586)
      at typechecker.Typers$Typer.typed1(Typers.scala:5603)
      at transform.Erasure$
      Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$
      Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$
      Typer.typed(Typers.scala:5618)
      at typechecker.Typers$Typer.typedQualifier(Typers.scala:5723)
      at typechecker.Typers$
      Typer.typedQualifier(Typers.scala:5731)
      at transform.Erasure$Eraser.adaptMember(Erasure.scala:714)
      at transform.Erasure$
      Eraser.typed1(Erasure.scala:789)
      -------------------------------------------------------------
      at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$
      Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$
      Typer.typed(Typers.scala:5618)
      at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)


      I'm not sure how feasible this is, but I'd be happy for any and all possible solutions even if they have their own limitations and constraints. Preliminary Googling didn't find me anything, but quite likely I just don't know what to google










      share|cite|improve this question









      New contributor




      Li Haoyi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      I would like an algorithm that can identify repeated parts of big stack traces like this:



      java.lang.StackOverflowError
      at transform.Erasure$Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$
      Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$
      Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$Typer.typed(Typers.scala:5618)
      at typechecker.Typers$
      Typer.$anonfun$typed1$38(Typers.scala:4752)
      at typechecker.Typers$
      Typer.silent(Typers.scala:700)
      at typechecker.Typers$Typer.normalTypedApply$1(Typers.scala:4754)
      at typechecker.Typers$Typer.typedApply$1(Typers.scala:4801)
      at typechecker.Typers$Typer.typedInAnyMode$1(Typers.scala:5586)
      at typechecker.Typers$Typer.typed1(Typers.scala:5603)
      at transform.Erasure$
      Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$
      Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$
      Typer.typed(Typers.scala:5618)
      at typechecker.Typers$Typer.typedQualifier(Typers.scala:5723)
      at typechecker.Typers$
      Typer.typedQualifier(Typers.scala:5731)
      at transform.Erasure$Eraser.adaptMember(Erasure.scala:714)
      at transform.Erasure$
      Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$
      Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$
      Typer.typed(Typers.scala:5618)
      at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)
      at typechecker.Typers$Typer.silent(Typers.scala:700)
      at typechecker.Typers$
      Typer.normalTypedApply$1(Typers.scala:4754)
      at typechecker.Typers$
      Typer.typedApply$1(Typers.scala:4801)
      at typechecker.Typers$
      Typer.typedInAnyMode$1(Typers.scala:5586)
      at typechecker.Typers$
      Typer.typed1(Typers.scala:5603)
      at transform.Erasure$Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$
      Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$
      Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$Typer.typed(Typers.scala:5618)
      at typechecker.Typers$
      Typer.typedQualifier(Typers.scala:5723)
      at typechecker.Typers$Typer.typedQualifier(Typers.scala:5731)
      at transform.Erasure$
      Eraser.adaptMember(Erasure.scala:714)
      at transform.Erasure$Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$
      Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$
      Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$Typer.typed(Typers.scala:5618)
      at typechecker.Typers$
      Typer.$anonfun$typed1$38(Typers.scala:4752)
      at typechecker.Typers$
      Typer.silent(Typers.scala:700)
      at typechecker.Typers$Typer.normalTypedApply$1(Typers.scala:4754)
      at typechecker.Typers$Typer.typedApply$1(Typers.scala:4801)
      at typechecker.Typers$Typer.typedInAnyMode$1(Typers.scala:5586)
      at typechecker.Typers$Typer.typed1(Typers.scala:5603)
      at transform.Erasure$
      Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$
      Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$
      Typer.typed(Typers.scala:5618)
      at typechecker.Typers$Typer.typedQualifier(Typers.scala:5723)
      at typechecker.Typers$
      Typer.typedQualifier(Typers.scala:5731)
      at transform.Erasure$Eraser.adaptMember(Erasure.scala:714)
      at transform.Erasure$
      Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$
      Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$
      Typer.typed(Typers.scala:5618)
      at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)


      With a bit of inspection, it's clear that this segment is being repeated three times:



      at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$
      Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$
      Typer.typed(Typers.scala:5618)
      at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)
      at typechecker.Typers$Typer.silent(Typers.scala:700)
      at typechecker.Typers$
      Typer.normalTypedApply$1(Typers.scala:4754)
      at typechecker.Typers$
      Typer.typedApply$1(Typers.scala:4801)
      at typechecker.Typers$
      Typer.typedInAnyMode$1(Typers.scala:5586)
      at typechecker.Typers$
      Typer.typed1(Typers.scala:5603)
      at transform.Erasure$Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$
      Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$
      Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$Typer.typed(Typers.scala:5618)
      at typechecker.Typers$
      Typer.typedQualifier(Typers.scala:5723)
      at typechecker.Typers$Typer.typedQualifier(Typers.scala:5731)
      at transform.Erasure$
      Eraser.adaptMember(Erasure.scala:714)
      at transform.Erasure$Eraser.typed1(Erasure.scala:789)


      The end goal is so I can print out stack traces of recursive functions in a nicer fashion



      java.lang.StackOverflowError
      at transform.Erasure$Eraser.typed1(Erasure.scala:789)
      ------------------------- Repeated 3x -------------------------
      at typechecker.Typers$
      Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$
      Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$Typer.typed(Typers.scala:5618)
      at typechecker.Typers$
      Typer.$anonfun$typed1$38(Typers.scala:4752)
      at typechecker.Typers$
      Typer.silent(Typers.scala:700)
      at typechecker.Typers$Typer.normalTypedApply$1(Typers.scala:4754)
      at typechecker.Typers$Typer.typedApply$1(Typers.scala:4801)
      at typechecker.Typers$Typer.typedInAnyMode$1(Typers.scala:5586)
      at typechecker.Typers$Typer.typed1(Typers.scala:5603)
      at transform.Erasure$
      Eraser.typed1(Erasure.scala:789)
      at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$
      Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$
      Typer.typed(Typers.scala:5618)
      at typechecker.Typers$Typer.typedQualifier(Typers.scala:5723)
      at typechecker.Typers$
      Typer.typedQualifier(Typers.scala:5731)
      at transform.Erasure$Eraser.adaptMember(Erasure.scala:714)
      at transform.Erasure$
      Eraser.typed1(Erasure.scala:789)
      -------------------------------------------------------------
      at typechecker.Typers$Typer.runTyper$1(Typers.scala:5640)
      at typechecker.Typers$Typer.typedInternal(Typers.scala:5672)
      at typechecker.Typers$
      Typer.body$2(Typers.scala:5613)
      at typechecker.Typers$
      Typer.typed(Typers.scala:5618)
      at typechecker.Typers$Typer.$anonfun$typed1$38(Typers.scala:4752)


      I'm not sure how feasible this is, but I'd be happy for any and all possible solutions even if they have their own limitations and constraints. Preliminary Googling didn't find me anything, but quite likely I just don't know what to google







      algorithms






      share|cite|improve this question









      New contributor




      Li Haoyi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|cite|improve this question









      New contributor




      Li Haoyi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|cite|improve this question




      share|cite|improve this question








      edited 2 days ago







      Li Haoyi













      New contributor




      Li Haoyi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 2 days ago









      Li HaoyiLi Haoyi

      436




      436




      New contributor




      Li Haoyi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Li Haoyi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Li Haoyi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          5 Answers
          5






          active

          oldest

          votes


















          3














          Under normal circumstances, JVM fills only the last 1024 calls in a stacktrace, and in Dotty/Scalac most stackoverflows have a repeating fragment of length ≈ 70 or less. A stacktrace T of a StackOverflowException can be decomposed into three parts S · R^N · P, where R is the repeating part of the stacktrace, S is some suffix of R, and P is either some prefix of R or an unrelated sequence of calls. We are interested in a solution such that the total length C = |S · R^N| of the repeating part and N are both maximal, and |S| is minimal.



          // Scala Pseudocode (beware of for comprehensions)
          //
          // Stack is assumed to be in reverse order,
          // most recent stack frame is last.
          val stack: Array[StackTraceElement]
          val F: Int // Maximum size of R.

          val candidates = for {
          // Enumerate all possible suffixes S.
          S <- ∀ prefix of stack
          if |S| < F

          // Remove the suffix from the stack,
          R <- ∀ non-empty prefix of stack.drop(|S|)
          // Find a fragment that ends with S.
          if R.endsWith(S)

          // Find out how many fragments fit into the stack.
          // (how many times we can remove R from the stack)
          N = coverSize(R, stack.drop(|S|))
          if N >= 2 // Or higher.
          } yield (S, R, N)

          // Best cover has maximum coverage,
          // minimum fragment length,
          // and minimum suffix length.
          val bestCandidate = candidates.maxBy { (S, R, N) =>
          val C = |S| + |R| * N
          return (C, -|R|, -|S|)
          }


          The entire algorithm can be implemented in a way that does not allocate any memory (to handle OOM). It has complexity O(F^2 |T|), but exceptions are rare enough and this is a small constant (F << 1024, T = 1024).



          I have implemented this exact algorithm in my library https://github.com/alexknvl/tracehash (https://github.com/alexknvl/tracehash/blob/master/src/main/java/tracehash/internal/SOCoverSolver.java) for the same purpose of simplifying scalac/dotc errors ;)



          EDIT: Here is an implementation of the same algorithm in Python:



          stack = list(reversed([3, 4, 2, 1, 2, 1, 2, 1, 2, 1, 2]))
          F = 6

          results =
          for slen in range(0, F + 1):
          suffix, stack1 = stack[:slen], stack[slen:]

          for flen in range(1, F + 1):
          fragment = stack1[:flen]

          if fragment[flen - slen:] != suffix:
          continue

          stack2 = stack1[:]
          n = 0
          while stack2[:flen] == fragment:
          stack2 = stack2[flen:]
          n += 1

          if n >= 2: # A heuristic, might want to set it a bit higher.
          results.append((slen, flen, n))

          def cost(t):
          s, r, n = t
          c = s + r * n
          return (c, -r, -s)

          S, R, N = max(results, key=cost)
          print('%s · %s^%d · %s' % (stack[:S], stack[S:S+R], N, stack[S+R*N:]))
          # Prints · [2, 1]^4 · [2, 4, 3]


          EDIT2: Following some of the ideas from mukel's answer, here is a function https://gist.github.com/alexknvl/b041099eb5347d728e2dacd1e8caed8c that solves something along the lines of:



          stack = a[1]^k[1] · a[2]^k[2] · ...
          argmax (sum |a[i]| * k[i] where k[i] >= 2,
          -sum |a[i]| where k[i] >= 2,
          -sum |a[i]| where k[i] == 1)


          It is greedy so it is not necessarily an optimal solution, but it seems to work reasonably well in simple cases, e.g. given



          stack = list(reversed([
          3, 4, 2,
          1, 2, 1, 2, 1, 2, 1, 2,
          5,
          4, 5, 4, 5, 4, 5,
          3, 3, 3, 3]))


          it produces an answer



          [([3], 4), ([5, 4], 3), ([5], 1), ([2, 1], 4), ([2, 4, 3], 1)] 





          share|cite|improve this answer










          New contributor




          alex.knvl is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.


















          • "The entire algorithm can be implemented in a way that does not allocate any memory." Do you mean that the algorithm can be implemented so that it only uses stack memory? Why is it useful to use stack memory only?
            – Apass.Jack
            yesterday








          • 1




            You might want to avoid allocating anything on the heap in case you catch an OutOfMemoryError.
            – alex.knvl
            yesterday










          • I see. Well, it happens the error in the question is StackOverflowError.
            – Apass.Jack
            yesterday





















          5














          Build suffix tree using Ukkonen's algorithm, this way in $mathcal O(n)$ you will find all substrings in provided text with indices.



          In the case of approximate matching, there is also extended version of Ukkonen's algorithm.






          share|cite|improve this answer





























            3














            If you consider that “a line of stacktrace” = “a character”, you can use:



            http://en.wikipedia.org/wiki/Longest_repeated_substring_problem



            One way to solve this problem efficiently is by constructing a Suffix Trie: http://en.wikipedia.org/wiki/Suffix_tree



            Every time you traverse an already estabilished path you are actually discovering a repetition in your string.



            Now, simply list all the repetions in a separate data structure and extract the longest one... or all if you want.






            share|cite|improve this answer








            New contributor




            n1r3 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.














            • 1




              The longest repeated substring problem appears to also find substrings which overlap, or are repeated with gaps between them. Is there a way to make it only find substrings which are repeated consecutively without gaps or overlap?
              – Li Haoyi
              2 days ago










            • from the suffix tree, you can easily list all repeated substrings. I created the following Gist to show my idea: gist.github.com/nremond/fb63a27b6291053ca4dd1de81135de84 I'm a bit ashamed by the quick&dirty code, but if you like the algo, we can work on it.
              – n1r3
              yesterday





















            3














            An efficient string factorization algorithm may help.
            Given a string $S$, $n = |S|$ find maximum $p$ such that $S = T^p$ e.g. $T$ concatenated $p$ times, we call $T$ the seed and p the period. This can be done in $O(n)$ using the prefix function of the (Knuth-)Morris-Pratt algorithm, but don't be scared by the name it's very simple and beautiful algorithm, here's my solution for the corresponding problem SPOJ PERIOD.



            Armed with a fast string factorization we can then proceed to decompose a string as the concatenation of factored chunks (dynamic programming) using a custom cost function e.g. minimize the total length of the seeds, minimize the sum of squares of the seeds' lengths...



            $S = {T_1}^{p_1}{T_2}^{p_2}cdots{T_k}^{p_k}$



            Total complexity is $O(n^2)$.



            If $O(n^2)$ is way too costly you can shift to a greedy like strategy e.g. as soon as you find a factorizable chunk, squeeze it and continue from that point, and also limit the maximum seed size (e.g. $|T|le 200$) and prune the search if you don't find a period $ge 2$ in the first e.g. 400 characters.






            share|cite|improve this answer










            New contributor




            mukel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.


























              0














                string contiguous_repeated_string(string A){
              string r(1,A[0]);
              for(int i=0;i<A.size();i++){
              for(int l=2;l<=A.size();l++){
              string s=A.substr(i,l);
              if ((s+s).substr(1,2*s.size()-1).find(s)!=s.size()-1 and s.size()>r.size()){
              r=s;
              }
              }
              }
              return r;
              }


              Call contiguous_repeated_string("abcdefgabcdabcdabcd") you will get abcdabcdabcd.



              Call contiguous_repeated_string("ATCGATCGA") you will get ATCGATCG.



              And then you can use KMP's Longest Proper Prefix Postfix array to fold the resulting string with O(N).



              Total time complexity is O(N^2).






              share|cite|improve this answer








              New contributor




              Wu Fuheng is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.


















              • "Total time complexity is $O(N^2)$". What about the worst time complexity?
                – Apass.Jack
                14 hours ago











              Your Answer





              StackExchange.ifUsing("editor", function () {
              return StackExchange.using("mathjaxEditing", function () {
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              });
              });
              }, "mathjax-editing");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "419"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });






              Li Haoyi is a new contributor. Be nice, and check out our Code of Conduct.










              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcs.stackexchange.com%2fquestions%2f102463%2flooking-for-an-algorithm-to-identify-contiguous-repeated-series-of-lines-in-a-lo%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              5 Answers
              5






              active

              oldest

              votes








              5 Answers
              5






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              3














              Under normal circumstances, JVM fills only the last 1024 calls in a stacktrace, and in Dotty/Scalac most stackoverflows have a repeating fragment of length ≈ 70 or less. A stacktrace T of a StackOverflowException can be decomposed into three parts S · R^N · P, where R is the repeating part of the stacktrace, S is some suffix of R, and P is either some prefix of R or an unrelated sequence of calls. We are interested in a solution such that the total length C = |S · R^N| of the repeating part and N are both maximal, and |S| is minimal.



              // Scala Pseudocode (beware of for comprehensions)
              //
              // Stack is assumed to be in reverse order,
              // most recent stack frame is last.
              val stack: Array[StackTraceElement]
              val F: Int // Maximum size of R.

              val candidates = for {
              // Enumerate all possible suffixes S.
              S <- ∀ prefix of stack
              if |S| < F

              // Remove the suffix from the stack,
              R <- ∀ non-empty prefix of stack.drop(|S|)
              // Find a fragment that ends with S.
              if R.endsWith(S)

              // Find out how many fragments fit into the stack.
              // (how many times we can remove R from the stack)
              N = coverSize(R, stack.drop(|S|))
              if N >= 2 // Or higher.
              } yield (S, R, N)

              // Best cover has maximum coverage,
              // minimum fragment length,
              // and minimum suffix length.
              val bestCandidate = candidates.maxBy { (S, R, N) =>
              val C = |S| + |R| * N
              return (C, -|R|, -|S|)
              }


              The entire algorithm can be implemented in a way that does not allocate any memory (to handle OOM). It has complexity O(F^2 |T|), but exceptions are rare enough and this is a small constant (F << 1024, T = 1024).



              I have implemented this exact algorithm in my library https://github.com/alexknvl/tracehash (https://github.com/alexknvl/tracehash/blob/master/src/main/java/tracehash/internal/SOCoverSolver.java) for the same purpose of simplifying scalac/dotc errors ;)



              EDIT: Here is an implementation of the same algorithm in Python:



              stack = list(reversed([3, 4, 2, 1, 2, 1, 2, 1, 2, 1, 2]))
              F = 6

              results =
              for slen in range(0, F + 1):
              suffix, stack1 = stack[:slen], stack[slen:]

              for flen in range(1, F + 1):
              fragment = stack1[:flen]

              if fragment[flen - slen:] != suffix:
              continue

              stack2 = stack1[:]
              n = 0
              while stack2[:flen] == fragment:
              stack2 = stack2[flen:]
              n += 1

              if n >= 2: # A heuristic, might want to set it a bit higher.
              results.append((slen, flen, n))

              def cost(t):
              s, r, n = t
              c = s + r * n
              return (c, -r, -s)

              S, R, N = max(results, key=cost)
              print('%s · %s^%d · %s' % (stack[:S], stack[S:S+R], N, stack[S+R*N:]))
              # Prints · [2, 1]^4 · [2, 4, 3]


              EDIT2: Following some of the ideas from mukel's answer, here is a function https://gist.github.com/alexknvl/b041099eb5347d728e2dacd1e8caed8c that solves something along the lines of:



              stack = a[1]^k[1] · a[2]^k[2] · ...
              argmax (sum |a[i]| * k[i] where k[i] >= 2,
              -sum |a[i]| where k[i] >= 2,
              -sum |a[i]| where k[i] == 1)


              It is greedy so it is not necessarily an optimal solution, but it seems to work reasonably well in simple cases, e.g. given



              stack = list(reversed([
              3, 4, 2,
              1, 2, 1, 2, 1, 2, 1, 2,
              5,
              4, 5, 4, 5, 4, 5,
              3, 3, 3, 3]))


              it produces an answer



              [([3], 4), ([5, 4], 3), ([5], 1), ([2, 1], 4), ([2, 4, 3], 1)] 





              share|cite|improve this answer










              New contributor




              alex.knvl is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.


















              • "The entire algorithm can be implemented in a way that does not allocate any memory." Do you mean that the algorithm can be implemented so that it only uses stack memory? Why is it useful to use stack memory only?
                – Apass.Jack
                yesterday








              • 1




                You might want to avoid allocating anything on the heap in case you catch an OutOfMemoryError.
                – alex.knvl
                yesterday










              • I see. Well, it happens the error in the question is StackOverflowError.
                – Apass.Jack
                yesterday


















              3














              Under normal circumstances, JVM fills only the last 1024 calls in a stacktrace, and in Dotty/Scalac most stackoverflows have a repeating fragment of length ≈ 70 or less. A stacktrace T of a StackOverflowException can be decomposed into three parts S · R^N · P, where R is the repeating part of the stacktrace, S is some suffix of R, and P is either some prefix of R or an unrelated sequence of calls. We are interested in a solution such that the total length C = |S · R^N| of the repeating part and N are both maximal, and |S| is minimal.



              // Scala Pseudocode (beware of for comprehensions)
              //
              // Stack is assumed to be in reverse order,
              // most recent stack frame is last.
              val stack: Array[StackTraceElement]
              val F: Int // Maximum size of R.

              val candidates = for {
              // Enumerate all possible suffixes S.
              S <- ∀ prefix of stack
              if |S| < F

              // Remove the suffix from the stack,
              R <- ∀ non-empty prefix of stack.drop(|S|)
              // Find a fragment that ends with S.
              if R.endsWith(S)

              // Find out how many fragments fit into the stack.
              // (how many times we can remove R from the stack)
              N = coverSize(R, stack.drop(|S|))
              if N >= 2 // Or higher.
              } yield (S, R, N)

              // Best cover has maximum coverage,
              // minimum fragment length,
              // and minimum suffix length.
              val bestCandidate = candidates.maxBy { (S, R, N) =>
              val C = |S| + |R| * N
              return (C, -|R|, -|S|)
              }


              The entire algorithm can be implemented in a way that does not allocate any memory (to handle OOM). It has complexity O(F^2 |T|), but exceptions are rare enough and this is a small constant (F << 1024, T = 1024).



              I have implemented this exact algorithm in my library https://github.com/alexknvl/tracehash (https://github.com/alexknvl/tracehash/blob/master/src/main/java/tracehash/internal/SOCoverSolver.java) for the same purpose of simplifying scalac/dotc errors ;)



              EDIT: Here is an implementation of the same algorithm in Python:



              stack = list(reversed([3, 4, 2, 1, 2, 1, 2, 1, 2, 1, 2]))
              F = 6

              results =
              for slen in range(0, F + 1):
              suffix, stack1 = stack[:slen], stack[slen:]

              for flen in range(1, F + 1):
              fragment = stack1[:flen]

              if fragment[flen - slen:] != suffix:
              continue

              stack2 = stack1[:]
              n = 0
              while stack2[:flen] == fragment:
              stack2 = stack2[flen:]
              n += 1

              if n >= 2: # A heuristic, might want to set it a bit higher.
              results.append((slen, flen, n))

              def cost(t):
              s, r, n = t
              c = s + r * n
              return (c, -r, -s)

              S, R, N = max(results, key=cost)
              print('%s · %s^%d · %s' % (stack[:S], stack[S:S+R], N, stack[S+R*N:]))
              # Prints · [2, 1]^4 · [2, 4, 3]


              EDIT2: Following some of the ideas from mukel's answer, here is a function https://gist.github.com/alexknvl/b041099eb5347d728e2dacd1e8caed8c that solves something along the lines of:



              stack = a[1]^k[1] · a[2]^k[2] · ...
              argmax (sum |a[i]| * k[i] where k[i] >= 2,
              -sum |a[i]| where k[i] >= 2,
              -sum |a[i]| where k[i] == 1)


              It is greedy so it is not necessarily an optimal solution, but it seems to work reasonably well in simple cases, e.g. given



              stack = list(reversed([
              3, 4, 2,
              1, 2, 1, 2, 1, 2, 1, 2,
              5,
              4, 5, 4, 5, 4, 5,
              3, 3, 3, 3]))


              it produces an answer



              [([3], 4), ([5, 4], 3), ([5], 1), ([2, 1], 4), ([2, 4, 3], 1)] 





              share|cite|improve this answer










              New contributor




              alex.knvl is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.


















              • "The entire algorithm can be implemented in a way that does not allocate any memory." Do you mean that the algorithm can be implemented so that it only uses stack memory? Why is it useful to use stack memory only?
                – Apass.Jack
                yesterday








              • 1




                You might want to avoid allocating anything on the heap in case you catch an OutOfMemoryError.
                – alex.knvl
                yesterday










              • I see. Well, it happens the error in the question is StackOverflowError.
                – Apass.Jack
                yesterday
















              3












              3








              3






              Under normal circumstances, JVM fills only the last 1024 calls in a stacktrace, and in Dotty/Scalac most stackoverflows have a repeating fragment of length ≈ 70 or less. A stacktrace T of a StackOverflowException can be decomposed into three parts S · R^N · P, where R is the repeating part of the stacktrace, S is some suffix of R, and P is either some prefix of R or an unrelated sequence of calls. We are interested in a solution such that the total length C = |S · R^N| of the repeating part and N are both maximal, and |S| is minimal.



              // Scala Pseudocode (beware of for comprehensions)
              //
              // Stack is assumed to be in reverse order,
              // most recent stack frame is last.
              val stack: Array[StackTraceElement]
              val F: Int // Maximum size of R.

              val candidates = for {
              // Enumerate all possible suffixes S.
              S <- ∀ prefix of stack
              if |S| < F

              // Remove the suffix from the stack,
              R <- ∀ non-empty prefix of stack.drop(|S|)
              // Find a fragment that ends with S.
              if R.endsWith(S)

              // Find out how many fragments fit into the stack.
              // (how many times we can remove R from the stack)
              N = coverSize(R, stack.drop(|S|))
              if N >= 2 // Or higher.
              } yield (S, R, N)

              // Best cover has maximum coverage,
              // minimum fragment length,
              // and minimum suffix length.
              val bestCandidate = candidates.maxBy { (S, R, N) =>
              val C = |S| + |R| * N
              return (C, -|R|, -|S|)
              }


              The entire algorithm can be implemented in a way that does not allocate any memory (to handle OOM). It has complexity O(F^2 |T|), but exceptions are rare enough and this is a small constant (F << 1024, T = 1024).



              I have implemented this exact algorithm in my library https://github.com/alexknvl/tracehash (https://github.com/alexknvl/tracehash/blob/master/src/main/java/tracehash/internal/SOCoverSolver.java) for the same purpose of simplifying scalac/dotc errors ;)



              EDIT: Here is an implementation of the same algorithm in Python:



              stack = list(reversed([3, 4, 2, 1, 2, 1, 2, 1, 2, 1, 2]))
              F = 6

              results =
              for slen in range(0, F + 1):
              suffix, stack1 = stack[:slen], stack[slen:]

              for flen in range(1, F + 1):
              fragment = stack1[:flen]

              if fragment[flen - slen:] != suffix:
              continue

              stack2 = stack1[:]
              n = 0
              while stack2[:flen] == fragment:
              stack2 = stack2[flen:]
              n += 1

              if n >= 2: # A heuristic, might want to set it a bit higher.
              results.append((slen, flen, n))

              def cost(t):
              s, r, n = t
              c = s + r * n
              return (c, -r, -s)

              S, R, N = max(results, key=cost)
              print('%s · %s^%d · %s' % (stack[:S], stack[S:S+R], N, stack[S+R*N:]))
              # Prints · [2, 1]^4 · [2, 4, 3]


              EDIT2: Following some of the ideas from mukel's answer, here is a function https://gist.github.com/alexknvl/b041099eb5347d728e2dacd1e8caed8c that solves something along the lines of:



              stack = a[1]^k[1] · a[2]^k[2] · ...
              argmax (sum |a[i]| * k[i] where k[i] >= 2,
              -sum |a[i]| where k[i] >= 2,
              -sum |a[i]| where k[i] == 1)


              It is greedy so it is not necessarily an optimal solution, but it seems to work reasonably well in simple cases, e.g. given



              stack = list(reversed([
              3, 4, 2,
              1, 2, 1, 2, 1, 2, 1, 2,
              5,
              4, 5, 4, 5, 4, 5,
              3, 3, 3, 3]))


              it produces an answer



              [([3], 4), ([5, 4], 3), ([5], 1), ([2, 1], 4), ([2, 4, 3], 1)] 





              share|cite|improve this answer










              New contributor




              alex.knvl is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.









              Under normal circumstances, JVM fills only the last 1024 calls in a stacktrace, and in Dotty/Scalac most stackoverflows have a repeating fragment of length ≈ 70 or less. A stacktrace T of a StackOverflowException can be decomposed into three parts S · R^N · P, where R is the repeating part of the stacktrace, S is some suffix of R, and P is either some prefix of R or an unrelated sequence of calls. We are interested in a solution such that the total length C = |S · R^N| of the repeating part and N are both maximal, and |S| is minimal.



              // Scala Pseudocode (beware of for comprehensions)
              //
              // Stack is assumed to be in reverse order,
              // most recent stack frame is last.
              val stack: Array[StackTraceElement]
              val F: Int // Maximum size of R.

              val candidates = for {
              // Enumerate all possible suffixes S.
              S <- ∀ prefix of stack
              if |S| < F

              // Remove the suffix from the stack,
              R <- ∀ non-empty prefix of stack.drop(|S|)
              // Find a fragment that ends with S.
              if R.endsWith(S)

              // Find out how many fragments fit into the stack.
              // (how many times we can remove R from the stack)
              N = coverSize(R, stack.drop(|S|))
              if N >= 2 // Or higher.
              } yield (S, R, N)

              // Best cover has maximum coverage,
              // minimum fragment length,
              // and minimum suffix length.
              val bestCandidate = candidates.maxBy { (S, R, N) =>
              val C = |S| + |R| * N
              return (C, -|R|, -|S|)
              }


              The entire algorithm can be implemented in a way that does not allocate any memory (to handle OOM). It has complexity O(F^2 |T|), but exceptions are rare enough and this is a small constant (F << 1024, T = 1024).



              I have implemented this exact algorithm in my library https://github.com/alexknvl/tracehash (https://github.com/alexknvl/tracehash/blob/master/src/main/java/tracehash/internal/SOCoverSolver.java) for the same purpose of simplifying scalac/dotc errors ;)



              EDIT: Here is an implementation of the same algorithm in Python:



              stack = list(reversed([3, 4, 2, 1, 2, 1, 2, 1, 2, 1, 2]))
              F = 6

              results =
              for slen in range(0, F + 1):
              suffix, stack1 = stack[:slen], stack[slen:]

              for flen in range(1, F + 1):
              fragment = stack1[:flen]

              if fragment[flen - slen:] != suffix:
              continue

              stack2 = stack1[:]
              n = 0
              while stack2[:flen] == fragment:
              stack2 = stack2[flen:]
              n += 1

              if n >= 2: # A heuristic, might want to set it a bit higher.
              results.append((slen, flen, n))

              def cost(t):
              s, r, n = t
              c = s + r * n
              return (c, -r, -s)

              S, R, N = max(results, key=cost)
              print('%s · %s^%d · %s' % (stack[:S], stack[S:S+R], N, stack[S+R*N:]))
              # Prints · [2, 1]^4 · [2, 4, 3]


              EDIT2: Following some of the ideas from mukel's answer, here is a function https://gist.github.com/alexknvl/b041099eb5347d728e2dacd1e8caed8c that solves something along the lines of:



              stack = a[1]^k[1] · a[2]^k[2] · ...
              argmax (sum |a[i]| * k[i] where k[i] >= 2,
              -sum |a[i]| where k[i] >= 2,
              -sum |a[i]| where k[i] == 1)


              It is greedy so it is not necessarily an optimal solution, but it seems to work reasonably well in simple cases, e.g. given



              stack = list(reversed([
              3, 4, 2,
              1, 2, 1, 2, 1, 2, 1, 2,
              5,
              4, 5, 4, 5, 4, 5,
              3, 3, 3, 3]))


              it produces an answer



              [([3], 4), ([5, 4], 3), ([5], 1), ([2, 1], 4), ([2, 4, 3], 1)] 






              share|cite|improve this answer










              New contributor




              alex.knvl is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.









              share|cite|improve this answer



              share|cite|improve this answer








              edited yesterday





















              New contributor




              alex.knvl is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.









              answered yesterday









              alex.knvlalex.knvl

              462




              462




              New contributor




              alex.knvl is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.





              New contributor





              alex.knvl is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              alex.knvl is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.












              • "The entire algorithm can be implemented in a way that does not allocate any memory." Do you mean that the algorithm can be implemented so that it only uses stack memory? Why is it useful to use stack memory only?
                – Apass.Jack
                yesterday








              • 1




                You might want to avoid allocating anything on the heap in case you catch an OutOfMemoryError.
                – alex.knvl
                yesterday










              • I see. Well, it happens the error in the question is StackOverflowError.
                – Apass.Jack
                yesterday




















              • "The entire algorithm can be implemented in a way that does not allocate any memory." Do you mean that the algorithm can be implemented so that it only uses stack memory? Why is it useful to use stack memory only?
                – Apass.Jack
                yesterday








              • 1




                You might want to avoid allocating anything on the heap in case you catch an OutOfMemoryError.
                – alex.knvl
                yesterday










              • I see. Well, it happens the error in the question is StackOverflowError.
                – Apass.Jack
                yesterday


















              "The entire algorithm can be implemented in a way that does not allocate any memory." Do you mean that the algorithm can be implemented so that it only uses stack memory? Why is it useful to use stack memory only?
              – Apass.Jack
              yesterday






              "The entire algorithm can be implemented in a way that does not allocate any memory." Do you mean that the algorithm can be implemented so that it only uses stack memory? Why is it useful to use stack memory only?
              – Apass.Jack
              yesterday






              1




              1




              You might want to avoid allocating anything on the heap in case you catch an OutOfMemoryError.
              – alex.knvl
              yesterday




              You might want to avoid allocating anything on the heap in case you catch an OutOfMemoryError.
              – alex.knvl
              yesterday












              I see. Well, it happens the error in the question is StackOverflowError.
              – Apass.Jack
              yesterday






              I see. Well, it happens the error in the question is StackOverflowError.
              – Apass.Jack
              yesterday













              5














              Build suffix tree using Ukkonen's algorithm, this way in $mathcal O(n)$ you will find all substrings in provided text with indices.



              In the case of approximate matching, there is also extended version of Ukkonen's algorithm.






              share|cite|improve this answer


























                5














                Build suffix tree using Ukkonen's algorithm, this way in $mathcal O(n)$ you will find all substrings in provided text with indices.



                In the case of approximate matching, there is also extended version of Ukkonen's algorithm.






                share|cite|improve this answer
























                  5












                  5








                  5






                  Build suffix tree using Ukkonen's algorithm, this way in $mathcal O(n)$ you will find all substrings in provided text with indices.



                  In the case of approximate matching, there is also extended version of Ukkonen's algorithm.






                  share|cite|improve this answer












                  Build suffix tree using Ukkonen's algorithm, this way in $mathcal O(n)$ you will find all substrings in provided text with indices.



                  In the case of approximate matching, there is also extended version of Ukkonen's algorithm.







                  share|cite|improve this answer












                  share|cite|improve this answer



                  share|cite|improve this answer










                  answered 2 days ago









                  EvilEvil

                  7,63842245




                  7,63842245























                      3














                      If you consider that “a line of stacktrace” = “a character”, you can use:



                      http://en.wikipedia.org/wiki/Longest_repeated_substring_problem



                      One way to solve this problem efficiently is by constructing a Suffix Trie: http://en.wikipedia.org/wiki/Suffix_tree



                      Every time you traverse an already estabilished path you are actually discovering a repetition in your string.



                      Now, simply list all the repetions in a separate data structure and extract the longest one... or all if you want.






                      share|cite|improve this answer








                      New contributor




                      n1r3 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                      Check out our Code of Conduct.














                      • 1




                        The longest repeated substring problem appears to also find substrings which overlap, or are repeated with gaps between them. Is there a way to make it only find substrings which are repeated consecutively without gaps or overlap?
                        – Li Haoyi
                        2 days ago










                      • from the suffix tree, you can easily list all repeated substrings. I created the following Gist to show my idea: gist.github.com/nremond/fb63a27b6291053ca4dd1de81135de84 I'm a bit ashamed by the quick&dirty code, but if you like the algo, we can work on it.
                        – n1r3
                        yesterday


















                      3














                      If you consider that “a line of stacktrace” = “a character”, you can use:



                      http://en.wikipedia.org/wiki/Longest_repeated_substring_problem



                      One way to solve this problem efficiently is by constructing a Suffix Trie: http://en.wikipedia.org/wiki/Suffix_tree



                      Every time you traverse an already estabilished path you are actually discovering a repetition in your string.



                      Now, simply list all the repetions in a separate data structure and extract the longest one... or all if you want.






                      share|cite|improve this answer








                      New contributor




                      n1r3 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                      Check out our Code of Conduct.














                      • 1




                        The longest repeated substring problem appears to also find substrings which overlap, or are repeated with gaps between them. Is there a way to make it only find substrings which are repeated consecutively without gaps or overlap?
                        – Li Haoyi
                        2 days ago










                      • from the suffix tree, you can easily list all repeated substrings. I created the following Gist to show my idea: gist.github.com/nremond/fb63a27b6291053ca4dd1de81135de84 I'm a bit ashamed by the quick&dirty code, but if you like the algo, we can work on it.
                        – n1r3
                        yesterday
















                      3












                      3








                      3






                      If you consider that “a line of stacktrace” = “a character”, you can use:



                      http://en.wikipedia.org/wiki/Longest_repeated_substring_problem



                      One way to solve this problem efficiently is by constructing a Suffix Trie: http://en.wikipedia.org/wiki/Suffix_tree



                      Every time you traverse an already estabilished path you are actually discovering a repetition in your string.



                      Now, simply list all the repetions in a separate data structure and extract the longest one... or all if you want.






                      share|cite|improve this answer








                      New contributor




                      n1r3 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                      Check out our Code of Conduct.









                      If you consider that “a line of stacktrace” = “a character”, you can use:



                      http://en.wikipedia.org/wiki/Longest_repeated_substring_problem



                      One way to solve this problem efficiently is by constructing a Suffix Trie: http://en.wikipedia.org/wiki/Suffix_tree



                      Every time you traverse an already estabilished path you are actually discovering a repetition in your string.



                      Now, simply list all the repetions in a separate data structure and extract the longest one... or all if you want.







                      share|cite|improve this answer








                      New contributor




                      n1r3 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                      Check out our Code of Conduct.









                      share|cite|improve this answer



                      share|cite|improve this answer






                      New contributor




                      n1r3 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                      Check out our Code of Conduct.









                      answered 2 days ago









                      n1r3n1r3

                      1311




                      1311




                      New contributor




                      n1r3 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                      Check out our Code of Conduct.





                      New contributor





                      n1r3 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                      Check out our Code of Conduct.






                      n1r3 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                      Check out our Code of Conduct.








                      • 1




                        The longest repeated substring problem appears to also find substrings which overlap, or are repeated with gaps between them. Is there a way to make it only find substrings which are repeated consecutively without gaps or overlap?
                        – Li Haoyi
                        2 days ago










                      • from the suffix tree, you can easily list all repeated substrings. I created the following Gist to show my idea: gist.github.com/nremond/fb63a27b6291053ca4dd1de81135de84 I'm a bit ashamed by the quick&dirty code, but if you like the algo, we can work on it.
                        – n1r3
                        yesterday
















                      • 1




                        The longest repeated substring problem appears to also find substrings which overlap, or are repeated with gaps between them. Is there a way to make it only find substrings which are repeated consecutively without gaps or overlap?
                        – Li Haoyi
                        2 days ago










                      • from the suffix tree, you can easily list all repeated substrings. I created the following Gist to show my idea: gist.github.com/nremond/fb63a27b6291053ca4dd1de81135de84 I'm a bit ashamed by the quick&dirty code, but if you like the algo, we can work on it.
                        – n1r3
                        yesterday










                      1




                      1




                      The longest repeated substring problem appears to also find substrings which overlap, or are repeated with gaps between them. Is there a way to make it only find substrings which are repeated consecutively without gaps or overlap?
                      – Li Haoyi
                      2 days ago




                      The longest repeated substring problem appears to also find substrings which overlap, or are repeated with gaps between them. Is there a way to make it only find substrings which are repeated consecutively without gaps or overlap?
                      – Li Haoyi
                      2 days ago












                      from the suffix tree, you can easily list all repeated substrings. I created the following Gist to show my idea: gist.github.com/nremond/fb63a27b6291053ca4dd1de81135de84 I'm a bit ashamed by the quick&dirty code, but if you like the algo, we can work on it.
                      – n1r3
                      yesterday






                      from the suffix tree, you can easily list all repeated substrings. I created the following Gist to show my idea: gist.github.com/nremond/fb63a27b6291053ca4dd1de81135de84 I'm a bit ashamed by the quick&dirty code, but if you like the algo, we can work on it.
                      – n1r3
                      yesterday













                      3














                      An efficient string factorization algorithm may help.
                      Given a string $S$, $n = |S|$ find maximum $p$ such that $S = T^p$ e.g. $T$ concatenated $p$ times, we call $T$ the seed and p the period. This can be done in $O(n)$ using the prefix function of the (Knuth-)Morris-Pratt algorithm, but don't be scared by the name it's very simple and beautiful algorithm, here's my solution for the corresponding problem SPOJ PERIOD.



                      Armed with a fast string factorization we can then proceed to decompose a string as the concatenation of factored chunks (dynamic programming) using a custom cost function e.g. minimize the total length of the seeds, minimize the sum of squares of the seeds' lengths...



                      $S = {T_1}^{p_1}{T_2}^{p_2}cdots{T_k}^{p_k}$



                      Total complexity is $O(n^2)$.



                      If $O(n^2)$ is way too costly you can shift to a greedy like strategy e.g. as soon as you find a factorizable chunk, squeeze it and continue from that point, and also limit the maximum seed size (e.g. $|T|le 200$) and prune the search if you don't find a period $ge 2$ in the first e.g. 400 characters.






                      share|cite|improve this answer










                      New contributor




                      mukel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                      Check out our Code of Conduct.























                        3














                        An efficient string factorization algorithm may help.
                        Given a string $S$, $n = |S|$ find maximum $p$ such that $S = T^p$ e.g. $T$ concatenated $p$ times, we call $T$ the seed and p the period. This can be done in $O(n)$ using the prefix function of the (Knuth-)Morris-Pratt algorithm, but don't be scared by the name it's very simple and beautiful algorithm, here's my solution for the corresponding problem SPOJ PERIOD.



                        Armed with a fast string factorization we can then proceed to decompose a string as the concatenation of factored chunks (dynamic programming) using a custom cost function e.g. minimize the total length of the seeds, minimize the sum of squares of the seeds' lengths...



                        $S = {T_1}^{p_1}{T_2}^{p_2}cdots{T_k}^{p_k}$



                        Total complexity is $O(n^2)$.



                        If $O(n^2)$ is way too costly you can shift to a greedy like strategy e.g. as soon as you find a factorizable chunk, squeeze it and continue from that point, and also limit the maximum seed size (e.g. $|T|le 200$) and prune the search if you don't find a period $ge 2$ in the first e.g. 400 characters.






                        share|cite|improve this answer










                        New contributor




                        mukel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.





















                          3












                          3








                          3






                          An efficient string factorization algorithm may help.
                          Given a string $S$, $n = |S|$ find maximum $p$ such that $S = T^p$ e.g. $T$ concatenated $p$ times, we call $T$ the seed and p the period. This can be done in $O(n)$ using the prefix function of the (Knuth-)Morris-Pratt algorithm, but don't be scared by the name it's very simple and beautiful algorithm, here's my solution for the corresponding problem SPOJ PERIOD.



                          Armed with a fast string factorization we can then proceed to decompose a string as the concatenation of factored chunks (dynamic programming) using a custom cost function e.g. minimize the total length of the seeds, minimize the sum of squares of the seeds' lengths...



                          $S = {T_1}^{p_1}{T_2}^{p_2}cdots{T_k}^{p_k}$



                          Total complexity is $O(n^2)$.



                          If $O(n^2)$ is way too costly you can shift to a greedy like strategy e.g. as soon as you find a factorizable chunk, squeeze it and continue from that point, and also limit the maximum seed size (e.g. $|T|le 200$) and prune the search if you don't find a period $ge 2$ in the first e.g. 400 characters.






                          share|cite|improve this answer










                          New contributor




                          mukel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.









                          An efficient string factorization algorithm may help.
                          Given a string $S$, $n = |S|$ find maximum $p$ such that $S = T^p$ e.g. $T$ concatenated $p$ times, we call $T$ the seed and p the period. This can be done in $O(n)$ using the prefix function of the (Knuth-)Morris-Pratt algorithm, but don't be scared by the name it's very simple and beautiful algorithm, here's my solution for the corresponding problem SPOJ PERIOD.



                          Armed with a fast string factorization we can then proceed to decompose a string as the concatenation of factored chunks (dynamic programming) using a custom cost function e.g. minimize the total length of the seeds, minimize the sum of squares of the seeds' lengths...



                          $S = {T_1}^{p_1}{T_2}^{p_2}cdots{T_k}^{p_k}$



                          Total complexity is $O(n^2)$.



                          If $O(n^2)$ is way too costly you can shift to a greedy like strategy e.g. as soon as you find a factorizable chunk, squeeze it and continue from that point, and also limit the maximum seed size (e.g. $|T|le 200$) and prune the search if you don't find a period $ge 2$ in the first e.g. 400 characters.







                          share|cite|improve this answer










                          New contributor




                          mukel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.









                          share|cite|improve this answer



                          share|cite|improve this answer








                          edited 2 days ago









                          Apass.Jack

                          7,5121633




                          7,5121633






                          New contributor




                          mukel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.









                          answered 2 days ago









                          mukelmukel

                          312




                          312




                          New contributor




                          mukel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.





                          New contributor





                          mukel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.






                          mukel is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.























                              0














                                string contiguous_repeated_string(string A){
                              string r(1,A[0]);
                              for(int i=0;i<A.size();i++){
                              for(int l=2;l<=A.size();l++){
                              string s=A.substr(i,l);
                              if ((s+s).substr(1,2*s.size()-1).find(s)!=s.size()-1 and s.size()>r.size()){
                              r=s;
                              }
                              }
                              }
                              return r;
                              }


                              Call contiguous_repeated_string("abcdefgabcdabcdabcd") you will get abcdabcdabcd.



                              Call contiguous_repeated_string("ATCGATCGA") you will get ATCGATCG.



                              And then you can use KMP's Longest Proper Prefix Postfix array to fold the resulting string with O(N).



                              Total time complexity is O(N^2).






                              share|cite|improve this answer








                              New contributor




                              Wu Fuheng is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                              Check out our Code of Conduct.


















                              • "Total time complexity is $O(N^2)$". What about the worst time complexity?
                                – Apass.Jack
                                14 hours ago
















                              0














                                string contiguous_repeated_string(string A){
                              string r(1,A[0]);
                              for(int i=0;i<A.size();i++){
                              for(int l=2;l<=A.size();l++){
                              string s=A.substr(i,l);
                              if ((s+s).substr(1,2*s.size()-1).find(s)!=s.size()-1 and s.size()>r.size()){
                              r=s;
                              }
                              }
                              }
                              return r;
                              }


                              Call contiguous_repeated_string("abcdefgabcdabcdabcd") you will get abcdabcdabcd.



                              Call contiguous_repeated_string("ATCGATCGA") you will get ATCGATCG.



                              And then you can use KMP's Longest Proper Prefix Postfix array to fold the resulting string with O(N).



                              Total time complexity is O(N^2).






                              share|cite|improve this answer








                              New contributor




                              Wu Fuheng is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                              Check out our Code of Conduct.


















                              • "Total time complexity is $O(N^2)$". What about the worst time complexity?
                                – Apass.Jack
                                14 hours ago














                              0












                              0








                              0






                                string contiguous_repeated_string(string A){
                              string r(1,A[0]);
                              for(int i=0;i<A.size();i++){
                              for(int l=2;l<=A.size();l++){
                              string s=A.substr(i,l);
                              if ((s+s).substr(1,2*s.size()-1).find(s)!=s.size()-1 and s.size()>r.size()){
                              r=s;
                              }
                              }
                              }
                              return r;
                              }


                              Call contiguous_repeated_string("abcdefgabcdabcdabcd") you will get abcdabcdabcd.



                              Call contiguous_repeated_string("ATCGATCGA") you will get ATCGATCG.



                              And then you can use KMP's Longest Proper Prefix Postfix array to fold the resulting string with O(N).



                              Total time complexity is O(N^2).






                              share|cite|improve this answer








                              New contributor




                              Wu Fuheng is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                              Check out our Code of Conduct.









                                string contiguous_repeated_string(string A){
                              string r(1,A[0]);
                              for(int i=0;i<A.size();i++){
                              for(int l=2;l<=A.size();l++){
                              string s=A.substr(i,l);
                              if ((s+s).substr(1,2*s.size()-1).find(s)!=s.size()-1 and s.size()>r.size()){
                              r=s;
                              }
                              }
                              }
                              return r;
                              }


                              Call contiguous_repeated_string("abcdefgabcdabcdabcd") you will get abcdabcdabcd.



                              Call contiguous_repeated_string("ATCGATCGA") you will get ATCGATCG.



                              And then you can use KMP's Longest Proper Prefix Postfix array to fold the resulting string with O(N).



                              Total time complexity is O(N^2).







                              share|cite|improve this answer








                              New contributor




                              Wu Fuheng is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                              Check out our Code of Conduct.









                              share|cite|improve this answer



                              share|cite|improve this answer






                              New contributor




                              Wu Fuheng is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                              Check out our Code of Conduct.









                              answered yesterday









                              Wu FuhengWu Fuheng

                              11




                              11




                              New contributor




                              Wu Fuheng is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                              Check out our Code of Conduct.





                              New contributor





                              Wu Fuheng is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                              Check out our Code of Conduct.






                              Wu Fuheng is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                              Check out our Code of Conduct.












                              • "Total time complexity is $O(N^2)$". What about the worst time complexity?
                                – Apass.Jack
                                14 hours ago


















                              • "Total time complexity is $O(N^2)$". What about the worst time complexity?
                                – Apass.Jack
                                14 hours ago
















                              "Total time complexity is $O(N^2)$". What about the worst time complexity?
                              – Apass.Jack
                              14 hours ago




                              "Total time complexity is $O(N^2)$". What about the worst time complexity?
                              – Apass.Jack
                              14 hours ago










                              Li Haoyi is a new contributor. Be nice, and check out our Code of Conduct.










                              draft saved

                              draft discarded


















                              Li Haoyi is a new contributor. Be nice, and check out our Code of Conduct.













                              Li Haoyi is a new contributor. Be nice, and check out our Code of Conduct.












                              Li Haoyi is a new contributor. Be nice, and check out our Code of Conduct.
















                              Thanks for contributing an answer to Computer Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.





                              Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                              Please pay close attention to the following guidance:


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcs.stackexchange.com%2fquestions%2f102463%2flooking-for-an-algorithm-to-identify-contiguous-repeated-series-of-lines-in-a-lo%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Сан-Квентин

                              8-я гвардейская общевойсковая армия

                              Алькесар